We often need to hide the confidential and sensitive information within the documents. In other articles, we have discussed the different strategies to search words and even search synonyms within multiple documents. This article guides you about how to redact PDF text and text in images within a document using C#.

The following topics will be covered below:

.NET API for Text and Image Redaction

GroupDocs.Redaction provides the document redaction .NET API that allows hiding and removing confidential information within documents of various file formats. Along with the simple text redaction and rasterization, the API also allows identifying the text in images that may have been inside any document like most commonly used scanned PDF files. The complete list of supported file formats is available in the documentation.

You can download the DLLs or MSI installer from the downloads section or install the API in your .NET application via NuGet.

Install via Package Manager Console

PM> Install-Package GroupDocs.Redaction

Install via NuGet Package Manager

GroupDocs.Redaction - NuGet Package - Install

Redact PDF Text and Scanned Image Text using C#

There are many different ways to find and replace text in documents that have already been discussed. You can find specific words in any document, find with case sensitivity, or by using regular expressions. I will be using the following PDF document, that contains some text and also an image with some text in it. Here we will combine the OCR and redaction process using GroupDocs.Redaction for .NET. Firstly, we will identify the text in the document and also the text which is inside the image of the document. Secondly, we will cover it with a black box to demonstrate how to programmatically hide any legal or confidential information even if is as text within a scanned document image.

PDF with text and scanned image

The following steps will detect and replace the text in a PDF document, that contains regular text along with some text within an embedded image.

  • Prepare the redactor settings using any OCR Connector.
  • Load the PDF document using Redactor class with the prepared settings and any specific loading options.
  • Define the replacement option. I have defined to black out the text.
  • For the text redaction, use the appropriate text selection strategy. I have used RegEx.
  • Apply the redactions using the Apply method.
  • Save the redacted document using Save method.

The following source code redacts the selected text within a PDF document using C#.

The output of the above code is as follows that black out the selected text of the PDF document.

Redact PDF text and scanned image text

Get a Free API License

You can get a free temporary license to use the API without the evaluation limitations.

Conclusion

To sum up, you have learned to redact text in documents. More importantly and precisely, we discussed how to redact text in images within a PDF document using C#. We selected the text to redact using regular expressions, however, it can be selected using many different ways as discussed earlier. Later we blackout the search results using a black rectangle box over the searched text.

For more details to learn about the API, visit the documentation. For queries, contact us via the forum.

See Also