GroupDocs.Parser for .NET

Today, we are excited to announce the release of version 18.7 of GroupDocs.Parser for .NET. The latest release supports extracting text areas from document pages. This feature may help you getting data for text analysis. We recommend you to upgrade the API to the latest version and share your valuable feedback.

Extracting Text AreasExtracting text areas is useful when you need to get the data for text analysis. To extract text areas, text extractors implement their own internal private class and provide DocumentContent property (see PdfTextExtractor as the sample). The DocumentContent class has the following members:

Member

Description

PageCount

Returns a total number of document pages

Dispose

Releases resources used by the class

GetPage

Returns a document page (see below)

GetTextAreas

Returns a collection of TextArea objects (see below)

The following code sample shows how to get text areas from a PDF document.

// Create a text extractor
PdfTextExtractor extractor = new PdfTextExtractor("invoice.pdf");
 
// Create search options
TextAreaSearchOptions searchOptions = new TextAreaSearchOptions();
// Set a regular expression to search 'Invoice # XXX' text
searchOptions.Expression = "\\s?INVOICE\\s?#\\s?[0-9]+";
// Limit the search with a rectangle
searchOptions.Rectangle = new GroupDocs.Parser.Rectangle(10, 10, 300, 150);
 
// Get text areas
IList< textarea > texts = extractor.DocumentContent.GetTextAreas(0, searchOptions);
             
// Iterate over a list
foreach(TextArea area in texts)
{
    // Print a text
    Console.WriteLine(area.Text);
}

Available Channels and ResourcesHere are a few channels and resources for you to download, learn, try and get technical support on GroupDocs.Parser:

Have Queries?If you have got any queries or concerns about the API, please feel free to get in touch with us over the forum. We’ll be glad to address your concerns.