Today, we are excited to announce the release of version 18.7 of GroupDocs.Parser for .NET. The latest release supports extracting text areas from document pages. This feature may help you getting data for text analysis. We recommend you to upgrade the API to the latest version and share your valuable feedback.
Extracting Text AreasExtracting text areas is useful when you need to get the data for text analysis. To extract text areas, text extractors implement their own internal private class and provide DocumentContent property (see PdfTextExtractor as the sample). The DocumentContent class has the following members:
Member
Description
PageCount
Returns a total number of document pages
Dispose
Releases resources used by the class
GetPage
Returns a document page (see below)
GetTextAreas
Returns a collection of TextArea objects (see below)
The following code sample shows how to get text areas from a PDF document.
// Create a text extractor
PdfTextExtractor extractor = new PdfTextExtractor("invoice.pdf");
// Create search options
TextAreaSearchOptions searchOptions = new TextAreaSearchOptions();
// Set a regular expression to search 'Invoice # XXX' text
searchOptions.Expression = "\\s?INVOICE\\s?#\\s?[0-9]+";
// Limit the search with a rectangle
searchOptions.Rectangle = new GroupDocs.Parser.Rectangle(10, 10, 300, 150);
// Get text areas
IList< textarea > texts = extractor.DocumentContent.GetTextAreas(0, searchOptions);
// Iterate over a list
foreach(TextArea area in texts)
{
// Print a text
Console.WriteLine(area.Text);
}
Available Channels and ResourcesHere are a few channels and resources for you to download, learn, try and get technical support on GroupDocs.Parser:
- Installation - Install GroupDocs.Parser using NuGet
- Documentation - Product Docs
- Examples - GitHub Source Code Examples
- Video Tutorials – YouTube Video Tutorials
- Product Support Forum – Technical Support Forum for GroupDocs.Parser