Parse Documents to Extract Text and Metadata using Java

GroupDocs.Parser for Java API is in the market since last year and it is proved to be one of the powerful document parser APIs. It allows parsing and reading popular formats of word processing documents, spreadsheets, presentations, ebooks, emails, markup documents, notes, archives, and databases. Not only the text but you can also extract the images and metadata properties from various document formats including PDF, XLS, XLSX, CSV, DOC, DOCX, PPT, PPTX, MPP, EML, MSG, OST, PST, ONE, and many more.

In order to improve the working of the API and simplify its usage for the developers, we have revamped its architecture from scratch. Now, the improved and simplified API is onboard as GroupDocs.Parser for Java 19.11.

What is new in Parser API for Java?

In case you are using an older version. The following are the key reasons why you should upgrade to the latest release.

The Parser class is introduced to read and extract data from the document of any supported format.
The process of data extraction has been unified for all data types.
Product architecture has been revamped from scratch in order to simplify the usage of different options and classes to manipulate data.
The process of getting document information and preview generation has been simplified.

How to migrate?

Since the product has gone through the major updates, the classes, methods, and the way they are used have also been changed. However, we haven’t yet removed the legacy API from the package, instead, we have moved it to the com.groupdocs.parser.legacy package. Once you upgrade to the v19.11, you just need to perform project-wide replacements of packages from com.groupdocs.parser to com.groupdocs.parser.legacy. This way you will get rid of immediate build issues. You can then gradually proceed to update the source code and use the new public API’s classes and methods.

Code Comparison - Extract Text and Metadata from Documents using Java

Let’s now have a look at how the code has been changed for extracting text and metadata using the latest release.

Extract Text from PDF in Java

v19.11 or Later

Legacy API

Extract Metadata from Documents in Java

v19.11 or Later

Legacy API

For more details on code comparison, please have a look at the migration notes.

Well, this was a brief overview of the latest release. Now, you can evaluate the recent changes yourself by downloading or cloning the updated source code examples from the GitHub repository. We have also updated the documentation as per the latest release.

In case you would face any issue while migrating to the latest release or using any particular feature, feel free to let us know via our forum.

What is new in Parser API for Java?#

How to migrate?#

Code Comparison - Extract Text and Metadata from Documents using Java#

Extract Text from PDF in Java#

v19.11 or Later#

Legacy API#

Extract Metadata from Documents in Java#

v19.11 or Later#

Legacy API#

What is new in Parser API for Java?

How to migrate?

Code Comparison - Extract Text and Metadata from Documents using Java

Extract Text from PDF in Java

v19.11 or Later

Legacy API

Extract Metadata from Documents in Java

v19.11 or Later

Legacy API