In the era of online businesses, the use of digital invoices and receipts has largely increased. Similarly, the efficient data extraction from these digital invoices is also demanding. In this article, you will be knowing how to extract data from PDF invoices or receipts programmatically in Java. Previously we have seen the extraction of invoice data using C# in one of the earlier posts.

Document Parsing and Data Extraction Java API
I will be using GroupDocs.Parser for Java to parse PDF invoices and extract data values within Java application. This API also allows extracting text, images, and metadata from documents, images, presentations, archives, email, and many other supported document formats.
Download or Configure
From the downloads section, you may download the JAR file or just get the repository and dependency configurations for the pom.xml of your maven-based Java applications.
How to Extract PDF Invoice Data in Java
The following steps will allow you to easily extract data from the PDF invoices using Java.
- Create a template.
- Parse the PDF invoice according to the created template.
- Extract the information from the parsed PDF.
Create Template for the Invoice
Below is the template that is created according to the invoice. You may also download the used invoice from the sample files available at the GitHub repository.
// Create Template to Parse Data from Invoice using Java | |
// First create Template Items | |
TemplateItem[] templateItems = new TemplateItem[] | |
{ | |
new TemplateField(new TemplateFixedPosition(new Rectangle(new Point(35, 135), new Size(100, 10))), "FromCompany"), | |
new TemplateField(new TemplateFixedPosition(new Rectangle(new Point(35, 150), new Size(100, 35))), "FromAddress"), | |
new TemplateField(new TemplateFixedPosition(new Rectangle(new Point(35, 190), new Size(150, 2))), "FromEmail"), | |
new TemplateField(new TemplateFixedPosition(new Rectangle(new Point(35, 250), new Size(100, 2))), "ToCompany"), | |
new TemplateField(new TemplateFixedPosition(new Rectangle(new Point(35, 260), new Size(100, 15))), "ToAddress"), | |
new TemplateField(new TemplateFixedPosition(new Rectangle(new Point(35, 290), new Size(150, 2))), "ToEmail"), | |
new TemplateField(new TemplateRegexPosition("Invoice Number"), "InvoiceNumber"), | |
new TemplateField(new TemplateLinkedPosition( | |
"InvoiceNumber", | |
new Size(200, 15), | |
new TemplateLinkedPositionEdges(false, false, true, false)), | |
"InvoiceNumberValue"), | |
new TemplateField(new TemplateRegexPosition("Order Number"), "InvoiceOrder"), | |
new TemplateField(new TemplateLinkedPosition( | |
"InvoiceOrder", | |
new Size(200, 15), | |
new TemplateLinkedPositionEdges(false, false, true, false)), | |
"InvoiceOrderValue"), | |
new TemplateField(new TemplateRegexPosition("Invoice Date"), "InvoiceDate"), | |
new TemplateField(new TemplateLinkedPosition( | |
"InvoiceDate", | |
new Size(200, 15), | |
new TemplateLinkedPositionEdges(false, false, true, false)), | |
"InvoiceDateValue"), | |
new TemplateField(new TemplateRegexPosition("Due Date"), "DueDate"), | |
new TemplateField(new TemplateLinkedPosition( | |
"DueDate", | |
new Size(200, 15), | |
new TemplateLinkedPositionEdges(false, false, true, false)), | |
"DueDateValue"), | |
new TemplateField(new TemplateRegexPosition("Total Due"), "TotalDue"), | |
new TemplateField(new TemplateLinkedPosition( | |
"TotalDue", | |
new Size(200, 15), | |
new TemplateLinkedPositionEdges(false, false, true, false)), | |
"TotalDueValue") | |
}; | |
// Transform into template | |
Template template = new Template(Arrays.asList(templateItems)); |
Parse PDF Invoice/Receipt for Data Extraction
The following lines will parse the PDF invoice according to the created template and extract the invoice data using simple Java code.
// Parse the PDF Invoice using the defined Template in Java | |
Parser parser = new Parser("filePath/invoice.pdf"); | |
DocumentData data = parser.parseByTemplate(template); | |
// Print the extracted data | |
for (int i = 0; i < data.getCount(); i++) { | |
// Printing Field Name | |
System.out.print(data.get(i).getName() + ": "); | |
// Cast PageArea property value to PageTextArea | |
// as we have defined only text fields in the template | |
PageTextArea area = data.get(i).getPageArea() instanceof PageTextArea | |
? (PageTextArea) data.get(i).getPageArea() | |
: null; | |
System.out.println(area == null ? "Not a template field" : area.getText()); | |
} |
The Output
The following is the output of the above code after extraction of data from the invoice.
**FROMCOMPANY:** DEMO - Sliced Invoices
**FROMADDRESS:** Suite 5A-1204
123 Somewhere Street
Your City AZ 12345
**FROMEMAIL:** admin@slicedinvoices.com
**TOCOMPANY:** Test Business
**TOADDRESS:** 123 Somewhere St
Melbourne, VIC 3000
**INVOICENUMBER:** Invoice Number
**INVOICENUMBERVALUE:** NV-3337
**INVOICEORDER:** Order Number
**INVOICEORDERVALUE:** 12345
**INVOICEDATE:** Invoice Date
**INVOICEDATEVALUE:** January 25, 2016
**DUEDATE:** Due Date
**DUEDATEVALUE:** January 31, 2016
**TOTALDUE:** Total Due
**TOTALDUEVALUE:** $93.50
There are many other open-source examples available at GitHub Repository. You can download the code and quickly run the examples. For more guidance and some other ways to use templates for parsing and data extraction in Java, visit the developer guide in the documentation. In case of any further difficulty, reach the support team for free, any time on the forum.