If you have a document and you want to use the images inside that document in some other documents, here is one of the solutions. In this article, we will be learning to programmatically extract images from PDF, Excel, PowerPoint, and Word documents using Java.

Extract Images from Documents in Java

Image Extraction Java API

Parse Documents and Extract Data in Java

For the extraction of images, we will use GroupDocs.Parser for Java. This Java API supports the parsing of documents and extraction of images, text, and metadata from word-processing documents, spreadsheets, presentations, archives, and email documents. The following are the document formats supported by the Java API for image extraction.

Document Type File Formats
Word Processing Documents DOC, DOCX, DOCM, DOT, DOTX, DOTM, ODT, OTT, RTF
Spreadsheets XLS, XLSX, XLSM, XLSB, XLT, XLTX, XLTM, ODS, OTS, XLA, XLAM, NUMBERS
Presentations PPT, PPTX, PPTM, PPS, PPSX, PPSM, POT, POTX, POTM, ODP, OTP
Portable Documents PDF
Emails EML, EMLX, MSG
Archives ZIP

Before you start with the examples below, I would recommend to set up the environment by downloading the latest version of document parsing Java API from the downloads section or you may set the following configurations in your maven-based java applications:

<repository>
	<id>GroupDocsJavaAPI</id>
	<name>GroupDocs Java API</name>
	<url>http://repository.groupdocs.com/repo/</url>
</repository>
<dependency>
	<groupId>com.groupdocs</groupId>
	<artifactId>groupdocs-parser</artifactId>
	<version>20.8</version> 
</dependency>

Extract Images from PDF Documents in Java

PDF Document to Extract Images

Follow these simple steps to get all images from the PDF document.

  1. Instantiate Parser class object.
  2. Call getImages method of Parser class to get all the images.
  3. Iterate over images using PageImageArea.
  4. Save images using the save method of PageImageArea.

It’s done. See the full code below. Extracted images can be saved in BMP, GIF, JPEG, PNG, and WebP formats.

These are the images retrieved from the PDF document using the above code.

Extracted Images from Document using Java

Extract Images from Word, Excel, PowerPoint Files in Java

Similarly, all the images can be taken out from the word-processing files, spreadsheets, presentations, with the unchanged code base. What you have to change? Just the source document path and the right file extension.

Parser parser = new Parser("path/document.docx") // Word Document
// Parser parser = new Parser("path/document.xlsx") // Excel Spreadsheet
// Parser parser = new Parser("path/document.pptx") // PowerPoint Presentation
// Parser parser = new Parser("path/document.pdf") // PDF Document

Image Extraction from Specific Document Page in Java

If you do not want to extract all the images from the whole document but from some specific page. Below code demonstrates how we can extract images from a particular page of the document in Java.

Conclusion

Today, we learned how to extract images from the whole document, and the specific page of word-processing documents, spreadsheets, presentations, and PDF in Java. There is no difference in the code if we have to extract images from the files of different file formats. We just have to pass the right path and name. That’s it.

See Also