If you have a document and you want to use the images inside that document in some other documents, here is one of the solutions. In this article, we will be learning to programmatically extract images from PDF, Excel, PowerPoint, and Word documents using Java.
- Image Extraction Java API
- Image Extraction from PDF documents in Java
- Extract Images from Word, Excel, PowerPoint documents in Java
- Extract Image from Specific Page in Java
Image Extraction Java API
For the extraction of images, we will use GroupDocs.Parser for Java. This Java API supports the parsing of documents and extraction of images, text, and metadata from word-processing documents, spreadsheets, presentations, archives, and email documents. The following are the document formats supported by the Java API for image extraction.
Document Type | File Formats |
---|---|
Word Processing Documents | DOC, DOCX, DOCM, DOT, DOTX, DOTM, ODT, OTT, RTF |
Spreadsheets | XLS, XLSX, XLSM, XLSB, XLT, XLTX, XLTM, ODS, OTS, XLA, XLAM, NUMBERS |
Presentations | PPT, PPTX, PPTM, PPS, PPSX, PPSM, POT, POTX, POTM, ODP, OTP |
Portable Documents | |
Emails | EML, EMLX, MSG |
Archives | ZIP |
Before you start with the examples below, I would recommend to set up the environment by downloading the latest version of document parsing Java API from the downloads section or you may set the following configurations in your maven-based java applications:
<repository>
<id>GroupDocsJavaAPI</id>
<name>GroupDocs Java API</name>
<url>http://repository.groupdocs.com/repo/</url>
</repository>
<dependency>
<groupId>com.groupdocs</groupId>
<artifactId>groupdocs-parser</artifactId>
<version>20.8</version>
</dependency>
Extract Images from PDF Documents in Java
Follow these simple steps to get all images from the PDF document.
- Instantiate Parser class object.
- Call getImages method of Parser class to get all the images.
- Iterate over images using PageImageArea.
- Save images using the save method of PageImageArea.
It’s done. See the full code below. Extracted images can be saved in BMP, GIF, JPEG, PNG, and WebP formats.
These are the images retrieved from the PDF document using the above code.
Extract Images from Word, Excel, PowerPoint Files in Java
Similarly, all the images can be taken out from the word-processing files, spreadsheets, presentations, with the unchanged code base. What you have to change? Just the source document path and the right file extension.
Parser parser = new Parser("path/document.docx") // Word Document
// Parser parser = new Parser("path/document.xlsx") // Excel Spreadsheet
// Parser parser = new Parser("path/document.pptx") // PowerPoint Presentation
// Parser parser = new Parser("path/document.pdf") // PDF Document
Image Extraction from Specific Document Page in Java
If you do not want to extract all the images from the whole document but from some specific page. Below code demonstrates how we can extract images from a particular page of the document in Java.
Conclusion
Today, we learned how to extract images from the whole document, and the specific page of word-processing documents, spreadsheets, presentations, and PDF in Java. There is no difference in the code if we have to extract images from the files of different file formats. We just have to pass the right path and name. That’s it.