Extract Images from PDF, Excel, PPT, Word Documents in C#

In the previous post, we discussed how to extract images from documents in Java. Today, we will be looking to achieve the same objective using C#. No worries if you have not visited the last post. In this article, we will be learning to programmatically extract images from PDF, Excel, PowerPoint, and Word documents in a C# application using document parsing .NET API.

Following topics will be covered here:

Image, Text, and Metadata Extraction .NET API
Image Extraction from PDF documents
Extract Images from Word, Excel, PowerPoint documents
Extract Image from Specific Page
Supported Formats for Image Extraction

Image, Text, and Metadata Extraction .NET API

Parse Documents and Extract Data in .NET

GroupDocs.Parser for .NET is document parsing and data extraction .NET API. It supports document parsing and extraction of images, text, and metadata from word-processing documents, spreadsheets, presentations, archives, and email documents. At the end of the article, document formats are mentioned that are supported by the API for image extraction.

In this article, we will use this API, so I would recommend to download its binaries or install the API from NuGet to prepare the environment.

Extract Images from PDF Documents in C#

You can easily retrieve all the images from any PDF document by following these simple steps.

Instantiate the Parser class object with the source document.
Call GetImages method of Parser class to get the collection of all the images in PageImageArea objects.
Iterate over PageImageArea to get every image.
Save images on the disk using the Save method of PageImageArea.

Extracted images can be saved in BMP, GIF, JPEG, PNG, and WebP formats. The complete code is shown below to demonstrate the whole steps.

Extracted Images from Document using GroupDocs.Parser

Image Extraction from Word, Excel, PowerPoint Files in C#

Not restricted to just PDF format, we can take out all the images from word-processing documents, spreadsheets, presentations, with the unchanged code base. Just change the source document path with the file extension, your document will be parsed to extract and save all the images to the disk.

using (Parser parser = new Parser("path/document.docx")) // Word Document
// using (Parser parser = new Parser("path/document.xlsx")) // Excel Spreadhseet
// using (Parser parser = new Parser("path/document.pptx")) // Presentation
// using (Parser parser = new Parser("path/document.pdf")) // PDF Document

Extract Images from Specific Document Page in C#

If you want to extract images from a specific page of the document, it can be done easily using the below-mentioned steps and C# code.

Get the information about the document using the GetDocumentInfo method.
From the document information, take out the total PageCount and other information.
Use GetImages(pageIndex) method and pass your target page index to it.
To save the retrieved images, traverse the images collection, and save the individual image using the Save method.

Supported Formats for Image Extraction in C#

Following are the document formats that are supported by the GroupDocs.Parser for .NET API for image extraction.

Document Type	File Formats
Word Processing Documents	DOC, DOCX, DOCM, DOT, DOTX, DOTM, ODT, OTT, RTF
Spreadsheets	XLS, XLSX, XLSM, XLSB, XLT, XLTX, XLTM, ODS, OTS, XLA, XLAM, NUMBERS
Presentations	PPT, PPTX, PPTM, PPS, PPSX, PPSM, POT, POTX, POTM, ODP, OTP
Portable Documents	PDF
Emails	EML, EMLX, MSG
Archives	ZIP

More about GroupDocs.Parser

Documentation
Source Code Examples
API Reference
Family (On-Premise APIs| Cloud APIs | Free Online App

Let’s talk some more @ Free Support Forum

Image, Text, and Metadata Extraction .NET API#

Extract Images from PDF Documents in C##

Image Extraction from Word, Excel, PowerPoint Files in C##

Extract Images from Specific Document Page in C##

Supported Formats for Image Extraction in C##

More about GroupDocs.Parser#

Related Articles#