We’re happy to announce the first release of GroupDocs.Parser for Python via .NET 25.12, available as of December 2025. This initial version brings the full power of the .NET parsing engine to Python developers, enabling extraction of text, images, attachments, barcodes, OCR content, and structured data from a wide range of document formats.

What’s new in this release

Major features

  • Text extraction – Retrieve plain or formatted text from PDFs, Office documents, emails, e‑books, archives and more.
  • Advanced search – Page‑level access with case‑sensitive, whole‑word, and regular‑expression search options.
  • Structured content parsing – Detect and extract document hierarchy such as headings, paragraphs, tables and custom text areas.
  • Template parsing – Use predefined templates to pull strongly‑typed fields from invoices, receipts and other business documents.
  • Image extraction – Pull embedded raster images from supported document and image formats.
  • Attachment extraction – Export file attachments embedded in documents.
  • Barcode scanning – Detect and read barcodes present in documents.
  • OCR support – Perform optical character recognition on scanned PDFs and raster images, with optional spell‑checking.
  • Metadata extraction – Access document properties like author, creation date, and custom metadata.
  • Table of contents extraction – Retrieve TOC structures from supported formats.
  • Hyperlink extraction – Extract hyperlinks (currently limited to a subset of formats).

Supported document formats

  • Word processing – DOC, DOCX, RTF, TXT, ODT
  • PDF & markup – PDF, HTML/MHTML, Markdown, XML
  • Spreadsheets – XLS, XLSX, ODS, CSV
  • Presentations – PPT, PPTX, ODP
  • Email & notes – PST, OST, EML, MSG, ONE
  • eBooks & web content – EPUB, MOBI, AZW3, CHM, FB2
  • Images – JPEG, PNG, TIFF, GIF, BMP, SVG
  • Archives & containers – ZIP, RAR, 7Z, TAR, GZ, BZ2

Platform support

  • Windows, Linux, and macOS
  • Python 3.5+

Installation

  1. Download the appropriate WHL package for your platform from the GroupDocs Releases page:

    • Windows x64
    • Windows x32
    • Linux
    • macOS
    • macOS ARM
  2. Install the package with pip (replace * with the actual file name you downloaded):

pip install groupdocs_parser_net-25.12-*.whl

Getting started

The following snippet shows how to extract plain text from a PDF file:

from groupdocs.parser import Parser

# Create a Parser instance for your document
with Parser("sample.pdf") as parser:
    # Extract text from the document
    text = parser.GetText()
    
    # Print all extracted text to the console
    print(text)

For more complex scenarios—such as using templates, OCR, or barcode scanning—refer to the API reference and the code samples repository linked below.

How to get the update

  • Direct download – Choose the WHL package matching your OS from the GroupDocs Releases page.
  • pip upgrade – Once a newer version is published, upgrade with:
pip install --upgrade groupdocs_parser_net

Resources