Extract Text from ZIP/RAR Archives using GroupDocs.Parser

Introduction

When your business needs to ingest large batches of invoices, legal documents, or email exports that arrive as compressed ZIP or RAR files, the traditional approach is to unzip them to disk, open each file with a separate reader, and then discard the temporary files. This round‑trip adds costly I/O, complicates cleanup, and makes handling nested archives a nightmare.

GroupDocs.Parser for .NET eliminates those pain points. It lets you open an archive directly, enumerate every entry, and extract raw text (and metadata) completely in memory. In this article you will learn how to:

Install the Parser NuGet package.
Pull text from a flat archive in a single pass.
Recursively walk nested ZIP/RAR files.
Apply best‑practice settings for robust processing.

Why In‑Memory Archive Parsing Matters

Processing archives in memory gives you:

Zero temporary files – no disk clutter, no leftover files.
Speed – avoid the extra read/write cycle for each entry.
Scalability – handle large archives or cloud‑based streams where a file system may not be available.

Prerequisites

.NET 6.0 or later.
GroupDocs.Parser for .NET (latest version) – see the temporary license for a free evaluation.
A ZIP or RAR archive containing supported documents (PDF, DOCX, TXT, etc.).

Installation

dotnet add package GroupDocs.Parser

Add the required namespaces:

using GroupDocs.Parser;
using GroupDocs.Parser.Data;
using System.Collections.Generic;
using System.IO;

Step 1 – Open the Archive

The first step is to create a Parser instance that points at the archive file. GetContainer() returns a collection of ContainerItem objects – one per entry inside the archive.

// Path to the archive you want to scan
string archivePath = "./SampleDocs/InvoicesArchive.zip";

using (Parser parser = new Parser(archivePath))
{
    // Retrieve every file (or nested archive) inside the container
    IEnumerable<ContainerItem> attachments = parser.GetContainer();

    if (attachments == null)
    {
        Console.WriteLine("Archive is empty or could not be read.");
        return;
    }

    // Hand off the collection to a helper that extracts text/metadata
    ExtractDataFromAttachments(attachments);
}

What’s happening:

The Parser constructor loads the archive without extracting it to disk.
GetContainer() lazily reads the archive’s directory and gives you ContainerItem objects you can work with.

Step 2 – Process Each Entry

ExtractDataFromAttachments walks the ContainerItem list, prints basic metadata, detects nested archives, and extracts text from regular documents. The method is completely reusable – call it once for a top‑level archive and again for any nested archive you discover.

/// <summary>
/// Recursively extracts metadata and plain‑text from each item in an archive.
/// </summary>
static void ExtractDataFromAttachments(IEnumerable<ContainerItem> attachments)
{
    foreach (ContainerItem item in attachments)
    {
        // Print a quick line with file name and size (optional)
        Console.WriteLine($"File: {item.FilePath} | Size: {item.Metadata.Size} bytes");

        try
        {
            // Each ContainerItem can open its own Parser instance
            using (Parser itemParser = item.OpenParser())
            {
                if (itemParser == null)
                {
                    // The item is not a supported document – skip it
                    continue;
                }

                // Detect nested archives by extension (case‑insensitive)
                bool isArchive = item.FilePath.EndsWith(".zip", StringComparison.OrdinalIgnoreCase) ||
                                 item.FilePath.EndsWith(".rar", StringComparison.OrdinalIgnoreCase);

                if (isArchive)
                {
                    // Recursively process the inner archive
                    IEnumerable<ContainerItem>? nested = itemParser.GetContainer();
                    if (nested != null)
                    {
                        ExtractDataFromAttachments(nested);
                    }
                }
                else
                {
                    // Regular document – extract its raw text
                    using (TextReader reader = itemParser.GetText())
                    {
                        string text = reader.ReadToEnd();
                        Console.WriteLine($"Extracted {text.Length} characters from {item.FilePath}");
                        // Here you could store `text` in a database, index it, etc.
                    }
                }
            }
        }
        catch (UnsupportedDocumentFormatException)
        {
            // The file type is not supported by GroupDocs.Parser – ignore gracefully
            Console.WriteLine($"Skipping unsupported format: {item.FilePath}");
        }
    }
}

Key Points

Metadata access – item.Metadata gives you file name, size, creation date, etc., without reading the file contents.
Recursive handling – The same method calls itself when it encounters another ZIP/RAR, giving you unlimited nesting support.
Error resilience – UnsupportedDocumentFormatException is caught so a single bad file won’t abort the whole run.

Step 3 – Putting It All Together

Below is a minimal, copy‑pasteable program that combines the two snippets above. It demonstrates a full end‑to‑end flow: install, open, process, and report.

using GroupDocs.Parser;
using GroupDocs.Parser.Data;
using System;
using System.Collections.Generic;
using System.IO;

class ArchiveTextExtractor
{
    static void Main(string[] args)
    {
        string archivePath = args.Length > 0 ? args[0] : "./SampleDocs/InvoicesArchive.zip";
        using (Parser parser = new Parser(archivePath))
        {
            IEnumerable<ContainerItem> attachments = parser.GetContainer();
            if (attachments == null)
            {
                Console.WriteLine("No items found in the archive.");
                return;
            }
            ExtractDataFromAttachments(attachments);
        }
    }

    static void ExtractDataFromAttachments(IEnumerable<ContainerItem> attachments)
    {
        foreach (ContainerItem item in attachments)
        {
            Console.WriteLine($"File: {item.FilePath} | Size: {item.Metadata.Size} bytes");
            try
            {
                using (Parser itemParser = item.OpenParser())
                {
                    if (itemParser == null) continue;

                    bool isArchive = item.FilePath.EndsWith(".zip", StringComparison.OrdinalIgnoreCase) ||
                                     item.FilePath.EndsWith(".rar", StringComparison.OrdinalIgnoreCase);

                    if (isArchive)
                    {
                        var nested = itemParser.GetContainer();
                        if (nested != null) ExtractDataFromAttachments(nested);
                    }
                    else
                    {
                        using (TextReader reader = itemParser.GetText())
                        {
                            string text = reader.ReadToEnd();
                            Console.WriteLine($"Extracted {text.Length} chars from {item.FilePath}");
                        }
                    }
                }
            }
            catch (UnsupportedDocumentFormatException)
            {
                Console.WriteLine($"Unsupported format: {item.FilePath}");
            }
        }
    }
}

Run the program with the path to your archive:

dotnet run -- ./Data/LegalDocs.zip

Best Practices & Tips

Limit parsing options – By default Parser extracts all supported content. If you only need text, avoid calling additional heavy methods like GetImages().
Large archives – Process items sequentially as shown; avoid loading all texts into memory at once.
Performance – Skip nested archives you don’t need by checking the file extension before recursing.
Error handling – Always catch UnsupportedDocumentFormatException; many corporate archives contain binaries that the parser cannot read.

Conclusion

GroupDocs.Parser for .NET provides a clean, in‑memory way to read every document inside ZIP or RAR archives, no matter how deeply they are nested. With just a few lines of code you can replace complex unzip‑plus‑parse pipelines, reduce I/O overhead, and build reliable document‑ingestion services.

Next steps

Explore document comparison or metadata extraction features.
Learn how to extract images from archived files with the same API.
Integrate the extracted text into a search index or AI pipeline.

How to Extract Text from ZIP/RAR Archives Using GroupDocs.Parser in .NET

Introduction

Why In‑Memory Archive Parsing Matters

Prerequisites

Installation

Step 1 – Open the Archive

Step 2 – Process Each Entry

Key Points

Step 3 – Putting It All Together

Best Practices & Tips

Conclusion

Additional Resources

Introduction#

Why In‑Memory Archive Parsing Matters#

Prerequisites#

Installation#

Step 1 – Open the Archive#

Step 2 – Process Each Entry#

Key Points#

Step 3 – Putting It All Together#

Best Practices & Tips#

Conclusion#

Additional Resources#

Introduction

Why In‑Memory Archive Parsing Matters

Prerequisites

Installation

Step 1 – Open the Archive

Step 2 – Process Each Entry

Key Points

Step 3 – Putting It All Together

Best Practices & Tips

Conclusion

Additional Resources