Introduction
When your business needs to ingest large batches of invoices, legal documents, or email exports that arrive as compressed ZIP or RAR files, the traditional approach is to unzip them to disk, open each file with a separate reader, and then discard the temporary files. This round‑trip adds costly I/O, complicates cleanup, and makes handling nested archives a nightmare.
GroupDocs.Parser for .NET eliminates those pain points. It lets you open an archive directly, enumerate every entry, and extract raw text (and metadata) completely in memory. In this article you will learn how to:
- Install the Parser NuGet package.
- Pull text from a flat archive in a single pass.
- Recursively walk nested ZIP/RAR files.
- Apply best‑practice settings for robust processing.
Why In‑Memory Archive Parsing Matters
Processing archives in memory gives you:
- Zero temporary files – no disk clutter, no leftover files.
- Speed – avoid the extra read/write cycle for each entry.
- Scalability – handle large archives or cloud‑based streams where a file system may not be available.
Prerequisites
- .NET 6.0 or later.
- GroupDocs.Parser for .NET (latest version) – see the temporary license for a free evaluation.
- A ZIP or RAR archive containing supported documents (PDF, DOCX, TXT, etc.).
Installation
dotnet add package GroupDocs.Parser
Add the required namespaces:
using GroupDocs.Parser;
using GroupDocs.Parser.Data;
using System.Collections.Generic;
using System.IO;
Step 1 – Open the Archive
The first step is to create a Parser instance that points at the archive file. GetContainer() returns a collection of ContainerItem objects – one per entry inside the archive.
// Path to the archive you want to scan
string archivePath = "./SampleDocs/InvoicesArchive.zip";
using (Parser parser = new Parser(archivePath))
{
// Retrieve every file (or nested archive) inside the container
IEnumerable<ContainerItem> attachments = parser.GetContainer();
if (attachments == null)
{
Console.WriteLine("Archive is empty or could not be read.");
return;
}
// Hand off the collection to a helper that extracts text/metadata
ExtractDataFromAttachments(attachments);
}
What’s happening:
- The
Parserconstructor loads the archive without extracting it to disk. GetContainer()lazily reads the archive’s directory and gives youContainerItemobjects you can work with.
Step 2 – Process Each Entry
ExtractDataFromAttachments walks the ContainerItem list, prints basic metadata, detects nested archives, and extracts text from regular documents. The method is completely reusable – call it once for a top‑level archive and again for any nested archive you discover.
/// <summary>
/// Recursively extracts metadata and plain‑text from each item in an archive.
/// </summary>
static void ExtractDataFromAttachments(IEnumerable<ContainerItem> attachments)
{
foreach (ContainerItem item in attachments)
{
// Print a quick line with file name and size (optional)
Console.WriteLine($"File: {item.FilePath} | Size: {item.Metadata.Size} bytes");
try
{
// Each ContainerItem can open its own Parser instance
using (Parser itemParser = item.OpenParser())
{
if (itemParser == null)
{
// The item is not a supported document – skip it
continue;
}
// Detect nested archives by extension (case‑insensitive)
bool isArchive = item.FilePath.EndsWith(".zip", StringComparison.OrdinalIgnoreCase) ||
item.FilePath.EndsWith(".rar", StringComparison.OrdinalIgnoreCase);
if (isArchive)
{
// Recursively process the inner archive
IEnumerable<ContainerItem>? nested = itemParser.GetContainer();
if (nested != null)
{
ExtractDataFromAttachments(nested);
}
}
else
{
// Regular document – extract its raw text
using (TextReader reader = itemParser.GetText())
{
string text = reader.ReadToEnd();
Console.WriteLine($"Extracted {text.Length} characters from {item.FilePath}");
// Here you could store `text` in a database, index it, etc.
}
}
}
}
catch (UnsupportedDocumentFormatException)
{
// The file type is not supported by GroupDocs.Parser – ignore gracefully
Console.WriteLine($"Skipping unsupported format: {item.FilePath}");
}
}
}
Key Points
- Metadata access –
item.Metadatagives you file name, size, creation date, etc., without reading the file contents. - Recursive handling – The same method calls itself when it encounters another ZIP/RAR, giving you unlimited nesting support.
- Error resilience –
UnsupportedDocumentFormatExceptionis caught so a single bad file won’t abort the whole run.
Step 3 – Putting It All Together
Below is a minimal, copy‑pasteable program that combines the two snippets above. It demonstrates a full end‑to‑end flow: install, open, process, and report.
using GroupDocs.Parser;
using GroupDocs.Parser.Data;
using System;
using System.Collections.Generic;
using System.IO;
class ArchiveTextExtractor
{
static void Main(string[] args)
{
string archivePath = args.Length > 0 ? args[0] : "./SampleDocs/InvoicesArchive.zip";
using (Parser parser = new Parser(archivePath))
{
IEnumerable<ContainerItem> attachments = parser.GetContainer();
if (attachments == null)
{
Console.WriteLine("No items found in the archive.");
return;
}
ExtractDataFromAttachments(attachments);
}
}
static void ExtractDataFromAttachments(IEnumerable<ContainerItem> attachments)
{
foreach (ContainerItem item in attachments)
{
Console.WriteLine($"File: {item.FilePath} | Size: {item.Metadata.Size} bytes");
try
{
using (Parser itemParser = item.OpenParser())
{
if (itemParser == null) continue;
bool isArchive = item.FilePath.EndsWith(".zip", StringComparison.OrdinalIgnoreCase) ||
item.FilePath.EndsWith(".rar", StringComparison.OrdinalIgnoreCase);
if (isArchive)
{
var nested = itemParser.GetContainer();
if (nested != null) ExtractDataFromAttachments(nested);
}
else
{
using (TextReader reader = itemParser.GetText())
{
string text = reader.ReadToEnd();
Console.WriteLine($"Extracted {text.Length} chars from {item.FilePath}");
}
}
}
}
catch (UnsupportedDocumentFormatException)
{
Console.WriteLine($"Unsupported format: {item.FilePath}");
}
}
}
}
Run the program with the path to your archive:
dotnet run -- ./Data/LegalDocs.zip
Best Practices & Tips
- Limit parsing options – By default Parser extracts all supported content. If you only need text, avoid calling additional heavy methods like
GetImages(). - Large archives – Process items sequentially as shown; avoid loading all texts into memory at once.
- Performance – Skip nested archives you don’t need by checking the file extension before recursing.
- Error handling – Always catch
UnsupportedDocumentFormatException; many corporate archives contain binaries that the parser cannot read.
Conclusion
GroupDocs.Parser for .NET provides a clean, in‑memory way to read every document inside ZIP or RAR archives, no matter how deeply they are nested. With just a few lines of code you can replace complex unzip‑plus‑parse pipelines, reduce I/O overhead, and build reliable document‑ingestion services.
Next steps
- Explore document comparison or metadata extraction features.
- Learn how to extract images from archived files with the same API.
- Integrate the extracted text into a search index or AI pipeline.