Make your corporate documents AI-ready — reliably, on-premise, and semantically.
It is quite common case when organizations keep their documentation in PDF, DOCX, XLSX and ePub formats. While LLMs (large language models) work well with HTML or plain text, these native document formats need conversion before they can be used effectively in LLM + RAG pipelines where we want to chat with a document or a set of documents.
LLM (Large Language Model) — попередньо навчену модель ШІ, яка генерує текст і відповіді на основі великих текстових корпусів.
RAG (Retrieval‑Augmented Generation) — підхід, який поєднує LLM з зовнішньою базою знань (наприклад, корпоративними документами), щоб модель могла отримувати та розмірковувати над доменною інформацією.
The following diagram illustrates the typical steps involved in generating an answer to a question:
The quality of answers you get from a System (LLM + RAG) depends both on the System itself and on how well the source documents preserve their structure and meaning when fed into the retrieval pipeline.
Проблема
Document formatting is not only visual — it carries semantics. Headings, lists, tables, bold/italic emphasis, captions, and inline images all convey meaning that helps an LLM understand context. Naïvely converting documents (for example, using OCR that treats every page as a flat image) often loses those semantics. As a result, RAG retrieval and downstream LLM answers can become inaccurate or noisy.
OCR can help for scanned documents but frequently removes structure (lists split across pages, table borders misinterpreted, lost annotations). It also adds cost and infrastructure overhead when processing large archives.
Рішення
An alternative approach is to parse documents with structural awareness and export that structure to a semantic, LLM‑friendly format — Markdown. Markdown is lightweight, widely supported, and preserves headings, lists, tables, code blocks, emphasis, captions, and image references — exactly the features that improve retrieval quality.
GroupDocs.Markdown for .NET converts popular document formats (PDF, DOCX, XLSX, ePub, and more) into clean, semantic Markdown suitable for ingestion into RAG systems. It’s an on‑premise .NET library, so all processing happens inside your environment — no external services, no data leakage, and no dependency on remote GPUs.
Як розпочати
GroupDocs.Markdown for .NET is available as a NuGet package, and also as MSI and ZIP downloads.
Install the NuGet package with the .NET CLI:
dotnet add package GroupDocs.Markdown
Or download installers and assemblies from the official downloads page: https://releases.groupdocs.com/markdown/net/
Example usage (add to Program.cs):
// Import the namespace
using GroupDocs.Markdown;
// Set the license (optional for evaluation)
License.Set("GroupDocs.Markdown.lic");
// Instantiate the converter for a source document
var converter = new MarkdownConverter("rich-text-formatting.docx");
// Convert and save output to file
converter.Convert("rich-text-formatting.md");
The converted rich-text-formatting.md file will be saved in the same folder as your application.
The following screenshot shows input DOCX file and output Markdown.
If you run without a license, the evaluation mode will process a limited number of pages (for example, the first three pages). To try the full product, request a temporary license.
To request a temporary license, open the Purchase Wizard, provide contact details, and click Get a temporary license on the Contact Details step. The temporary license will be emailed to you.
Learn more about temporary licenses: https://purchase.groupdocs.com/temporary-license/.
Підтримувані формати файлів
GroupDocs.Markdown for .NET supports a broad set of common enterprise and ebook formats. The full list of supported extensions:
- PDF
pdf
- Таблиці
.xls,.xlsx,.xlsb,.xlsm,.xlt,.xltx,.xltm,.xlam,.csv,.tsv,.ods,.ots,.fods,.numbers,.sxc
- Word / Rich Text
.doc,.docx,.dot,.dotm,.dotx,.docm,.rtf,.odt,.ott
- Електронні книги
.azw3,.mobi,.epub
- Текст / Розмітка / Довідка
.chm,.xml,.txt
Як це працює (внутрішньо — високий рівень)
When a document is processed, two main phases occur:
-
Видобуток моделі документа
The document is parsed into an in‑memory object model that represents structural elements (paragraphs, headings, lists, tables, images, footnotes, annotations, etc.). The parser strives to preserve semantics (for example, list nesting, table cells, and image captions). -
Генерація Markdown
The object model is walked and converted to Markdown according to configurable conversion options (how to handle images, table formatting, heading levels, special annotations, etc.). The result is a readable, semantically meaningful Markdown file ready for indexing by your RAG pipeline.
Приклад експорту
The code example above shows how to export DOCX to Markdown. Lets take this code example and take a look at the source and output files as a demonstration.
Вихідний DOCX
The source file rich-text-formatting.docx contains various content blocks and is heavily formatted to highlight the main semantic elements.
Вихідний Markdown
The output content of rich-text-formatting.md is provided below, showing how different formatting elements are represented in the generated Markdown file.
This document contains a variety of formatted elements that are used to test document rendering quality during file conversion
# <a name="_toc76372684"></a>**Font Formatting**
Source Sans Pro Light, 14 pt.
Simple text in Times New Roman 12 followed by an empty paragraph<sub>subscript</sub> and<sup>superscript</sup>.
Various characters: ‘ “ & < > £ ¥ § ¨ © ª « ® » ¼
Paragraph with multiple segments of text formatted in different fonts, sizes and colors. Very different sizes and colors including **bold**, *italic*, underline and 1 2 3 4 5 ~~strikethrough~~. Make sure that the lines wrapped in the same way in Word and in Pdf.
This text has shading and highlighting and borders, and it is supported.
# <a name="_toc76372685"></a>**Paragraph Formatting**
Paragraph shading should not form empty gaps even with spacing 12 after.
Centered paragraph with a line break had a problem.\
Centered paragraph with a line break had a problem.
Right aligned paragraphs must be right aligned properly.
Right aligned paragraph with line break works well.\
Right aligned paragraph with line break works well.
This paragraph has a border.
Right aligned condensed text had a problem.
Right aligned expanded text had a problem.
Spacing after and before do not add up, just the greater is used. This paragraph has 12 after. Also, when indents are different, the shading does not join.
This paragraph has 12 before, but in total there is only 12 above. Also note that shading belongs to the paragraph at the top and shading of this paragraph does not go down unless next paragraph has shading too. There are 24 points below.
There are 24 points above, but the gap between this and previous paragraph is only 24.
This paragraph is a test for double line spacing. This paragraph is a test for double line spacing. It also have 0.5” for the first line.
This is a test for 1.5 line spacing. This is a test for 1.5 line spacing. Also has -0.5” indent for the first line.
This paragraph has a page break
and centered. It actually creates two paragraphs.
This is a test for Exactly 20 points of spacing. This is a test for Exactly 20 points of spacing. TTTTTT (20, 22, 24, 26, 28, 30).
There is a continuous section break after this line.
This line is in the new section. Next here is an empty section.
This line is in the fourth section.
# <a name="_toc76372686"></a>**Non-English Characters**
Wingdings: (x, Symbol: WÄ
Russian: Теперь немного по русски.
# <a name="_toc76372687"></a>**Tables**
|Cell 1.1 Left|Cell 1.2 Right|||
| :- | -: | :- | :- |
|Cell 2.1 Centered vertically|Cell 2.2 with background|Cell 2.3 with line break<br>and coloured border.||
|Cell 3.1 Bottom vertically|<p>Cell 3.2</p><p>Centered</p><p>Horizontally</p>|Cell 3.3 No border||
|Left red, blue top, green right and yellow bottom.|
| :- |
|Table with left indent and merged cells.||||
| :- | :- | :- | :- |
|||||
|||||
**Cell padding etc.**
|<p>Cell padding.</p><p>Top: 0.1, bottom 0.2</p><p>Left: 0.5, Right 0.4</p>|Zero padding on all sides, right aligned.|
| :- | -: |
|Outer 1.1|Outer 1.2. There is a nested table here||
| :- | :-: | -: |
|**Nested 1.1**|**Nested 1.2**|
| :- | :- |
|||Outer 1.3|
| :- | :-: | -: |
#
# <a name="_toc76372688"></a>**Lists**
**Numbered list:**
1. Item 1
1. Item 2
1. Item 2.1
1. Item 2.2
1. Item 3
**Bulleted list:**
- Item 1
- Item 2
- Item 2.1
- Item 2.2
- Item 3
#
# <a name="_toc76372689"></a>**Images**
This section starts from a new page.
**Ellipse text**

Inline text box  is here and inline ellipse  this text is after the image.
Зображення у таблиці. Вирівняно ліворуч та праворуч.
|||
| :- | -: | :- | -: |
Вбудований текстовий блок  тут, а вбудована еліпса  цей текст після зображення.
Новий розділ, який починається з нової сторінки, тут.
Він має портретну орієнтацію та поля.
# **Поля**
Поле злиття «FirstName»
Номер сторінки 5
Hyperlink [Aspose.com](http://www.aspose.com)
Зміст
[Форматування шрифту 1](#_toc76372684)
[Форматування абзацу 1](#_toc76372685)
[Неанглійські символи 2](#_toc76372686)
[Таблиці 2](#_toc76372687)
[Списки 2](#_toc76372688)
[Зображення 4](#_toc76372689)
[Поля 5](#_toc76372690)
# **Поля форми**
Редагувати <a name="text1"></a>тестовий текст
Прапорець <a name="check1"></a>
Комбобокс <a name="dropdown1"></a>
# **Виноски та кінцеві примітки**
У цьому рядку є виноска в кінці.[^1]
У цьому рядку є кінцева примітка в кінці.[^2]
[^1]: Виноска 1.
[^2]: Кінцева примітка 1.
Підсумок
GroupDocs.Markdown for .NET допомагає конвертувати широкий спектр форматів документів у семантичний Markdown, готовий для систем LLM + RAG. Він зберігає структуру та зміст документів, працює локально та підтримує поширені корпоративні формати — що робить його практичним вибором для організацій, які потребують підготовки великих колекцій документів для споживання ШІ.
Дізнатись більше
- Домашня сторінка продукту: https://products.groupdocs.com/markdown/net/
- Документація: https://docs.groupdocs.com/markdown/net/
- Інформація про ліцензію: https://about.groupdocs.com/legal/
- Завантаження: https://releases.groupdocs.com/markdown/net/
Підтримка та зворотний зв’язок
Для запитань або технічної допомоги, будь ласка, використовуйте наш Безкоштовний форум підтримки — ми будемо раді допомогти.