讓您的企業文件具備 AI 準備度 — 可靠、內部部署且語意化。
在許多情況下,組織會將文件保存為 PDF、DOCX、XLSX 和 ePub 格式。雖然 LLM(大型語言模型)在 HTML 或純文字上表現良好,但這些原生文件格式在用於 LLM + RAG 流程(例如與單一文件或多個文件對話)之前,需要先進行轉換。
LLM(Large Language Model) — 一種預訓練的 AI 模型,根據大量文字語料生成文本並提供答案。
RAG(Retrieval‑Augmented Generation) — 一種將 LLM 與外部知識庫(例如企業文件)結合的方式,使模型能檢索並推理領域內容。
以下序列圖說明了產生問題答案的典型步驟:
從系統(LLM + RAG)獲得的答案品質,同時取決於系統本身以及來源文件在進入檢索管道時保持結構與意義的程度。
問題
文件格式不僅僅是視覺呈現——它還承載語意。標題、清單、表格、粗體/斜體強調、說明文字與內嵌圖像都傳遞了有助於 LLM 理解上下文的資訊。若僅以簡單方式(例如將每頁視為平面圖像的 OCR)轉換文件,往往會遺失這些語意,導致 RAG 檢索與後續 LLM 回答不準確或雜訊過多。
OCR 可用於掃描文件,但常會移除結構(跨頁清單、表格邊框誤判、註釋遺失)。同時在處理大量檔案時也會增加成本與基礎設施負擔。
解決方案
另一種做法是以結構感知的方式解析文件,並將此結構匯出為語意化、適合 LLM 使用的格式 — Markdown。Markdown 輕量、支援廣泛,且能保留標題、清單、表格、程式碼區塊、強調、說明文字與圖像引用——正是提升檢索品質的關鍵特性。
GroupDocs.Markdown for .NET 能將常見文件格式(PDF、DOCX、XLSX、ePub 等)轉換為乾淨、語意化的 Markdown,適合匯入 RAG 系統。它是一個內部部署的 .NET 函式庫,所有處理皆在您的環境內完成——無需外部服務、無資料外洩、亦不依賴遠端 GPU。
如何開始
GroupDocs.Markdown for .NET 以 NuGet 套件形式提供,也可下載 MSI 與 ZIP 安裝檔。
使用 .NET CLI 安裝 NuGet 套件:
dotnet add package GroupDocs.Markdown
或從官方下載頁面取得安裝程式與組件:https://releases.groupdocs.com/markdown/net/
範例程式(加入 Program.cs):
// Import the namespace
using GroupDocs.Markdown;
// Set the license (optional for evaluation)
License.Set("GroupDocs.Markdown.lic");
// Instantiate the converter for a source document
var converter = new MarkdownConverter("rich-text-formatting.docx");
// Convert and save output to file
converter.Convert("rich-text-formatting.md");
轉換後的 rich-text-formatting.md 檔案會儲存在與應用程式相同的資料夾中。
以下螢幕截圖顯示輸入的 DOCX 檔案與輸出的 Markdown。
若未提供授權,評估模式僅會處理有限頁數(例如前三頁)。若想體驗完整功能,請申請臨時授權。
欲申請臨時授權,請開啟 Purchase Wizard,填寫聯絡資訊,並在 Contact Details 步驟點選 Get a temporary license。臨時授權將以電子郵件方式寄送給您。
了解更多臨時授權資訊:https://purchase.groupdocs.com/temporary-license/。
支援的檔案格式
GroupDocs.Markdown for .NET 支援廣泛的企業與電子書格式。完整的副檔名清單如下:
- PDF
pdf
- 試算表
.xls,.xlsx,.xlsb,.xlsm,.xlt,.xltx,.xltm,.xlam,.csv,.tsv,.ods,.ots,.fods,.numbers,.sxc
- Word / Rich Text
.doc,.docx,.dot,.dotm,.dotx,.docm,.rtf,.odt,.ott
- 電子書
.azw3,.mobi,.epub
- 文字 / 標記 / 說明
.chm,.xml,.txt
工作原理(內部概覽)
文件處理時會經歷兩個主要階段:
-
文件模型抽取
文件被解析為記憶體中的物件模型,該模型代表結構元素(段落、標題、清單、表格、圖像、腳註、註釋等)。解析器盡可能保留語意(例如清單層級、表格儲存格與圖像說明)。 -
Markdown 產生
依據可設定的轉換選項(圖像處理方式、表格格式、標題層級、特殊註釋等),遍歷物件模型並轉換為 Markdown。最終產出可讀且語意完整的 Markdown 檔案,供您的 RAG 管道進行索引。
匯出範例
上述程式碼示範了如何將 DOCX 匯出為 Markdown。以下以此範例說明來源與輸出檔案的實際內容。
Source DOCX
來源檔案 rich-text-formatting.docx 包含多種內容區塊,且格式豐富,以突顯主要語意元素。
Output Markdown
以下提供 rich-text-formatting.md 的輸出內容,展示不同格式元素在產生的 Markdown 檔案中如何呈現。
This document contains a variety of formatted elements that are used to test document rendering quality during file conversion
# <a name="_toc76372684"></a>**Font Formatting**
Source Sans Pro Light, 14 pt.
Simple text in Times New Roman 12 followed by an empty paragraph<sub>subscript</sub> and<sup>superscript</sup>.
Various characters: ‘ “ & < > £ ¥ § ¨ © ª « ® » ¼
Paragraph with multiple segments of text formatted in different fonts, sizes and colors. Very different sizes and colors including **bold**, *italic*, underline and 1 2 3 4 5 ~~strikethrough~~. Make sure that the lines wrapped in the same way in Word and in Pdf.
This text has shading and highlighting and borders, and it is supported.
# <a name="_toc76372685"></a>**Paragraph Formatting**
Paragraph shading should not form empty gaps even with spacing 12 after.
Centered paragraph with a line break had a problem.\
Centered paragraph with a line break had a problem.
Right aligned paragraphs must be right aligned properly.
Right aligned paragraph with line break works well.\
Right aligned paragraph with line break works well.
This paragraph has a border.
Right aligned condensed text had a problem.
Right aligned expanded text had a problem.
Spacing after and before do not add up, just the greater is used. This paragraph has 12 after. Also, when indents are different, the shading does not join.
This paragraph has 12 before, but in total there is only 12 above. Also note that shading belongs to the paragraph at the top and shading of this paragraph does not go down unless next paragraph has shading too. There are 24 points below.
There are 24 points above, but the gap between this and previous paragraph is only 24.
This paragraph is a test for double line spacing. This paragraph is a test for double line spacing. It also have 0.5” for the first line.
This is a test for 1.5 line spacing. This is a test for 1.5 line spacing. Also has -0.5” indent for the first line.
This paragraph has a page break
and centered. It actually creates two paragraphs.
This is a test for Exactly 20 points of spacing. This is a test for Exactly 20 points of spacing. TTTTTT (20, 22, 24, 26, 28, 30).
There is a continuous section break after this line.
This line is in the new section. Next here is an empty section.
This line is in the fourth section.
# <a name="_toc76372686"></a>**Paragraph Justify**
This is a justified paragraph with a single segment. 111111111111111111111111111111111111111111.
Also a justified **paragraph** reset to left because of multiple segments. 111111111111111111111111111111111111111111.
# **Non-English Characters**
Wingdings: (x, Symbol: WÄ
Russian: Теперь немного по русски.
# <a name="_toc76372687"></a>**Tables**
|Cell 1.1 Left|Cell 1.2 Right|||
| :- | -: | :- | :- |
|Cell 2.1 Centered vertically|Cell 2.2 with background|Cell 2.3 with line break<br>and coloured border.||
|Cell 3.1 Bottom vertically|<p>Cell 3.2</p><p>Centered</p><p>Horizontally</p>|Cell 3.3 No border||
|Left red, blue top, green right and yellow bottom.|
| :- |
|Table with left indent and merged cells.||||
| :- | :- | :- | :- |
|||||
|||||
**Cell padding etc.**
|<p>Cell padding.</p><p>Top: 0.1, bottom 0.2</p><p>Left: 0.5, Right 0.4</p>|Zero padding on all sides, right aligned.|
| :- | -: |
|Outer 1.1|Outer 1.2. There is a nested table here||
| :- | :-: | -: |
|**Nested 1.1**|**Nested 1.2**|
| :- | :- |
|||Outer 1.3|
| :- | :-: | -: |
#
# <a name="_toc76372688"></a>**Lists**
**Numbered list:**
1. Item 1
1. Item 2
1. Item 2.1
1. Item 2.2
1. Item 3
**Bulleted list:**
- Item 1
- Item 2
- Item 2.1
- Item 2.2
- Item 3
#
# <a name="_toc76372689"></a>**Images**
This section starts from a new page.
**Ellipse text**
There is an image in a black border in the top right corner, but it will drop down into the text. There is also a transparent ellipse with text that overlaps the picture.
Inline JPEG in a separate paragraph.

Inline GIF scaled 50% and WMF scaled 25% in a paragraph. This text is before the image and  this text is after the image.
Images in a table. Left and right aligned.
|||
| :- | -: |
Inline text box  is here and inline ellipse  is here.
New section that starts from a new page is here.
It has portrait orientation and margins.
# <a name="_toc76372690"></a>**Fields**
Merge field «FirstName»
Page number 5
Hyperlink [Aspose.com](http://www.aspose.com)
TOC
[Font Formatting 1](#_toc76372684)
[Paragraph Formatting 1](#_toc76372685)
[Non-English Characters 2](#_toc76372686)
[Tables 2](#_toc76372687)
[Lists 2](#_toc76372688)
[Images 4](#_toc76372689)
[Fields 5](#_toc76372690)
# **Form Fields**
Edit <a name="text1"></a>test text
Checkbox <a name="check1"></a>
Combobox <a name="dropdown1"></a>
# **Footnotes and Endnotes**
This line has a footnote at the end.[^1]
This line has an endnote at the end.[^2]
[^1]: Footnote 1.
[^2]: Endnote 1.
摘要
GroupDocs.Markdown for .NET 協助您將各種文件格式轉換為語意化的 Markdown,適用於 LLM + RAG 系統。它保留文件結構與意義、在本地執行,且支援常見企業格式,是需要為 AI 準備大量文件集合的組織的實用選擇。
了解更多
- 產品首頁:https://products.groupdocs.com/markdown/net/
- 文件說明:https://docs.groupdocs.com/markdown/net/
- 授權資訊:https://about.groupdocs.com/legal/
- 下載頁面:https://releases.groupdocs.com/markdown/net/
支援與回饋
如有任何問題或技術協助需求,請使用我們的 Free Support Forum — 我們很樂意為您服務。