《筆記》7-7 RAG Context（文本）切割

sky · 2025年07月28日05:38

▌課程重點

為何需要文本切割（Chunking）
「智慧/邏輯切割」與「簡單/固定大小切割」兩種方法
處理不同資料類型（文章、書籍、圖片、簡報等）的具體策略
用於文本清理和 JSON 輸出的 Prompt 範例
Python 的程式碼範例
使用 LLM as a Judge 進行評估的想法

▌為何需要文本切割 (Chunking)？

在建構 RAG 系統時，我們常需處理篇幅很長的資料，例如學術論文、書籍章節或會議逐字稿。如果每次處理使用者查詢時，都將整份長文件放入大型語言模型 (LLM) 的提示 (Prompt) 中，會導致幾個問題：

成本高昂：LLM 的使用成本通常與輸入的文本長度（Token 數量）成正比。
效率低下：處理過長的上下文會增加模型的計算時間。
超出上下文長度限制：許多模型有其輸入長度的上限。
資訊雜訊：一份長文件中，只有一小部分內容與特定查詢相關。將不相關的資訊一併提供，反而可能干擾模型生成精確的答案。

因此，文本切割 成為一個必要的預處理步驟。其目標是將一份大文件分解成多個更小、更易於管理的 「區塊 (Chunks)」。RAG 系統在收到查詢後，首先從這些區塊中檢索出最相關的幾個，然後僅將這些高度相關的區塊提供給 LLM，從而生成更精準、高效且經濟的回答。

▌文本處理與切割的兩種主要方法

課程中介紹了兩種主流的文本切割方法：智慧/邏輯切割和簡單/固定大小切割。

》方法一：智慧／邏輯切割 (Smart/Logical Chunking)

此方法利用 LLM 的強大語言理解能力，對文本進行有意義的分割。這不僅是機械地切分，而是試圖保持每個區塊內的語義完整性和邏輯連貫性。執行流程通常包含以下兩個步驟：

文本清理與初步結構化
- 目的：將原始、非結構化的文本（例如充滿贅詞和停頓的會議逐字稿）轉換為流暢、語法正確且可讀性高的文章。
- 實作：可以透過一個精心設計的 Prompt 指示 LLM 扮演專業編輯的角色，完成這項任務。
- Prompt 範例 (本文最下方有完整說明):
  
  你是一位專業的數據科學領域編輯。你的任務是將一段 podcast 逐字稿轉換為流暢易讀的文本，並盡可能保留原始資訊。請移除不必要的填充詞（如嗯、啊、然後、對）、重新組織語句使其通順，並將文本分割成多個簡短的段落。
命名區塊並輸出為結構化格式 (JSON)
- 目的：在文本被清理後，進一步讓 LLM 識別出各個邏輯區塊，為其賦予標題，並以易於程式處理的格式（如 JSON）輸出。
- 實作：將上一步產生的潔淨文本再次提供給 LLM，並給予新的指令。
- Prompt 範例 (本文最下方有完整說明):
  
  你是一位專業的數據科學領域編輯。我會提供一段已按邏輯切分的文本（區塊間以 -~~~- 分隔）。你的任務是為每個區塊命名一個標題，並在必要時重新排列或合併區塊。請嚴格按照指定的 JSON 格式輸出，不要包含任何額外格式。
- 輸出範例：
```
{"blocks": [
  {"title": "Welcome and Introduction",
   "sentences": ["Hi everyone, welcome to our office hours! It's been a while since we had these sessions, so it's nice to have you here.", "I can't see you, but I see one person has joined so far. Great to have you!"]},
  {"title": "Project Requirements and Overview",
   "sentences": ["Let's start with this file.", "Remember, to get a certificate of completion for this course, you must complete a project.", "The key requirement for passing the course and getting a certificate is completing a project that demonstrates your ability to apply what you've learned."]}
]}
```

》方法二：簡單／固定大小切割 (Simple/Fixed-Size Chunking)

這是一種更直接的方法，它使用一個 滑動視窗 (Sliding Window) 的概念，將文本切割成固定長度的區塊，並允許區塊之間有部分重疊。

概念：
- 區塊大小 (Chunk Size)：定義每個區塊包含多少單詞或字符。
- 重疊大小 (Overlap Size)：定義相鄰兩個區塊之間共享的單詞或字符數量。適度的重疊有助於避免在區塊邊界切斷完整的句子或語義單元，從而更好地保留上下文。

Python 實作範例 (本文最下方有完整說明)::

def chunk_text(text, chunk_size=240, overlap_size=20):
    """
    將給定文本切割成指定大小且帶有重疊的區塊。

    參數:
    text (str): 要切割的輸入文本。
    chunk_size (int): 每個區塊應包含的單詞數。
    overlap_size (int): 連續區塊之間應重疊的單詞數。

    返回:
    list: 一個包含文本區塊的列表。
    """
    words = text.split()
    chunks = []
    start = 0
    text_length = len(words)

    while start < text_length:
        # 確定區塊的結束索引
        end = start + chunk_size
        
        # 將區塊加入列表
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        
        # 移動起始索引，為下一個區塊做準備
        # 步長為 (chunk_size - overlap_size) 以確保重疊
        start += chunk_size - overlap_size
        
        if end >= text_length:
            break

    return chunks

# 使用範例:
# article_text = "你的長篇文章內容..."
# chunks = chunk_text(article_text, chunk_size=240, overlap_size=20)
# for i, chunk in enumerate(chunks):
#     print(f"Chunk {i+1}:\n{chunk}\n")

▌不同資料類型的處理策略

根據資料的特性，我們可以採用不同的處理策略。

資料類型	處理策略
多篇文章 (Multiple Articles)	1.為每篇文章分配唯一的 `doc_id`。 2. 對每篇文章進行切割 (chunking)。 3. 為每個切割出的區塊分配唯一的 `chunk_id` (例如 `doc_id_chunk_number`)。 4. 在索引資料庫（如 Elasticsearch）中儲存每個區塊，並同時記錄其 `doc_id` 和 `chunk_id`。 5. 評估檢索效果時，可同時評估文件命中率（是否找到了正確的文章）和區塊命中率（是否找到了最精確的那個區塊）。
單篇文章／逐字稿	流程與多篇文章類似，但規模較小。可建立一個即時處理系統：使用者提供一個來源（如 YouTube 影片連結），系統自動抓取內容、切割、索引，並讓使用者立即開始對話。由於資料量不大，可考慮使用記憶體內資料庫 (in-memory database)。
書籍／長篇內容	1.嘗試將內容在更高層次上進行劃分，例如將每個章節 (Chapter) 或小節 (Section) 視為一份獨立文件。 2. 然後再對這些「文件」進行更細緻的切割。 3. 由於策略眾多，建議透過實驗和評估（如使用 `LLM as a Judge`）來決定哪種結構劃分最有效。
圖片 (Images)	1.文字描述：使用多模態模型（如 `gpt-4o-mini`）為每張圖片生成詳細的文字描述。 2. 向量嵌入：使用像 CLIP 這樣的模型將圖片直接轉換為向量嵌入。 3. 將每張圖片（及其生成的描述或嵌入）視為一個獨立文件進行索引。
簡報 (Slides)	1.將一份簡報檔案 (`slide deck`) 視為一份文件 (`document`)。 2. 將簡報中的每一頁投影片 (`slide`) 視為一個區塊 (`chunk`)。 3. 可以利用 LLM 為每頁投影片生成內容描述，然後對這些描述進行索引，這樣就回到了處理多篇文章的邏輯。

▌總結與評估

無論選擇哪種切割策略，評估都是至關重要的一環。你需要一種方法來判斷調整（例如改變 chunk_size、更換切割方法）是否對 RAG 系統的最終表現產生了正面影響。
課程中提到的 LLM as a Judge 是一個強大的評估工具：我們可以讓一個獨立的 LLM 來判斷 RAG 系統生成的答案是否準確地回答了問題。透過比較不同策略下的評分，我們可以數據化地做出決策，找到最適合當前資料與應用的文本處理方案。

▌文本切割 prompt 範例

本節課程老師使用三個 prompt，逐步示範從原始的影片逐字稿，切割為適合 RAG 處理的資料。

以下是原文和中文翻譯供大家參考：

將影片逐字稿轉為流暢文章
將前述文章編輯並輸出為 JSON 格式
撰寫 Python 程式，切割文本

》Prompt 1：將影片逐字稿轉為流暢文章

這個 prompt 要求 GAI 將一份原始的逐字稿，透過多項編輯規則，轉換成一篇流暢、易讀且結構清晰的文章。

prompt 翻譯參考：

你是一位精通數據科學的專業編輯。你的任務是將一份 podcast 逐字稿轉換為流暢易讀的文本，同時盡可能保留所有原始資訊。

指令：

移除填充詞，如嗯、啊等語助詞。

當 ‘so’, ‘right’, ‘like’ 等詞僅作為填充詞而無實際意義時，將其移除。

為了更清晰地表達，重組語句。

重新排列字詞，使其符合文法。

如果句子以「and」開頭，請重新表述。

如果句子以「right?」結尾，請將其改寫為一個正式的問句。

盡可能使用原句中的字詞。

當一個思緒在邏輯上結束時，透過換行開始一個新段落。

為提升可讀性，請保持段落簡短（每段約 3-4 個句子或行）。

為提升可讀性，將文本分割成邏輯區塊，並在區塊之間使用 -~~~- 分隔。

請務必遵循提供的格式，不要添加任何額外格式，如標題、粗體等。

僅使用提供的資訊，不要添加任何額外內容。

自動逐字稿有時會出錯，你需要修正它們。我們會提供上下文——即我們在活動前準備好的問題。同時，也請運用你自己的判斷力和知識。

老師原始 prompt 1

You're a professional editor highly skilled in data science. Your task is to turn a podcast transcript into a readable text while preserving as much of the original information as possible

Instructions:

- remove filler words, uhms, mhms and so on
- remove "so", "right", "like" when they are not needed in the text and used only as fillers
- rephrase sentences for clarity
- rearrange words so the result is grammatically correct
- if a sentence starts with "and", rephase it 
- if a sentence ends with "right?", rephrase it - make it a question 
- use as many words from the original sentence as possible
- when a thought logically ends, start a new paragraph by simply adding a linebreak
- keep paragraphs short (3-4 sentences or lines each) to enhance readability
- split the texts into logical blocks to enhance readability, separate paragraphs between blocks by adding -~~~-
- always follow the provided format and don't add any extra formatting like headers, bold, etc
- use only the provided information, don't add anything 

Sometimes there are errors in the automatic transcription, and you will need to correct them. We will give you context - the questions we prepared in advance before the event. Also use your own judgement and knowledge.

Format:

Sentence 1 of paragraph 1. Sentence 2 of paragraph 1. ...  
Sentence 1 of paragraph 2. Sentence 2 of paragraph 2. ...  

-~~~-

Sentence 1 of paragraph 3. Sentence 2 of paragraph 3. ...  
Sentence 1 of paragraph 4. Sentence 2 of paragraph 4. ...  

Transcript:
以下是字幕逐字稿，在此省略。

》Prompt 2：將前述文章編輯並輸出為 JSON 格式

這個 prompt 指示 AI 扮演一個編輯角色，將已經初步分塊的文本進行整理、命名，並以結構化的 JSON 格式輸出。

prompt 翻譯參考：

你是一位精通數據科學的專業編輯。

我會提供一段已經按邏輯區塊劃分好的編輯後逐字稿，區塊之間使用 -~~~- 分隔。

你的任務是為每個區塊命名，並在可能的情況下重新排列區塊以達到更好的結構。這也包含了將區塊進一步拆分，或將段落從一個區塊移動到另一個區塊。

請嚴格遵循格式要求，不要添加任何額外格式。

輸出格式必須是可解析的 JSON。請不要在輸出中包含程式碼區塊。
{"blocks": [
  {"title": "<標題>",
   "sentences": ["<句子1>", "<句子2>", ...]},
  {"title": "<標題>",
   "sentences": ["<句子1>", "<句子2>", ...]}
  ...
]}

老師原始 prompt 2

You're a professional editor highly skilled at data science. I give you an edited transcript already broken down by logical blocks, The blocks are separated using -~~~-

Your task is to give each block a name and re-arrange blocks if better arrangement is possible. This includes splitting the blocks further or moving a paragraph from one block to another
Follow the format exactly and don't add any extra formatting.

The output format should be a parsable JSON. Don't include codeblocks in the output.

{"blocks": [
  {"title": "<TITLE>",
   "sentences": ["<SENTENCE1>", "<SENTENCE2>", ...]},
  {"title": "<TITLE>",
   "sentences": ["<SENTENCE1>", "<SENTENCE2>", ...]}
  ...
]}

The text:
以下為步驟一所潤飾後的文章，省略。

》Prompt 3：撰寫用於 RAG 的文本切割 Python 函數

這個 prompt 請 GAI 撰寫 Python 程式碼，用來將長篇文章切割成帶有重疊部分的小區塊，以利於後續的 RAG 使用。

prompt 翻譯參考：

我有一篇文章，我想要將它切割成區塊（chunk），以便用於 RAG。請為此撰寫一個 Python 函數。

我希望每個區塊的總長度為 240 個單詞（此數值可配置），並有約 20 個單詞的重疊（此數值也可配置）——也就是說，當前區塊應包含前一個區塊的 20 個單詞，以及下一個區塊的 20 個單詞。

老師原始 prompt 3

I have an article and I want to chunk the code so I can use it for RAG. Write a function in Python for doing it

I want the chunks to be each 240 words in total (configurable), and have overlap of ~20 words (also configurable) - i.e. the current chunk should contain 20 words from the previous chunk and 20 from the next.

▌參考資料

github.com/DataTalksClub/llm-zoomcamp

06-project-example/content-processing-summary.md

main

# Content Processing Cases and Steps

## Case: Multiple Articles

- Assign each article a document id
- Chunk the articles
- Assign each chunk a unique chunk id (could be doc_id + chunk_number)
- Evaluate retrieval: separate hitrate for both doc_id and chunk_id
- Evaluate RAG: LLM as a Judge
- Tuning chunk size: use metrics from Evaluate RAG

Example JSON structure for a chunk:
```json
{
  "doc_id": "ashdiasdh",
  "chunk_id": "ashdiasdh_1",
  "text": "actual text"
}
```

This file has been truncated. show original