RAG 切块Chunk技术总结与自定义分块实现思路

致Great

发布于 2025-01-18 14:03:49

17200

代码可运行

文章被收录于专栏：自然语言处理自然语言处理

运行总次数：0

代码可运行

TrustRAG项目地址🌟：https://github.com/gomate-community/TrustRAG
可配置的模块化RAG框架

切块简介

在RAG（Retrieval-Augmented Generation）任务中，Chunk切分是一个关键步骤，尤其是在处理结构复杂的PDF文档时。PDF文档可能包含图片、奇怪的排版等，增加了Chunk切分的难度。

Chunk粒度的影响

句子/单词粒度：注重局部、关键信息的查询，但可能缺失上下文信息。
长篇段落/文章粒度：Embedding结果反映整个文章的意思，但难以精准到个体单词。

不同场景的Chunk分块粒度

微博：少字符，适合较小的Chunk粒度。
知乎/小红书：中小量字符数，适合中等Chunk粒度。
博客：超多字符，适合较大的Chunk粒度。
专业性较强、专有名词较多的文章：需要较小的Chunk粒度以保留专业信息。
综述类信息总结文章：适合较大的Chunk粒度以保留整体信息。

Chunk切分对信息的影响

上下文信息：例如，《统计学习的要素》这本书有18章，每章专注于一个主题，副标题和第二层副标题等。人们习惯于在语境中理解文章。
位置信息：文本的权重取决于它们在文档中的位置。文档开头和结尾的文字比中间的文字更重要。
连续的信息：一个故事可能以“in the beginning”开头，然后以“then”、“therefore”、“after that”继续，直到以“finally”结尾。使用分块策略，这种连接可能不再完整。
描述信息：使用分块，可能无法保证描述性信息集中在一起。

RAG任务不擅长回答的问题

小范围的描述性问题：例如，哪个主体具有某些特征?
关系推理：寻找从实体A到实体B的路径或识别实体集团。
时间跨度很长的总结：例如，“列出所有哈利波特的战斗”或“哈利波特有多少次战斗?”

确定最佳分块策略的因素

被索引内容的性质：处理较长的文档（如文章或书籍）还是较短的内容（如微博或即时消息）？
使用的Embedding模型：例如，sentence-transformer模型在单个句子上工作得很好，但像text-embedding-ada-002这样的模型在包含256或512个tokens的块上表现得更好。
用户查询的长度和复杂性：用户输入的问题文本是简短而具体的还是冗长而复杂的？
检索结果的使用方式：用于语义搜索、问答、摘要或其他目的？底层连接的LLM的tokens限制也会影响分块的大小。

总之，没有最好的分块策略，只有适合的分块策略。为了确保查询结果更加准确，有时候需要选择性地使用几种不同的策略。
下面是Langchain/Langchain-chatchat，Langchain提供了很多文本切割的工具，其中langchain默认使用RecursiveCharacterTextSplitter,还有其他的切块方法比如：

RecursiveCharacterTextSplitter
CharacterTextSplitter
TokenTextSplitter
MarkdownHeaderTextSplitter
CodeTextSplitter
spaCy（TokenTextSplitter变形）
SentenceTransformersTokenTextSplitter（TokenTextSplitter变形）
NLTKTextSplitter（TokenTextSplitter变形）
GPT2TokenizerFast
AliTextSplitter
ChineseRecursiveTextSplitter
ChineseTextSplitter
zh_title_enhance
…

如何确定最佳块大小

确定最佳块大小通常需要通过A/B测试来进行。运行一系列查询来评估质量，并比较不同块大小的性能。这是一个反复测试的过程，针对不同的查询测试不同的块大小，直到找到最佳的块大小。

经验之谈

较小的块大小：为了获得更好的结果，建议使用较小的块大小。微软的分析表明，较小的块大小有助于提高性能。

分割策略：在分割文本时，可以选择不同的分割策略。最简单的方法是在单词的中间切断，也可以尝试在句子或段落的中间切断。为了得到更好的结果，可以重叠相邻的块。

Embedding模型的限制

Embedding模型在呈现多主题、多回合语料库时不如简单语料库有效。这就是为什么RAG（Retrieval-Augmented Generation）更喜欢较短的块。

块大小范围：在微软的分析中，最小的块大小是512个tokens。一些企业级RAG应用程序中的块大小只有100个tokens。
信息丢失：分块策略会将文本语料库分解成小块，导致信息丢失。数据块越小，丢失的信息就越多。因此，存在一个最优的块大小，过小的分块可能不太理想。

寻找最优块大小

寻找最优块大小就像超参数调优一样，必须用你的数据或者文档做实验。

文本重叠对准确率的提升

重叠的作用

重叠可以帮助将相邻的块链接在一起，并为块提供更好的上下文信息。然而，即使是非常激进的25%重叠也只能将准确率提高1.5%，从42.4%提高到43.9%。这意味着这不是优化RAG性能的最有效方法。

重叠的局限性

处理代词：重叠在处理代词时很有效。例如，上一个chunk提到“徐悲鸿非常擅长画马”，下一句说“他画马的主要技法是xxxxx”，如果缺少重叠，代词“他”就会失去上下文。
小块不适用：重叠分块甚至不能用于小块。

知识图谱的引入

知识图谱的优势：在知识图谱的帮助下，RAG可以将这些关系存储在图数据库中，块之间的连接可以完全保留。如果关系推理对您的项目至关重要，这是一个非常可观的解决方案。
挑战：从非结构化文本中建立知识图谱是非常重要的。自动提取的实体和关系可能包含大量的噪声，忽略了太多的真实信息。必须非常仔细地检查产品的质量。

支持向量搜索的关系数据库

pgvector：像pgvector这样的数据库允许您将复杂的信息存储为列，同时保留语义搜索功能。它比知识图谱更容易与其他企业系统集成，也更灵活。

自定义切块

下面是TrustRAG项目中，实现一个句子切块的逻辑：

1. 句子切块初始化

作用: 初始化 SentenceChunker 类，设置 tokenizer 和分块的最大 token 数量。
逻辑:
- 调用父类的 __init__ 方法。
- 设置 tokenizer 为 rag_tokenizer，用于计算句子的 token 数量。
- 设置 chunk_size，默认值为 512，表示每个分块的最大 token 数量。

2. 切分句子

作用: 将输入的文本按照句子进行分割，支持中英文的句子分割。
逻辑:
- 使用正则表达式 re.compile(r'([。！？.!?])') 匹配句子结束的标点符号（中文：。！？；英文：.!?）。
- 将文本按照这些标点符号进行分割，得到一个包含句子和标点符号的列表。
- 将标点符号与前面的句子合并，形成完整的句子。
- 处理最后一个句子（如果它没有标点符号）。
- 去除句子前后的空白字符，并过滤掉空句子。
- 返回一个包含所有句子的列表。

3. 处理切块

作用: 对分块后的文本进行预处理，主要是规范化多余的换行符和空格。
逻辑:
- 遍历每个分块，处理其中的换行符和空格：
  - 将四个或更多连续的换行符替换为两个换行符。
  - 将四个或更多连续的空格替换为两个空格。
- 返回处理后的分块列表。

4. 段落切块

作用: 将输入的段落列表分块，确保每个分块的 token 数量不超过 chunk_size。
逻辑:
- 将段落列表合并为一个完整的文本。
- 使用 split_sentences 方法将文本分割成句子列表。
- 如果没有分割出句子，则将段落列表作为句子列表。
- 初始化 chunks 列表和 current_chunk 列表，用于存储当前分块的句子和 token 数量。
- 遍历句子列表，计算每个句子的 token 数量：
  - 如果当前分块的 token 数量加上当前句子的 token 数量不超过 chunk_size，则将句子加入当前分块。
  - 否则，将当前分块加入 chunks 列表，并开始一个新的分块。
- 处理最后一个分块（如果它包含句子）。
- 使用 process_text_chunks 方法对分块进行预处理。
- 返回最终的分块列表。

完整代码如下：

import re
from trustrag.modules.document import rag_tokenizer
from trustrag.modules.chunks.base import BaseChunker

class SentenceChunker(BaseChunker):
    """
    A class for splitting text into chunks based on sentences, ensuring each chunk does not exceed a specified token size.

    This class is designed to handle both Chinese and English text, splitting it into sentences using punctuation marks.
    It then groups these sentences into chunks, ensuring that the total number of tokens in each chunk does not exceed
    the specified `chunk_size`. The class also provides methods to preprocess the text chunks by normalizing excessive
    newlines and spaces.

    Attributes:
        tokenizer (callable): A tokenizer function used to count tokens in sentences.
        chunk_size (int): The maximum number of tokens allowed per chunk.

    Methods:
        split_sentences(text: str) -> list[str]:
            Splits the input text into sentences based on Chinese and English punctuation marks.

        process_text_chunks(chunks: list[str]) -> list[str]:
            Preprocesses text chunks by normalizing excessive newlines and spaces.

        get_chunks(paragraphs: list[str]) -> list[str]:
            Splits a list of paragraphs into chunks based on a specified token size.
    """

    def __init__(self, chunk_size=512):
        """
        Initializes the SentenceChunker with a tokenizer and a specified chunk size.

        Args:
            chunk_size (int, optional): The maximum number of tokens allowed per chunk. Defaults to 512.
        """
        super().__init__()
        self.tokenizer = rag_tokenizer
        self.chunk_size = chunk_size

    def split_sentences(self, text: str) -> list[str]:
        """
        Splits the input text into sentences based on Chinese and English punctuation marks.

        Args:
            text (str): The input text to be split into sentences.

        Returns:
            list[str]: A list of sentences extracted from the input text.
        """
        # Use regex to split text by sentence-ending punctuation marks
        sentence_endings = re.compile(r'([。！？.!?])')
        sentences = sentence_endings.split(text)

        # Merge punctuation marks with their preceding sentences
        result = []
        for i in range(0, len(sentences) - 1, 2):
            if sentences[i]:
                result.append(sentences[i] + sentences[i + 1])

        # Handle the last sentence if it lacks punctuation
        if sentences[-1]:
            result.append(sentences[-1])

        # Remove whitespace and filter out empty sentences
        result = [sentence.strip() for sentence in result if sentence.strip()]

        return result

    def process_text_chunks(self, chunks: list[str]) -> list[str]:
        """
        Preprocesses text chunks by normalizing excessive newlines and spaces.

        Args:
            chunks (list[str]): A list of text chunks to be processed.

        Returns:
            list[str]: A list of processed text chunks with normalized formatting.
        """
        processed_chunks = []
        for chunk in chunks:
            # Normalize four or more consecutive newlines
            while '\n\n\n\n' in chunk:
                chunk = chunk.replace('\n\n\n\n', '\n\n')

            # Normalize four or more consecutive spaces
            while '    ' in chunk:
                chunk = chunk.replace('    ', '  ')

            processed_chunks.append(chunk)

        return processed_chunks

    def get_chunks(self, paragraphs: list[str]) -> list[str]:
        """
        Splits a list of paragraphs into chunks based on a specified token size.

        Args:
            paragraphs (list[str]|str): A list of paragraphs to be chunked.

        Returns:
            list[str]: A list of text chunks, each containing sentences that fit within the token limit.
        """
        # Combine paragraphs into a single text
        text = ''.join(paragraphs)

        # Split the text into sentences
        sentences = self.split_sentences(text)

        # If no sentences are found, treat paragraphs as sentences
        if len(sentences) == 0:
            sentences = paragraphs

        chunks = []
        current_chunk = []
        current_chunk_tokens = 0

        # Iterate through sentences and build chunks based on token count
        for sentence in sentences:
            tokens = self.tokenizer.tokenize(sentence)
            if current_chunk_tokens + len(tokens) <= self.chunk_size:
                # Add sentence to the current chunk if it fits
                current_chunk.append(sentence)
                current_chunk_tokens += len(tokens)
            else:
                # Finalize the current chunk and start a new one
                chunks.append(''.join(current_chunk))
                current_chunk = [sentence]
                current_chunk_tokens = len(tokens)

        # Add the last chunk if it contains any sentences
        if current_chunk:
            chunks.append(''.join(current_chunk))

        # Preprocess the chunks to normalize formatting
        chunks = self.process_text_chunks(chunks)
        return chunks

if __name__ == '__main__':
    with open("../../../data/docs/news.txt","r",encoding="utf-8") as f:
        content=f.read()
    tc=SentenceChunker(chunk_size=128)
    chunks = tc.get_chunks([content])
    for chunk in chunks:
        print(f"Chunk Content：\n{chunk}")

输出如下：

Chunk Content：
韩国总统警卫处长辞职

#韩国总统警卫处长辞职#更新：韩国总统警卫处长朴钟俊今天（1月10日）到案接受调查前，向代总统崔相穆递交辞呈。#韩国总统警卫处长到案接受调查#今天上午，朴钟俊抵达韩国警察厅国家调查本部，接受警方调查。他在接受调查前向现场记者表示，针对被停职总统尹锡悦的逮捕令存在法理上的争议，对尹锡悦的调查程序应符合总统身份，而不是以逮捕令的形式进行。
Chunk Content：
他还说，政府机构间不能出现流血事件。韩国高级公职人员犯罪调查处（公调处）1月3日组织人员前往位于首尔市龙山区汉南洞的总统官邸进行抓捕，但遭总统警卫处抵抗，双方对峙5个多小时后，公调处宣布抓捕行动失败。韩国“共同调查本部”以涉嫌妨碍执行特殊公务为由对朴钟俊立案，要求其到案接受调查。朴钟俊曾两次拒绝到案接受警方调查。（总台记者 张昀