TrustRAG项目地址🌟:https://github.com/gomate-community/TrustRAG
可配置的模块化RAG框架
在RAG(Retrieval-Augmented Generation)任务中,Chunk切分是一个关键步骤,尤其是在处理结构复杂的PDF文档时。PDF文档可能包含图片、奇怪的排版等,增加了Chunk切分的难度。
总之,没有最好的分块策略,只有适合的分块策略。为了确保查询结果更加准确,有时候需要选择性地使用几种不同的策略。
下面是Langchain/Langchain-chatchat,Langchain提供了很多文本切割的工具,其中langchain默认使用RecursiveCharacterTextSplitter,还有其他的切块方法比如:
确定最佳块大小通常需要通过A/B测试来进行。运行一系列查询来评估质量,并比较不同块大小的性能。这是一个反复测试的过程,针对不同的查询测试不同的块大小,直到找到最佳的块大小。
Embedding模型在呈现多主题、多回合语料库时不如简单语料库有效。这就是为什么RAG(Retrieval-Augmented Generation)更喜欢较短的块。
寻找最优块大小就像超参数调优一样,必须用你的数据或者文档做实验。
重叠可以帮助将相邻的块链接在一起,并为块提供更好的上下文信息。然而,即使是非常激进的25%重叠也只能将准确率提高1.5%,从42.4%提高到43.9%。这意味着这不是优化RAG性能的最有效方法。
下面是TrustRAG项目中,实现一个句子切块的逻辑:
SentenceChunker
类,设置 tokenizer 和分块的最大 token 数量。__init__
方法。tokenizer
为 rag_tokenizer
,用于计算句子的 token 数量。chunk_size
,默认值为 512,表示每个分块的最大 token 数量。re.compile(r'([。!?.!?])')
匹配句子结束的标点符号(中文:。!?;英文:.!?)。chunk_size
。split_sentences
方法将文本分割成句子列表。chunks
列表和 current_chunk
列表,用于存储当前分块的句子和 token 数量。chunk_size
,则将句子加入当前分块。chunks
列表,并开始一个新的分块。process_text_chunks
方法对分块进行预处理。完整代码如下:
import re
from trustrag.modules.document import rag_tokenizer
from trustrag.modules.chunks.base import BaseChunker
class SentenceChunker(BaseChunker):
"""
A class for splitting text into chunks based on sentences, ensuring each chunk does not exceed a specified token size.
This class is designed to handle both Chinese and English text, splitting it into sentences using punctuation marks.
It then groups these sentences into chunks, ensuring that the total number of tokens in each chunk does not exceed
the specified `chunk_size`. The class also provides methods to preprocess the text chunks by normalizing excessive
newlines and spaces.
Attributes:
tokenizer (callable): A tokenizer function used to count tokens in sentences.
chunk_size (int): The maximum number of tokens allowed per chunk.
Methods:
split_sentences(text: str) -> list[str]:
Splits the input text into sentences based on Chinese and English punctuation marks.
process_text_chunks(chunks: list[str]) -> list[str]:
Preprocesses text chunks by normalizing excessive newlines and spaces.
get_chunks(paragraphs: list[str]) -> list[str]:
Splits a list of paragraphs into chunks based on a specified token size.
"""
def __init__(self, chunk_size=512):
"""
Initializes the SentenceChunker with a tokenizer and a specified chunk size.
Args:
chunk_size (int, optional): The maximum number of tokens allowed per chunk. Defaults to 512.
"""
super().__init__()
self.tokenizer = rag_tokenizer
self.chunk_size = chunk_size
def split_sentences(self, text: str) -> list[str]:
"""
Splits the input text into sentences based on Chinese and English punctuation marks.
Args:
text (str): The input text to be split into sentences.
Returns:
list[str]: A list of sentences extracted from the input text.
"""
# Use regex to split text by sentence-ending punctuation marks
sentence_endings = re.compile(r'([。!?.!?])')
sentences = sentence_endings.split(text)
# Merge punctuation marks with their preceding sentences
result = []
for i in range(0, len(sentences) - 1, 2):
if sentences[i]:
result.append(sentences[i] + sentences[i + 1])
# Handle the last sentence if it lacks punctuation
if sentences[-1]:
result.append(sentences[-1])
# Remove whitespace and filter out empty sentences
result = [sentence.strip() for sentence in result if sentence.strip()]
return result
def process_text_chunks(self, chunks: list[str]) -> list[str]:
"""
Preprocesses text chunks by normalizing excessive newlines and spaces.
Args:
chunks (list[str]): A list of text chunks to be processed.
Returns:
list[str]: A list of processed text chunks with normalized formatting.
"""
processed_chunks = []
for chunk in chunks:
# Normalize four or more consecutive newlines
while '\n\n\n\n' in chunk:
chunk = chunk.replace('\n\n\n\n', '\n\n')
# Normalize four or more consecutive spaces
while ' ' in chunk:
chunk = chunk.replace(' ', ' ')
processed_chunks.append(chunk)
return processed_chunks
def get_chunks(self, paragraphs: list[str]) -> list[str]:
"""
Splits a list of paragraphs into chunks based on a specified token size.
Args:
paragraphs (list[str]|str): A list of paragraphs to be chunked.
Returns:
list[str]: A list of text chunks, each containing sentences that fit within the token limit.
"""
# Combine paragraphs into a single text
text = ''.join(paragraphs)
# Split the text into sentences
sentences = self.split_sentences(text)
# If no sentences are found, treat paragraphs as sentences
if len(sentences) == 0:
sentences = paragraphs
chunks = []
current_chunk = []
current_chunk_tokens = 0
# Iterate through sentences and build chunks based on token count
for sentence in sentences:
tokens = self.tokenizer.tokenize(sentence)
if current_chunk_tokens + len(tokens) <= self.chunk_size:
# Add sentence to the current chunk if it fits
current_chunk.append(sentence)
current_chunk_tokens += len(tokens)
else:
# Finalize the current chunk and start a new one
chunks.append(''.join(current_chunk))
current_chunk = [sentence]
current_chunk_tokens = len(tokens)
# Add the last chunk if it contains any sentences
if current_chunk:
chunks.append(''.join(current_chunk))
# Preprocess the chunks to normalize formatting
chunks = self.process_text_chunks(chunks)
return chunks
if __name__ == '__main__':
with open("../../../data/docs/news.txt","r",encoding="utf-8") as f:
content=f.read()
tc=SentenceChunker(chunk_size=128)
chunks = tc.get_chunks([content])
for chunk in chunks:
print(f"Chunk Content:\n{chunk}")
输出如下:
Chunk Content:
韩国总统警卫处长辞职
#韩国总统警卫处长辞职#更新:韩国总统警卫处长朴钟俊今天(1月10日)到案接受调查前,向代总统崔相穆递交辞呈。#韩国总统警卫处长到案接受调查#今天上午,朴钟俊抵达韩国警察厅国家调查本部,接受警方调查。他在接受调查前向现场记者表示,针对被停职总统尹锡悦的逮捕令存在法理上的争议,对尹锡悦的调查程序应符合总统身份,而不是以逮捕令的形式进行。
Chunk Content:
他还说,政府机构间不能出现流血事件。韩国高级公职人员犯罪调查处(公调处)1月3日组织人员前往位于首尔市龙山区汉南洞的总统官邸进行抓捕,但遭总统警卫处抵抗,双方对峙5个多小时后,公调处宣布抓捕行动失败。韩国“共同调查本部”以涉嫌妨碍执行特殊公务为由对朴钟俊立案,要求其到案接受调查。朴钟俊曾两次拒绝到案接受警方调查。(总台记者 张昀