语义索引(可通俗理解为向量索引)技术是搜索引擎、推荐系统、广告系统在召回阶段的核心技术之一。语义索引模型的目标是:给定输入文本,模型可以从海量候选召回库中快速、准确地召回一批语义相关文本。语义索引模型的效果直接决定了语义相关的物料能否被成功召回进入系统参与上层排序,从基础层面影响整个系统的效果。
我们采用百度paddleNLP里提到的In-batch Negatives方案。
In-batch Negatives 策略的训练数据为语义相似的 Pair 对,策略核心是在 1 个 Batch 内同时基于 N 个负例进行梯度更新,将Batch 内除自身之外其它所有 Source Text 的相似文本 Target Text 作为负例,例如: 上例中“我手机丢了,我想换个手机” 有 1 个正例(”我想买个新手机,求推荐“),3 个负例(1.求秋色之空全集漫画,2.手机学日语的软件,3.侠盗飞车罪恶都市怎么改车)。
具体来说,In-batch negatives策略的实施步骤如下:
In-batch negatives策略的优势在于:
然而,In-batch negatives策略也存在一些潜在的问题,例如:
一般通过 Recall@1,Recall@5 ,Recall@10 ,Recall@20 和 Recall@50 指标来评估语义索引模型的召回效果。按照paddleNLP给出的基线:
策略 | 模型 | Recall@1 | Recall@5 | Recall@10 | Recall@20 | Recall@50 |
---|---|---|---|---|---|---|
In-batch Negatives | ernie 1.0 | 51.301 | 65.309 | 69.878 | 73.996 | 78.881 |
In-batch Negatives | rocketqa-zh-base-query-encoder | 59.622 | 75.089 | 79.668 | 83.404 | 87.773 |
rocketqa作为打底transformer模型效果更好。
总结,为什么为采用In-batch negatives,一方面能充分利用现有数据,不用单独准备负样例,减少投入,另外一方面模型的区分能力也比较好。
流传一句话,用1亿条数据,训练10个epoch,不如用10亿数据训练一个epoch,也就是见多识广,大力出奇迹。
我们要训练一个给搜索用的向量召回模型,核心就是让准备足够多的正样例数据。正样例数据,一方面网上有较多的开源数据,可以直接利用。另外一方面,之间了解SimBERT 时,他们的数据很多也源自于搜索数据,所以可以通过搜索引擎将query和召回结果的doc作为相似句对。
作为试验,我们构造了8000万的一个小训练集,用rocketqa-zh-mini-query-encoder
作为打底模型,训练256维的embedding模型。
root_path=inbatch
python -u -m paddle.distributed.launch --gpus "0" \
train_batch_neg.py \
--device gpu \
--save_dir ./checkpoints/${root_path} \
--batch_size 64 \
--learning_rate 5E-5 \
--epochs 3 \
--output_emb_size 256 \
--model_name_or_path rocketqa-zh-mini-query-encoder \
--save_steps 5000 \
--max_seq_length 128 \
--margin 0.2 \
--train_set_file recall/train.csv \
--recall_result_dir "recall_result_dir" \
--recall_result_file "recall_result.txt" \
--hnsw_m 100 \
--hnsw_ef 100 \
--recall_num 50 \
--similar_text_pair_file "recall/dev.csv" \
--corpus_file "recall/corpus.csv"
训练完成导出onnx模型:
def convert_model(model_path):
try:
import onnx
import onnxruntime as ort
import paddle2onnx
from onnxconverter_common import float16
except ImportError:
print(
"The inference precision is change to 'fp32', please install the dependencies that required for 'fp16' inference, pip install onnxruntime-gpu onnx onnxconverter-common"
)
onnx_dir = os.path.join(model_path, "onnx")
if not os.path.exists(onnx_dir):
os.mkdir(onnx_dir)
float_onnx_file = os.path.join(onnx_dir, "model.onnx")
if not os.path.exists(float_onnx_file):
onnx_model = paddle2onnx.command.c_paddle_to_onnx(
model_file=os.path.join(model_path, "inference.pdmodel"),
params_file=os.path.join(model_path, "inference.pdiparams"),
opset_version=13,
enable_onnx_checker=True,
)
with open(float_onnx_file, "wb") as f:
f.write(onnx_model)
fp16_model_file = os.path.join(onnx_dir, "fp16_model.onnx")
if not os.path.exists(fp16_model_file):
onnx_model = onnx.load_model(float_onnx_file)
trans_model = float16.convert_float_to_float16(onnx_model, keep_io_types=True)
onnx.save_model(trans_model, fp16_model_file)
加载测试:
class MiniRocketQAEmbedding():
def __init__(self, model_file: str = model_file, use_gpu: bool = True):
providers = ['CUDAExecutionProvider'] if use_gpu else ['CPUExecutionProvider']
sess_options = ort.SessionOptions()
self.predictor = ort.InferenceSession(
model_file, sess_options=sess_options, providers=providers)
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
def embeding(self, embeding_text):
features = self.tokenizer(embeding_text, max_seq_len=128,
pad_to_max_seq_len=True, truncation_strategy="longest_first")
vecs = self.predictor.run(None, features.data)
return vecs[0]
def similarity(self, pairs):
query = pairs[0][0]
texts = [item[1] for item in pairs]
emdbeding_text = [query]
emdbeding_text.extend(texts)
features = self.tokenizer(emdbeding_text, max_seq_len=128,
pad_to_max_seq_len=True, truncation_strategy="longest_first")
vecs = self.predictor.run(None, features.data)
# print(vecs)
query_embeding = vecs[0][0]
vecs_text1 = query_embeding / (query_embeding**2).sum() ** 0.5
result = []
for i in range(1, len(vecs[0])):
vecs_text2 = vecs[0][i]
vecs_text2 = vecs_text2 / (vecs_text2**2).sum() ** 0.5
similarity = (vecs_text1 * vecs_text2).sum()
result.append({"similarity": float(similarity)})
return result
if __name__ == "__main__":
bert = MiniRocketQAEmbedding(use_gpu=False)
import time
start = time.time()
bert.embeding(["双鱼座性格特点","双鱼座性格特点"])
print((time.time() - start) * 1000)
通过MTEB框架来测试自建搜索测试集效果:
if __name__ == '__main__':
model = MyModel()
task_names = ["SSRetrieval"]
for task in task_names:
model.query_instruction_for_retrieval = None
evaluation = MTEB(tasks=[task], task_langs=['zh', 'zh-CN'])
evaluation.run(model, output_folder=f"zh_results/256_model", batch_size=64)
测试结果:
{
"dataset_revision": null,
"dev": {
"evaluation_time": 251.86,
"map_at_1": 0.13427,
"map_at_10": 0.62859,
"map_at_100": 0.72526,
"map_at_1000": 0.72564,
"map_at_3": 0.31398,
"map_at_5": 0.45025,
"mrr_at_1": 0.71863,
"mrr_at_10": 0.81982,
"mrr_at_100": 0.82077,
"mrr_at_1000": 0.82078,
"mrr_at_3": 0.80707,
"mrr_at_5": 0.81587,
"ndcg_at_1": 0.71803,
"ndcg_at_10": 0.77357,
"ndcg_at_100": 0.83634,
"ndcg_at_1000": 0.83907,
"ndcg_at_3": 0.72048,
"ndcg_at_5": 0.73003,
"precision_at_1": 0.71803,
"precision_at_10": 0.53373,
"precision_at_100": 0.07386,
"precision_at_1000": 0.00747,
"precision_at_3": 0.68889,
"precision_at_5": 0.65699,
"recall_at_1": 0.13427,
"recall_at_10": 0.78675,
"recall_at_100": 0.98082,
"recall_at_1000": 0.99181,
"recall_at_3": 0.35371,
"recall_at_5": 0.53211
},
"mteb_dataset_name": "SSRetrieval",
"mteb_version": "1.1.1"
}
同样的数据集,用peg模型测试:
{
"dataset_revision": null,
"dev": {
"evaluation_time": 1036.11,
"map_at_1": 0.09911,
"map_at_10": 0.42835,
"map_at_100": 0.49497,
"map_at_1000": 0.49681,
"map_at_3": 0.2277,
"map_at_5": 0.31901,
"mrr_at_1": 0.56794,
"mrr_at_10": 0.67111,
"mrr_at_100": 0.6737,
"mrr_at_1000": 0.67386,
"mrr_at_3": 0.65495,
"mrr_at_5": 0.66559,
"ndcg_at_1": 0.56794,
"ndcg_at_10": 0.56275,
"ndcg_at_100": 0.62991,
"ndcg_at_1000": 0.64939,
"ndcg_at_3": 0.55564,
"ndcg_at_5": 0.54815,
"precision_at_1": 0.56794,
"precision_at_10": 0.38468,
"precision_at_100": 0.05755,
"precision_at_1000": 0.00641,
"precision_at_3": 0.53329,
"precision_at_5": 0.49464,
"recall_at_1": 0.09911,
"recall_at_10": 0.55328,
"recall_at_100": 0.7634,
"recall_at_1000": 0.84758,
"recall_at_3": 0.25931,
"recall_at_5": 0.38263
},
"mteb_dataset_name": "SSRetrieval",
"mteb_version": "1.1.1"
}
模型 | Recall@1 | Recall@10 | Recall@100 | Recall@1000 |
---|---|---|---|---|
peg模型 | 9.911 | 55.328 | 76.34 | 84.758 |
微调256模型 | 13.427 | 78.675 | 98.082 | 99.181 |
可以看到,微调的模型,用更小的参数,见多识广后,整体效果明显优于未经历大规模数据训练的更大尺寸的模型。