前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Kaggle 专利匹配比赛赛后总结

Kaggle 专利匹配比赛赛后总结

作者头像
致Great
发布2022-09-09 10:34:02
3820
发布2022-09-09 10:34:02
举报
文章被收录于专栏:程序生活

比赛简介

在专利匹配数据集中,选手需要判断两个短语的相似度,一个是anchor ,一个是target ,然后输出两者在不同语义(context)的相似度,范围是0-1,我们队伍id为xlyhq,a榜rank 13,b榜ran12,非常感谢@heng zheng@pythonlan,@leolu1998,@syzong四位队友的努力和付出,最后比较幸运的狗到金牌。

和其他前排核心思路差不多,我们在这里主要分享下我们的比赛历程以及相关实验的具体结果,以及有意思的尝试

文本处理

数据集主要有anchor、target和context字段,另外有额外的文本拼接信息,在比赛过程中我们主要是尝试了以下拼接的尝试:

  • v1:test['anchor'] + '[SEP]' + test['target'] + '[SEP]' + test['context_text']
  • v2:test['anchor'] + '[SEP]' + test['target'] + '[SEP]' +test['context']+ '[SEP]' + test['context_text'],相当于直接把A47类似编码拼接上去
  • v3:test['text'] = test['anchor'] + '[SEP]' + test['target'] + '[SEP]' + test['context'] + '[SEP]' + test['context_text'] 获取更多的文本进行拼接,相当于把A47下面的子类别拼接上去,比如A47B,A47C
代码语言:javascript
复制
context_mapping = {
    "A": "Human Necessities",
    "B": "Operations and Transport",
    "C": "Chemistry and Metallurgy",
    "D": "Textiles",
    "E": "Fixed Constructions",
    "F": "Mechanical Engineering",
    "G": "Physics",
    "H": "Electricity",
    "Y": "Emerging Cross-Sectional Technologies",
}

titles = pd.read_csv('./input/cpc-codes/titles.csv')


def process(text):
    return re.sub(u"\\(.*?\\)|\\{.*?}|\\[.*?]", "", text)


def get_context(cpc_code):
    cpc_data = titles[(titles['code'].map(len) <= 4) & (titles['code'].str.contains(cpc_code))]
    texts = cpc_data['title'].values.tolist()
    texts = [process(text) for text in texts]
    return ";".join([context_mapping[cpc_code[0]]] + texts)


def get_cpc_texts():
    cpc_texts = dict()
    for code in tqdm(train['context'].unique()):
        cpc_texts[code] = get_context(code)
    return cpc_texts


cpc_texts = get_cpc_texts()

这个拼接方式可以得到不小的提升,但是文本长度变得更长,最大长度设置为300,导致训练更慢

  • v4:核心的拼接方式:test['text'] = test['text'] + '[SEP]' + test['target_info']
代码语言:javascript
复制
# 拼接target info
test['text'] = test['anchor'] + '[SEP]' + test['target'] + '[SEP]' + test['context_text']
target_info = test.groupby(['anchor', 'context'])['target'].agg(list).reset_index()
target_info['target'] = target_info['target'].apply(lambda x: list(set(x)))
target_info['target_info'] = target_info['target'].apply(lambda x: ', '.join(x))
target_info['target_info'].apply(lambda x: len(x.split(', '))).describe()

del target_info['target']
test=test.merge(target_info,on=['anchor','context'],how='left')
test['text'] = test['text'] + '[SEP]' + test['target_info'] 
test.head()

这种拼接方式可以让模型cv和lb分数得到较大提升,通过v3和v4两种不同拼接方式的对比,我们可以发现选取质量更高的文本进行拼接对模型更有提升作用,v3方式中有很多冗余信息,而v4方式中有很多实体级别的关键信息。

数据划分

在比赛过程中,我们尝试了不同的数据划分方式,其中包括:

  • StratifiedGroupKFold,这种拼接方式cv与lb线差比较小,分数稍微好一点
  • StratifiedKFold:线下cv比较高
  • 其他Kfold和GrouFold效果不好

损失函数

主要可以参考的损失函数有:

  • BCE: nn.BCEWithLogitsLoss(reduction="mean")
  • MSE:nn.MSELoss()
  • Mixture Loss:MseCorrloss
代码语言:javascript
复制
class CorrLoss(nn.Module):
    """
    use 1 - correlational coefficience between the output of the network and the target as the loss
    input (o, t):
        o: Variable of size (batch_size, 1) output of the network
        t: Variable of size (batch_size, 1) target value
    output (corr):
        corr: Variable of size (1)
    """
    def __init__(self):
        super(CorrLoss, self).__init__()

    def forward(self, o, t):
        assert(o.size() == t.size())
        # calcu z-score for o and t
        o_m = o.mean(dim = 0)
        o_s = o.std(dim = 0)
        o_z = (o - o_m)/o_s

        t_m = t.mean(dim =0)
        t_s = t.std(dim = 0)
        t_z = (t - t_m)/t_s

        # calcu corr between o and t
        tmp = o_z * t_z
        corr = tmp.mean(dim = 0)
        return  1 - corr
    
class MSECorrLoss(nn.Module):
    def __init__(self, p = 1.5):
        super(MSECorrLoss, self).__init__()
        self.p = p
        self.mseLoss = nn.MSELoss()
        self.corrLoss = CorrLoss()

    def forward(self, o, t):
        mse = self.mseLoss(o, t)
        corr = self.corrLoss(o, t)
        loss = mse + self.p * corr
        return loss

我们实验采用的这个损失函数,效果稍微比BCE好一点

模型设计

为了提高模型的差异度,我们主要选择了不同模型的变体,其中包括以下五个模型:

  • Deberta-v3-large
  • Bert-for-patents
  • Roberta-large
  • Ernie-en-2.0-Large
  • Electra-large-discriminator

具体cv分数如下:

代码语言:javascript
复制
deberta-v3-large:[0.8494,0.8455,0.8523,0.8458,0.8658] cv 0.85176
bertforpatents [0.8393, 0.8403, 0.8457, 0.8402, 0.8564] cv 0.8444
roberta-large [0.8183,0.8172,0.8203,0.8193,0.8398] cv 0.8233
ernie-large [0.8276,0.8277,0.8251,0.8296,0.8466] cv 0.8310
electra-large [0.8429,0.8309,0.8259,0.8416,0.846] cv 0.8376

训练优化

根据以往比赛经验,我们主要采用了以下模型训练优化方式:

  • 对抗训练:尝试了FGM 对模型训练有提升效果
代码语言:javascript
复制
class FGM():
    def __init__(self, model):
        self.model = model
        self.backup = {}
    def attack(self, epsilon=1., emb_name='word_embeddings'):
        # emb_name这个参数要换成你模型中embedding的参数名
        for name, param in self.model.named_parameters():
            if param.requires_grad and emb_name in name:
                self.backup[name] = param.data.clone()
                norm = torch.norm(param.grad)
                if norm != 0 and not torch.isnan(norm):
                    r_at = epsilon * param.grad / norm
                    param.data.add_(r_at)
    def restore(self, emb_name='emb.'):
        # emb_name这个参数要换成你模型中embedding的参数名
        for name, param in self.model.named_parameters():
            if param.requires_grad and emb_name in name: 
                assert name in self.backup
                param.data = self.backup[name]
        self.backup = {}
  • 模型泛化:加入了multidroout
  • ema对模型训练有提升效果
代码语言:javascript
复制
class EMA():
    def __init__(self, model, decay):
        self.model = model
        self.decay = decay
        self.shadow = {}
        self.backup = {}
 
    def register(self):
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                self.shadow[name] = param.data.clone()
 
    def update(self):
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                assert name in self.shadow
                new_average = (1.0 - self.decay) * param.data + self.decay * self.shadow[name]
                self.shadow[name] = new_average.clone()
 
    def apply_shadow(self):
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                assert name in self.shadow
                self.backup[name] = param.data
                param.data = self.shadow[name]
 
    def restore(self):
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                assert name in self.backup
                param.data = self.backup[name]
        self.backup = {}
 
# 初始化
ema = EMA(model, 0.999)
ema.register()
 
# 训练过程中,更新完参数后,同步update shadow weights
def train():
    optimizer.step()
    ema.update()
 
# eval前,apply shadow weights;eval之后,恢复原来模型的参数
def evaluate():
    ema.apply_shadow()
    # evaluate
    ema.restore()

没有用的尝试:

  • AWP
  • PGD

模型融合

根据线下交叉验证分数以及线上分数反馈,我们通过加权融合的方式进行平均融合

代码语言:javascript
复制
from sklearn.preprocessing import MinMaxScaler
MMscaler = MinMaxScaler()
predictions1 = MMscaler.fit_transform(submission['predictions1'].values.reshape(-1,1)).reshape(-1)
predictions2 = MMscaler.fit_transform(submission['predictions2'].values.reshape(-1,1)).reshape(-1)
predictions3 = MMscaler.fit_transform(submission['predictions3'].values.reshape(-1,1)).reshape(-1)
predictions4 = MMscaler.fit_transform(submission['predictions4'].values.reshape(-1,1)).reshape(-1)
predictions5 = MMscaler.fit_transform(submission['predictions5'].values.reshape(-1,1)).reshape(-1)


# final_predictions=(predictions1+predictions2)/2
# final_predictions=(predictions1+predictions2+predictions3+predictions4+predictions5)/5
# 5:2:1:1:1
final_predictions=0.5*predictions1+0.2*predictions2+0.1*predictions3+0.1*predictions4+0.1*predictions5

其他尝试

  • two stage 前期我们做了不同预训练模型的微调,所以特征数量相对较多,我们尝试基于树模型的对文本统计特征以及模型预测做stacking尝试,当时模型是有比较不错的融合效果,下面含有部分代码
代码语言:javascript
复制
# ====================================================
# predictions1
# ====================================================

def get_fold_pred(CFG, path, model):
    CFG.path = path
    CFG.model = model
    CFG.config_path = CFG.path + "config.pth"
    CFG.tokenizer = AutoTokenizer.from_pretrained(CFG.path)
    test_dataset = TestDataset(CFG, test)

    test_loader = DataLoader(test_dataset,
                             batch_size=CFG.batch_size,
                             shuffle=False,
                             num_workers=CFG.num_workers, pin_memory=True, drop_last=False)
    predictions = []
    for fold in CFG.trn_fold:
        model = CustomModel(CFG, config_path=CFG.config_path, pretrained=False)
        state = torch.load(CFG.path + f"{CFG.model.split('/')[-1]}_fold{fold}_best.pth",
                           map_location=torch.device('cpu'))
        model.load_state_dict(state['model'])
        prediction = inference_fn(test_loader, model, device)
        predictions.append(prediction.flatten())
        del model, state, prediction
        gc.collect()
        torch.cuda.empty_cache()
    # predictions1 = np.mean(predictions, axis=0)

    # fea_df = pd.DataFrame(predictions).T
    # fea_df.columns = [f"{CFG.model.split('/')[-1]}_fold{fold}" for fold in CFG.trn_fold]
    # del test_dataset, test_loader

    return predictions


model_paths = [
    "../input/albert-xxlarge-v2/albert-xxlarge-v2/",
    "../input/bert-large-cased-cv5/bert-large-cased/",
    "../input/deberta-base-cv5/deberta-base/",
    "../input/deberta-v3-base-cv5/deberta-v3-base/",
    "../input/deberta-v3-small/deberta-v3-small/",
    "../input/distilroberta-base/distilroberta-base/",
    "../input/roberta-large/roberta-large/",
    "../input/xlm-roberta-base/xlm-roberta-base/",
    "../input/xlmrobertalarge-cv5/xlm-roberta-large/",
]

print("train.shape, test.shape", train.shape, test.shape)
print("titles.shape", titles.shape)


# for model_path in model_paths:
#     with open(f'{model_path}/oof_df.pkl', "rb") as fh:
#         oof = pickle.load(fh)[['id', 'fold', 'pred']]
# #     oof = pd.read_pickle(f'{model_path}/oof_df.pkl')[['id', 'fold', 'pred']]
#     oof[f"{model_path.split('/')[1]}"] = oof['pred']
#     train = train.merge(oof[['id', f"{model_path.split('/')[1]}"]], how='left', on='id')
    
oof_res=pd.read_csv('../input/train-res/train_oof.csv')

train = train.merge(oof_res, how='left', on='id')

model_infos = {
    'albert-xxlarge-v2': ['../input/albert-xxlarge-v2/albert-xxlarge-v2/', "albert-xxlarge-v2"],
    'bert-large-cased': ['../input/bert-large-cased-cv5/bert-large-cased/', "bert-large-cased"],
    'deberta-base': ['../input/deberta-base-cv5/deberta-base/', "deberta-base"],
    'deberta-v3-base': ['../input/deberta-v3-base-cv5/deberta-v3-base/', "deberta-v3-base"],
    'deberta-v3-small': ['../input/deberta-v3-small/deberta-v3-small/', "deberta-v3-small"],
    'distilroberta-base': ['../input/distilroberta-base/distilroberta-base/', "distilroberta-base"],
    'roberta-large': ['../input/roberta-large/roberta-large/', "roberta-large"],
    'xlm-roberta-base': ['../input/xlm-roberta-base/xlm-roberta-base/', "xlm-roberta-base"],
    'xlm-roberta-large': ['../input/xlmrobertalarge-cv5/xlm-roberta-large/', "xlm-roberta-large"],
}

for model, path_info in model_infos.items():
    print(model)
    model_path, model_name = path_info[0], path_info[1]
    fea_df = get_fold_pred(CFG, model_path, model_name)
    model_infos[model].append(fea_df)
    del model_path, model_name

del oof_res

训练代码:

代码语言:javascript
复制
for fold_ in range(5):
    print("Fold:", fold_)

    trn_ = train[train['fold'] != fold_].index
    val_ = train[train['fold'] == fold_].index
#     print(train.iloc[val_].sort_values('id'))
    trn_x, trn_y = train[train_features].iloc[trn_], train['score'].iloc[trn_]
    val_x, val_y = train[train_features].iloc[val_], train['score'].iloc[val_]

    # train_folds = folds[folds['fold'] != fold].reset_index(drop=True)
    # valid_folds = folds[folds['fold'] == fold].reset_index(drop=True)

    reg = lgb.LGBMRegressor(**params,n_estimators=1100)
    xgb = XGBRegressor(**xgb_params, n_estimators=1000)
    cat = CatBoostRegressor(iterations=1000,learning_rate=0.03,
                            depth=10,
                            eval_metric='RMSE',
                            random_seed = 42,
                            bagging_temperature = 0.2,
                            od_type='Iter',
                            metric_period = 50,
                            od_wait=20)
    print("-"* 20 + "LightGBM Training" + "-"* 20)
    reg.fit(trn_x, np.log1p(trn_y),eval_set=[(val_x, np.log1p(val_y))],early_stopping_rounds=50,verbose=100,eval_metric='rmse')
    print("-"* 20 + "XGboost Training" + "-"* 20)
    xgb.fit(trn_x, np.log1p(trn_y),eval_set=[(val_x, np.log1p(val_y))],early_stopping_rounds=50,eval_metric='rmse',verbose=100)
    print("-"* 20 + "Catboost Training" + "-"* 20)
    cat.fit(trn_x, np.log1p(trn_y), eval_set=[(val_x, np.log1p(val_y))],early_stopping_rounds=50,use_best_model=True,verbose=100)
    
    imp_df = pd.DataFrame()
    imp_df['feature'] = train_features
    imp_df['gain_reg'] = reg.booster_.feature_importance(importance_type='gain')
    imp_df['fold'] = fold_ + 1
    importances = pd.concat([importances, imp_df], axis=0, sort=False)
    
    
    for model, values in model_infos.items():
        test[model] = values[2][fold_]
        
    for model, values in uspppm_model_infos.items():
        test[f"uspppm_{model}"] = values[2][fold_]

        
        
        
#     for f in tqdm(amount_feas, desc="amount_feas 基本聚合特征"):
#         for cate in category_fea:
#             if f != cate:
#                 test['{}_{}_medi'.format(cate, f)] = test.groupby(cate)[f].transform('median')
#                 test['{}_{}_mean'.format(cate, f)] = test.groupby(cate)[f].transform('mean')
#                 test['{}_{}_max'.format(cate, f)] = test.groupby(cate)[f].transform('max')
#                 test['{}_{}_min'.format(cate, f)] = test.groupby(cate)[f].transform('min')
#                 test['{}_{}_std'.format(cate, f)] = test.groupby(cate)[f].transform('std')
            
            
            
    # LightGBM
    oof_reg_preds[val_] = reg.predict(val_x, num_iteration=reg.best_iteration_)
#     oof_reg_preds[oof_reg_preds < 0] = 0
    lgb_preds = reg.predict(test[train_features], num_iteration=reg.best_iteration_)
#     lgb_preds[lgb_preds < 0] = 0
    
    
    # Xgboost
    oof_reg_preds1[val_] = xgb.predict(val_x)
    oof_reg_preds1[oof_reg_preds1 < 0] = 0
    xgb_preds = xgb.predict(test[train_features])
#     xgb_preds[xgb_preds < 0] = 0
    
    # catboost
    oof_reg_preds2[val_] = cat.predict(val_x)
    oof_reg_preds2[oof_reg_preds2 < 0] = 0
    cat_preds = cat.predict(test[train_features])
    cat_preds[xgb_preds < 0] = 0
        
#     merge all prediction
    merge_pred[val_] = oof_reg_preds[val_] * 0.4 + oof_reg_preds1[val_] * 0.3 +oof_reg_preds2[val_] * 0.3
    
#     sub_reg_preds += np.expm1(_preds) / len(folds)
#     sub_reg_preds += np.expm1(_preds) / len(folds)

    sub_preds += (lgb_preds / 5) * 0.6 + (xgb_preds / 5) * 0.2 + (cat_preds / 5) * 0.2 #三个模型五折测试集预测结果
    
    sub_reg_preds+=lgb_preds / 5 # lgb五折测试集预测结果
print("lgb",pearsonr(train['score'], np.expm1(oof_reg_preds))[0]) # lgb
print("xgb",pearsonr(train['score'], np.expm1(oof_reg_preds1))[0]) # xgb
print("cat",pearsonr(train['score'], np.expm1(oof_reg_preds2))[0]) # cat
print("xgb lgb cat",pearsonr(train['score'], np.expm1(merge_pred))[0]) # xgb lgb cat
本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
原始发表:2022-06-23,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 比赛简介
  • 文本处理
  • 数据划分
  • 损失函数
  • 模型设计
  • 训练优化
  • 模型融合
  • 其他尝试
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档