CCF BDCI 2020 房产行业聊天问答匹配赛道 A榜47/2985

Overview

CCF BDCI 2020 房产行业聊天问答匹配 A榜47/2985

赛题描述详见:https://www.datafountain.cn/competitions/474

文件说明

data: 存放训练数据和测试数据以及预处理代码

model_bert.py: 网络模型结构定义

adv_train.py: 对抗训练代码

run_bert_pse_adv.py: 运行bert-wwm + 对抗训练 + 伪标签模型

run_roberta_cls_pse_reinit_adv.py: 运行roberta-large last2embedding_cls + reinit + 对抗训练 + 伪标签模型

个人方案

我的baseline是将query和answer拼接后传入预训练好的bert进行特征提取,之后将提取的特征传入一个全连接层,最后接一个softmax进行分类。

其中尝试的预训练模型有bert(谷歌),bert_wwm(哈工大版本),roberta_large(哈工大版本),xlneternie等,其中效果较好的有bert-wwm和roberta-large。之后在baseline的基础上进行了各种尝试,主要尝试有以下:

模型 线上F1
bert-wwm 0.78
bert-wwm + 对抗训练 0.783
bert-wwm + 对抗训练 + 伪标签 0.7879
roberta-large 0.774
roberta-large + reinit + 对抗训练 0.786
roberta-large + reinit+对抗训练 + 伪标签 0.7871
roberta-large last2embedding_cls + reinit + 对抗训练 + 伪标签 0.7879

对抗训练

其基本的原理呢,就是通过添加扰动构造一些对抗样本,放给模型去训练,以攻为守,提高模型在遇到对抗样本时的鲁棒性,同时一定程度也能提高模型的表现和泛化能力。

参考链接:https://zhuanlan.zhihu.com/p/91269728

伪标签

将测试数据和预测结果进行拼接,之后当成训练数据传入到模型中重新进行训练。为了减少对训练数据的原始分布的影响并增加伪标签的置信度,我只在五个采用不同预训练模型的baseline预测一致的数据中采样了6000条测试数据加入到训练集进行训练。

重新初始化

参考链接:如何让Bert在finetune小数据集时更“稳”一点 https://zhuanlan.zhihu.com/p/148720604

大致思想是靠近底部的层(靠近input)学到的是比较通用的语义方面的信息,比如词性、词法等语言学知识,而靠近顶部的层会倾向于学习到接近下游任务的知识,对于预训练来说就是类似masked word prediction、next sentence prediction任务的相关知识。当使用bert预训练模型finetune其他下游任务(比如序列标注)时,如果下游任务与预训练任务差异较大,那么bert顶层的权重所拥有的知识反而会拖累整体的finetune进程,使得模型在finetune初期产生训练不稳定的问题。

因此,我们可以在finetune时,只保留接近底部的bert权重,对于靠近顶部的层的权重,可以重新随机初始化,从头开始学习。

在本次比赛中,我只对最后roberta-large的最后五层进行重新初始化。在实验中,我发现对于bert,重新初始化会降低效果,而roberta-large则有提升。

bert 不同embedding和cls组合

思路主要是参考 CCF BDCI 2019 互联网新闻情感分析 复赛top1解决方案

参考链接:https://github.com/cxy229/BDCI2019-SENTIMENT-CLASSIFICATION

即对bert不同embedding进行组合后传入全连接层进行分类。该方案尝试时间较晚,只实验last2embedding_cls这种组合,结果也确实有提升。

模型融合

对于单模,我采用五折交叉验证,对每一个单模的五个模型结果,我尝试了相加融合和投票的方式,结果是融合相加的线上f1较高

对于不同模型,我也只是采用的相加融合的方式(由于时间问题没有尝试投票和stacking的方式)。最后a榜效果最好的是bert-wwm + 对抗训练 + 伪标签、roberta-large + reinit+对抗训练 + 伪标签、roberta-large last2embedding_cls + reinit + 对抗训练 + 伪标签 三个模型的融合,线上F1有 0.7908 , 排名47;B榜我尝试只对两个效果最好的模型进行融合,即 bert-wwm + 对抗训练 + 伪标签last2embedding_cls + reinit + 对抗训练 + 伪标签,最终F1为0.80,排名72。

总结

本次参加比赛完全是数据挖掘课程要求,也是我第一次参加大数据比赛。因为我的研究方向是图像,所以基本可以说是从零开始,写这个github只是想记录一下这一个月自己从零开始的参赛经历,也希望对同样参加类似比赛的新人有帮助。最后,希望看到了顺手给star,万分感谢。

Owner
shuo
shuo
Athena is an open-source implementation of end-to-end speech processing engine.

Athena is an open-source implementation of end-to-end speech processing engine. Our vision is to empower both industrial application and academic research on end-to-end models for speech processing.

Ke Technologies 34 Sep 08, 2022
CCF BDCI BERT系统调优赛题baseline(Pytorch版本)

CCF BDCI BERT系统调优赛题baseline(Pytorch版本) 此版本基于Pytorch后端的huggingface进行实现。由于此实现使用了Oneflow的dataloader作为数据读入的方式,因此也需要安装Oneflow。其它框架的数据读取可以参考OneflowDataloade

Ziqi Zhou 9 Oct 13, 2022
SurvTRACE: Transformers for Survival Analysis with Competing Events

⭐ SurvTRACE: Transformers for Survival Analysis with Competing Events This repo provides the implementation of SurvTRACE for survival analysis. It is

Zifeng 13 Oct 06, 2022
Just a Basic like Language for Zeno INC

zeno-basic-language Just a Basic like Language for Zeno INC This is written in 100% python. this is basic language like language. so its not for big p

Voidy Devleoper 1 Dec 18, 2021
gaiic2021-track3-小布助手对话短文本语义匹配复赛rank3、决赛rank4

决赛答辩已经过去一段时间了,我们队伍ac milan最终获得了复赛第3,决赛第4的成绩。在此首先感谢一些队友的carry~ 经过2个多月的比赛,学习收获了很多,也认识了很多大佬,在这里记录一下自己的参赛体验和学习收获。

102 Dec 19, 2022
Textlesslib - Library for Textless Spoken Language Processing

textlesslib Textless NLP is an active area of research that aims to extend NLP t

Meta Research 379 Dec 27, 2022
Easy-to-use CPM for Chinese text generation

CPM 项目描述 CPM(Chinese Pretrained Models)模型是北京智源人工智能研究院和清华大学发布的中文大规模预训练模型。官方发布了三种规模的模型,参数量分别为109M、334M、2.6B,用户需申请与通过审核,方可下载。 由于原项目需要考虑大模型的训练和使用,需要安装较为复杂

382 Jan 07, 2023
L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources.

L3Cube-MahaCorpus L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources. We expand the existing Marathi monolingual

21 Dec 17, 2022
TextFlint is a multilingual robustness evaluation platform for natural language processing tasks,

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks, which unifies general text transformation, task-specific transformation, adversarial attack, sub-popu

TextFlint 587 Dec 20, 2022
novel deep learning research works with PaddlePaddle

Research 发布基于飞桨的前沿研究工作,包括CV、NLP、KG、STDM等领域的顶会论文和比赛冠军模型。 目录 计算机视觉(Computer Vision) 自然语言处理(Natrual Language Processing) 知识图谱(Knowledge Graph) 时空数据挖掘(Spa

1.5k Jan 03, 2023
Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

KR-BERT-SimCSE Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT. Training Unsupervised python train_unsupervised.py --mi

Jeong Ukjae 27 Dec 12, 2022
Speech Recognition for Uyghur using Speech transformer

Speech Recognition for Uyghur using Speech transformer Training: this model using CTC loss and Cross Entropy loss for training. Download pretrained mo

Uyghur 11 Nov 17, 2022
A machine learning model for analyzing text for user sentiment and determine whether its a positive, neutral, or negative review.

Sentiment Analysis on Yelp's Dataset Author: Roberto Sanchez, Talent Path: D1 Group Docker Deployment: Deployment of this application can be found her

Roberto Sanchez 0 Aug 04, 2021
IMS-Toucan is a toolkit to train state-of-the-art Speech Synthesis models

IMS-Toucan is a toolkit to train state-of-the-art Speech Synthesis models. Everything is pure Python and PyTorch based to keep it as simple and beginner-friendly, yet powerful as possible.

Digital Phonetics at the University of Stuttgart 247 Jan 05, 2023
PyJPBoatRace: Python-based Japanese boatrace tools 🚤

pyjpboatrace :speedboat: provides you with useful tools for data analysis and auto-betting for boatrace.

5 Oct 29, 2022
Official Stanford NLP Python Library for Many Human Languages

Official Stanford NLP Python Library for Many Human Languages

Stanford NLP 6.4k Jan 02, 2023
Implementation of the Hybrid Perception Block and Dual-Pruned Self-Attention block from the ITTR paper for Image to Image Translation using Transformers

ITTR - Pytorch Implementation of the Hybrid Perception Block (HPB) and Dual-Pruned Self-Attention (DPSA) block from the ITTR paper for Image to Image

Phil Wang 17 Dec 23, 2022
MEDIALpy: MEDIcal Abbreviations Lookup in Python

A small python package that allows the user to look up common medical abbreviations.

Aberystwyth Systems Biology 7 Nov 09, 2022
ThinkTwice: A Two-Stage Method for Long-Text Machine Reading Comprehension

ThinkTwice ThinkTwice is a retriever-reader architecture for solving long-text machine reading comprehension. It is based on the paper: ThinkTwice: A

Walle 4 Aug 06, 2021
NLP-SentimentAnalysis - Coursera Course ( Duration : 5 weeks ) offered by DeepLearning.AI

Coursera Natural Language Processing Specialization This repository contains material related to Coursera Natural Language Processing Specialization.

Nishant Sharma 1 Jun 05, 2022