当前位置:网站首页>The latest NLP game practice summary!
The latest NLP game practice summary!
2022-07-01 15:43:00 【Datawhale】
Introduction to the contest question
In order to improve the competitiveness of products, domestic automobile enterprises 、 Better go to overseas markets , Put forward the demand for intelligent interaction in overseas markets . But countries around the world are “ Data security ” There are strict legal restrictions on , Do a good job in overseas intelligent interaction , The biggest challenge for local enterprises is the lack of data . This competition requires the contestants to pass NLP Relevant artificial intelligence algorithms to achieve multilingual transfer learning in the automotive field .
Event address :https://challenge.xfyun.cn/topic/info?type=car-multilingual&ch=ds22-dw-gzh01
The mission of the event
In this transfer learning task , IFLYTEK smart car BU There will be more in car human-computer interaction Chinese corpus , And a small amount of Chinese and English 、 China and Japan 、 Chinese and Arabic parallel corpora are used as training sets .
Contestants build models from the data provided , Carry out intention classification and key information extraction tasks , The final use of English 、 Japanese 、 Test and judge in Arabic .
1. Preliminaries
- Training set : Chinese corpus 30000 strip , Chinese and English parallel corpora 1000 strip , Chinese and Japanese parallel corpora 1000 strip
- Test set A: English Corpus 500 strip , Japanese Corpus 500 strip
- Test set B: English Corpus 500 strip , Japanese Corpus 500 strip
2. The rematch
- Training set : Chinese corpus is the same as the preliminary contest , Chinese Arabic parallel corpora 1000 strip
- Test set A: Arabic corpus 500 strip
- Test set B: Arabic corpus 500 strip
Question data
This competition provides three types of in car interactive function corpus for contestants , This includes command control classes 、 Navigation class 、 Music .
More Chinese corpora and less multilingual parallel corpora contain intention classification and key information , Players need to make full use of the data provided , In Britain 、 Japan 、 Good results have been achieved in the task of intention classification and key information extraction in Arabic corpus . The label types and value types contained in the data are shown in the following table .
Variable | The numerical format | explain |
---|---|---|
intent | string | Whole sentence intention tag |
device | string | Operation equipment name label |
mode | string | Operation device mode label |
offset | string | Operation equipment adjustment label |
endloc | string | Destination label |
landmark | string | Search around the reference label |
singer | string | singer |
song | string | song |
Evaluation indicators
This model is based on the submitted result document , use accuracy Evaluate .
Intention classification intention correct number total data volume
Key information extraction key information completely correct number total data volume
notes :
It is wrong to draw more or less key information of each data , Finally, we get the average value of intention classification and key information extraction ;
Language conversion is not allowed in the prediction process , The language provided by the test set must be used for intention classification and key information extraction tasks directly .
Their thinking
Intention is classified as a typical text task ;
Information extraction is an entity extraction task ;
The competition task has the following characteristics :
Multilingual text , Need to consider multilingualism BERT;
Short text , You can try keyword matching ;
So let's use TFIDF + The idea of logical regression , In the future, we will continue to share and use BERT The idea of matching keywords .
step 1: Import library
import pandas as pd # Read the file
import numpy as np # Numerical calculation
import nagisa # Japanese word segmentation
from sklearn.feature_extraction.text import TfidfVectorizer # Text feature extraction
from sklearn.linear_model import LogisticRegression # Logical regression
from sklearn.pipeline import make_pipeline # Assembly line
step 2: Reading data
# Reading data
train_cn = pd.read_excel(' The preliminary training set of the multilingual transfer learning challenge in the automotive field / chinese _trian.xlsx')
train_ja = pd.read_excel(' The preliminary training set of the multilingual transfer learning challenge in the automotive field / Japanese _train.xlsx')
train_en = pd.read_excel(' The preliminary training set of the multilingual transfer learning challenge in the automotive field / english _train.xlsx')
test_ja = pd.read_excel('testA.xlsx', sheet_name=' Japanese _testA')
test_en = pd.read_excel('testA.xlsx', sheet_name=' english _testA')
step 3: Text participle
# Text participle
train_ja['words'] = train_ja[' Original text '].apply(lambda x: ' '.join(nagisa.tagging(x).words))
train_en['words'] = train_en[' Original text '].apply(lambda x: x.lower())
test_ja['words'] = test_ja[' Original text '].apply(lambda x: ' '.join(nagisa.tagging(x).words))
test_en['words'] = test_en[' Original text '].apply(lambda x: x.lower())
step 4: Build the model
# Training TFIDF And logical regression
pipline = make_pipeline(
TfidfVectorizer(),
LogisticRegression()
)
pipline.fit(
train_ja['words'].tolist() + train_en['words'].tolist(),
train_ja[' Intention '].tolist() + train_en[' Intention '].tolist()
)
# Model to predict
test_ja[' Intention '] = pipline.predict(test_ja['words'])
test_en[' Intention '] = pipline.predict(test_en['words'])
test_en[' Slot value 1'] = np.nan
test_en[' Slot value 2'] = np.nan
test_ja[' Slot value 1'] = np.nan
test_ja[' Slot value 2'] = np.nan
# Write the submission
writer = pd.ExcelWriter('submit.xlsx')
test_en.drop(['words'], axis=1).to_excel(writer, sheet_name=' english _testA', index=None)
test_ja.drop(['words'], axis=1).to_excel(writer, sheet_name=' Japanese _testA', index=None)
writer.save()
writer.close()
Focus on Datawhale official account , reply “NLP” Invite in NLP The competition exchange group , You don't need to add any more if you are already there .
Sorting is not easy to , spot Fabulous Three even ↓
边栏推荐
- Microservice tracking SQL (support Gorm query tracking under isto control)
- 跨平台应用开发进阶(二十四) :uni-app实现文件下载并保存
- ThinkPHP advanced
- 选择在长城证券上炒股开户可以吗?安全吗?
- 2022 Moonriver全球黑客松优胜项目名单
- Implementation of wechat web page subscription message
- 《性能之巅第2版》阅读笔记(五)--file-system监测
- MySQL backup and restore single database and single table
- 【Pygame实战】你说神奇不神奇?吃豆人+切水果结合出一款你没玩过的新游戏!(附源码)
- Photoshop plug-in HDR (II) - script development PS plug-in
猜你喜欢
【STM32-USB-MSC问题求助】STM32F411CEU6 (WeAct)+w25q64+USB-MSC Flash用SPI2 读出容量只有520KB
Raytheon technology rushes to the Beijing stock exchange and plans to raise 540million yuan
七夕表白攻略:教你用自己的专业说情话,成功率100%,我只能帮你们到这里了啊~(程序员系列)
入侵检测模型(An Intrusion-Detection Model)
张驰咨询:家电企业用六西格玛项目减少客户非合理退货案例
HR面试:最常见的面试问题和技巧性答复
phpcms后台上传图片按钮无法点击
微信小程序03-文字一左一右显示,行内块元素居中
自动、智能、可视!深信服SSLO方案背后的八大设计
Redis high availability principle
随机推荐
RT-Thread Env 工具介绍(学习笔记)
Trace the source of drugs and tamp the safety dike
6.2 normalization 6.2.6 BC normal form (BCNF) 6.2.9 normalization summary
ATSS:自动选择样本,消除Anchor based和Anchor free物体检测方法之间的差别
S32K1xx 微控制器的硬件设计指南
她就是那个「别人家的HR」|ONES 人物
Reading notes of top performance version 2 (V) -- file system monitoring
Go zero actual combat demo (I)
【目标跟踪】|STARK
[Cloudera][ImpalaJDBCDriver](500164)Error initialized or created transport for authentication
[300 + selected interview questions from big companies continued to share] big data operation and maintenance sharp knife interview question column (III)
Stm32f4-tft-spi timing logic analyzer commissioning record
软件测试的可持续发展,必须要学会敲代码?
STM32F411 SPI2输出错误,PB15无脉冲调试记录【最后发现PB15与PB14短路】
STM32ADC模拟/数字转换详解
说明 | 华为云云商店「商品推荐榜」
微信小程序03-文字一左一右显示,行内块元素居中
【OpenCV 例程200篇】216. 绘制多段线和多边形
使用 csv 导入的方式在 SAP S/4HANA 里创建 employee 数据
ThinkPHP advanced