当前位置:网站首页>The latest NLP game practice summary!
The latest NLP game practice summary!
2022-07-01 15:43:00 【Datawhale】
Introduction to the contest question
In order to improve the competitiveness of products, domestic automobile enterprises 、 Better go to overseas markets , Put forward the demand for intelligent interaction in overseas markets . But countries around the world are “ Data security ” There are strict legal restrictions on , Do a good job in overseas intelligent interaction , The biggest challenge for local enterprises is the lack of data . This competition requires the contestants to pass NLP Relevant artificial intelligence algorithms to achieve multilingual transfer learning in the automotive field .
Event address :https://challenge.xfyun.cn/topic/info?type=car-multilingual&ch=ds22-dw-gzh01
The mission of the event
In this transfer learning task , IFLYTEK smart car BU There will be more in car human-computer interaction Chinese corpus , And a small amount of Chinese and English 、 China and Japan 、 Chinese and Arabic parallel corpora are used as training sets .
Contestants build models from the data provided , Carry out intention classification and key information extraction tasks , The final use of English 、 Japanese 、 Test and judge in Arabic .
1. Preliminaries
- Training set : Chinese corpus 30000 strip , Chinese and English parallel corpora 1000 strip , Chinese and Japanese parallel corpora 1000 strip
- Test set A: English Corpus 500 strip , Japanese Corpus 500 strip
- Test set B: English Corpus 500 strip , Japanese Corpus 500 strip
2. The rematch
- Training set : Chinese corpus is the same as the preliminary contest , Chinese Arabic parallel corpora 1000 strip
- Test set A: Arabic corpus 500 strip
- Test set B: Arabic corpus 500 strip
Question data
This competition provides three types of in car interactive function corpus for contestants , This includes command control classes 、 Navigation class 、 Music .
More Chinese corpora and less multilingual parallel corpora contain intention classification and key information , Players need to make full use of the data provided , In Britain 、 Japan 、 Good results have been achieved in the task of intention classification and key information extraction in Arabic corpus . The label types and value types contained in the data are shown in the following table .
Variable | The numerical format | explain |
---|---|---|
intent | string | Whole sentence intention tag |
device | string | Operation equipment name label |
mode | string | Operation device mode label |
offset | string | Operation equipment adjustment label |
endloc | string | Destination label |
landmark | string | Search around the reference label |
singer | string | singer |
song | string | song |
Evaluation indicators
This model is based on the submitted result document , use accuracy Evaluate .
Intention classification intention correct number total data volume
Key information extraction key information completely correct number total data volume
notes :
It is wrong to draw more or less key information of each data , Finally, we get the average value of intention classification and key information extraction ;
Language conversion is not allowed in the prediction process , The language provided by the test set must be used for intention classification and key information extraction tasks directly .
Their thinking
Intention is classified as a typical text task ;
Information extraction is an entity extraction task ;
The competition task has the following characteristics :
Multilingual text , Need to consider multilingualism BERT;
Short text , You can try keyword matching ;
So let's use TFIDF + The idea of logical regression , In the future, we will continue to share and use BERT The idea of matching keywords .
step 1: Import library
import pandas as pd # Read the file
import numpy as np # Numerical calculation
import nagisa # Japanese word segmentation
from sklearn.feature_extraction.text import TfidfVectorizer # Text feature extraction
from sklearn.linear_model import LogisticRegression # Logical regression
from sklearn.pipeline import make_pipeline # Assembly line
step 2: Reading data
# Reading data
train_cn = pd.read_excel(' The preliminary training set of the multilingual transfer learning challenge in the automotive field / chinese _trian.xlsx')
train_ja = pd.read_excel(' The preliminary training set of the multilingual transfer learning challenge in the automotive field / Japanese _train.xlsx')
train_en = pd.read_excel(' The preliminary training set of the multilingual transfer learning challenge in the automotive field / english _train.xlsx')
test_ja = pd.read_excel('testA.xlsx', sheet_name=' Japanese _testA')
test_en = pd.read_excel('testA.xlsx', sheet_name=' english _testA')
step 3: Text participle
# Text participle
train_ja['words'] = train_ja[' Original text '].apply(lambda x: ' '.join(nagisa.tagging(x).words))
train_en['words'] = train_en[' Original text '].apply(lambda x: x.lower())
test_ja['words'] = test_ja[' Original text '].apply(lambda x: ' '.join(nagisa.tagging(x).words))
test_en['words'] = test_en[' Original text '].apply(lambda x: x.lower())
step 4: Build the model
# Training TFIDF And logical regression
pipline = make_pipeline(
TfidfVectorizer(),
LogisticRegression()
)
pipline.fit(
train_ja['words'].tolist() + train_en['words'].tolist(),
train_ja[' Intention '].tolist() + train_en[' Intention '].tolist()
)
# Model to predict
test_ja[' Intention '] = pipline.predict(test_ja['words'])
test_en[' Intention '] = pipline.predict(test_en['words'])
test_en[' Slot value 1'] = np.nan
test_en[' Slot value 2'] = np.nan
test_ja[' Slot value 1'] = np.nan
test_ja[' Slot value 2'] = np.nan
# Write the submission
writer = pd.ExcelWriter('submit.xlsx')
test_en.drop(['words'], axis=1).to_excel(writer, sheet_name=' english _testA', index=None)
test_ja.drop(['words'], axis=1).to_excel(writer, sheet_name=' Japanese _testA', index=None)
writer.save()
writer.close()
Focus on Datawhale official account , reply “NLP” Invite in NLP The competition exchange group , You don't need to add any more if you are already there .
Sorting is not easy to , spot Fabulous Three even ↓
边栏推荐
- What time do you get off work?!!!
- STM32ADC模拟/数字转换详解
- Sort out the four commonly used sorting functions in SQL
- Short Wei Lai grizzly, to "touch China" in the concept of stocks for a living?
- 【STM32-USB-MSC问题求助】STM32F411CEU6 (WeAct)+w25q64+USB-MSC Flash用SPI2 读出容量只有520KB
- [Cloudera][ImpalaJDBCDriver](500164)Error initialized or created transport for authentication
- Implementation of wechat web page subscription message
- 摩根大通期货开户安全吗?摩根大通期货公司开户方法是什么?
- Wechat applet 01 bottom navigation bar settings
- 有些能力,是工作中学不来的,看看这篇超过90%同行
猜你喜欢
雷神科技冲刺北交所,拟募集资金5.4亿元
Wechat applet 01 bottom navigation bar settings
u本位合约和币本位合约有区别,u本位合约会爆仓吗
Don't ask me again why MySQL hasn't left the index? For these reasons, I'll tell you all
点云重建方法汇总一(PCL-CGAL)
C#/VB.NET 合并PDF文档
[300 + selected interview questions from big companies continued to share] big data operation and maintenance sharp knife interview question column (III)
ATSS:自动选择样本,消除Anchor based和Anchor free物体检测方法之间的差别
6.2 normalization 6.2.6 BC normal form (BCNF) 6.2.9 normalization summary
Stm32f411 SPI2 output error, pb15 has no pulse debugging record [finally, pb15 and pb14 were found to be short circuited]
随机推荐
u本位合约和币本位合约有区别,u本位合约会爆仓吗
ThinkPHP进阶
综述 | 激光与视觉融合SLAM
Sort out the four commonly used sorting functions in SQL
求求你们,别再刷 Star 了!这跟“爱国”没关系!
张驰课堂:六西格玛数据的几种类型与区别
Tanabata confession introduction: teach you to use your own profession to say love words, the success rate is 100%, I can only help you here ~ (programmer Series)
Stm32f411 SPI2 output error, pb15 has no pulse debugging record [finally, pb15 and pb14 were found to be short circuited]
《性能之巅第2版》阅读笔记(五)--file-system监测
Photoshop plug-in HDR (II) - script development PS plug-in
【一天学awk】函数与自定义函数
MySQL的零拷贝技术
[STM32 learning] w25qxx automatic judgment capacity detection based on STM32 USB storage device
《QT+PCL第六章》点云配准icp系列3
Zhang Chi's class: several types and differences of Six Sigma data
What are the EN ISO 20957 certification standards for common fitness equipment
#夏日挑战赛# HarmonyOS canvas实现时钟
What time do you get off work?!!!
Wechat applet 02 - Implementation of rotation map and picture click jump
6.2 normalization 6.2.6 BC normal form (BCNF) 6.2.9 normalization summary