当前位置：网站首页>The latest NLP game practice summary!

The latest NLP game practice summary!

2022-07-01 15:43:00 【Datawhale】

Introduction to the contest question

In order to improve the competitiveness of products, domestic automobile enterprises 、 Better go to overseas markets , Put forward the demand for intelligent interaction in overseas markets . But countries around the world are “ Data security ” There are strict legal restrictions on , Do a good job in overseas intelligent interaction , The biggest challenge for local enterprises is the lack of data . This competition requires the contestants to pass NLP Relevant artificial intelligence algorithms to achieve multilingual transfer learning in the automotive field .

Event address ：https://challenge.xfyun.cn/topic/info?type=car-multilingual&ch=ds22-dw-gzh01

The mission of the event

In this transfer learning task , IFLYTEK smart car BU There will be more in car human-computer interaction Chinese corpus , And a small amount of Chinese and English 、 China and Japan 、 Chinese and Arabic parallel corpora are used as training sets .

Contestants build models from the data provided , Carry out intention classification and key information extraction tasks , The final use of English 、 Japanese 、 Test and judge in Arabic .

1. Preliminaries

-  Training set ： Chinese corpus 30000 strip , Chinese and English parallel corpora 1000 strip , Chinese and Japanese parallel corpora 1000 strip 
-  Test set A： English Corpus 500 strip , Japanese Corpus 500 strip 
-  Test set B： English Corpus 500 strip , Japanese Corpus 500 strip

2. The rematch

-  Training set ： Chinese corpus is the same as the preliminary contest , Chinese Arabic parallel corpora 1000 strip 
-  Test set A： Arabic corpus 500 strip 
-  Test set B： Arabic corpus 500 strip

Question data

This competition provides three types of in car interactive function corpus for contestants , This includes command control classes 、 Navigation class 、 Music .

More Chinese corpora and less multilingual parallel corpora contain intention classification and key information , Players need to make full use of the data provided , In Britain 、 Japan 、 Good results have been achieved in the task of intention classification and key information extraction in Arabic corpus . The label types and value types contained in the data are shown in the following table .

Variable	The numerical format	explain
intent	string	Whole sentence intention tag
device	string	Operation equipment name label
mode	string	Operation device mode label
offset	string	Operation equipment adjustment label
endloc	string	Destination label
landmark	string	Search around the reference label
singer	string	singer
song	string	song

Evaluation indicators

This model is based on the submitted result document , use accuracy Evaluate .

Intention classification intention correct number total data volume

Key information extraction key information completely correct number total data volume

notes ：

It is wrong to draw more or less key information of each data , Finally, we get the average value of intention classification and key information extraction ;
Language conversion is not allowed in the prediction process , The language provided by the test set must be used for intention classification and key information extraction tasks directly .

Their thinking

Intention is classified as a typical text task ;
Information extraction is an entity extraction task ;

The competition task has the following characteristics ：

Multilingual text , Need to consider multilingualism BERT;
Short text , You can try keyword matching ;

So let's use TFIDF + The idea of logical regression , In the future, we will continue to share and use BERT The idea of matching keywords .

step 1： Import library

import pandas as pd #  Read the file 
import numpy as np #  Numerical calculation 
import nagisa #  Japanese word segmentation 
from sklearn.feature_extraction.text import TfidfVectorizer #  Text feature extraction 
from sklearn.linear_model import LogisticRegression #  Logical regression 
from sklearn.pipeline import make_pipeline #  Assembly line

step 2： Reading data

#  Reading data 
train_cn = pd.read_excel(' The preliminary training set of the multilingual transfer learning challenge in the automotive field / chinese _trian.xlsx')
train_ja = pd.read_excel(' The preliminary training set of the multilingual transfer learning challenge in the automotive field / Japanese _train.xlsx')
train_en = pd.read_excel(' The preliminary training set of the multilingual transfer learning challenge in the automotive field / english _train.xlsx')

test_ja = pd.read_excel('testA.xlsx', sheet_name=' Japanese _testA')
test_en = pd.read_excel('testA.xlsx', sheet_name=' english _testA')

step 3： Text participle

#  Text participle 
train_ja['words'] = train_ja[' Original text '].apply(lambda x: ' '.join(nagisa.tagging(x).words))
train_en['words'] = train_en[' Original text '].apply(lambda x: x.lower())

test_ja['words'] = test_ja[' Original text '].apply(lambda x: ' '.join(nagisa.tagging(x).words))
test_en['words'] = test_en[' Original text '].apply(lambda x: x.lower())

step 4： Build the model

#  Training TFIDF And logical regression 
pipline = make_pipeline(
    TfidfVectorizer(),
    LogisticRegression()
)
pipline.fit(
    train_ja['words'].tolist() + train_en['words'].tolist(),
    train_ja[' Intention '].tolist() + train_en[' Intention '].tolist()
)

#  Model to predict 
test_ja[' Intention '] = pipline.predict(test_ja['words'])
test_en[' Intention '] = pipline.predict(test_en['words'])
test_en[' Slot value 1'] = np.nan
test_en[' Slot value 2'] = np.nan

test_ja[' Slot value 1'] = np.nan
test_ja[' Slot value 2'] = np.nan

#  Write the submission 
writer = pd.ExcelWriter('submit.xlsx')
test_en.drop(['words'], axis=1).to_excel(writer, sheet_name=' english _testA', index=None)
test_ja.drop(['words'], axis=1).to_excel(writer, sheet_name=' Japanese _testA', index=None)
writer.save()
writer.close()

Focus on Datawhale official account , reply “NLP” Invite in NLP The competition exchange group , You don't need to add any more if you are already there .