当前位置：网站首页>[about text classification trick] things you don't know

[about text classification trick] things you don't know

2022-07-03 23:48:00 【Necther】

One 、 Data preprocessing problem

1.1 vocab Construction issues

Data preprocessing vocab Selection of （ front N High frequency words or filter out the occurrence times less than 3 Words, etc ）;

1.2 Model input problem

Combined with the model to be used , Here the data can be processed into char、word Or wait ;
Part of speech features of words 、 Emotional characteristics of words Adding training data will get better results ;
as for PAD Words , Take the mean value or a slightly larger number 【75% digit 】, But don't take the maximum value, which should be all right ;
You can only keep certain parts of speech , For example, as long as adjectives and nouns
stemming Adding training data will get better results ;
Topic vector Adding training data will get better results ;
Position vector Adding training data will get better results ;【 The position vector is the position of the current vocabulary embedding, Then it is spliced with the word vector 】

1.3 Noise data processing

Noise types ：【 Data sets D(X, Y)】
- Noise type 1 ：X The internal noise is very loud （ For example, the text is spoken or generated by the majority of Internet users ）;
- Noise type II ：Y The noise is very loud （ Some samples are clearly mislabeled , Some sample people are also difficult to define which category they belong to , It even has category ambiguity ）
- Noise type III ： Fixed rule data （eg：XX The report 、XX Edit high frequency fields ; Words that obviously affect the judgment of the model 【 Punctuation affects 】）.
Solution to noise type one ：
- Method 1 ： Word vector and The word vector angle ：
  - s1： Use char-level（ In Chinese, it is the granularity of words ） As model input ;
  - s2： take train from scratch（ Do not use pre trained word vectors ） Go with word-level A comparison of ;
  - s3： Preferential use ;
- Method 2 ： Using special hyperparameters fasttext To train a word vector ：
  - Introduce ： Generally speaking fasttext In English char ngram The window size of is generally taken as 3～6, But when dealing with Chinese , If our goal is to remove the noise in the input , Then we can limit this window to 1～2, This small window is conducive to the model to capture typos （ Imagine , When we type a wrong word , Generally, one word is homonymous to another word ）, such as word2vec Learned from “ Seems to be ” The most recent word of may be “ As if ”, But little ngram window fasttext Learned from “ Seems to be ” The most recent word is probably “ Yes ” And other words with typos inside , In this way, the words formed by not too much typos are brought back together again , It can even resist the noise produced by the word splitter to a certain extent （ Cut a word into several words ）.
- Method 3 ： Text correction ：
  - If Chinese is a normal text, most of them do not involve , But many malicious texts , There will be a lot of illegal characters , For example, insert special symbols between normal words , In reverse order , Width, etc . There are also some strange characters , You may need to maintain a conversion table by yourself ;
  - If it's in English , It involves spell checking , It can be used python package pyenchant Realization , such as mothjer -> mother.
- Method four ： Text generalization ：
  - emoticon 、 Numbers 、 The person's name 、 Address 、 website 、 Named entities, etc , Just substitute keywords . This depends on the specific task , It may have to be further refined . For example, there are many kinds of numbers , Common number , Phone number , Line number , Hotline number 、 Bank card number ,QQ Number , WeChat ID , money , Distance, etc , And many tasks , These can also be used as one-dimensional features alone . We also have to consider Chinese numbers, Arabic numbers, etc .
  - Chinese converts words into pinyin , Many malicious texts will use homonyms instead .
  - If it's in English , That may also have to do stem extraction 、 Morphological reduction, etc , such as fucking,fucked -> fuck
- Solution of noise type II ：
  - Method 1 ： Cross validation
    - s1： Training models ;
    - s2： Let the model go select Training set and verification set in Samples with inconsistent labels ;
    - s3：bad case analysis ;
      - The source of the error :
      - There is regularity ： Write rules to filter ;
      - Irregular , But there is the possibility of marking errors ： Delete ;
      - other ,...
Solution to noise type 3 ：
- Method 1 ： Make statistics on the fragments or words of the corpus , Remove the useless elements of very high frequency ;

1.4 Chinese task word segmentation

You can only keep the length greater than 1 The word
- result ： It has no effect on the accuracy of the results , But it can effectively reduce the feature dimension .
Make sure that the word separator and the word vector table token Particle size match
- reason ： If The participle is right , But word search Cannot find in word vector , That is, to become OOV, Then no matter how good the word segmentation performance is
- Strategy ：
  - Use word2vec、glove、fasttext The corresponding word breaker ;
  - take The dictionary As Stutterer dictionary Join in Stuttering participle ;
  - Case problem ： It is suggested that Case write All into Capital or lowercase , Prevent case problems OOV;

1.5 Stop word processing problem

Use common words
1. source ： online
2. adjustment ： Need basis Specific tasks Add or delete Some stop words
Word screening . Too many or too few occurrences can be considered to be removed
1. Too many words are commonly used in this kind of text ;
2. Too few are often spelling mistakes 、 Named entity 、 Special phrases, etc
according to tfidf To screen

Two 、 Model

2.1 Model selection

2.2 Word vector selection

2.3 word or The word vector Preliminary training

Word vector ： Expand the window during pre training
The choice of word vector , You can use pre trained word vectors such as Google 、facebook Open source , When the training set is relatively large, it can also be fine tuned or randomly initialized at the same time as the training . Don't fine tune if the training set is small

3、 ... and 、 Parameters

3.1 Regularization

BN and dropout(<0.5), And their relative position and order ： There are certain benefits , It needs to be combined with corpus analysis
dropout Add location ：word embed After the layer 、pooling after 、FC After the layer

3.2 Learning rate

Learning rate setting ： Default learning rate ( Usually 1e-3) ;
attenuation ： In multiple rounds epoch After the lr * 0.1;
Other strategies ：
- At the default learning rate ( Usually 1e-3) Training models , Get a model that performs best in the validation set .
- Load the optimal model of the previous step , The learning rate fell to 1e-4, Continuous training model , Keep the model that performs best on the validation set ;
- Load the optimal model of the previous step , Remove the regularization strategy （dropout etc. ）, The learning rate is adjusted to 1e-5, Then training to get the final optimal model .

Four 、 Mission

4.1 Dichotomous problem

Selection of output layer ：sigmoid or softmax
- softmax： Sometimes there may be some improvement

4.2 Multi label classification

problem ： If a sample has multiple labels at the same time , Even labels also constitute DAG（ Directed acyclic graph ）
Method ： use binary-cross-entropy Train a baseline Come on （ That is, turn each category into a binary classification problem , such N The problem of multi label classification of categories becomes N A dichotomous question ）
Tools ：tf.nn.sigmoid_cross_entropy_with_logits

4.3 Long text questions

Method 1 ： Rough cut ：
- Just take the first sentence + At the end of the sentence ： Intercept the head and tail with a large amount of information , And then we'll do the stitching ;
- Random interception ： If the fixed truncation information loss is large , Can be in DataLoader Each time, it is truncated with different random probabilities , This truncation allows the model to see more forms case;
- At the beginning of the sentence +tfidf Sift out a few words ： Intercept the beginning of the sentence , For the middle and end of a sentence , Extract keywords through keyword advance method , Splice to the end of the sentence ;
- truncation & sliding window + Forecast average ： Cut a sample into multiple samples by random truncation or fixed sliding window , Average the results of multiple samples during prediction ;
Method 2 ： Model angle , Some model optimizations are commonly used ,eg：XLNet、Reformer、Longformer

4.4 Robustness problem

Brutal data enhancement , Add stop words and punctuation 、 Delete words 、 Synonym substitution, etc , If the effect decreases, wash the enhanced training data .
Against learning 、 Contrast learning such high-level skills to improve ;

5、 ... and 、 Label system construction

5.1 Label system construction

Long tail label ： There are few samples under some classification labels , You can set this kind of label 「 other 」, Then these long tail tags are further processed at the next level .
Confusing labels ： Some samples under the label are not easy to distinguish , First of all, we need to think about whether such labels can be merged directly ; If not , Such labels can be unified first , Then the rules are processed at the next level .
Multi label ： In some scenarios, there may be hundreds of labels , You can set up a multi-level label system for processing . for example , First build the label category 、 Then build the label subclass ; You can also set multiple secondary classifications , Applicable to relatively independent label classification , And it is often necessary to add and modify scenes , Be able to be independent of each other 、 Easy to maintain .
Unknown tag ： When the business is cold started , If it is not clear which labels are appropriate , You can try to preliminarily divide labels by text clustering , And then assisted by expert intervention to jointly set , This is also a cycle iteration process .

5.2 Evaluation of rationality of labeling system

6、 ... and 、 Strategy building

6.1 Algorithm strategy construction

Algorithm strategy ：
1. Common rule methods ： important case cache 、 Pattern mining 、 key word + Rule settings, etc
2. Rule mining 【 The rules tell the whole story 】： For some high frequency case and hard case Give priority to rules or dictionaries , Avoid iterations due to model updates , Which leads to the model for these case Processing is not robust enough ;
3. Model generalization ： The modeling method is suitable for dealing with rules that cannot be hit case, It has generalization . Another processing logic is ： If case Hit the rule , But the model gives a very low confidence in the results of rule prediction （ That is, the model believes that the confidence of another category is higher ）, At this time, we can choose to trust the model , The model output shall prevail .

6.2 Feature mining strategy

Discrete data mining
1. Construct high-dimensional sparse features of keywords ： Similar to structured data mining （ Such as CTR Medium wide&deep）, For example, mining text content according to keyword list , Construct high-dimensional sparse features and feed xDeepFM [3] Intermediate processing , Finally, it is spliced with the text vector .
2. Other business features ： Such as the classification of diseases 、 Business characteristics such as visiting Department .
Text feature mining
1. key word & Entity words and text splicing ： The keywords or entity words extracted from the text sequence are spliced after the text sequence , And then classify . If in BERT in ：[CLS][ Original text ][SEP][ key word 1][SEP][ Substantive words 1]...
2. key word embedding turn ： Divide keywords into different category attributes , Conduct embedding turn , Different from discrete data mining , there embedding Should not be sparse .
3. Domain vector mining ： In addition to continuing to pre train word vectors on domain corpus , You can also construct word vectors with supervision ： For example, for 21 Classification problem , First, train according to the weak supervision method 21 Based on SVM Two classifiers of , Then extract each word in 21 individual SVM Weight in , Can be built for each vocabulary 21 The word vector of dimension .
Label features are integrated
1. label embedding turn ： Set up label embedding, Then interact with word vector through attention mechanism , Extract global vector classification .
2. Label information supplement ： Category labels can be spliced together with the original text , Then proceed 2 classification , If in BERT in ：[CLS][ Original text ][SEP][ Category label ]. Besides , You can also dynamically supplement label information through reinforcement learning , For details, please refer to the literature [4] .

6.3 Data imbalance

Categories of imbalances ：
- The amount of data is uneven ;
- Data diversity is uneven ;
The usual way to solve the imbalance problem ：
- Resampling （re-sampling）;
- Reweighting （re-weighting）;
- Data to enhance ;
- Gradient zoom ;
- Pseudo label ;

6.3.1 Resampling （re-sampling）

6.3.2 Reweighting （re-weighting）

6.3.3 Data to enhance

Data to enhance ：
- “ Additions and deletions ”： In the sentence “ Additions and deletions ” Some words ;
- Back translation ： Translate the text into other languages and then back ;
- words whose meaning is similar ： use words whose meaning is similar Replace Some words
- expand
- Intercept

6.4 Pre training model fusion angle

Generally, it does not need to be carried out directly finetune. Of course, you can also do it alone first BERT、XLNET、ALBERT Conduct finetune, Then we can integrate features together .
The word segmenter can adopt the best pre training model tokenizer, You can also use different pre training models at the same time tokenizer.
Don't ignore the role of simple word vectors . Similar to word vector 、bi-gram The addition of vectors is critical to the richness of the underlying model .
When configuring the upper model , Attention should be paid to the adjustment of learning rate . Feed the integrated underlying features biLSTM or CNN in , It can also be spliced biLSTM and CNN Together as the upper model . During training , You can first pre train the bottom model freeze, Only adjust the learning rate of the upper model （ more ）, Finally, adjust the learning rate globally （ smaller ）.
CLS Finally, it must be used again . No matter what the upper model is .CLS Finally, the feature should enter the full connection layer directly .

6.5 Catastrophic forgetting

motivation ： After learning new knowledge , Almost completely forget the previously learned content ;
A detailed description ： We built a deep neural network to learn and recognize various animals . Suppose we meet a very stingy data provider , Only one kind of animal data is provided at a time , And after learning to recognize the animal , Collect the data ; Then the training data of the next animal is given . Interesting phenomenon came out , This neural network learns to recognize puppies , Let it recognize the kitten it has learned before , It can't recognize . This is catastrophic forgetting , It has always been a serious problem in the field of deep learning .
Why does this problem arise ：
- Once the structure of deep learning is determined , It is difficult to adjust during training . The structure of neural network directly determines the capacity of learning model . The fixed structure of neural network means that the capacity of the model is also limited , In the case of limited capacity , Neural network in order to learn a new task , We must erase the old knowledge .
- The neurons in the hidden layer of deep learning are global , Small changes in a single neuron can affect the output of the whole network at the same time . in addition , The parameters of all feedforward networks are connected to each dimension of input , New data is likely to change all parameters in the network . We know , For the neural network whose structure is already fixed , Parameters are the only variables about knowledge . If the changed parameters include parameters that are highly related to historical knowledge , Then the final effect is , New knowledge covers old knowledge .
Solution ：

Directly mix the existing data with the original data for training ;
Feature extraction layer freeze, Only update the new category softMax Fully connected layer ;
Adopt knowledge distillation . When the existing data is mixed with the original data for training , Distill the original category , Guide new model learning .
Unify the classification labels label embedding, New categories are built separately label embedding Does not affect the original category . So as to turn the classification into a match and rank problem .

6.6 Small model, great wisdom

motivation ：BERT Although powerful , But in the low consumption scenario 、 Less machine scenes , Take it directly BERT Deploying a classification model usually doesn't work . Can we adopt a lightweight model , such as TextCNN, To approach BERT The effect of ？
Ideas ： Adopt knowledge distillation technology . The essence of distillation is function approximation , But if you directly BERT（Teacher Model ） Distill to a very light weight TextCNN（Student Model ）, Indicators generally decline .
Method ：
- Model distillation
- Data distillation

6.6.1 Model distillation

If there is less unmarked data in the business , We usually take logits The approximate （ Approximate ） Give Way TextCNN To study , This method can be called model distillation . This is an off-line distillation method ： First of all Teacher Model finetune, then freeze, let Student Model learning . In order to avoid obvious decline of indicators after distillation , We can take the following ways to improve ：

Data to enhance ： Introduce text enhancement technology while distilling , Specific enhancement techniques can be referred to 《NLP On the dilemma of small sample in 》.TinyBERT We adopted enhanced technology , To assist in distillation .
Integrated distillation ： To be different Teacher Model （ Such as different pre training models ） Of logits Integrate , Give Way TextCNN Study .**「 Integrated distillation + Data to enhance 」** It can effectively avoid the obvious decline of indicators .
Combined distillation ： Different from offline distillation , This is a joint training method .Teacher Model training at the same time , will logits Pass to Student Model learning . Combined distillation can reduce the serious of isomerization Teacher and Student Between models gap,Student The model can slowly learn from the intermediate state , Better imitate Teacher Behavior .

6.6.2 Data distillation

If the unmarked data in the business is large , We can take labels to approximate TextCNN To study . This method is called data distillation . Its essence is similar to the pseudo label method ： Give Way Teacher The model pseudo labels the unlabeled data , let Student Model learning . The specific steps are as follows ：

Training 1：BERT In dimensioning datasets A On finetune, Train one bert_model;
False mark ：bert_model For a large number of unmarked data U To make predictions （ False mark ）, Then score according to confidence , Choose data with high confidence B Fill to dimension data A, At this time, the marked data becomes （A+B）;
Training 2： Based on dimension data A+B Training TextCNN, obtain textcnn_model_1;
Training 3（optional）： Jean di 3 Step training is good textcnn_model_1 Based on dimension data A Train again , Form the final model textcnn_model_2;

For the above two distillation methods , The author is interested in one of the businesses 21 Category tasks （ Every kind 100 Samples ） Experiments were carried out , The relevant results are as follows ：

As can be seen from the above figure , If we can get more unlabeled data , Data distillation is more effective , Can make a lightweight TextCNN Maximum approximation BERT.

But maybe some readers will ask , Why not directly distill it into a shallow layer BERT Well ？ Of course it can , But I recommend here TextCNN The reason is that ： It's too light , And it will be more convenient to introduce some business-related features .

If you still want to distill into a shallow layer BERT, We need to first think about whether our field is similar to BERT In the field of original pre training gap Is it larger ？ If gap more , Let's not stop pre training , Continue the field pre training 、 And then distillation ; Or re pre train a shallow BERT. Besides , take BERT On line , Operator fusion is also possible （Faster Transformer） Or mixing accuracy .

Reference material

【 Xiaoxi collection 】 How to solve the problem of unbalanced classification gracefully and fashionable
In the text classification task , What are the few papers that have a significant impact on performance trick
xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems
Description Based Text Classification with Reinforcement Learning
How to solve NLP Classified tasks 11 Key issues ： Category imbalance & Calculated at low consumption & Small sample & Robustness & Test and inspection & Long text classification

All articles

Grains and other grains

NLP Everything goes together