当前位置:网站首页>[about text classification trick] things you don't know
[about text classification trick] things you don't know
2022-07-03 23:48:00 【Necther】
One 、 Data preprocessing problem
1.1 vocab Construction issues
- Data preprocessing vocab Selection of ( front N High frequency words or filter out the occurrence times less than 3 Words, etc );
1.2 Model input problem
- Combined with the model to be used , Here the data can be processed into char、word Or wait ;
- Part of speech features of words 、 Emotional characteristics of words Adding training data will get better results ;
- as for PAD Words , Take the mean value or a slightly larger number 【75% digit 】, But don't take the maximum value, which should be all right ;
- You can only keep certain parts of speech , For example, as long as adjectives and nouns
- stemming Adding training data will get better results ;
- Topic vector Adding training data will get better results ;
- Position vector Adding training data will get better results ;【 The position vector is the position of the current vocabulary embedding, Then it is spliced with the word vector 】
1.3 Noise data processing
- Noise types :【 Data sets D(X, Y)】
- Noise type 1 :X The internal noise is very loud ( For example, the text is spoken or generated by the majority of Internet users );
- Noise type II :Y The noise is very loud ( Some samples are clearly mislabeled , Some sample people are also difficult to define which category they belong to , It even has category ambiguity )
- Noise type III : Fixed rule data (eg:XX The report 、XX Edit high frequency fields ; Words that obviously affect the judgment of the model 【 Punctuation affects 】).
- Solution to noise type one :
- Method 1 : Word vector and The word vector angle :
- s1: Use char-level( In Chinese, it is the granularity of words ) As model input ;
- s2: take train from scratch( Do not use pre trained word vectors ) Go with word-level A comparison of ;
- s3: Preferential use ;
- Method 2 : Using special hyperparameters fasttext To train a word vector :
- Introduce : Generally speaking fasttext In English char ngram The window size of is generally taken as 3~6, But when dealing with Chinese , If our goal is to remove the noise in the input , Then we can limit this window to 1~2, This small window is conducive to the model to capture typos ( Imagine , When we type a wrong word , Generally, one word is homonymous to another word ), such as word2vec Learned from “ Seems to be ” The most recent word of may be “ As if ”, But little ngram window fasttext Learned from “ Seems to be ” The most recent word is probably “ Yes ” And other words with typos inside , In this way, the words formed by not too much typos are brought back together again , It can even resist the noise produced by the word splitter to a certain extent ( Cut a word into several words ).
- Method 3 : Text correction :
- If Chinese is a normal text, most of them do not involve , But many malicious texts , There will be a lot of illegal characters , For example, insert special symbols between normal words , In reverse order , Width, etc . There are also some strange characters , You may need to maintain a conversion table by yourself ;
- If it's in English , It involves spell checking , It can be used python package pyenchant Realization , such as mothjer -> mother.
- Method four : Text generalization :
- emoticon 、 Numbers 、 The person's name 、 Address 、 website 、 Named entities, etc , Just substitute keywords . This depends on the specific task , It may have to be further refined . For example, there are many kinds of numbers , Common number , Phone number , Line number , Hotline number 、 Bank card number ,QQ Number , WeChat ID , money , Distance, etc , And many tasks , These can also be used as one-dimensional features alone . We also have to consider Chinese numbers, Arabic numbers, etc .
- Chinese converts words into pinyin , Many malicious texts will use homonyms instead .
- If it's in English , That may also have to do stem extraction 、 Morphological reduction, etc , such as fucking,fucked -> fuck
- Solution of noise type II :
- Method 1 : Cross validation
- s1: Training models ;
- s2: Let the model go select Training set and verification set in Samples with inconsistent labels ;
- s3:bad case analysis ;
- The source of the error :
- There is regularity : Write rules to filter ;
- Irregular , But there is the possibility of marking errors : Delete ;
- other ,...
- Method 1 : Cross validation
- Method 1 : Word vector and The word vector angle :
- Solution to noise type 3 :
- Method 1 : Make statistics on the fragments or words of the corpus , Remove the useless elements of very high frequency ;
1.4 Chinese task word segmentation
- You can only keep the length greater than 1 The word
- result : It has no effect on the accuracy of the results , But it can effectively reduce the feature dimension .
- Make sure that the word separator and the word vector table token Particle size match
- reason : If The participle is right , But word search Cannot find in word vector , That is, to become OOV, Then no matter how good the word segmentation performance is
- Strategy :
- Use word2vec、glove、fasttext The corresponding word breaker ;
- take The dictionary As Stutterer dictionary Join in Stuttering participle ;
- Case problem : It is suggested that Case write All into Capital or lowercase , Prevent case problems OOV;
1.5 Stop word processing problem
- Use common words
- source : online
- adjustment : Need basis Specific tasks Add or delete Some stop words
- Word screening . Too many or too few occurrences can be considered to be removed
- Too many words are commonly used in this kind of text ;
- Too few are often spelling mistakes 、 Named entity 、 Special phrases, etc
- according to tfidf To screen
Two 、 Model
2.1 Model selection

2.2 Word vector selection

2.3 word or The word vector Preliminary training
- Word vector : Expand the window during pre training
- The choice of word vector , You can use pre trained word vectors such as Google 、facebook Open source , When the training set is relatively large, it can also be fine tuned or randomly initialized at the same time as the training . Don't fine tune if the training set is small
3、 ... and 、 Parameters
3.1 Regularization
- BN and dropout(<0.5), And their relative position and order : There are certain benefits , It needs to be combined with corpus analysis
- dropout Add location :word embed After the layer 、pooling after 、FC After the layer
3.2 Learning rate
- Learning rate setting : Default learning rate ( Usually 1e-3) ;
- attenuation : In multiple rounds epoch After the lr * 0.1;
- Other strategies :
- At the default learning rate ( Usually 1e-3) Training models , Get a model that performs best in the validation set .
- Load the optimal model of the previous step , The learning rate fell to 1e-4, Continuous training model , Keep the model that performs best on the validation set ;
- Load the optimal model of the previous step , Remove the regularization strategy (dropout etc. ), The learning rate is adjusted to 1e-5, Then training to get the final optimal model .
Four 、 Mission
4.1 Dichotomous problem
- Selection of output layer :sigmoid or softmax
- softmax: Sometimes there may be some improvement
4.2 Multi label classification
- problem : If a sample has multiple labels at the same time , Even labels also constitute DAG( Directed acyclic graph )
- Method : use binary-cross-entropy Train a baseline Come on ( That is, turn each category into a binary classification problem , such N The problem of multi label classification of categories becomes N A dichotomous question )
- Tools :tf.nn.sigmoid_cross_entropy_with_logits
4.3 Long text questions
- Method 1 : Rough cut :
- Just take the first sentence + At the end of the sentence : Intercept the head and tail with a large amount of information , And then we'll do the stitching ;
- Random interception : If the fixed truncation information loss is large , Can be in DataLoader Each time, it is truncated with different random probabilities , This truncation allows the model to see more forms case;
- At the beginning of the sentence +tfidf Sift out a few words : Intercept the beginning of the sentence , For the middle and end of a sentence , Extract keywords through keyword advance method , Splice to the end of the sentence ;
- truncation & sliding window + Forecast average : Cut a sample into multiple samples by random truncation or fixed sliding window , Average the results of multiple samples during prediction ;
- Method 2 : Model angle , Some model optimizations are commonly used ,eg:XLNet、Reformer、Longformer
4.4 Robustness problem
- Brutal data enhancement , Add stop words and punctuation 、 Delete words 、 Synonym substitution, etc , If the effect decreases, wash the enhanced training data .
- Against learning 、 Contrast learning such high-level skills to improve ;
5、 ... and 、 Label system construction
5.1 Label system construction
- Long tail label : There are few samples under some classification labels , You can set this kind of label 「 other 」, Then these long tail tags are further processed at the next level .
- Confusing labels : Some samples under the label are not easy to distinguish , First of all, we need to think about whether such labels can be merged directly ; If not , Such labels can be unified first , Then the rules are processed at the next level .
- Multi label : In some scenarios, there may be hundreds of labels , You can set up a multi-level label system for processing . for example , First build the label category 、 Then build the label subclass ; You can also set multiple secondary classifications , Applicable to relatively independent label classification , And it is often necessary to add and modify scenes , Be able to be independent of each other 、 Easy to maintain .
- Unknown tag : When the business is cold started , If it is not clear which labels are appropriate , You can try to preliminarily divide labels by text clustering , And then assisted by expert intervention to jointly set , This is also a cycle iteration process .
5.2 Evaluation of rationality of labeling system

6、 ... and 、 Strategy building
6.1 Algorithm strategy construction
- Algorithm strategy :
- Common rule methods : important case cache 、 Pattern mining 、 key word + Rule settings, etc
- Rule mining 【 The rules tell the whole story 】: For some high frequency case and hard case Give priority to rules or dictionaries , Avoid iterations due to model updates , Which leads to the model for these case Processing is not robust enough ;
- Model generalization : The modeling method is suitable for dealing with rules that cannot be hit case, It has generalization . Another processing logic is : If case Hit the rule , But the model gives a very low confidence in the results of rule prediction ( That is, the model believes that the confidence of another category is higher ), At this time, we can choose to trust the model , The model output shall prevail .
6.2 Feature mining strategy
- Discrete data mining
- Construct high-dimensional sparse features of keywords : Similar to structured data mining ( Such as CTR Medium wide&deep), For example, mining text content according to keyword list , Construct high-dimensional sparse features and feed xDeepFM [3] Intermediate processing , Finally, it is spliced with the text vector .
- Other business features : Such as the classification of diseases 、 Business characteristics such as visiting Department .
- Text feature mining
- key word & Entity words and text splicing : The keywords or entity words extracted from the text sequence are spliced after the text sequence , And then classify . If in BERT in :[CLS][ Original text ][SEP][ key word 1][SEP][ Substantive words 1]...
- key word embedding turn : Divide keywords into different category attributes , Conduct embedding turn , Different from discrete data mining , there embedding Should not be sparse .
- Domain vector mining : In addition to continuing to pre train word vectors on domain corpus , You can also construct word vectors with supervision : For example, for 21 Classification problem , First, train according to the weak supervision method 21 Based on SVM Two classifiers of , Then extract each word in 21 individual SVM Weight in , Can be built for each vocabulary 21 The word vector of dimension .
- Label features are integrated
- label embedding turn : Set up label embedding, Then interact with word vector through attention mechanism , Extract global vector classification .
- Label information supplement : Category labels can be spliced together with the original text , Then proceed 2 classification , If in BERT in :[CLS][ Original text ][SEP][ Category label ]. Besides , You can also dynamically supplement label information through reinforcement learning , For details, please refer to the literature [4] .
6.3 Data imbalance
- Categories of imbalances :
- The amount of data is uneven ;
- Data diversity is uneven ;
- The usual way to solve the imbalance problem :
- Resampling (re-sampling);
- Reweighting (re-weighting);
- Data to enhance ;
- Gradient zoom ;
- Pseudo label ;
6.3.1 Resampling (re-sampling)

6.3.2 Reweighting (re-weighting)



6.3.3 Data to enhance
- Data to enhance :
- “ Additions and deletions ”: In the sentence “ Additions and deletions ” Some words ;
- Back translation : Translate the text into other languages and then back ;
- words whose meaning is similar : use words whose meaning is similar Replace Some words
- expand
- Intercept
6.4 Pre training model fusion angle
- Generally, it does not need to be carried out directly finetune. Of course, you can also do it alone first BERT、XLNET、ALBERT Conduct finetune, Then we can integrate features together .
- The word segmenter can adopt the best pre training model tokenizer, You can also use different pre training models at the same time tokenizer.
- Don't ignore the role of simple word vectors . Similar to word vector 、bi-gram The addition of vectors is critical to the richness of the underlying model .
- When configuring the upper model , Attention should be paid to the adjustment of learning rate . Feed the integrated underlying features biLSTM or CNN in , It can also be spliced biLSTM and CNN Together as the upper model . During training , You can first pre train the bottom model freeze, Only adjust the learning rate of the upper model ( more ), Finally, adjust the learning rate globally ( smaller ).
- CLS Finally, it must be used again . No matter what the upper model is .CLS Finally, the feature should enter the full connection layer directly .
6.5 Catastrophic forgetting
- motivation : After learning new knowledge , Almost completely forget the previously learned content ;
- A detailed description : We built a deep neural network to learn and recognize various animals . Suppose we meet a very stingy data provider , Only one kind of animal data is provided at a time , And after learning to recognize the animal , Collect the data ; Then the training data of the next animal is given . Interesting phenomenon came out , This neural network learns to recognize puppies , Let it recognize the kitten it has learned before , It can't recognize . This is catastrophic forgetting , It has always been a serious problem in the field of deep learning .
- Why does this problem arise :
- Once the structure of deep learning is determined , It is difficult to adjust during training . The structure of neural network directly determines the capacity of learning model . The fixed structure of neural network means that the capacity of the model is also limited , In the case of limited capacity , Neural network in order to learn a new task , We must erase the old knowledge .
- The neurons in the hidden layer of deep learning are global , Small changes in a single neuron can affect the output of the whole network at the same time . in addition , The parameters of all feedforward networks are connected to each dimension of input , New data is likely to change all parameters in the network . We know , For the neural network whose structure is already fixed , Parameters are the only variables about knowledge . If the changed parameters include parameters that are highly related to historical knowledge , Then the final effect is , New knowledge covers old knowledge .
- Solution :
- Directly mix the existing data with the original data for training ;
- Feature extraction layer freeze, Only update the new category softMax Fully connected layer ;
- Adopt knowledge distillation . When the existing data is mixed with the original data for training , Distill the original category , Guide new model learning .
- Unify the classification labels label embedding, New categories are built separately label embedding Does not affect the original category . So as to turn the classification into a match and rank problem .
6.6 Small model, great wisdom
- motivation :BERT Although powerful , But in the low consumption scenario 、 Less machine scenes , Take it directly BERT Deploying a classification model usually doesn't work . Can we adopt a lightweight model , such as TextCNN, To approach BERT The effect of ?
- Ideas : Adopt knowledge distillation technology . The essence of distillation is function approximation , But if you directly BERT(Teacher Model ) Distill to a very light weight TextCNN(Student Model ), Indicators generally decline .
- Method :
- Model distillation
- Data distillation
6.6.1 Model distillation
If there is less unmarked data in the business , We usually take logits The approximate ( Approximate ) Give Way TextCNN To study , This method can be called model distillation . This is an off-line distillation method : First of all Teacher Model finetune, then freeze, let Student Model learning . In order to avoid obvious decline of indicators after distillation , We can take the following ways to improve :
- Data to enhance : Introduce text enhancement technology while distilling , Specific enhancement techniques can be referred to 《NLP On the dilemma of small sample in 》.TinyBERT We adopted enhanced technology , To assist in distillation .
- Integrated distillation : To be different Teacher Model ( Such as different pre training models ) Of logits Integrate , Give Way TextCNN Study .**「 Integrated distillation + Data to enhance 」** It can effectively avoid the obvious decline of indicators .
- Combined distillation : Different from offline distillation , This is a joint training method .Teacher Model training at the same time , will logits Pass to Student Model learning . Combined distillation can reduce the serious of isomerization Teacher and Student Between models gap,Student The model can slowly learn from the intermediate state , Better imitate Teacher Behavior .
6.6.2 Data distillation
If the unmarked data in the business is large , We can take labels to approximate TextCNN To study . This method is called data distillation . Its essence is similar to the pseudo label method : Give Way Teacher The model pseudo labels the unlabeled data , let Student Model learning . The specific steps are as follows :
- Training 1:BERT In dimensioning datasets A On finetune, Train one bert_model;
- False mark :bert_model For a large number of unmarked data U To make predictions ( False mark ), Then score according to confidence , Choose data with high confidence B Fill to dimension data A, At this time, the marked data becomes (A+B);
- Training 2: Based on dimension data A+B Training TextCNN, obtain textcnn_model_1;
- Training 3(optional): Jean di 3 Step training is good textcnn_model_1 Based on dimension data A Train again , Form the final model textcnn_model_2;
For the above two distillation methods , The author is interested in one of the businesses 21 Category tasks ( Every kind 100 Samples ) Experiments were carried out , The relevant results are as follows :

As can be seen from the above figure , If we can get more unlabeled data , Data distillation is more effective , Can make a lightweight TextCNN Maximum approximation BERT.
But maybe some readers will ask , Why not directly distill it into a shallow layer BERT Well ? Of course it can , But I recommend here TextCNN The reason is that : It's too light , And it will be more convenient to introduce some business-related features .
If you still want to distill into a shallow layer BERT, We need to first think about whether our field is similar to BERT In the field of original pre training gap Is it larger ? If gap more , Let's not stop pre training , Continue the field pre training 、 And then distillation ; Or re pre train a shallow BERT. Besides , take BERT On line , Operator fusion is also possible (Faster Transformer) Or mixing accuracy .
Reference material
- 【 Xiaoxi collection 】 How to solve the problem of unbalanced classification gracefully and fashionable
- In the text classification task , What are the few papers that have a significant impact on performance trick
- xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems
- Description Based Text Classification with Reinforcement Learning
- How to solve NLP Classified tasks 11 Key issues : Category imbalance & Calculated at low consumption & Small sample & Robustness & Test and inspection & Long text classification
All articles
Grains and other grains
- super 1900 Star standard ! Natural language processing thesis learning notes
- super 500 Star standard ! natural language processing Face the
- super 500 Star standard ! Recommendation system Face the
- Promotion search The arsenal 【 Continuous updating 】
NLP Everything goes together
- 【 Fundamentals of algorithms 】
- 【 About Over fitting and under fitting 】 Things you don't know
- 【 About BatchNorm vs LayerNorm】 Things you don't know
- 【 About Activation function 】 Things you don't know
- 【 About Regularization 】 Things you don't know
- 【 About optimization algorithm 】 Things you don't know
- 【 About normalization 】 Things you don't know
- 【 About Discriminant (discriminative) Model vs. Generative (generative) Model 】 Things you don't know
- 【 Machine learning 】
- 【 Deep learning 】
- 【 Pre training model 】
- 【 About TF-idf】 Things you don't know
- 【 About Word2vec】 Things you don't know
- 【 About fastText】 Things you don't know
- 【 About Bert】 Things you don't know ( On )
- 【 About Bert】 Things you don't know ( Next )
- 【 About Bert The source code parsing I And The main body 】 Things you don't know
- 【 About Bert The source code parsing II And Pre training 】 Things you don't know
- 【 About Bert The source code parsing III And fine-tuning piece 】 Things you don't know
- 【 About Bert The source code parsing IV And Sentence vector generation 】 Things you don't know
- 【Bert The bigger, the more refined 】
- 【Bert Short and concise 】
- 【 Text classification 】
- 【 other 】
Rasa Dialogue system
- 《【 The community said 】 Let's talk Rasa 3.0》 Incomplete notes
- ( One ) Overview of dialogue robot
- ( Two )RASA Introduction to open source engine
- ( 3、 ... and )RASA NLU Language model
- ( Four )RASA NLU Word segmentation is
- ( 5、 ... and )RASA NLU Feature generator
- ( 6、 ... and )RASA NLU Intention classifier
- ( 7、 ... and )RASA NLU Entity extractor
- ( Nine )RASA Customize pipeline Components
- ( Ten )RASA CORE Policy
- ( 11、 ... and )RASA CORE Action
- ( Twelve )RASA Domain
- ( 13、 ... and )RASA Training data
- ( fourteen )RASA story
- ( 15、 ... and )Rasa Rules
- ( sixteen )RASA Best practices
- ( seventeen ) be based on RASA Start Chinese robot
- ( eighteen ) be based on RASA Start the Chinese robot implementation mechanism
- ( nineteen ) Question answering system based on knowledge map (KBQA)
- ( twenty ) Question answering system based on reading comprehension
- ( The 21st )RASA Application FAQs
- ( Twenty-two )RASA Hyperparametric optimization
- ( 23 ) Robot testing and evaluation
- ( Twenty-four ) utilize Rasa Forms Create a context dialog assistant
- DIET:Dual Intent and Entity Transformer——RASA Thesis translation
Introduction to knowledge map
- Handout of Zhejiang University atlas | Lesson one - An introduction to the map of knowledge — The first 1 section - Language and knowledge
- Handout of Zhejiang University atlas | Lesson one - An introduction to the map of knowledge — The first 2 section - The origin of knowledge map
- Manual handout | Lesson one - The first 3 section - The value of the map of knowledge
- Manual handout | Lesson one - The first 4 section - The technical connotation of knowledge map
- Manual handout | The second is - The first 1 section - What is knowledge representation
Reprint record
- Bert And TensorRT Deployment manual , Enjoy silky smoothness
- New scheme of sentence vector CoSENT Practice record
- CHIP2021| The third program of clinical terminology standardization is open source
- CHIP2021 | Medical dialogue clinical discovery Yin Yang discrimination task first program open source
- Crack transformer eight-part essay ( a literary composition prescribed for the imperial civil service examinations, knows for it rigidity of form and poverty of ideas ) , Ask and answer quickly
- BERT Visualization tools bertviz Experience
- PRGC: A new joint relation extraction model
- Add prior knowledge to the neural network !
- CBLUE Chinese medical language understanding evaluation Baseline

边栏推荐
- 2022 t elevator repair registration examination and the latest analysis of T elevator repair
- Fluent learning (4) listview
- Pyqt5 sensitive word detection tool production, operator's Gospel
- 2022 examination of safety production management personnel of hazardous chemical production units and examination skills of safety production management personnel of hazardous chemical production unit
- Op amp related - link
- 股票開戶傭金最低的券商有哪些大家推薦一下,手機上開戶安全嗎
- [MySQL] classification of multi table queries
- Selenium library 4.5.0 keyword explanation (4)
- Is user authentication really simple
- D29:post Office (post office, translation)
猜你喜欢

Double efficiency. Six easy-to-use pychar plug-ins are recommended

Tencent interview: can you find the number of 1 in binary?

Report on the construction and development mode and investment mode of sponge cities in China 2022-2028

Interpretation of corolla sub low configuration, three cylinder power configuration, CVT fuel saving and smooth, safety configuration is in place

How to write a good title of 10w+?

China standard gas market prospect investment and development feasibility study report 2022-2028

Introducing Software Testing

The first game of the new year, many bug awards submitted

Common mode interference of EMC

A preliminary study on the middleware of script Downloader
随机推荐
P1656 bombing Railway
[Mongodb] 2. Use mongodb --------- use compass
Ningde times and BYD have refuted rumors one after another. Why does someone always want to harm domestic brands?
Correlation analysis summary
D29:post Office (post office, translation)
QT creator source code learning note 05, how does the menu bar realize plug-in?
2022 chemical automation control instrument examination content and chemical automation control instrument simulation examination
Distributed transaction -- middleware of TCC -- selection / comparison
What are the securities companies with the lowest Commission for stock account opening? Would you recommend it? Is it safe to open an account on your mobile phone
[BSP video tutorial] stm32h7 video tutorial phase 5: MDK topic, system introduction to MDK debugging, AC5, AC6 compilers, RTE development environment and the role of various configuration items (2022-
Deep learning ----- using NN, CNN, RNN neural network to realize MNIST data set processing
D25:sequence search (sequence search, translation + problem solving)
X Opencv feature point detection and matching
A treasure open source software, cross platform terminal artifact tabby
Generic tips
No qualifying bean of type ‘com. netflix. discovery. AbstractDiscoveryClientOptionalArgs<?>‘ available
The first game of the new year, many bug awards submitted
How to quickly build high availability of service discovery
"Learning notes" recursive & recursive
Ningde times and BYD have refuted rumors one after another. Why does someone always want to harm domestic brands?