当前位置：网站首页>NLP natural language processing - Introduction to machine learning and natural language processing (3)

NLP natural language processing - Introduction to machine learning and natural language processing (3)

2022-07-26 07:25:00 【Emperor Confucianism is supreme】

NLP natural language processing - Introduction to machine learning and natural language processing - New word discovery and TF-IDF
1. The discovery of new words
（1） Why should we do new word discovery
①　 If there is no vocabulary , How do we find words ;
②　 As the amount of data increases , The old vocabulary will gradually fail to meet the subsequent needs ;
③　 The supplementary vocabulary is helpful to the realization of downstream tasks .
④　 Words are equivalent to a fixed collocation , The inside of the word is stable , Also called internal solidity Solidifying degree
; And the outside of the word is unstable , It is called left and right entropy entropy
.
Such as below ： The word Hebei is stable , But the one behind is not fixed .
Insert picture description here
（2） What is an important word
①　 When we segment the text , Need to use words to understand the document , Then what we need is the important words in the document ; as follows ：

②　 If a word is in a certain kind of text （ Assuming that A class ） There are many times in , And in other categories of text （ Not A class ） There are few , Then the word is A Important words of text like （ High weight words ）.
Insert picture description here
conversely , If a word appears in many fields , Then its importance to any category is very poor .

③　 Use mathematics to describe the importance of a word , namely NLP Medium TF-IDF：TF Word frequency , That is, the number of times a word appears in a category / The total number of words in this category ;IDF Reverse document frequency , Reverse document frequency is high -> The word rarely appears in other documents Insert picture description here
.
Calculation method ： Each word will get one for each category TF·IDF value ,TF·IDF high -> The word is highly important in this field .

2. TD-IDF Characteristics of the algorithm
（1）tf-idf The calculation of depends very much on the result of word segmentation , If the word segmentation is wrong , The significance of statistical value will be greatly reduced ;
（2） Every word , For each document , Different tf-idf value , So we can't leave the data discussion tfidf;
（3） If there is only one text , Can't calculate tf-idf;
（4） Category data balance is important ;
（5） Susceptible to various special symbols , It's best to do some pretreatment .
3. TD-IDF Application of algorithm
（1）TF-IDF application - Search engine
①　 For all existing web pages （ Text ）, Calculate , Lexical TFIDF value ;
②　 For an input query Carry out word segmentation ;
③　 For documents D, Calculation query The words in the document D Medium TFIDF Sum of values , As query Score of relevance to documents .
（2）TF-IDF application - Text in this paper,
①　 By calculation TFIDF Worth every text keyword ;
②　 Sentences that contain many keywords , Think it is a key sentence ;
③　 Choose some key sentences , As a summary of the text .
（3）TF-IDF application - Text similarity calculation
Calculate for all text tfidf after , Select from each text tfidf Higher front n Word , Get a set of words S. For each text D, Calculation S The word frequency of each word in , Take it as a vector of text . By calculating the cosine of the angle between vectors , Get the vector similarity , As the similarity of text .
Calculation of cosine value of vector angle ：
Insert picture description here
4. TF-IDF The advantages of
①　 Good explainability ： You can clearly see the keywords , Even if the prediction results are wrong , It's also easy to find the reason ;
②　 Fast calculation ： Word segmentation itself takes up the most time , The rest are simple statistical calculations ;
③　 Little dependence on annotation data ： You can use unmarked corpus to complete part of the work ;
④　 It can be combined with many algorithms ： It can be seen as word weight .
5. TF-IDF The disadvantages of
①　 Greatly affected by the effect of word segmentation ;
②　 There is no semantic similarity between words （ This problem is fatal ）;
③　 No word order information （ The word bag model ）;
④　 Limited capacity , Unable to complete complex tasks , Such as machine translation and entity mining ;
⑤　 Sample imbalance will have a great impact on the results ;
⑥　 The distribution between samples within a class is not considered .