当前位置：网站首页>ML - natural language processing - Key Technologies

ML - natural language processing - Key Technologies

2022-07-25 15:23:00 【sword_ csdn】

Catalog

Reference resources

Huawei cloud College

participle

Chinese word segmentation （Chinese Word Segmentation）： It refers to the segmentation of a Chinese character sequence into a single word . Word segmentation is the process of recombining consecutive word sequences into word sequences according to certain norms .
for example ： 1998 / China / Realization / Import and export / Gross value / reach / 109.82 billion / dollar

Regular participle

Regular participle ： A mechanical word segmentation method , Mainly through the maintenance of dictionaries , When cutting sentences , Match each string in the statement with the words in the thesaurus one by one , If you find it, cut it , Otherwise, it can't be divided . According to the way of matching and segmentation , There are mainly ：
（1） Forward maximum matching （Maximum Match Method,MM Law ）
（2） Reverse maximum matching （Reverse Maximum Match Method,RMM Law ）
（3） Two way maximum matching （Bi-direction Match Method,MM Law ）
characteristic ： A simple and efficient , Dictionary maintenance is difficult . Network neologisms emerge in endlessly , It is difficult for a dictionary to cover all words .

Statistical analysis

The word segmentation is realized as the sequence annotation task of words in the string . Each word occupies a certain word formation position when constructing a specific word , If the connected words appear more frequently in different texts , It proves that the connected word is probably a word .
Insert picture description here
step ：
（1） Build statistical language model
（2） Divide sentences into words , Then calculate the probability of the result , Get the word segmentation with the greatest probability . Such as hidden Markov （HMM）、 Conditional random field （CRF） etc. .

Deep learning word segmentation

Use word2vec Embed the words of the word material , Get word embedding , Use word embedding feature input to two-way LSTM, Add a linear layer to the hidden layer of the output , And then add one CRF Get the final implementation model .
Insert picture description here

Mixed participle

In practical engineering applications , Most of them are based on a word segmentation algorithm , The most common way is to segment words based on dictionaries , Then use statistical word segmentation to assist .

Definition of part of speech tagging

Part of speech tagging refers to the process of tagging a correct part of speech for each word in the word segmentation result . For example, a word is a noun 、 Verb 、 Adjectives or other parts of speech .
The part of speech ： Basic grammatical attributes of vocabulary .
Purpose ： A lot NLP Task preprocessing steps , Such as parsing 、 Information extraction , The text marked with part of speech will bring great convenience , But it is not indispensable .
Method ： A rule-based approach 、 Statistical based methods 、 Methods based on deep learning .

Named entity recognition

Named entity recognition （Named Entities Recognition,NER）： Also known as “ Proper name recognition ”, It refers to the recognition of entities with specific meaning in the text , Mainly including people's names 、 Place names 、 Organization name 、 Proper nouns, etc . for example ： metallurgy /n Ministry of industry /n luoyang /ns Refractories /l research institute /n.
NER The named entities studied are generally divided into 3 Categories: （ Entity class 、 Time class and number class ） and 7 Subclass （ The person's name 、 Place names 、 Organization name 、 Time 、 date 、 Currency and percentage ）.
And automatic word segmentation 、 Part of speech tagging , Named entity recognition is also a basic task in natural language , It's information extraction 、 Information retrieval 、 Machine translation 、 Q & a system and other essential components of Technology .
step ：（1） Entity boundary recognition .（2） Determine the entity class （ The person's name 、 Place names 、 Organization name ）
difficulty ：（1） There are many kinds of named entities .（2） The composition of named entities is complex .（3） Nesting is complex .（4） The length is uncertain

Deep learning NER

Insert picture description here

Keywords extraction

Keywords are a group of words that represent the important content of an article , In reality, a large number of texts do not contain keywords , Therefore, automatic keyword extraction technology can make people browse and obtain information conveniently , Clustering text 、 classification 、 Automatic summarization plays an important role .
Keyword extraction algorithms can also be divided into supervised and unsupervised .
Supervised ： By classification , By building a relatively rich and perfect The vocabulary of , Then by judging the matching degree between each document and each word in the Thesaurus , In a manner similar to labeling , Achieve the effect of extracting keywords .
Unsupervised ： No need to manually generate 、 Maintain a vocabulary , Don't train with the help of manual standard corpus . for example ,TF-IDF Algorithm 、TextRank Algorithm 、 Topic model algorithm （LSA、LSI、LDA）

TF-IDF Algorithm

Word frequency - Inverse document frequency algorithm （Term Frequency-Inverse Document Frequency,TF-IDF）： It is a calculation method based on Statistics , It is often used to evaluate the importance of a word to a document in a document set .
Insert picture description here

TextRank Algorithm

TextRank The basic idea of the algorithm comes from Google Of PageRank Algorithm .PR Algorithm is a method to evaluate the importance of web pages covered by search system . There are two basic ideas ：
（1） Number of links . A web page is linked by more other web pages , The more important this page is .
（2） Link quality . A web page is linked by a higher weight web page , It also shows that this webpage is important .
Insert picture description here

LSA/LSI/LDA Algorithm

The topic model believes that there is no direct connection between words and documents , They should also have a dimension that connects them , This dimension is called theme . Each document should correspond to one or more topics , And each topic will have a corresponding word distribution , The word distribution of each document can be obtained through the topic .
Insert picture description here