当前位置:网站首页>NLP (natural language processing) natural language processing learning
NLP (natural language processing) natural language processing learning
2022-07-26 08:19:00 【I am I】
One : Noun recognition
1 Segmentation ( Division ) Full stop Punctuation marks such as commas Separate sentences
2 Tokenizing ( Tokenization )
3 Stop words( Stop words ) such as was are and in And other types of words
4 Stemming ( Extract the stem of a word )skipped skipping skips All from words skip
5 Lemmatization ( Word pattern reduction )are am is All are be Verb
6 speech tagging ( The part of speech )noum Noun .vreb Verb . preposition Preposition
7 Named entity tagging( Named entity tags )
Two :language modeling (n-gram RNN
effect : Predict next word ( For example, Google search After inputting some words It will prompt the content you want to find according to the possibility of the next word )


RNN Advantages and disadvantages of the model :

Original text (raw data) __
participle (segmentation)____
cleaning (cleaning) Useless punctuation Special symbols Stop words
Standardization (nomalization) stemming Stem extraction lemmation Word pattern reduction
feature extraction (feature extraction) tf-idf word2vec
modeling (modeling) Similarity algorithm Classification algorithm
Text preprocessing :
1: Remove the non text part of the data
Regular expressions remove unnecessary symbolic punctuation :clearn = re.compile(‘<.*[email protected]>’)
- participle
english :split()
chinese :pip install jieba
2. Remove the stop words
english : install nltk
chinese : Construct a Chinese stop list by yourself 1208 individual
3. English words
stemming Stem extraction lemmation Word pattern reduction
Use nltk Of wordnet
4. English words are converted to lowercase
word=word.lower()
5. Feature handling
bag of words The word bag model (bow ,, tf-idf)
n-gram Language model (bigram, trigram)
word2vec Distributed model
Two :RNN stay NLP The use of (Recurrent Neural Network)
Original reference :https://zhuanlan.zhihu.com/p/40797277
The difference with N-gram RNN You can see the whole sentence before and after , and 2-gram.3-gram Wait, you can only see a part of a sentence , So the error is relatively large .
Basic cyclic neural network :
Input layer — Hidden layer — Output layer You can look forward to any output value

Traditional neural networks ( Include CNN), Input and output are independent of each other , For example, the cat and dog in a picture are separated , But the subsequent output of some tasks is related to the previous content , Local information is not enough to make subsequent tasks continue .RNN It is a neural network that needs the information of the previous or previous sequence to make the task go on .RNN introduce ‘ memory ’ The concept of , loop 2 The word comes from the fact that each of its elements performs the same task , But output depends on input and ‘ memory ’. The structure shown below :



Two way recurrent neural network ( English Cloze depends not only on the words in front of it, but also on the words behind it )


Deep loop neural network ( Hidden layers stack more than two hidden layers )


3、 ... and : Recursive neural network (Recursive Neural Network)
RNN The actual effect is similar to CNN Not much difference , But the training speed ratio CNN Too slow
Four : Why is it RNN Add LSTM http:// https://zhuanlan.zhihu.com/p/40797277
As the time interval increases ,RNN Will lose the learning connection to far away information , That is, the memory capacity is limited ,LSTM Our memory cells are transformed , Things to remember ( If you input something new ) Will always be passed on , What should not be remembered will be cut off .
RNN Cell structure :

LSTM Cell structure : stay RNN On the basis of


First step :forget

The second step :update

The third step :output

LSTM Several variants of



边栏推荐
- Traversal mode of list, set, map, queue, deque, stack
- 2022-07-08 group 5 Gu Xiangquan's learning notes day01
- 2022/7/11 exam summary
- General Dao interface design
- 2022/7/12 exam summary
- Common methods of string: construction method, other methods
- 2022/7/7 exam summary
- 有点牛逼,一个月13万+
- Common Oracle functions
- Take out brother is the biggest support in this society
猜你喜欢

Take out brother is the biggest support in this society

How WPS sets page headers page by page

CentOS install mysql5.7

Vscode utility shortcut

Burp Suite - Chapter 2 burp suite proxy and browser settings

The most complete network: detailed explanation of six constraints of MySQL

File parsing (JSON parsing)

C# 获取选择文件信息

Burp Suite-第二章 Burp Suite代理和浏览器设置

Burp Suite - Chapter 1 burp suite installation and environment configuration
随机推荐
[xshell7 free download and installation]
Burp Suite - Chapter 1 burp suite installation and environment configuration
Burp suite Chapter 9 how to use burp repeater
vim跨行匹配搜索
[fastjson1.2.24 deserialization vulnerability principle code analysis]
C# 获取选择文件信息
A little awesome, 130000 a month+
Burp suite Chapter 8 how to use burp intruder
苹果强硬新规:用第三方支付也要抽成,开发者亏大了!
The first ide overlord in the universe, replaced...
Let's talk about the three core issues of concurrent programming.
【EndNote】文献类型与文献类型缩写汇编
C # get the information of the selected file
awk作业
一点一点理解微服务
2022/7/7 exam summary
matplotlib学习笔记
JSP built-in object (implicit object)
SPSS uses kmeans, two-stage clustering and RFM model to study the behavior law data of borrowers and lenders in P2P network finance
2022-07-08 group 5 Gu Xiangquan's learning notes day01