当前位置:网站首页>Chinese text classification based on CNN
Chinese text classification based on CNN
2022-06-11 20:28:00 【biyezuopinvip】
text-classification-cnn
Using convolutional neural networks (CNN) Dealing with naturallanguageprocessing (NLP) Text classification in . This article will combine TensorFlow Code Introduction :
- Word embedding
- fill
- Embedding
- Convolution layer
- Convolution (tf.nn.conv1d)
- Pooling (pooling)
- Fully connected layer
- dropout
- Output layer
- softmax
File test

Keyboard input test

Network structure and interpretation
The main structure of the network is as follows :
The detailed flow chart of the code is :
1. word embedding Word vector conversion
This is also NLP The most important step in the task of text classification , Because when we know how to use computers to understand The word vector (word vector) Express Natural language sentences (sequence) when , The text classification problem degenerates into a simple data classification problem , and MNIST There is essentially no difference in classification .
Before we solve this problem , Let's first review how language expresses .
How to express the meaning of a word
Let's first look at how to define “ Meaning ” It means , In English meaning Representing what people or words want to express idea. This is a recursive definition , Estimation query idea A dictionary can use meaning To explain it .
1. Use words 、 Phrases, etc
2. People want to use words 、 Ideas expressed by symbols, etc
3. In writing , The idea expressed in a painting
however , At present, this representation method cannot be applied in computer system processing language .
How the computer processes the meaning of words
The initial word vector is one-hot Vector of form , That is, only the dimension where the word is located is 1, Other dimensions are 0, Vector length and vocabulary (vocab) The same size . As shown in the following table :
| Text | The word vector |
|---|---|
| One | [0,0,0,0,0,0,0,1,0,……,0,0,0,0,0,0,0] |
| individual | [0,0,0,0,0,0,1,0,0,……,0,0,0,0,0,0,0] |
| Small | [0,0,0,0,0,1,0,0,0,……,0,0,0,0,0,0,0] |
| can | [0,0,1,0,0,0,0,0,0,……,0,0,0,0,0,0,0] |
| Love | [0,0,0,1,0,0,0,0,0,……,0,0,0,0,0,0,0] |
The disadvantages of this representation are obvious :
1.** Easy to cause dimensional disaster **. Suppose we want to express 5000 It's a common word , Need to use 5000 The word vector of dimension . If you represent words or idioms, you need a larger word vector .
2. It can not express the semantic relationship between words . The distance between any two words is the same , It is impossible to make words with similar meanings close .
Can we reduce the dimension of word vector ?
Dristributed representation Can solve One hot representation The problem of , The idea is to train , Map each word to a shorter word vector . All these word vectors make up vector space , Then we can use common statistical methods to study the relationship between words . How big is this shorter word vector dimension ? This generally needs to be specified by ourselves during training .
That is to say Use neural network to train and express itself .
ideally , We can train the words to express as shown in the figure below . however , In fact, in practice, we only need to specify the size of the word vector dimension , During the training I don't know what each dimension means .

We will king The word sits in space from a vector that can be very sparse , Map to the space where the four-dimensional vector is now , The following properties must be satisfied :
(1) This mapping is a singleton ;
(2) The mapped vector does not lose the information contained in the previous vector .
This process is called word embedding( Word embedding ), That is, embedding high-dimensional word vectors into a low-dimensional space .

After a series of operations by our God of dimensionality , It works Dristributed representation A short vector of words , We can easily analyze the relationship between words , For example, we reduce the dimension of words to 2 dimension , There is an interesting study that shows , When we use the word vector below to represent our words , We can find out :
K i n g → − M a n → + W o m a n → = Q u e e n → \overrightarrow{K i n g}-\overrightarrow{M a n}+\overrightarrow{W o m a n}=\overrightarrow{Q u e e n} King−Man+Woman=Queen
Is the learning ability of machine learning also good !

How to deal with the meaning of words
Chinese text Is different from English , Because English words can be easily distinguished by spaces , And Chinese usually needs to be done first participle operation , And then, after the participle words code . Do not participle in advance , Directly encoding Chinese characters is called Character level code .
This article first uses one-hot On the text Character level code , Then the mapping method from high dimension to low dimension is trained by neural network . The detailed steps are :
1. Build a vocabulary (vocab), The glossary is containing input All possible letters 、 Numbers 、 A collection of symbols and Chinese characters ( This article uses vocab The size is 5000).vocab The form of is shown in the following table :
| id | vocabulary |
|---|---|
| … | … |
| 15 | LAN |
| 16 | cloth |
| 17 | Cypress |
| 18 | Tit |
| 19 | art |
| 20 | Of |
| 21 | Do you |
| 22 | g |
| 23 | ( Space ) |
| … | … |
2. Use a vocabulary (vocab) Convert the input text to id The form of the list , The code is :
with open_file(vocab_dir) as fp: # open vocab file
words = [_.strip() for _ in fp.readlines()] # Read words by line , And turn it into a list
word_to_id = dict(zip(words, range(len(words)))) # Combine words with id Combine , And turn it into a dictionary (dict) In the form of
# word_to_id = {' LAN ':15, ' cloth ':16, ' Cypress ':17 ...}
Suppose a text is entered as :
' Lambertic Cloth art bed 1.8 rice Double bed soft bed marriage bed '
```http://www.biyezuopin.vip
Use the vocabulary above (vocab) Turn into id After the form is :
[15, 17, 18, 22, 23, 16, …]
one-hot The coding matrix is :
```text
[ 0, 0, 0, 0, 0, 0, ...]
...
[ 1, 0, 0, 0, 0, 0, ...] # Subscript 15
[ 0, 0, 0, 0, 0, 1, ...]
[ 0, 1, 0, 0, 0, 0, ...]
[ 0, 0, 1, 0, 0, 0, ...]
[ 0, 0, 0, 0, 0, 0, ...]
[ 0, 0, 0, 0, 0, 0, ...] # Subscript 20
[ 0, 0, 0, 0, 0, 0, ...]
[ 0, 0, 0, 1, 0, 0, ...]
[ 0, 0, 0, 0, 1, 0, ...]
...
3. Put text pad Is a fixed length
x_pad = kr.preprocessing.sequence.pad_sequences(data_id, max_length)
here max_length Set to 100, The maximum length of the representative text cannot exceed max_length, Turn into id The list of forms is filled in and becomes Fixed length A list of . fill (pad) The way is in Fill the front with several 0:
[0, 0, 0, 0, 0, ...... ,15, 17, 18, 22, 23, 16, ...] # fill 0 The rear length is max_length
4. Word embedding (embedding)
embedding = tf.get_variable('embedding', [vocab_size, embedding_dim]) #5000×64
embedding_inputs = tf.nn.embedding_lookup(embedding, input_x)
The code above will 5000 dimension one-hot The encoded input text is converted to a lower dimension (embedding_dim dimension ) The use of The set of real Numbers The word vector represented by . In the project code embedding_dim Set to 64, To simplify the problem , Study tf.nn.embedding_lookup Usage of , Let's assume :
embedding_dim = 2 # Suppose the word vector only uses 2 Dimensional real number coding
input_x = [[0, 0, 0, 0, 0, ...... ,15, 17, 18, 22, 23, 16, ...]] # There are two levels of lists , The outer list represents the input statement , Because there is only one statement, the length is 1
embedding = [[0,0], .....( Subscript to be 15)[0.1,1.5], [1.0,0.1], [0.2,0.1], [1.0,0.3], [0.5,0.1], ( Subscript to be 20)[0.3,1.5
embedding Expressed in a table as :
| Subscript | Content |
|---|---|
| 0 | [0.0,0.0] |
| … | … |
| 15 | [0.1,1.5] |
| 16 | [1.0,0.1] |
| 17 | [0.2,0.1] |
| 18 | [1.0,0.3] |
| 19 | [0.5,0.1] |
| 20 | [0.3,1.5] |
| 21 | [0.1,0.6] |
| 22 | [0.4,0.8] |
| 23 | [0.5,0.5] |
| … | … |
Be careful , These parameters are constantly updated during training .
Use the above embedding,tf.nn.embedding_lookup(embedding, input_x) As the result of the :
[[[ 0.0 0.0]
....
[ 0.1 1.5] # 15- LAN
[ 0.2 0.1] # 17- Cypress
[ 1.0 0.3] # 18- Tit
[ 0.4 0.8] # 22- g
[ 0.5 0.5] # 23-( Space )
[ 1.0 0.1] # 16- cloth
...........]]
Also is to input_x from 5000×100 Dimensional one-hot The encoding map is 2×100 The word of the vector ( Each word is mapped to 2 Dimension word vector , The length is 100). The code does not explicitly appear one-hot The coding process , however tf.nn.embedding_lookup Function from embedding To take input_x Specifies the sequence of subscripts , Because the subscript i The range is [0,5000), and embedding[i] It's a 2 Dimension vector , It's equivalent to finishing 5000 dimension (one-hot form ) To 2 The mapping of dimensions , This is the same as proceeding first one-hot The result of code remapping is the same . When embedding The dimensions are n when , Principle and 2 The dimensions are the same , Just replace the array of representations with n dimension .
2. conv1d Convolution
------http://www.biyezuopin.vip
conv = tf.layers.conv1d(embedding_inputs, filters=5, kernel_size=256)
The calculation method of convolution is shown in the figure below :

It is different from the two-dimensional convolution used in image processing , One dimensional convolution is used to process text . As shown in the figure above , Used 256 Convolution kernels , The size of each convolution kernel is 1×5, Convolution kernel on each feature meanwhile Slide to the right , The calculation method is the sum of the convolution of the feature of each dimension and the convolution kernel plus the offset ( See the red area in the figure ). It can be seen that The distance between two words in a sentence exceeds 5 when , Will not be computed in a convolution kernel , That is, the relevance between them will not be considered , This is also CNN The limitations of dealing with text , Use LSTM This deficiency can be improved .
3. max_pool Maximum pooling
max_pool = tf.reduce_max(conv, reduction_indices=[1])
In the process of convolution , The length is 5 The length of the convolution kernel is 100 Slide on the text of , The resulting 96 The output values , Because of 256 Convolution kernels , The final output size after convolution is 96×256.
A simplified maximum pooling is used in the code , to 96 Outputs are directly maximized ( Instead of using pooled window sliding ), The output size after pooling is 256.
4. dense Full connection layer and output Output layer
fc = tf.layers.dense(max_pool , units=512)
fc = tf.contrib.layers.dropout(fc, self.keep_prob)
fc = tf.nn.relu(fc)
logits = tf.layers.dense(fc, units=num_classes)
y = tf.nn.softmax(self.logits) # Probability output
y_pred_cls = tf.argmax(y, 1) # Index of forecast categories
The full connection layer will 256 The intermediate feature of dimension is transformed into 512 Dimensional , The output layer is further converted to 1258 Probability output of two categories . The subscript with the highest probability is the category of prediction , Last in categories Find the category output corresponding to the subscript in , You can get the predicted results .
End of the flower ~~

边栏推荐
- 黑圆圈显示实现
- STL container nested container
- Summary of C language programming knowledge points 01
- 2022-2028 current market situation and future development trend of thermopile Pyranometers in the world and China
- LNMP架构源码编译安装图文讲解(附带搭建论坛实验)
- 2022-2028 current situation and future development trend of fuel cell market for cogeneration application in the world and China
- ORA-01089 ORA-19809 ORA-19815 超过了恢复文件的限制
- Current situation and future development trend of thermoelectric generator Market in the world and China from 2022 to 2028
- Three common sense that managers must know
- In 2021, the global ceramic substrate revenue will be about US $409.7 million, and it is expected to reach US $657.3 million in 2028
猜你喜欢

APB2standard_ handshake_ Bridge design

Wechat applet | rotation chart

In 2021, the global adult diaper revenue was about $11560million, which is expected to reach $15440million in 2028. From 2022 to 2028, the CAGR was 4.2%

moderlarts第二次作业

Detailed tutorial on installing MySQL database in Linux environment (including uninstallation and password reset process)

这些考研专业容易混淆,搞错就白考啦!

Google proposed the super pre training model coca, and the accuracy of fine-tuning top-1 on Imagenet reached 91%! SOTA! On multiple downstream tasks

RTL arbiter design

Rtd2171u, substitute for trd2171u, substitute for trd2171u, cs5261 C to hdmi4k_ 30Hz

MySQL installation free configuration tutorial under Windows mysql-5.6.51-winx64 Zip version
随机推荐
Unified exception handling
UDP、TCP
Module 8 operation
What is the meaning of holding shares of financial products?
ICML 2022 | 基于结构化数据的异常检测再思考:我们究竟需要怎样的图神经网络?...
On scale of canvas recttransform in ugui
Global Market Consulting - Research Report on the overall scale and application segmentation of the global HVDC converter station market in 2022
[Unity插件]着色器关键字分析工具ShaderControl
Detailed explanation on persistence of 2022redis7.0x
My favorite product management template - Lenny
【四川大学】初试复试考研资料分享
Usage methods and cases of PLSQL blocks, cursors, functions, stored procedures and triggers of Oracle Database
C#深拷贝
12 date and time in R
Shanghai internal promotion 𞓜 Yuyang teacher's research group of Shanghai Chizhi research institute recruits full-time researchers
[high precision] decimal integer addition
The principle that the picture of QQ Post Bar changes when it is opened
导师转我800块,让我仿真一个电路(电源设计)
2022-2028 current situation and future development trend of fuel cell market for cogeneration application in the world and China
3D建模有什么技巧吗?