当前位置:网站首页>Explain in detail the process of realizing Chinese text classification by CNN
Explain in detail the process of realizing Chinese text classification by CNN
2022-07-01 17:50:00 【Huawei cloud developer community】
Abstract : This article mainly explains CNN The process of Chinese text classification , And with Bayes 、 Decision tree 、 Logical regression 、 Random forests 、KNN、SVM Compared with other classification algorithms .
This article is shared from Huawei cloud community 《[Python Artificial intelligence ] The 21st .Word2Vec+CNN Detailed explanation of Chinese text classification and comparison with machine learning algorithm 》, author :eastmount.
One . Text classification
Text classification aims to automatically classify and mark text sets according to certain classification systems or standards , It belongs to an automatic classification based on classification system . Text classification can be traced back to the last century 50 years , At that time, text classification was mainly based on expert definition rules ;80 In the s, expert systems based on knowledge engineering emerged ;90 The s began to use machine learning methods , The text is classified by artificial feature engineering and shallow classification model . Now we use word vector and depth neural network to classify text .
Mr. Niu Yafeng summarized the traditional text classification process as shown in the figure below . In traditional text classification , Basically, most machine learning methods are applied in the field of text classification . It mainly includes :
- Naive Bayes
- Random forests \ Decision tree
- Set class methods
- Maximum entropy
- neural network
utilize Keras The basic process of text classification is as follows :
- step 1: Text preprocessing , participle -> Remove stop words -> Statistical choice top n As a characteristic word
- step 2: Generate... For each feature word ID
- step 3: Translate the text into ID Sequence , And make up the left side
- step 4: Training set shuffle
- step 5:Embedding Layer Translate words into word vectors
- step 6: Add model , Building neural network structure
- step 7: Training models
- step 8: Get accuracy 、 Recall rate 、F1 value
Be careful , If you use TFIDF Instead of using word vectors to document , Then directly segment words to stop and generate TFIDF Input the model after the matrix . In this paper, we will use word vector 、TFIDF Experiment in two ways .
In Zhihu history teacher Of “https://zhuanlan.zhihu.com/p/34212945” in To summarize and classify , Text classification based on deep learning mainly includes 5 Big categories :
- Word embedding vectorization :word2vec, FastText etc.
- Convolution neural network feature extraction :TextCNN( Convolutional neural networks )、Char-CNN etc.
- Context mechanism :TextRNN( Cyclic neural network )、BiRNN、BiLSTM、RCNN、TextRCNN(TextRNN+CNN) etc.
- Memory storage mechanism :EntNet, DMN etc.
- Attention mechanism :HAN、TextRNN+Attention etc.
Two . Text classification based on random forest
This part mainly focuses on the common text classification cases , Because random forests work better , So we mainly share this method . The specific steps include :
- Read CSV Chinese text
- call Jieba The library realizes Chinese word segmentation and data cleaning
- Feature extraction uses TF-IDF or Word2Vec Word vector representation
- Classification based on machine learning
- Accuracy rate 、 Recall rate 、F Value calculation and evaluation
1. Text classification
(1). Data sets
The data of this paper is the recent tourism review text of Huangguoshu waterfall in Guizhou Province , From dianping.com , share 240 Data , Among them, the negative evaluation data 114 strip , Good data 126 strip , As shown in the figure below :
(2) Random forest text classification
This article will not describe the code implementation process in detail , Many previous articles have introduced , And the source code has detailed comments for your reference .
# -*- coding:utf-8 -*-import csvimport numpy as npimport jiebaimport jieba.analysefrom sklearn import feature_extraction from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import classification_reportfrom sklearn.ensemble import RandomForestClassifier#---------------------------------- First step Read the file --------------------------------file = "data.csv"with open(file, "r", encoding="UTF-8") as f: # Use csv.DictReader Read the information in the file reader = csv.DictReader(f) labels = [] contents = [] for row in reader: # Data element acquisition if row['label'] == ' Praise ': res = 0 else: res = 1 labels.append(res) content = row['content'] seglist = jieba.cut(content,cut_all=False) # Accurate model output = ' '.join(list(seglist)) # Space splicing #print(output) contents.append(output)print(labels[:5])print(contents[:5])#---------------------------------- The second step Data preprocessing --------------------------------# Convert words in the text to word frequency matrix Matrix elements a[i][j] Express j Words in i Frequency of words in similar texts vectorizer = CountVectorizer()# This class will count the tf-idf A weight transformer = TfidfTransformer()# first fit_transform It's calculation tf-idf the second fit_transform It is to turn the text into a word frequency matrix tfidf = transformer.fit_transform(vectorizer.fit_transform(contents))for n in tfidf[:5]: print(n)#tfidf = tfidf.astype(np.float32)print(type(tfidf))# Get all the words in the bag model word = vectorizer.get_feature_names()for n in word[:5]: print(n)print(" Number of words :", len(word)) # take tf-idf Matrix extraction , Elements w[i][j] Express j Words in i Class text tf-idf The weight X = tfidf.toarray()print(X.shape)# Use train_test_split Division X y list # X_train The number of matrices corresponds to y_train Number of lists ( One-to-one correspondence ) -->> To train the model # X_test The number of matrices corresponds to ( One-to-one correspondence ) -->> Used to test the accuracy of the model X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.3, random_state=1)#---------------------------------- The third step Machine learning classification --------------------------------# Random forest classification model # n_estimators: The number of trees in the forest clf = RandomForestClassifier(n_estimators=20)# Training models clf.fit(X_train, y_train)# Use test values Yes The accuracy of the model is calculated print(' The accuracy of the model :{}'.format(clf.score(X_test, y_test)))print("\n")# Predicted results pre = clf.predict(X_test)print(' Predicted results :', pre[:10])print(len(pre), len(y_test))print(classification_report(y_test, pre))
The output result is shown in the figure below , The average accuracy of random forests is 0.86, The recall rate is 0.86,F Value for 0.86.
2. Algorithm evaluation
Then the author tries to customize the accuracy (Precision)、 Recall rate (Recall) and F The eigenvalue (F-measure), Its calculation formula is as follows :
Because this article mainly aims at 2 Classification problem , The experimental evaluation is mainly divided into 0 and 1 Two types of , The complete code is as follows :
# -*- coding:utf-8 -*-import csvimport numpy as npimport jiebaimport jieba.analysefrom sklearn import feature_extraction from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import classification_reportfrom sklearn.ensemble import RandomForestClassifier#---------------------------------- First step Read the file --------------------------------file = "data.csv"with open(file, "r", encoding="UTF-8") as f: # Use csv.DictReader Read the information in the file reader = csv.DictReader(f) labels = [] contents = [] for row in reader: # Data element acquisition if row['label'] == ' Praise ': res = 0 else: res = 1 labels.append(res) content = row['content'] seglist = jieba.cut(content,cut_all=False) # Accurate model output = ' '.join(list(seglist)) # Space splicing #print(output) contents.append(output)print(labels[:5])print(contents[:5])#---------------------------------- The second step Data preprocessing --------------------------------# Convert words in the text to word frequency matrix Matrix elements a[i][j] Express j Words in i Frequency of words in similar texts vectorizer = CountVectorizer()# This class will count the tf-idf A weight transformer = TfidfTransformer()# first fit_transform It's calculation tf-idf the second fit_transform It is to turn the text into a word frequency matrix tfidf = transformer.fit_transform(vectorizer.fit_transform(contents))for n in tfidf[:5]: print(n)#tfidf = tfidf.astype(np.float32)print(type(tfidf))# Get all the words in the bag model word = vectorizer.get_feature_names()for n in word[:5]: print(n)print(" Number of words :", len(word)) # take tf-idf Matrix extraction , Elements w[i][j] Express j Words in i Class text tf-idf The weight X = tfidf.toarray()print(X.shape)# Use train_test_split Division X y list # X_train The number of matrices corresponds to y_train Number of lists ( One-to-one correspondence ) -->> To train the model # X_test The number of matrices corresponds to ( One-to-one correspondence ) -->> Used to test the accuracy of the model X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.3, random_state=1)#---------------------------------- The third step Machine learning classification --------------------------------# Random forest classification model # n_estimators: The number of trees in the forest clf = RandomForestClassifier(n_estimators=20)# Training models clf.fit(X_train, y_train)# Use test values Yes The accuracy of the model is calculated print(' The accuracy of the model :{}'.format(clf.score(X_test, y_test)))print("\n")# Predicted results pre = clf.predict(X_test)print(' Predicted results :', pre[:10])print(len(pre), len(y_test))print(classification_report(y_test, pre))#---------------------------------- Step four Evaluation results --------------------------------def classification_pj(name, y_test, pre): print(" Algorithm evaluation :", name) # Accuracy rate Precision = The total number of individuals correctly identified / The total number of individuals identified # Recall rate Recall = The total number of individuals correctly identified / The total number of individuals present in the test set # F value F-measure = Accuracy rate * Recall rate * 2 / ( Accuracy rate + Recall rate ) YC_B, YC_G = 0,0 # forecast bad good ZQ_B, ZQ_G = 0,0 # correct CZ_B, CZ_G = 0,0 # There is #0-good 1-bad At the same time, it is calculated to prevent the class label from changing i = 0 while i<len(pre): z = int(y_test[i]) # real y = int(pre[i]) # forecast if z==0: CZ_G += 1 else: CZ_B += 1 if y==0: YC_G += 1 else: YC_B += 1 if z==y and z==0 and y==0: ZQ_G += 1 elif z==y and z==1 and y==1: ZQ_B += 1 i = i + 1 print(ZQ_B, ZQ_G, YC_B, YC_G, CZ_B, CZ_G) print("") # Results output P_G = ZQ_G * 1.0 / YC_G P_B = ZQ_B * 1.0 / YC_B print("Precision Good 0:", P_G) print("Precision Bad 1:", P_B) R_G = ZQ_G * 1.0 / CZ_G R_B = ZQ_B * 1.0 / CZ_B print("Recall Good 0:", R_G) print("Recall Bad 1:", R_B) F_G = 2 * P_G * R_G / (P_G + R_G) F_B = 2 * P_B * R_B / (P_B + R_B) print("F-measure Good 0:", F_G) print("F-measure Bad 1:", F_B)# Function call classification_pj("RandomForest", y_test, pre)
The output result is shown in the figure below , Among them, the accuracy rate of praise 、 Recall rate 、F Values, respectively 0.9268、0.9268、0.9268, The accuracy of bad reviews 、 Recall rate 、F Values, respectively 0.9032、0.9032、0.9032.
3. Algorithm comparison
Finally, the author gives machine learning RF、DTC、SVM、KNN、NB、LR Text classification results of , This is also a very common operation in writing papers .
# -*- coding:utf-8 -*-import csvimport numpy as npimport jiebaimport jieba.analysefrom sklearn import feature_extraction from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import classification_reportfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.tree import DecisionTreeClassifierfrom sklearn import svmfrom sklearn import neighborsfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.linear_model import LogisticRegression#---------------------------------- First step Read the file --------------------------------file = "data.csv"with open(file, "r", encoding="UTF-8") as f: # Use csv.DictReader Read the information in the file reader = csv.DictReader(f) labels = [] contents = [] for row in reader: # Data element acquisition if row['label'] == ' Praise ': res = 0 else: res = 1 labels.append(res) content = row['content'] seglist = jieba.cut(content,cut_all=False) # Accurate model output = ' '.join(list(seglist)) # Space splicing #print(output) contents.append(output)print(labels[:5])print(contents[:5])#---------------------------------- The second step Data preprocessing --------------------------------# Convert words in the text to word frequency matrix Matrix elements a[i][j] Express j Words in i Frequency of words in similar texts vectorizer = CountVectorizer()# This class will count the tf-idf A weight transformer = TfidfTransformer()# first fit_transform It's calculation tf-idf the second fit_transform It is to turn the text into a word frequency matrix tfidf = transformer.fit_transform(vectorizer.fit_transform(contents))for n in tfidf[:5]: print(n)#tfidf = tfidf.astype(np.float32)print(type(tfidf))# Get all the words in the bag model word = vectorizer.get_feature_names()for n in word[:5]: print(n)print(" Number of words :", len(word)) # take tf-idf Matrix extraction , Elements w[i][j] Express j Words in i Class text tf-idf The weight X = tfidf.toarray()print(X.shape)# Use train_test_split Division X y list # X_train The number of matrices corresponds to y_train Number of lists ( One-to-one correspondence ) -->> To train the model # X_test The number of matrices corresponds to ( One-to-one correspondence ) -->> Used to test the accuracy of the model X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.3, random_state=1)#---------------------------------- Step four Evaluation results --------------------------------def classification_pj(name, y_test, pre): print(" Algorithm evaluation :", name) # Accuracy rate Precision = The total number of individuals correctly identified / The total number of individuals identified # Recall rate Recall = The total number of individuals correctly identified / The total number of individuals present in the test set # F value F-measure = Accuracy rate * Recall rate * 2 / ( Accuracy rate + Recall rate ) YC_B, YC_G = 0,0 # forecast bad good ZQ_B, ZQ_G = 0,0 # correct CZ_B, CZ_G = 0,0 # There is #0-good 1-bad At the same time, it is calculated to prevent the class label from changing i = 0 while i<len(pre): z = int(y_test[i]) # real y = int(pre[i]) # forecast if z==0: CZ_G += 1 else: CZ_B += 1 if y==0: YC_G += 1 else: YC_B += 1 if z==y and z==0 and y==0: ZQ_G += 1 elif z==y and z==1 and y==1: ZQ_B += 1 i = i + 1 print(ZQ_B, ZQ_G, YC_B, YC_G, CZ_B, CZ_G) # Results output P_G = ZQ_G * 1.0 / YC_G P_B = ZQ_B * 1.0 / YC_B print("Precision Good 0:{:.4f}".format(P_G)) print("Precision Bad 1:{:.4f}".format(P_B)) print("Avg_precision:{:.4f}".format((P_G+P_B)/2)) R_G = ZQ_G * 1.0 / CZ_G R_B = ZQ_B * 1.0 / CZ_B print("Recall Good 0:{:.4f}".format(R_G)) print("Recall Bad 1:{:.4f}".format(R_B)) print("Avg_recall:{:.4f}".format((R_G+R_B)/2)) F_G = 2 * P_G * R_G / (P_G + R_G) F_B = 2 * P_B * R_B / (P_B + R_B) print("F-measure Good 0:{:.4f}".format(F_G)) print("F-measure Bad 1:{:.4f}".format(F_B)) print("Avg_fmeasure:{:.4f}".format((F_G+F_B)/2)) #---------------------------------- The third step Machine learning classification --------------------------------# Random forest classification model rf = RandomForestClassifier(n_estimators=20)rf.fit(X_train, y_train)pre = rf.predict(X_test)print(" Random forest classification ")print(classification_report(y_test, pre))classification_pj("RandomForest", y_test, pre)print("\n")# Decision tree classification model dtc = DecisionTreeClassifier()dtc.fit(X_train, y_train)pre = dtc.predict(X_test)print(" Decision tree classification ")print(classification_report(y_test, pre))classification_pj("DecisionTree", y_test, pre)print("\n")# SVM Classification model SVM = svm.LinearSVC() # Support vector machine classifier LinearSVCSVM.fit(X_train, y_train)pre = SVM.predict(X_test)print(" Support vector machine classification ")print(classification_report(y_test, pre))classification_pj("LinearSVC", y_test, pre)print("\n")# KNN Classification model knn = neighbors.KNeighborsClassifier() #n_neighbors=11knn.fit(X_train, y_train)pre = knn.predict(X_test)print(" Nearest neighbor classification ")print(classification_report(y_test, pre))classification_pj("KNeighbors", y_test, pre)print("\n")# Naive Bayesian classification model nb = MultinomialNB()nb.fit(X_train, y_train)pre = nb.predict(X_test)print(" naive bayesian classification ")print(classification_report(y_test, pre))classification_pj("MultinomialNB", y_test, pre)print("\n")# Logistic regression classification method model LR = LogisticRegression(solver='liblinear')LR.fit(X_train, y_train)pre = LR.predict(X_test)print(" Logistic regression classification ")print(classification_report(y_test, pre))classification_pj("LogisticRegression", y_test, pre)print("\n")
The output is as follows , It is found that the effect of Bayesian algorithm in text classification is still very good ; And random forest 、 Logical regression 、SVM The effect is not bad .
The complete results are as follows :
Random forest classification precision recall f1-score support 0 0.92 0.88 0.90 41 1 0.85 0.90 0.88 31 accuracy 0.89 72 macro avg 0.89 0.89 0.89 72weighted avg 0.89 0.89 0.89 72 Algorithm evaluation : RandomForest28 36 33 39 31 41Precision Good 0:0.9231Precision Bad 1:0.8485Avg_precision:0.8858Recall Good 0:0.8780Recall Bad 1:0.9032Avg_recall:0.8906F-measure Good 0:0.9000F-measure Bad 1:0.8750Avg_fmeasure:0.8875 Decision tree classification precision recall f1-score support 0 0.81 0.73 0.77 41 1 0.69 0.77 0.73 31 accuracy 0.75 72 macro avg 0.75 0.75 0.75 72weighted avg 0.76 0.75 0.75 72 Algorithm evaluation : DecisionTree24 30 35 37 31 41Precision Good 0:0.8108Precision Bad 1:0.6857Avg_precision:0.7483Recall Good 0:0.7317Recall Bad 1:0.7742Avg_recall:0.7530F-measure Good 0:0.7692F-measure Bad 1:0.7273Avg_fmeasure:0.7483 Support vector machine classification nearest neighbor classification naive Bayes classification logistic regression classification ......
3、 ... and . be based on CNN Text classification
And then we started to go through CNN Realize text classification , This method can be applied in many fields , As long as there is a data set, you can analyze . Only the most basic and available methods and source code are given here , I hope it will be of some help to you .
1. Data preprocessing
In the last part I was writing about machine learning text categorization , The preprocessing operations such as Chinese word segmentation have been introduced , Why does this part introduce ? Because here I'm going to add two new operations :
- To stop using words
- Part of speech tagging
These two operations are very important in the process of text mining , On the one hand, it can improve our classification effect , On the other hand, it can filter out irrelevant feature words , Part of speech tagging can also assist us in other analysis , Like emotional analysis 、 Public opinion mining, etc .
The code of this part is as follows :
# -*- coding:utf-8 -*-import csvimport numpy as npimport jiebaimport jieba.analyseimport jieba.posseg as psegfrom sklearn import feature_extraction from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import classification_report#---------------------------------- First step Data preprocessing --------------------------------file = "data.csv"# Get stop words def stopwordslist(): # Load thesaurus stopwords = [line.strip() for line in open('stop_words.txt', encoding="UTF-8").readlines()] return stopwords# Remove stop words def deleteStop(sentence): stopwords = stopwordslist() outstr = "" for i in sentence: # print(i) if i not in stopwords and i!="\n": outstr += i return outstr# Chinese word segmentation Mat = []with open(file, "r", encoding="UTF-8") as f: # Use csv.DictReader Read the information in the file reader = csv.DictReader(f) labels = [] contents = [] for row in reader: # Data element acquisition if row['label'] == ' Praise ': res = 0 else: res = 1 labels.append(res) # Chinese word segmentation content = row['content'] #print(content) seglist = jieba.cut(content,cut_all=False) # Accurate model #print(seglist) # To stop using words stc = deleteStop(seglist) # Notice that there is no space in the sentence # Space splicing seg_list = jieba.cut(stc,cut_all=False) output = ' '.join(list(seg_list)) #print(output) contents.append(output) # Part of speech tagging res = pseg.cut(stc) seten = [] for word,flag in res: if flag not in ['nr','ns','nt','mz','m','f','ul','l','r','t']: seten.append(word) Mat.append(seten)print(labels[:5])print(contents[:5])print(Mat[:5])# File is written to fileDic = open('wordCut.txt', 'w', encoding="UTF-8")for i in Mat: fileDic.write(" ".join(i)) fileDic.write('\n')fileDic.close()words = [line.strip().split(" ") for line in open('WordCut.txt',encoding='UTF-8').readlines()]print(words[:5])
The results are shown in the following figure , You can see that the original text has been segmented , And filter out “ also ”、“,”、“ often ” And so on , And in two forms , Readers can make follow-up analysis according to their own needs . meanwhile , Write the text after the word segmentation into wordCut.txt In file .
- contents: Displays the segmented sentences in the form of a list
- Mat: Displays the word sequence that has been segmented and exists in the form of a list
2. Feature extraction and Word2Vec Word vector conversion
(1) Feature word number
First , Let's call Tokenizer and fit_on_texts Function to number every word in the text , The higher the frequency of words, the smaller the number . As shown in the figure below ,“ The waterfall ”、“ The scenic spot ”、“ line up ”、“ Waterfall cave ” There are many characteristic words such as , Note blank space 、“ Comment on ”、“ Retract ” You can continue to filter out , Add it to the stop list .
#fit_on_texts Function to number each word of the input text Number according to word frequency ( The greater the frequency of words, the smaller the number )tokenizer = Tokenizer()tokenizer.fit_on_texts(Mat)vocab = tokenizer.word_index # Stop words have been filtered , Get the number of each word print(vocab)
The output result is shown in the figure below :
(2) Word2Vec Word vector training
The number of feature words is obtained, and the header of feature matrix is defined , Next we need to convert each line of text into a one-dimensional word vector , Finally, the characteristic matrix is constructed , For training and classification . Be careful , utilize pad_sequences Methods will CNN The length of training is uniform , Train better . For example, set to 100, If the sentence exceeds 100 The following words will be cut out ; If the sentence does not exceed 100, You will fill in the sentence with 0, The figure below shows the complement 0 The process . meanwhile , Classification results [0,1] Indicates that the class mark is a good comment 0,[1,0] Indicates that the class label is a bad comment 1.
The complete code at this time is as follows :
# Use train_test_split Division X y list X_train, X_test, y_train, y_test = train_test_split(Mat, labels, test_size=0.3, random_state=1)print(X_train[:5])print(y_train[:5])#---------------------------------- The third step Word vector construction --------------------------------# Word2Vec Training maxLen = 100 # Maximum length of word sequence num_features = 100 # Set the word vector dimension min_word_count = 3 # Ensure the minimum frequency of the words to be considered num_workers = 4 # Set up parallel training to use CPU Count the number of cores context = 4 # Set the word context window size # Set up the model model = word2vec.Word2Vec(Mat, workers=num_workers, size=num_features, min_count=min_word_count,window=context)# Force unit normalization model.init_sims(replace=True)# Enter a path to save the training model among ./data/model The directory exists in advance model.save("CNNw2vModel")model.wv.save_word2vec_format("CNNVector",binary=False)print(model)# Load model If word2vec I have been trained to use the following sentence directly w2v_model = word2vec.Word2Vec.load("CNNw2vModel")# Feature number ( Fill in the gap 0)trainID = tokenizer.texts_to_sequences(X_train)print(trainID)testID = tokenizer.texts_to_sequences(X_test)print(testID)# This method will make CNN The length of training is uniform trainSeq = pad_sequences(trainID, maxlen=maxLen)print(trainSeq)# Tag alone hot code Convert to one-hot code trainCate = to_categorical(y_train, num_classes=2) # Dichotomous problem print(trainCate)testCate = to_categorical(y_test, num_classes=2) # Dichotomous problem print(testCate)
The output is as follows :
[[' scenery ', ' ', ' The scenic spot ', ' too ', ' mature ', ' from ', ' Big ', ' The waterfall ', ' The scenic spot ', ' set out ', ' The scenic spot ', ' Sightseeing car ', ' fully ', ' tourists ', ' Half an hour ', ' the World Expo ', ' Road ', ' jostle one another on the way ', ' appreciate ', ' Beautiful scenery ', ' Mood ', ' Sightseeing car ', ' Get on the train ', ' It's about ', ' mark ', ' Destination ', ' entrance ', ' guide ', ' go ', ' Wrong way ', ' Muddleheaded ', ' Get on the train ', ' ask ', ' The driver ', ' arrive ', ' The driver ', ' Vaguely ', ' say ', ' Drive out ', ' The scenic spot ', ' Passenger station ', ' Seven holes ', ' The scenic spot ', ' Development ', ' perfect ', ' Retract ', ' Comment on '], [' off season ', ' The waterfall ', ' people ', ' Less ', ' Jingmei ', ' Plane ticket ', ' cheap ', ' Worth ', ' Go to '], [' The waterfall ', ' Experience ', ' Bad ', ' Five stars ', ' Praise ', ' whole ', ' yes ', ' brush ', ' road ', ' Very narrow ', ' Lead to ', ' A large area ', ' jam ', ' line up ', ' collapse ', ' The scenic spot ', ' guide ', ' Clear ', ' line up ', ' The heavy rain ', ' Shelter from rain ', ' Design ', ' To make ', ' adults ', ' The child ', ' the elderly ', ' In the rain ', ' The scenic spot ', ' Reception ', ' Poor ability ', ' The waterfall ', ' Really? ', ' enjoy undeserved fame ', ' Seven holes ', ' Retract ', ' Comment on '], [' dad ', ' branch ', ' The waterfall ', ' The waterfall ', ' The waterfall ', ' visit ', ' The waterfall ', ' tickets ', ' Anyway ', ' exceed ', ' ', ' Came to ', ' be familiar with ', ' inform ', ' Can only ', ' Out ', ' Get into ', ' mouth ', ' go back to ', ' High speed ', ' exit ', ' Go straight ', ' Go back ', ' pour ', ' instructions ', ' Clear ', ' Isolation ', ' Railing ', ' Self driving ', ' Guide in ', ' The parking lot ', ' The parking lot ', ' charge ', ' And ', ' Time ', ' ', ' The parking lot ', ' After the check ', ' The scenic spot ', ' tickets ', ' Single person ', ' contain ', ' traffic ', ' fare ', ' A traffic car ', ' Need to be ', ' Pay separately ', ' From the outside ', ' around ', ' road ', ' flowers ', ' Less than ', ' minute ', ' fare ', ' sincerely ', ' Accept ', ' ', ' whole family ', ' Don't want to ', '┐', '(', '─', '__', '─', ')', '┌', ' Benefits ', ' Collude with ', ' severe ', ' Increase the fee ', ' Gejin ', ' The waterfall ', ' good-looking ', ' Bad ', ' review ', ' ', ' picture ', ' not ', ' Development ', ' The waterfall ', ' Tian hang ', ' The waterfall ', ' spectacular ', ' spectacular ', ' Yes ', ' Lingxiu ', ' The scenic spot ', ' inflation ', ' become ', ' Retract ', ' Comment on '], [' Whole family ', ' ticket ', ' Resident ', ' Exclusive ', ' Discount ', ' ticket ']][1, 0, 1, 1, 1]Word2Vec(vocab=718, size=100, alpha=0.025)[[ 0 0 0 ... 2481 5 4] [ 0 0 0 ... 570 52 90] [ 0 0 0 ... 187 5 4] ... [ 0 0 0 ... 93 5 4] [ 0 0 0 ... 30 5 4] [ 0 0 0 ... 81 18 78]] [[0. 1.] [1. 0.] [0. 1.] [0. 1.] [0. 1.] [0. 1.] [1. 0.]
3.CNN structure
Next, we start to train the constructed eigenmatrix , Calculate the similarity between different texts or one-dimensional matrices , In this way, the different sentences with positive and negative comments are divided into two categories according to their similarity . The same thing is used here Word2Vec The implementation core code is as follows :
model = word2vec.Word2Vec( Mat, workers=num_workers, size=num_features, min_count=min_word_count, window=context);
The result of the training model is “Word2Vec(vocab=718, size=100, alpha=0.025)”, The filtering frequency set here is 3, The frequency of occurrence is lower than 3 That's filtered , The resulting 718 A characteristic word .num_features The value is 100, Said is 100 The word vector of dimension .sg The default is continuous bag model , Can also be set to 1 Jump model . The default optimization method is negative sampling , More parameter explanation please reader Baidu .
Refer to the author's previous article :[Python Artificial intelligence ] Nine .gensim The word vector Word2Vec Installation and 《 Celebrate more than 》 Chinese short text similarity meter
If we have a training set 、 A test set , If there is no feature word in the test set , How to solve it ? Here we are getting the word vector of a feature word , And convert it to training matrix , Used try-except Exception trapping , If the feature word is not found, skip it , It will automatically fill in 0.
This part of the code is as follows :
#---------------------------------- Step four CNN structure --------------------------------# Take advantage of the training Word2vec Customize Embedding Training matrix Each line represents a word ( Combining the unique thermal coding and matrix multiplication to understand )embedding_matrix = np.zeros((len(vocab)+1, 100)) # from 0 Start counting Add 1 Corresponding to the previous feature word for word, i in vocab.items(): try: # Extract the word vector and place the training matrix embedding_vector = w2v_model[str(word)] embedding_matrix[i] = embedding_vector except KeyError: # Word not found, skip continue# Training models main_input = Input(shape=(maxLen,), dtype='float64')# Word embedding Use pre training Word2Vec The word of the vector Custom weight matrix 100 Is the output word vector dimension embedder = Embedding(len(vocab)+1, 100, input_length=maxLen, weights=[embedding_matrix], trainable=False) # No more training # Build a model model = Sequential()model.add(embedder) # structure Embedding layer model.add(Conv1D(256, 3, padding='same', activation='relu')) # Convolution layer stride 3model.add(MaxPool1D(maxLen-5, 3, padding='same')) # Pooling layer model.add(Conv1D(32, 3, padding='same', activation='relu')) # Convolution layer model.add(Flatten()) # Straightening model.add(Dropout(0.3)) # Prevent over fitting 30% No training model.add(Dense(256, activation='relu')) # Fully connected layer model.add(Dropout(0.2)) # Prevent over fitting model.add(Dense(units=2, activation='softmax')) # Output layer # Model visualization model.summary()# Activate neural networks model.compile(optimizer = 'adam', # Optimizer loss = 'categorical_crossentropy', # Loss metrics = ['accuracy'] # Calculation error or accuracy )# Training ( Training data 、 Training label 、batch—size Every time 256 Training 、epochs、 Random selection 、 Verification set 20%)history = model.fit(trainSeq, trainCate, batch_size=256, epochs=6, validation_split=0.2)model.save("TextCNN")#---------------------------------- Step five prediction model --------------------------------# Forecasting and evaluating mainModel = load_model("TextCNN")result = mainModel.predict(testSeq) # Test samples #print(result)print(np.argmax(result,axis=1))score = mainModel.evaluate(testSeq, testCate, batch_size=32)print(score)
The model is as follows :
Model: "sequential_1"_________________________________________________________________Layer (type) Output Shape Param # =================================================================embedding_2 (Embedding) (None, 100, 100) 290400 _________________________________________________________________conv1d_1 (Conv1D) (None, 100, 256) 77056 _________________________________________________________________max_pooling1d_1 (MaxPooling1 (None, 34, 256) 0 _________________________________________________________________conv1d_2 (Conv1D) (None, 34, 32) 24608 _________________________________________________________________flatten_1 (Flatten) (None, 1088) 0 _________________________________________________________________dropout_1 (Dropout) (None, 1088) 0 _________________________________________________________________dense_1 (Dense) (None, 256) 278784 _________________________________________________________________dropout_2 (Dropout) (None, 256) 0 _________________________________________________________________dense_2 (Dense) (None, 2) 514 =================================================================Total params: 671,362Trainable params: 380,962Non-trainable params: 290,400
The output result is shown in the figure below , The prediction result of this model is not very ideal ,accuracy It's only worth 0.625, Why? ? The author is also further studying the optimization of the depth model , What's more important in this paper is to provide an available method , Sorry for the bad effect ~
4. Test Visualization
Finally, add visual code , Draw the figure as shown in the figure below . Again , The effect of this algorithm is not ideal , The error is not decreasing gradually , The accuracy rate is not rising . If readers find out the reason or optimization method, please let us know , thank you .
Finally, attach the complete code :
# -*- coding:utf-8 -*-import csvimport numpy as npimport jiebaimport jieba.analyseimport jieba.posseg as psegfrom sklearn import feature_extraction from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import classification_reportfrom keras import modelsfrom keras import layersfrom keras import Inputfrom gensim.models import word2vecfrom keras.preprocessing.text import Tokenizerfrom keras.utils.np_utils import to_categoricalfrom keras.preprocessing.sequence import pad_sequencesfrom keras.models import Modelfrom keras.models import Sequentialfrom keras.models import load_modelfrom keras.layers import Flatten, Dense, Dropout, Conv1D, MaxPool1D, Embedding#---------------------------------- First step Data preprocessing --------------------------------file = "data.csv"# Get stop words def stopwordslist(): # Load thesaurus stopwords = [line.strip() for line in open('stop_words.txt', encoding="UTF-8").readlines()] return stopwords# Remove stop words def deleteStop(sentence): stopwords = stopwordslist() outstr = "" for i in sentence: # print(i) if i not in stopwords and i!="\n": outstr += i return outstr# Chinese word segmentation Mat = []with open(file, "r", encoding="UTF-8") as f: # Use csv.DictReader Read the information in the file reader = csv.DictReader(f) labels = [] contents = [] for row in reader: # Data element acquisition if row['label'] == ' Praise ': res = 0 else: res = 1 labels.append(res) # Chinese word segmentation content = row['content'] #print(content) seglist = jieba.cut(content,cut_all=False) # Accurate model #print(seglist) # To stop using words stc = deleteStop(seglist) # Notice that there is no space in the sentence # Space splicing seg_list = jieba.cut(stc,cut_all=False) output = ' '.join(list(seg_list)) #print(output) contents.append(output) # Part of speech tagging res = pseg.cut(stc) seten = [] for word,flag in res: if flag not in ['nr','ns','nt','mz','m','f','ul','l','r','t']: #print(word,flag) seten.append(word) Mat.append(seten)print(labels[:5])print(contents[:5])print(Mat[:5])#---------------------------------- The second step Feature number --------------------------------# fit_on_texts Function to number each word of the input text Number according to word frequency ( The greater the frequency of words, the smaller the number )tokenizer = Tokenizer()tokenizer.fit_on_texts(Mat)vocab = tokenizer.word_index # Stop words have been filtered , Get the number of each word print(vocab)# Use train_test_split Division X y list X_train, X_test, y_train, y_test = train_test_split(Mat, labels, test_size=0.3, random_state=1)print(X_train[:5])print(y_train[:5])#---------------------------------- The third step Word vector construction --------------------------------# Word2Vec Training maxLen = 100 # Maximum length of word sequence num_features = 100 # Set the word vector dimension min_word_count = 3 # Ensure the minimum frequency of the words to be considered num_workers = 4 # Set up parallel training to use CPU Count the number of cores context = 4 # Set the word context window size # Set up the model model = word2vec.Word2Vec(Mat, workers=num_workers, size=num_features, min_count=min_word_count,window=context)# Force unit normalization model.init_sims(replace=True)# Enter a path to save the training model among ./data/model The directory exists in advance model.save("CNNw2vModel")model.wv.save_word2vec_format("CNNVector",binary=False)print(model)# Load model If word2vec I have been trained to use the following sentence directly w2v_model = word2vec.Word2Vec.load("CNNw2vModel")# Feature number ( Fill in the gap 0)trainID = tokenizer.texts_to_sequences(X_train)print(trainID)testID = tokenizer.texts_to_sequences(X_test)print(testID)# This method will make CNN The length of training is uniform trainSeq = pad_sequences(trainID, maxlen=maxLen)print(trainSeq)testSeq = pad_sequences(testID, maxlen=maxLen)print(testSeq)# Tag alone hot code Convert to one-hot code trainCate = to_categorical(y_train, num_classes=2) # Dichotomous problem print(trainCate)testCate = to_categorical(y_test, num_classes=2) # Dichotomous problem print(testCate)#---------------------------------- Step four CNN structure --------------------------------# Take advantage of the training Word2vec Customize Embedding Training matrix Each line represents a word ( Combining the unique thermal coding and matrix multiplication to understand )embedding_matrix = np.zeros((len(vocab)+1, 100)) # from 0 Start counting Add 1 Corresponding to the previous feature word for word, i in vocab.items(): try: # Extract the word vector and place the training matrix embedding_vector = w2v_model[str(word)] embedding_matrix[i] = embedding_vector except KeyError: # Word not found, skip continue# Training models main_input = Input(shape=(maxLen,), dtype='float64')# Word embedding Use pre training Word2Vec The word of the vector Custom weight matrix 100 Is the output word vector dimension embedder = Embedding(len(vocab)+1, 100, input_length=maxLen, weights=[embedding_matrix], trainable=False) # No more training # Build a model model = Sequential()model.add(embedder) # structure Embedding layer model.add(Conv1D(256, 3, padding='same', activation='relu')) # Convolution layer stride 3model.add(MaxPool1D(maxLen-5, 3, padding='same')) # Pooling layer model.add(Conv1D(32, 3, padding='same', activation='relu')) # Convolution layer model.add(Flatten()) # Straightening model.add(Dropout(0.3)) # Prevent over fitting 30% No training model.add(Dense(256, activation='relu')) # Fully connected layer model.add(Dropout(0.2)) # Prevent over fitting model.add(Dense(units=2, activation='softmax')) # Output layer # Model visualization model.summary()# Activate neural networks model.compile(optimizer = 'adam', # Optimizer loss = 'categorical_crossentropy', # Loss metrics = ['accuracy'] # Calculation error or accuracy )# Training ( Training data 、 Training label 、batch—size Every time 256 Training 、epochs、 Random selection 、 Verification set 20%)history = model.fit(trainSeq, trainCate, batch_size=256, epochs=6, validation_split=0.2)model.save("TextCNN")#---------------------------------- Step five prediction model --------------------------------# Forecasting and evaluating mainModel = load_model("TextCNN")result = mainModel.predict(testSeq) # Test samples print(result)print(np.argmax(result,axis=1))score = mainModel.evaluate(testSeq, testCate, batch_size=32)print(score)#---------------------------------- Step five visualization --------------------------------import matplotlib.pyplot as pltplt.plot(history.history['accuracy'])plt.plot(history.history['val_accuracy'])plt.title('Model accuracy')plt.ylabel('Accuracy')plt.xlabel('Epoch')plt.legend(['Train','Valid'], loc='upper left')plt.plot(history.history['loss'])plt.plot(history.history['val_loss'])plt.title('Model loss')plt.ylabel('Loss')plt.xlabel('Epoch')plt.legend(['Train','Valid'], loc='upper left')plt.show()
Four . summary
All in all , This passage Keras Implemented a CNN The case of text classification learning , It also introduces the principle of text classification and the comparison with machine learning .
Click to follow , The first time to learn about Huawei's new cloud technology ~
- Is it reasonable and safe to open a securities account for 10000 shares free of charge? How to say
- (27) Open operation, close operation, morphological gradient, top hat, black hat
- Yyds dry inventory MySQL RC transaction isolation level implementation
- Enter wechat applet
- Development cost of smart factory management system software platform
- Why should you consider using prism
- Htt [ripro network disk link detection plug-in] currently supports four common network disks
- 反射型XSS漏洞
- Official announcement! Hong Kong University of science and Technology (Guangzhou) approved!
- Irradiance, Joule energy, exercise habits
[PHP foundation] realize the connection between PHP and SQL database
SQL injection vulnerability (MySQL and MSSQL features)
Gameframework eating guide
Replace UUID, nanoid is faster and safer!
[beauty detection artifact] come on, please show your unique skill (is this beauty worthy of the audience?)
Rotation order and universal lock of unity panel
【Try to Hack】vulnhub DC4
(1) CNN network structure
Enter wechat applet
Kernel stray cat stray dog pet adoption platform H5 source code
China PBAT resin Market Forecast and Strategic Research Report (2022 Edition)
MFC obtains local IP (used more in network communication)
Gold, silver and four job hopping, interview questions are prepared, and Ali becomes the champion
Sword finger offer 20 String representing numeric value
ISO 27001 Information Security Management System Certification
Htt [ripro network disk link detection plug-in] currently supports four common network disks
(17) DAC conversion experiment
【牛客网刷题系列 之 Verilog快速入门】~ 优先编码器电路①
Openlayers customize bubble boxes and navigate to bubble boxes
Replace UUID, nanoid is faster and safer!
Length of learning and changing