当前位置:网站首页>R language for text mining Part4 text classification

R language for text mining Part4 text classification

2022-07-06 21:14:00 Full stack programmer webmaster

Hello everyone , I meet you again , I'm the king of the whole stack .

Part4 Text classification

Part3 Text clustering mentioned . Simple difference from clustering classification .

that , We need to sort out the classification of training sets , There is clearly classified text ; Test set , Can use training set to replace . Prediction set , Is unclassified text . It is the final application of classification method .

1. Data preparation

Training set preparation is a very tedious function , I didn't find any labor-saving way temporarily , Sort out manually according to the text content . Here is the official wechat data of a brand , According to the content of Weibo . I divide the main content of its Weibo into : Promotional information (promotion)、 Product promotion (product)、 Public welfare information (publicWelfare)、 Chicken soup (life)、 Fashion information (fashionNews)、 Movie entertainment (showbiz). Each category has 20-50 Data . For example, we can see the text number of each category under the training set below , There is no problem that the training set is classified as Chinese .

The training set is hlzj.train, It will also be used as a test set later .

The prediction set is Part2 Inside hlzj.

> hlzj.train <-read.csv(“hlzj_train.csv”,header=T,stringsAsFactors=F)

> length(hlzj.train)

[1] 2

> table(hlzj.train$type)

fashionNews life product

27 34 38

promotion publicWelfare showbiz

45 22 36

> length(hlzj)

[1] 1639

2. Word segmentation

Training set 、 Test set 、 Prediction sets need word segmentation before possible classification .

It will not be specified here , The process is similar to Part2 Talked about .

After word segmentation in the training set hlzjTrainTemp. Previous pair hlzj After word segmentation, the file is hlzjTemp.

And then they will hlzjTrainTemp and hlzjTemp Remove stop words .

> library(Rwordseg)

Load the required program package :rJava

# Version: 0.2-1

> hlzjTrainTemp <- gsub(“[0-90123456789 < > ~]”,””,hlzj.train$text)

> hlzjTrainTemp <-segmentCN(hlzjTrainTemp)

> hlzjTrainTemp2 <-lapply(hlzjTrainTemp,removeStopWords,stopwords)

>hlzjTemp2 <-lapply(hlzjTemp,removeStopWords,stopwords)

3. Get the matrix

stay Part3 Speak to the . When doing clustering, first convert the text into a matrix , The same process is needed for classification . be used tm software package . First, combine the results of the training set and the prediction set after removing the stop words into hlzjAll, Remember before 202(1:202) Data is a training set , after 1639(203:1841) Bars are prediction sets . obtain hlzjAll The corpus of , And get documents - Entry matrix . Convert it to a normal matrix .

> hlzjAll <- character(0)

> hlzjAll[1:202] <- hlzjTrainTemp2

> hlzjAll[203:1841] <- hlzjTemp2

> length(hlzjAll)

[1] 1841

> corpusAll <-Corpus(VectorSource(hlzjAll))

> (hlzjAll.dtm <-DocumentTermMatrix(corpusAll,control=list(wordLengths = c(2,Inf))))

<<DocumentTermMatrix(documents: 1841, terms: 10973)>>

Non-/sparse entries: 33663/20167630

Sparsity : 100%

Maximal term length: 47

Weighting : term frequency (tf)

> dtmAll_matrix <-as.matrix(hlzjAll.dtm)

4. classification

be used knn Algorithm (K Nearest neighbor algorithm ). The algorithm is class In the software package .

Before the matrix 202 Row data is a training set , There are already classifications , hinder 1639 Pieces of data are not classified . We should get the classification model according to the training set, and then make the classification prediction for it .

Put the classified results together with the original Weibo . use fix() see , You can see the classification results , The effect is quite obvious .

> rownames(dtmAll_matrix)[1:202] <-hlzj.train$type

> rownames(dtmAll_matrix)[203:1841]<- c(“”)

> train <- dtmAll_matrix[1:202,]

> predict <-dtmAll_matrix[203:1841,]

> trainClass <-as.factor(rownames(train))

> library(class)

> hlzj_knnClassify <-knn(train,predict,trainClass)

> length(hlzj_knnClassify)

[1] 1639

> hlzj_knnClassify[1:10]

[1] product product product promotion product fashionNews life

[8] product product fashionNews

Levels: fashionNews life productpromotion publicWelfare showbiz

> table(hlzj_knnClassify)

hlzj_knnClassify

fashionNews life product promotion publicWelfare showbiz

40 869 88 535 28 79

> hlzj.knnResult <-list(type=hlzj_knnClassify,text=hlzj)

> hlzj.knnResult <-as.data.frame(hlzj.knnResult)

> fix(hlzj.knnResult)

Knn Classification algorithm is the simplest one . Later, try to use neural network algorithm (nnet())、 Support vector machine algorithm (svm())、 Random forest algorithm (randomForest()) when . There is a problem of insufficient computer memory , My computer is 4G Of , When looking at memory monitoring, you can see that the maximum usage is 3.92G.

It seems that we need to change a computer with more power ╮(╯▽╰)╭

When the hardware conditions can be met , There should be no problem with classification . Relevant algorithms can be used :?? Method name , To view its documentation .

5. Classification effect

The test process is not mentioned above , For the example above , Namely knn The first two parameters are used train, Because using data sets is the same . Therefore, the accuracy of the obtained results can reach 100%. When there are many training sets . Can press it randomly 7:3 Or is it 8:2 Divided into two parts , The former is good for training and the latter is good for testing . I won't go into details here .

When the classification effect is not ideal . To improve the classification effect, we need to enrich the training set . Make the features of the training set as obvious as possible . This practical problem is a very tedious but cannot be perfunctory process .

What can be improved? Welcome to correct , Reprint please indicate the source , thank you !

Copyright notice : This article is the original article of the blogger , Blog , Do not reprint without permission .

Publisher : Full stack programmer stack length , Reprint please indicate the source :https://javaforall.cn/117093.html Link to the original text :https://javaforall.cn

原网站

版权声明
本文为[Full stack programmer webmaster]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/187/202207061255088511.html