当前位置:网站首页>R language for text mining Part4 text classification
R language for text mining Part4 text classification
2022-07-06 21:14:00 【Full stack programmer webmaster】
Hello everyone , I meet you again , I'm the king of the whole stack .
Part4 Text classification
Part3 Text clustering mentioned . Simple difference from clustering classification .
that , We need to sort out the classification of training sets , There is clearly classified text ; Test set , Can use training set to replace . Prediction set , Is unclassified text . It is the final application of classification method .
1. Data preparation
Training set preparation is a very tedious function , I didn't find any labor-saving way temporarily , Sort out manually according to the text content . Here is the official wechat data of a brand , According to the content of Weibo . I divide the main content of its Weibo into : Promotional information (promotion)、 Product promotion (product)、 Public welfare information (publicWelfare)、 Chicken soup (life)、 Fashion information (fashionNews)、 Movie entertainment (showbiz). Each category has 20-50 Data . For example, we can see the text number of each category under the training set below , There is no problem that the training set is classified as Chinese .
The training set is hlzj.train, It will also be used as a test set later .
The prediction set is Part2 Inside hlzj.
> hlzj.train <-read.csv(“hlzj_train.csv”,header=T,stringsAsFactors=F)
> length(hlzj.train)
[1] 2
> table(hlzj.train$type)
fashionNews life product
27 34 38
promotion publicWelfare showbiz
45 22 36
> length(hlzj)
[1] 1639
2. Word segmentation
Training set 、 Test set 、 Prediction sets need word segmentation before possible classification .
It will not be specified here , The process is similar to Part2 Talked about .
After word segmentation in the training set hlzjTrainTemp. Previous pair hlzj After word segmentation, the file is hlzjTemp.
And then they will hlzjTrainTemp and hlzjTemp Remove stop words .
> library(Rwordseg)
Load the required program package :rJava
# Version: 0.2-1
> hlzjTrainTemp <- gsub(“[0-90123456789 < > ~]”,””,hlzj.train$text)
> hlzjTrainTemp <-segmentCN(hlzjTrainTemp)
> hlzjTrainTemp2 <-lapply(hlzjTrainTemp,removeStopWords,stopwords)
>hlzjTemp2 <-lapply(hlzjTemp,removeStopWords,stopwords)
3. Get the matrix
stay Part3 Speak to the . When doing clustering, first convert the text into a matrix , The same process is needed for classification . be used tm software package . First, combine the results of the training set and the prediction set after removing the stop words into hlzjAll, Remember before 202(1:202) Data is a training set , after 1639(203:1841) Bars are prediction sets . obtain hlzjAll The corpus of , And get documents - Entry matrix . Convert it to a normal matrix .
> hlzjAll <- character(0)
> hlzjAll[1:202] <- hlzjTrainTemp2
> hlzjAll[203:1841] <- hlzjTemp2
> length(hlzjAll)
[1] 1841
> corpusAll <-Corpus(VectorSource(hlzjAll))
> (hlzjAll.dtm <-DocumentTermMatrix(corpusAll,control=list(wordLengths = c(2,Inf))))
<<DocumentTermMatrix(documents: 1841, terms: 10973)>>
Non-/sparse entries: 33663/20167630
Sparsity : 100%
Maximal term length: 47
Weighting : term frequency (tf)
> dtmAll_matrix <-as.matrix(hlzjAll.dtm)
4. classification
be used knn Algorithm (K Nearest neighbor algorithm ). The algorithm is class In the software package .
Before the matrix 202 Row data is a training set , There are already classifications , hinder 1639 Pieces of data are not classified . We should get the classification model according to the training set, and then make the classification prediction for it .
Put the classified results together with the original Weibo . use fix() see , You can see the classification results , The effect is quite obvious .
> rownames(dtmAll_matrix)[1:202] <-hlzj.train$type
> rownames(dtmAll_matrix)[203:1841]<- c(“”)
> train <- dtmAll_matrix[1:202,]
> predict <-dtmAll_matrix[203:1841,]
> trainClass <-as.factor(rownames(train))
> library(class)
> hlzj_knnClassify <-knn(train,predict,trainClass)
> length(hlzj_knnClassify)
[1] 1639
> hlzj_knnClassify[1:10]
[1] product product product promotion product fashionNews life
[8] product product fashionNews
Levels: fashionNews life productpromotion publicWelfare showbiz
> table(hlzj_knnClassify)
hlzj_knnClassify
fashionNews life product promotion publicWelfare showbiz
40 869 88 535 28 79
> hlzj.knnResult <-list(type=hlzj_knnClassify,text=hlzj)
> hlzj.knnResult <-as.data.frame(hlzj.knnResult)
> fix(hlzj.knnResult)
Knn Classification algorithm is the simplest one . Later, try to use neural network algorithm (nnet())、 Support vector machine algorithm (svm())、 Random forest algorithm (randomForest()) when . There is a problem of insufficient computer memory , My computer is 4G Of , When looking at memory monitoring, you can see that the maximum usage is 3.92G.
It seems that we need to change a computer with more power ╮(╯▽╰)╭
When the hardware conditions can be met , There should be no problem with classification . Relevant algorithms can be used :?? Method name , To view its documentation .
5. Classification effect
The test process is not mentioned above , For the example above , Namely knn The first two parameters are used train, Because using data sets is the same . Therefore, the accuracy of the obtained results can reach 100%. When there are many training sets . Can press it randomly 7:3 Or is it 8:2 Divided into two parts , The former is good for training and the latter is good for testing . I won't go into details here .
When the classification effect is not ideal . To improve the classification effect, we need to enrich the training set . Make the features of the training set as obvious as possible . This practical problem is a very tedious but cannot be perfunctory process .
What can be improved? Welcome to correct , Reprint please indicate the source , thank you !
Copyright notice : This article is the original article of the blogger , Blog , Do not reprint without permission .
Publisher : Full stack programmer stack length , Reprint please indicate the source :https://javaforall.cn/117093.html Link to the original text :https://javaforall.cn
边栏推荐
- Is this the feeling of being spoiled by bytes?
- What are RDB and AOF
- New database, multidimensional table platform inventory note, flowus, airtable, seatable, Vig table Vika, Feishu multidimensional table, heipayun, Zhixin information, YuQue
- 每个程序员必须掌握的常用英语词汇(建议收藏)
- What is the difference between procedural SQL and C language in defining variables
- 039. (2.8) thoughts in the ward
- Reference frame generation based on deep learning
- 全网最全的新型数据库、多维表格平台盘点 Notion、FlowUs、Airtable、SeaTable、维格表 Vika、飞书多维表格、黑帕云、织信 Informat、语雀
- c#使用oracle存储过程获取结果集实例
- The most comprehensive new database in the whole network, multidimensional table platform inventory note, flowus, airtable, seatable, Vig table Vika, flying Book Multidimensional table, heipayun, Zhix
猜你喜欢

The biggest pain point of traffic management - the resource utilization rate cannot go up

HMS core machine learning service creates a new "sound" state of simultaneous interpreting translation, and AI makes international exchanges smoother

【OpenCV 例程200篇】220.对图像进行马赛克处理

3D face reconstruction: from basic knowledge to recognition / reconstruction methods!

每个程序员必须掌握的常用英语词汇(建议收藏)

SAP Fiori应用索引大全工具和 SAP Fiori Tools 的使用介绍

966 minimum path sum

面试官:Redis中有序集合的内部实现方式是什么?

The difference between break and continue in the for loop -- break completely end the loop & continue terminate this loop

KDD 2022 | 通过知识增强的提示学习实现统一的对话式推荐
随机推荐
Laravel notes - add the function of locking accounts after 5 login failures in user-defined login (improve system security)
OneNote 深度评测:使用资源、插件、模版
1_ Introduction to go language
After working for 5 years, this experience is left when you reach P7. You have helped your friends get 10 offers
【Redis设计与实现】第一部分 :Redis数据结构和对象 总结
Chris LATTNER, the father of llvm: why should we rebuild AI infrastructure software
Distributed ID
ICML 2022 | flowformer: task generic linear complexity transformer
968 edit distance
3D人脸重建:从基础知识到识别/重建方法!
正则表达式收集
Simple continuous viewing PTA
Statistical inference: maximum likelihood estimation, Bayesian estimation and variance deviation decomposition
El table table - get the row and column you click & the sort of El table and sort change, El table column and sort method & clear sort clearsort
OneNote in-depth evaluation: using resources, plug-ins, templates
嵌入式开发的7大原罪
el-table表格——sortable排序 & 出现小数、%时排序错乱
JS traversal array and string
华为设备命令
代理和反向代理