当前位置:网站首页>R language for text mining Part4 text classification
R language for text mining Part4 text classification
2022-07-06 21:14:00 【Full stack programmer webmaster】
Hello everyone , I meet you again , I'm the king of the whole stack .
Part4 Text classification
Part3 Text clustering mentioned . Simple difference from clustering classification .
that , We need to sort out the classification of training sets , There is clearly classified text ; Test set , Can use training set to replace . Prediction set , Is unclassified text . It is the final application of classification method .
1. Data preparation
Training set preparation is a very tedious function , I didn't find any labor-saving way temporarily , Sort out manually according to the text content . Here is the official wechat data of a brand , According to the content of Weibo . I divide the main content of its Weibo into : Promotional information (promotion)、 Product promotion (product)、 Public welfare information (publicWelfare)、 Chicken soup (life)、 Fashion information (fashionNews)、 Movie entertainment (showbiz). Each category has 20-50 Data . For example, we can see the text number of each category under the training set below , There is no problem that the training set is classified as Chinese .
The training set is hlzj.train, It will also be used as a test set later .
The prediction set is Part2 Inside hlzj.
> hlzj.train <-read.csv(“hlzj_train.csv”,header=T,stringsAsFactors=F)
> length(hlzj.train)
[1] 2
> table(hlzj.train$type)
fashionNews life product
27 34 38
promotion publicWelfare showbiz
45 22 36
> length(hlzj)
[1] 1639
2. Word segmentation
Training set 、 Test set 、 Prediction sets need word segmentation before possible classification .
It will not be specified here , The process is similar to Part2 Talked about .
After word segmentation in the training set hlzjTrainTemp. Previous pair hlzj After word segmentation, the file is hlzjTemp.
And then they will hlzjTrainTemp and hlzjTemp Remove stop words .
> library(Rwordseg)
Load the required program package :rJava
# Version: 0.2-1
> hlzjTrainTemp <- gsub(“[0-90123456789 < > ~]”,””,hlzj.train$text)
> hlzjTrainTemp <-segmentCN(hlzjTrainTemp)
> hlzjTrainTemp2 <-lapply(hlzjTrainTemp,removeStopWords,stopwords)
>hlzjTemp2 <-lapply(hlzjTemp,removeStopWords,stopwords)
3. Get the matrix
stay Part3 Speak to the . When doing clustering, first convert the text into a matrix , The same process is needed for classification . be used tm software package . First, combine the results of the training set and the prediction set after removing the stop words into hlzjAll, Remember before 202(1:202) Data is a training set , after 1639(203:1841) Bars are prediction sets . obtain hlzjAll The corpus of , And get documents - Entry matrix . Convert it to a normal matrix .
> hlzjAll <- character(0)
> hlzjAll[1:202] <- hlzjTrainTemp2
> hlzjAll[203:1841] <- hlzjTemp2
> length(hlzjAll)
[1] 1841
> corpusAll <-Corpus(VectorSource(hlzjAll))
> (hlzjAll.dtm <-DocumentTermMatrix(corpusAll,control=list(wordLengths = c(2,Inf))))
<<DocumentTermMatrix(documents: 1841, terms: 10973)>>
Non-/sparse entries: 33663/20167630
Sparsity : 100%
Maximal term length: 47
Weighting : term frequency (tf)
> dtmAll_matrix <-as.matrix(hlzjAll.dtm)
4. classification
be used knn Algorithm (K Nearest neighbor algorithm ). The algorithm is class In the software package .
Before the matrix 202 Row data is a training set , There are already classifications , hinder 1639 Pieces of data are not classified . We should get the classification model according to the training set, and then make the classification prediction for it .
Put the classified results together with the original Weibo . use fix() see , You can see the classification results , The effect is quite obvious .
> rownames(dtmAll_matrix)[1:202] <-hlzj.train$type
> rownames(dtmAll_matrix)[203:1841]<- c(“”)
> train <- dtmAll_matrix[1:202,]
> predict <-dtmAll_matrix[203:1841,]
> trainClass <-as.factor(rownames(train))
> library(class)
> hlzj_knnClassify <-knn(train,predict,trainClass)
> length(hlzj_knnClassify)
[1] 1639
> hlzj_knnClassify[1:10]
[1] product product product promotion product fashionNews life
[8] product product fashionNews
Levels: fashionNews life productpromotion publicWelfare showbiz
> table(hlzj_knnClassify)
hlzj_knnClassify
fashionNews life product promotion publicWelfare showbiz
40 869 88 535 28 79
> hlzj.knnResult <-list(type=hlzj_knnClassify,text=hlzj)
> hlzj.knnResult <-as.data.frame(hlzj.knnResult)
> fix(hlzj.knnResult)
Knn Classification algorithm is the simplest one . Later, try to use neural network algorithm (nnet())、 Support vector machine algorithm (svm())、 Random forest algorithm (randomForest()) when . There is a problem of insufficient computer memory , My computer is 4G Of , When looking at memory monitoring, you can see that the maximum usage is 3.92G.
It seems that we need to change a computer with more power ╮(╯▽╰)╭
When the hardware conditions can be met , There should be no problem with classification . Relevant algorithms can be used :?? Method name , To view its documentation .
5. Classification effect
The test process is not mentioned above , For the example above , Namely knn The first two parameters are used train, Because using data sets is the same . Therefore, the accuracy of the obtained results can reach 100%. When there are many training sets . Can press it randomly 7:3 Or is it 8:2 Divided into two parts , The former is good for training and the latter is good for testing . I won't go into details here .
When the classification effect is not ideal . To improve the classification effect, we need to enrich the training set . Make the features of the training set as obvious as possible . This practical problem is a very tedious but cannot be perfunctory process .
What can be improved? Welcome to correct , Reprint please indicate the source , thank you !
Copyright notice : This article is the original article of the blogger , Blog , Do not reprint without permission .
Publisher : Full stack programmer stack length , Reprint please indicate the source :https://javaforall.cn/117093.html Link to the original text :https://javaforall.cn
边栏推荐
- 基于STM32单片机设计的红外测温仪(带人脸检测)
- ICML 2022 | Flowformer: 任务通用的线性复杂度Transformer
- 966 minimum path sum
- Yyds dry goods count re comb this of arrow function
- 15million employees are easy to manage, and the cloud native database gaussdb makes HR office more efficient
- Web开发小妙招:巧用ThreadLocal规避层层传值
- OSPF多区域配置
- JS according to the Chinese Alphabet (province) or according to the English alphabet - Za sort &az sort
- 正则表达式收集
- Pycharm remote execution
猜你喜欢

Pinduoduo lost the lawsuit, and the case of bargain price difference of 0.9% was sentenced; Wechat internal test, the same mobile phone number can register two account functions; 2022 fields Awards an

2022 fields Award Announced! The first Korean Xu Long'er was on the list, and four post-80s women won the prize. Ukrainian female mathematicians became the only two women to win the prize in history

Statistical inference: maximum likelihood estimation, Bayesian estimation and variance deviation decomposition

基于深度学习的参考帧生成

HMS Core 机器学习服务打造同传翻译新“声”态,AI让国际交流更顺畅

for循环中break与continue的区别——break-完全结束循环 & continue-终止本次循环
![[MySQL] trigger](/img/b5/6df17eb254bbdb0aba422d08f13046.png)
[MySQL] trigger

全网最全的知识库管理工具综合评测和推荐:FlowUs、Baklib、简道云、ONES Wiki 、PingCode、Seed、MeBox、亿方云、智米云、搜阅云、天翎

【mysql】游标的基本使用

The biggest pain point of traffic management - the resource utilization rate cannot go up
随机推荐
@GetMapping、@PostMapping 和 @RequestMapping详细区别附实战代码(全)
What are RDB and AOF
[redis design and implementation] part I: summary of redis data structure and objects
【mysql】触发器
过程化sql在定义变量上与c语言中的变量定义有什么区别
Performance test process and plan
【mysql】游标的基本使用
What is the problem with the SQL group by statement
039. (2.8) thoughts in the ward
Distributed ID
Nodejs tutorial let's create your first expressjs application with typescript
Thinking about agile development
New database, multidimensional table platform inventory note, flowus, airtable, seatable, Vig table Vika, Feishu multidimensional table, heipayun, Zhixin information, YuQue
el-table表格——获取单击的是第几行和第几列 & 表格排序之el-table与sort-change、el-table-column与sort-method & 清除排序-clearSort
Three schemes of SVM to realize multi classification
JS operation DOM element (I) -- six ways to obtain DOM nodes
Laravel笔记-自定义登录中新增登录5次失败锁账户功能(提高系统安全性)
038. (2.7) less anxiety
Web开发小妙招:巧用ThreadLocal规避层层传值
【Redis设计与实现】第一部分 :Redis数据结构和对象 总结