当前位置:网站首页>R language for text mining Part4 text classification
R language for text mining Part4 text classification
2022-07-06 21:14:00 【Full stack programmer webmaster】
Hello everyone , I meet you again , I'm the king of the whole stack .
Part4 Text classification
Part3 Text clustering mentioned . Simple difference from clustering classification .
that , We need to sort out the classification of training sets , There is clearly classified text ; Test set , Can use training set to replace . Prediction set , Is unclassified text . It is the final application of classification method .
1. Data preparation
Training set preparation is a very tedious function , I didn't find any labor-saving way temporarily , Sort out manually according to the text content . Here is the official wechat data of a brand , According to the content of Weibo . I divide the main content of its Weibo into : Promotional information (promotion)、 Product promotion (product)、 Public welfare information (publicWelfare)、 Chicken soup (life)、 Fashion information (fashionNews)、 Movie entertainment (showbiz). Each category has 20-50 Data . For example, we can see the text number of each category under the training set below , There is no problem that the training set is classified as Chinese .
The training set is hlzj.train, It will also be used as a test set later .
The prediction set is Part2 Inside hlzj.
> hlzj.train <-read.csv(“hlzj_train.csv”,header=T,stringsAsFactors=F)
> length(hlzj.train)
[1] 2
> table(hlzj.train$type)
fashionNews life product
27 34 38
promotion publicWelfare showbiz
45 22 36
> length(hlzj)
[1] 1639
2. Word segmentation
Training set 、 Test set 、 Prediction sets need word segmentation before possible classification .
It will not be specified here , The process is similar to Part2 Talked about .
After word segmentation in the training set hlzjTrainTemp. Previous pair hlzj After word segmentation, the file is hlzjTemp.
And then they will hlzjTrainTemp and hlzjTemp Remove stop words .
> library(Rwordseg)
Load the required program package :rJava
# Version: 0.2-1
> hlzjTrainTemp <- gsub(“[0-90123456789 < > ~]”,””,hlzj.train$text)
> hlzjTrainTemp <-segmentCN(hlzjTrainTemp)
> hlzjTrainTemp2 <-lapply(hlzjTrainTemp,removeStopWords,stopwords)
>hlzjTemp2 <-lapply(hlzjTemp,removeStopWords,stopwords)
3. Get the matrix
stay Part3 Speak to the . When doing clustering, first convert the text into a matrix , The same process is needed for classification . be used tm software package . First, combine the results of the training set and the prediction set after removing the stop words into hlzjAll, Remember before 202(1:202) Data is a training set , after 1639(203:1841) Bars are prediction sets . obtain hlzjAll The corpus of , And get documents - Entry matrix . Convert it to a normal matrix .
> hlzjAll <- character(0)
> hlzjAll[1:202] <- hlzjTrainTemp2
> hlzjAll[203:1841] <- hlzjTemp2
> length(hlzjAll)
[1] 1841
> corpusAll <-Corpus(VectorSource(hlzjAll))
> (hlzjAll.dtm <-DocumentTermMatrix(corpusAll,control=list(wordLengths = c(2,Inf))))
<<DocumentTermMatrix(documents: 1841, terms: 10973)>>
Non-/sparse entries: 33663/20167630
Sparsity : 100%
Maximal term length: 47
Weighting : term frequency (tf)
> dtmAll_matrix <-as.matrix(hlzjAll.dtm)
4. classification
be used knn Algorithm (K Nearest neighbor algorithm ). The algorithm is class In the software package .
Before the matrix 202 Row data is a training set , There are already classifications , hinder 1639 Pieces of data are not classified . We should get the classification model according to the training set, and then make the classification prediction for it .
Put the classified results together with the original Weibo . use fix() see , You can see the classification results , The effect is quite obvious .
> rownames(dtmAll_matrix)[1:202] <-hlzj.train$type
> rownames(dtmAll_matrix)[203:1841]<- c(“”)
> train <- dtmAll_matrix[1:202,]
> predict <-dtmAll_matrix[203:1841,]
> trainClass <-as.factor(rownames(train))
> library(class)
> hlzj_knnClassify <-knn(train,predict,trainClass)
> length(hlzj_knnClassify)
[1] 1639
> hlzj_knnClassify[1:10]
[1] product product product promotion product fashionNews life
[8] product product fashionNews
Levels: fashionNews life productpromotion publicWelfare showbiz
> table(hlzj_knnClassify)
hlzj_knnClassify
fashionNews life product promotion publicWelfare showbiz
40 869 88 535 28 79
> hlzj.knnResult <-list(type=hlzj_knnClassify,text=hlzj)
> hlzj.knnResult <-as.data.frame(hlzj.knnResult)
> fix(hlzj.knnResult)
Knn Classification algorithm is the simplest one . Later, try to use neural network algorithm (nnet())、 Support vector machine algorithm (svm())、 Random forest algorithm (randomForest()) when . There is a problem of insufficient computer memory , My computer is 4G Of , When looking at memory monitoring, you can see that the maximum usage is 3.92G.
It seems that we need to change a computer with more power ╮(╯▽╰)╭
When the hardware conditions can be met , There should be no problem with classification . Relevant algorithms can be used :?? Method name , To view its documentation .
5. Classification effect
The test process is not mentioned above , For the example above , Namely knn The first two parameters are used train, Because using data sets is the same . Therefore, the accuracy of the obtained results can reach 100%. When there are many training sets . Can press it randomly 7:3 Or is it 8:2 Divided into two parts , The former is good for training and the latter is good for testing . I won't go into details here .
When the classification effect is not ideal . To improve the classification effect, we need to enrich the training set . Make the features of the training set as obvious as possible . This practical problem is a very tedious but cannot be perfunctory process .
What can be improved? Welcome to correct , Reprint please indicate the source , thank you !
Copyright notice : This article is the original article of the blogger , Blog , Do not reprint without permission .
Publisher : Full stack programmer stack length , Reprint please indicate the source :https://javaforall.cn/117093.html Link to the original text :https://javaforall.cn
边栏推荐
- Pinduoduo lost the lawsuit, and the case of bargain price difference of 0.9% was sentenced; Wechat internal test, the same mobile phone number can register two account functions; 2022 fields Awards an
- Le langage r visualise les relations entre plus de deux variables de classification (catégories), crée des plots Mosaiques en utilisant la fonction Mosaic dans le paquet VCD, et visualise les relation
- R language visualizes the relationship between more than two classification (category) variables, uses mosaic function in VCD package to create mosaic plots, and visualizes the relationship between tw
- 3D人脸重建:从基础知识到识别/重建方法!
- Pat 1085 perfect sequence (25 points) perfect sequence
- 968 edit distance
- 正则表达式收集
- R語言可視化兩個以上的分類(類別)變量之間的關系、使用vcd包中的Mosaic函數創建馬賽克圖( Mosaic plots)、分別可視化兩個、三個、四個分類變量的關系的馬賽克圖
- Web开发小妙招:巧用ThreadLocal规避层层传值
- Three schemes of SVM to realize multi classification
猜你喜欢
数据湖(八):Iceberg数据存储格式
The biggest pain point of traffic management - the resource utilization rate cannot go up
Data Lake (VIII): Iceberg data storage format
【深度学习】PyTorch 1.12发布,正式支持苹果M1芯片GPU加速,修复众多Bug
Infrared thermometer based on STM32 single chip microcomputer (with face detection)
Laravel notes - add the function of locking accounts after 5 login failures in user-defined login (improve system security)
硬件开发笔记(十): 硬件开发基本流程,制作一个USB转RS232的模块(九):创建CH340G/MAX232封装库sop-16并关联原理图元器件
Swagger UI tutorial API document artifact
20220211 failure - maximum amount of data supported by mongodb
for循环中break与continue的区别——break-完全结束循环 & continue-终止本次循环
随机推荐
OneNote in-depth evaluation: using resources, plug-ins, templates
js中,字符串和数组互转(二)——数组转为字符串的方法
JS操作dom元素(一)——获取DOM节点的六种方式
Web开发小妙招:巧用ThreadLocal规避层层传值
【论文解读】用于白内障分级/分类的机器学习技术
Notes - detailed steps of training, testing and verification of yolo-v4-tiny source code
Nodejs tutorial expressjs article quick start
2017 8th Blue Bridge Cup group a provincial tournament
Performance test process and plan
Common English vocabulary that every programmer must master (recommended Collection)
JS get array subscript through array content
如何实现常见框架
字符串的使用方法之startwith()-以XX开头、endsWith()-以XX结尾、trim()-删除两端空格
SAP UI5 框架的 manifest.json
Aiko ai Frontier promotion (7.6)
过程化sql在定义变量上与c语言中的变量定义有什么区别
967- letter combination of telephone number
全网最全的知识库管理工具综合评测和推荐:FlowUs、Baklib、简道云、ONES Wiki 、PingCode、Seed、MeBox、亿方云、智米云、搜阅云、天翎
Nodejs教程之让我们用 typescript 创建你的第一个 expressjs 应用程序
How to turn a multi digit number into a digital list