当前位置:网站首页>[information retrieval] experiment of classification and clustering
[information retrieval] experiment of classification and clustering
2022-07-04 14:29:00 【Alex_ SCY】
(1) use Java Language or other commonly used languages to realize teaching materials 《Introduction to Information Retrieval》 The first 13 The two feature selection methods introduced in the chapter :13.5.1 Based on mutual information described in section (Mutual Information) Feature selection method and 13.5.2 The described in section is based on X^2 Feature selection method of .
Please get it by yourself from the official document of the school 2021 News documents of ( Crawl or download manually ), Requirements include the following 150 News document :
“ Party and government offices ” The latest 30 News document ,
“ The Ministry of Education ” The latest 30 News document ,
“ Admissions office ” The latest 30 News document ,
“ graduate school ” The latest 30 News document ,
“ Ministry of science and technology ” The latest 30 News document .
take “ Party and government offices ”、“ The Ministry of Education ”、“ Admissions office ”、“ graduate school ” and “ Ministry of science and technology ” As 5 individual class, And through mutual information and X^2 For each class Choose the most relevant 15 Features ( Include the feature name and the corresponding value , After the decimal point 2 position ), And make a brief analysis of the results .
Code screenshot 、 Screenshots of running results and detailed text description :
First step : Reptiles
Use python selenium Automatic tools crawl text from official documents , And extract the text and organize it into a file for storage . The specific implementation process is roughly using chrome The automation tool automatically switches to 2021 Interfaces of various departments in , Then before the crawler gets 30 Link to the news and use it to locate the corresponding div Read text . The specific implementation is as follows :
Automatic page switching :
Get text information :
The second step is to process text information
1. Read document , Classify according to file name
among file[0-5] There are five categories of articles in turn
2. Use jieba Do word segmentation , And generate word bags .
Segmentation result :
The word bag :
notes : Because the crawler down the article will have \u3000 Wait for extra characters , Therefore, additional treatment is required
The third step is feature selection :
MI The formula is as follows :
So in order to reduce double counting , First of all, I made statistics on the occurrence of each word item in different categories . For each word , First of all, the statistics are in the form of the following table :
Word / Category | Number of occurrences | No occurrences |
---|---|---|
Category 1 | 1 | 29 |
Category 2 | 3 | 27 |
Category 3 | 5 | 25 |
Category 4 | 7 | 23 |
Category 5 | 9 | 21 |
With the above table , You can quickly calculate each category N11,N10,N01,N00 Four values of , And according to MI,X^2 The corresponding results are obtained by the calculation formula of
The final calculation results are as follows :
The fourth step is to sort .
Because the title only needs to be obtained before 15 Large eigenvalues , Therefore, the small top stack based TopK Algorithm :
First, write a small top heap reconstruction algorithm :
And then there was TopK Sorting algorithm : First, before using K Create a small top heap with three elements , If the subsequent element is larger than the heap top element , The replacement , And rebuild the small top pile . Finally, use the heap sorting algorithm , Yes K Elements to sort .
The result is as follows :
Make a brief introduction to the Chinese word segmentation tools used :
call jieba.cut Function for word segmentation
jieba participle 0.4 Versions above support four word segmentation modes :
1. Accurate model : Try to cut the sentence as precisely as possible , Suitable for text analysis ;
2. All model : Scan the sentences for all the words that can be made into words , Very fast , But it doesn't solve the ambiguity
3. Search engine model : On the basis of exact patterns , Again shred long words , Increase recall rate , Suitable for search engine segmentation
4.paddle Pattern : utilize PaddlePaddle Deep learning framework , Training sequence labeling ( two-way GRU) The network model realizes word segmentation . It also supports part of speech tagging .
It can be seen that , The coarsest pattern of words is , All words are returned . There are mainly the following questions :
1. Without context , Easily ambiguous : Collaborative filtering ----> synergy + The same thing + Filter
2. Don't know vocabulary : Robust ----> Lu + Great
Precise mode and search engine mode can be selected according to specific needs .
Through mutual information for each class Choose the most relevant 15 Features :
Through mutual information for each class Choose the most relevant 15 Make a brief analysis of the characteristics :
1、 Party and government offices : The filtered information is more in line with , party and government 、 Party Committee 、 Grassroots and other words are very consistent .
2、 The Ministry of Education : Most of the selected articles focus on 12 month , At that time, most of the articles were about summarizing the annual work . After comparison , It can roughly summarize the main work content of the month .
3、 Admissions office : More in line with . It is obvious that many consultations were released at that time to go into high school and promote deep education . verified , It is found that a large number of topics have been published for 《 Famous teachers go to middle school 》 A series of articles .
4、 graduate school : The filtered information is more in line with , master 、 mentor 、 Doctor and other words are very consistent .
5、 Ministry of science and technology : The filtered information is more in line with , NSFC 、 Natural science 、 Funds and other words are very consistent .
adopt X^2 For each class Choose the most relevant 15 Features :
Yes X^2 For each class Choose the most relevant 15 Make a brief analysis of the characteristics :
1、 Party and government offices : The filtered information is more in line with , party and government 、 Party Committee 、 Grassroots and other words are very consistent .
2、 The Ministry of Education : Most of the selected articles focus on 12 month , At that time, most of the articles were about summarizing the annual work . After comparison , It can roughly summarize the main work content of the month .
3、 Admissions office : More in line with . It is obvious that many consultations were released at that time to go into high school and promote deep education . verified , It is found that a large number of topics have been published for 《 Famous teachers go to middle school 》 A series of articles .
4、 graduate school : The filtered information is more in line with , master 、 mentor 、 Doctor and other words are very consistent .
5、 Ministry of science and technology : The filtered information is more in line with , NSFC 、 Natural science 、 Funds and other words are very consistent .
Through mutual information and X^2 For each class Choose the most relevant 15 Make a brief comparative analysis of the three features :
Because of reptiles , In all articles, there will be similar ( This article was recently updated on 2021/12/29 19:05:00 Cumulative hits :877) The sentence of . But both algorithms can filter out this kind of information that repeats in all categories , The reason is that in this statement term Of N11 and N10 All very high , It can filter better .
The other two calculation methods , The selection and ranking of the first few features are relatively consistent . The latter features will have different emphasis , This is because X^2 Choose based on statistical significance , So he will be better than MI Select more rare items , These terms are not reliable for classification . Of course ,MI It is not necessarily possible to choose the words that maximize the classification accuracy of yes . So I think the better way is to increase the sample size .
(2) use Java Language or other commonly used languages to implement a naive Bayesian Classification Algorithm (Naive Bayes algorithm) Simple document classification system ( Judge whether the notice of a document communication is “ Party and government offices ”、“ The Ministry of Education ”、“ Admissions office ”、“ graduate school ” and “ Ministry of science and technology ” Information about , From 5 Select the most relevant category ).
The classification effects of using feature selection and not using feature selection should be compared and analyzed . Use questions (1) To train and test , In each category 20 Articles are used for training ,10 Articles are used for testing .
Please attach the overall design of the system in the report 、 Code screenshot ( Don't copy the source code , Please use screenshots )、 Screenshots of running results and detailed text description . The program should have detailed notes . Make a brief introduction to the Chinese word segmentation tools used .(20 branch )
Overall design of the system :
Integral design :
Code screenshot 、 Screenshots of running results and detailed text description :
First step : Read the article data set
During reading , Need to read text for processing . Among them, there will be articles down due to crawlers \u3000 Wait for extra characters , Therefore, additional treatment is required . And then use it jieba Word segmentation generates a list of articles .postingList and classVec One-to-one correspondence , For the correct classification of text and markup .
Generate word bag according to the article list
The next step is training NB The process of classifier
Naive Bayesian calculation formula is as follows :
Specifically, the following training pseudo code :
The specific implementation is as follows :
Finally, we can get the conditional probability of each word
condprob[term][c] representative term In Category c Conditional probabilities in
Application of naive Bayesian algorithm :
The formula is as follows : You can add log Function to solve the problem of decimal loss
Specifically, the following training pseudo code :
applyMultinomialNB The document category with the highest probability will be returned .
Classification effect when using feature selection :
Overall accuracy 94%
Classification effect without feature selection :
Overall accuracy 86%
The classification effects of using feature selection and not using feature selection are compared and analyzed :
You can see , The classification effect is better after using feature selection . This is because after using feature selection , It can more accurately distinguish keywords from categories , In the unused process , There will be more redundant words to interfere .
Under the other two methods , The classification accuracy of the academic affairs department is not ideal . Combined with specific articles , I think the possible reason is that there are too many types of articles , The amount of data is too small to conform to the law . Therefore, it will lead to the decline of classification accuracy . I guess a feasible method is to increase the sample size , Enrich the corresponding words .
边栏推荐
- An overview of 2D human posture estimation
- 富文本编辑:wangEditor使用教程
- Excel quickly merges multiple rows of data
- 商业智能BI财务分析,狭义的财务分析和广义的财务分析有何不同?
- 数据中台概念
- scratch古堡历险记 电子学会图形化编程scratch等级考试三级真题和答案解析2022年6月
- Practical puzzle solving | how to extract irregular ROI regions in opencv
- 2022游戏出海实用发行策略
- 数据湖(十三):Spark与Iceberg整合DDL操作
- Map of mL: Based on Boston house price regression prediction data set, an interpretable case of xgboost model using map value
猜你喜欢
92.(cesium篇)cesium楼栋分层
MySQL之详解索引
Nowcoder rearrange linked list
Digi restarts XBee Pro S2C production. Some differences need to be noted
【MySQL从入门到精通】【高级篇】(五)MySQL的SQL语句执行流程
SqlServer函数,存储过程的创建和使用
Nowcoder reverse linked list
nowcoder重排链表
An overview of 2D human posture estimation
Supprimer les lettres dupliquées [avidité + pile monotone (maintenir la séquence monotone avec un tableau + Len)]
随机推荐
去除重复字母[贪心+单调栈(用数组+len来维持单调序列)]
LiveData
No servers available for service: xxxx
R language ggplot2 visualization: gganimate package creates animated graph (GIF) and uses anim_ The save function saves the GIF visual animation
Practical puzzle solving | how to extract irregular ROI regions in opencv
ML:SHAP值的简介、原理、使用方法、经典案例之详细攻略
GCC [6] - 4 stages of compilation
为什么图片传输要使用base64编码
Innovation and development of independent industrial software
Visual Studio调试方式详解
Intelligence d'affaires bi analyse financière, analyse financière au sens étroit et analyse financière au sens large sont - ils différents?
Popular framework: the use of glide
Pandora IOT development board learning (RT thread) - Experiment 3 button experiment (learning notes)
2022游戏出海实用发行策略
leetcode:6110. The number of incremental paths in the grid graph [DFS + cache]
MySQL triggers
MySQL之详解索引
SqlServer函数,存储过程的创建和使用
flink sql-client. SH tutorial
ViewModel 初体验