当前位置：网站首页>[information retrieval] experiment of classification and clustering

[information retrieval] experiment of classification and clustering

2022-07-04 14:29:00 【Alex_ SCY】

(1) use Java Language or other commonly used languages to realize teaching materials 《Introduction to Information Retrieval》 The first 13 The two feature selection methods introduced in the chapter ：13.5.1 Based on mutual information described in section （Mutual Information） Feature selection method and 13.5.2 The described in section is based on X^2 Feature selection method of .

Please get it by yourself from the official document of the school 2021 News documents of （ Crawl or download manually ）, Requirements include the following 150 News document ：
“ Party and government offices ” The latest 30 News document ,
“ The Ministry of Education ” The latest 30 News document ,
“ Admissions office ” The latest 30 News document ,
“ graduate school ” The latest 30 News document ,
“ Ministry of science and technology ” The latest 30 News document .
take “ Party and government offices ”、“ The Ministry of Education ”、“ Admissions office ”、“ graduate school ” and “ Ministry of science and technology ” As 5 individual class, And through mutual information and X^2 For each class Choose the most relevant 15 Features （ Include the feature name and the corresponding value , After the decimal point 2 position ）, And make a brief analysis of the results .

Code screenshot 、 Screenshots of running results and detailed text description ：

First step ： Reptiles

Use python selenium Automatic tools crawl text from official documents , And extract the text and organize it into a file for storage . The specific implementation process is roughly using chrome The automation tool automatically switches to 2021 Interfaces of various departments in , Then before the crawler gets 30 Link to the news and use it to locate the corresponding div Read text . The specific implementation is as follows ：

Automatic page switching ：

Get text information ：

The second step is to process text information

1. Read document , Classify according to file name

among file[0-5] There are five categories of articles in turn

2. Use jieba Do word segmentation , And generate word bags .

Segmentation result ：

The word bag ：

notes ： Because the crawler down the article will have \u3000 Wait for extra characters , Therefore, additional treatment is required

The third step is feature selection ：

MI The formula is as follows ：

So in order to reduce double counting , First of all, I made statistics on the occurrence of each word item in different categories . For each word , First of all, the statistics are in the form of the following table ：

Word / Category	Number of occurrences	No occurrences
Category 1	1	29
Category 2	3	27
Category 3	5	25
Category 4	7	23
Category 5	9	21

With the above table , You can quickly calculate each category N11,N10,N01,N00 Four values of , And according to MI,X^2 The corresponding results are obtained by the calculation formula of

The final calculation results are as follows ：

The fourth step is to sort .

Because the title only needs to be obtained before 15 Large eigenvalues , Therefore, the small top stack based TopK Algorithm ：
First, write a small top heap reconstruction algorithm ：

And then there was TopK Sorting algorithm ： First, before using K Create a small top heap with three elements , If the subsequent element is larger than the heap top element , The replacement , And rebuild the small top pile . Finally, use the heap sorting algorithm , Yes K Elements to sort .

The result is as follows ：

Make a brief introduction to the Chinese word segmentation tools used ：

call jieba.cut Function for word segmentation
jieba participle 0.4 Versions above support four word segmentation modes ：
1. Accurate model ： Try to cut the sentence as precisely as possible , Suitable for text analysis ;
2. All model ： Scan the sentences for all the words that can be made into words , Very fast , But it doesn't solve the ambiguity
3. Search engine model ： On the basis of exact patterns , Again shred long words , Increase recall rate , Suitable for search engine segmentation
4.paddle Pattern ： utilize PaddlePaddle Deep learning framework , Training sequence labeling （ two-way GRU） The network model realizes word segmentation . It also supports part of speech tagging .

It can be seen that , The coarsest pattern of words is , All words are returned . There are mainly the following questions ：
1. Without context , Easily ambiguous ： Collaborative filtering ----> synergy + The same thing + Filter
2. Don't know vocabulary ： Robust ----> Lu + Great

Precise mode and search engine mode can be selected according to specific needs .

Through mutual information for each class Choose the most relevant 15 Features ：

Through mutual information for each class Choose the most relevant 15 Make a brief analysis of the characteristics ：

1、 Party and government offices ： The filtered information is more in line with , party and government 、 Party Committee 、 Grassroots and other words are very consistent .
2、 The Ministry of Education ： Most of the selected articles focus on 12 month , At that time, most of the articles were about summarizing the annual work . After comparison , It can roughly summarize the main work content of the month .
3、 Admissions office ： More in line with . It is obvious that many consultations were released at that time to go into high school and promote deep education . verified , It is found that a large number of topics have been published for 《 Famous teachers go to middle school 》 A series of articles .
4、 graduate school ： The filtered information is more in line with , master 、 mentor 、 Doctor and other words are very consistent .
5、 Ministry of science and technology ： The filtered information is more in line with , NSFC 、 Natural science 、 Funds and other words are very consistent .

adopt X^2 For each class Choose the most relevant 15 Features ：

Yes X^2 For each class Choose the most relevant 15 Make a brief analysis of the characteristics ：

1、 Party and government offices ： The filtered information is more in line with , party and government 、 Party Committee 、 Grassroots and other words are very consistent .

2、 The Ministry of Education ： Most of the selected articles focus on 12 month , At that time, most of the articles were about summarizing the annual work . After comparison , It can roughly summarize the main work content of the month .

3、 Admissions office ： More in line with . It is obvious that many consultations were released at that time to go into high school and promote deep education . verified , It is found that a large number of topics have been published for 《 Famous teachers go to middle school 》 A series of articles .

4、 graduate school ： The filtered information is more in line with , master 、 mentor 、 Doctor and other words are very consistent .

5、 Ministry of science and technology ： The filtered information is more in line with , NSFC 、 Natural science 、 Funds and other words are very consistent .

Through mutual information and X^2 For each class Choose the most relevant 15 Make a brief comparative analysis of the three features ：

Because of reptiles , In all articles, there will be similar （ This article was recently updated on 2021/12/29 19:05:00　 Cumulative hits :877） The sentence of . But both algorithms can filter out this kind of information that repeats in all categories , The reason is that in this statement term Of N11 and N10 All very high , It can filter better .

The other two calculation methods , The selection and ranking of the first few features are relatively consistent . The latter features will have different emphasis , This is because X^2 Choose based on statistical significance , So he will be better than MI Select more rare items , These terms are not reliable for classification . Of course ,MI It is not necessarily possible to choose the words that maximize the classification accuracy of yes . So I think the better way is to increase the sample size .

(2) use Java Language or other commonly used languages to implement a naive Bayesian Classification Algorithm （Naive Bayes algorithm） Simple document classification system （ Judge whether the notice of a document communication is “ Party and government offices ”、“ The Ministry of Education ”、“ Admissions office ”、“ graduate school ” and “ Ministry of science and technology ” Information about , From 5 Select the most relevant category ）.

The classification effects of using feature selection and not using feature selection should be compared and analyzed . Use questions (1) To train and test , In each category 20 Articles are used for training ,10 Articles are used for testing .
Please attach the overall design of the system in the report 、 Code screenshot （ Don't copy the source code , Please use screenshots ）、 Screenshots of running results and detailed text description . The program should have detailed notes . Make a brief introduction to the Chinese word segmentation tools used .（20 branch ）

Overall design of the system ：

Integral design ：

Code screenshot 、 Screenshots of running results and detailed text description ：

First step ： Read the article data set

During reading , Need to read text for processing . Among them, there will be articles down due to crawlers \u3000 Wait for extra characters , Therefore, additional treatment is required . And then use it jieba Word segmentation generates a list of articles .postingList and classVec One-to-one correspondence , For the correct classification of text and markup .

Generate word bag according to the article list

The next step is training NB The process of classifier

Naive Bayesian calculation formula is as follows ：

Specifically, the following training pseudo code ：

The specific implementation is as follows ：

Finally, we can get the conditional probability of each word
condprob[term][c] representative term In Category c Conditional probabilities in

Application of naive Bayesian algorithm ：

The formula is as follows ： You can add log Function to solve the problem of decimal loss

Specifically, the following training pseudo code ：

applyMultinomialNB The document category with the highest probability will be returned .

Classification effect when using feature selection ：

Overall accuracy 94%

Classification effect without feature selection ：

Overall accuracy 86%

The classification effects of using feature selection and not using feature selection are compared and analyzed ：

You can see , The classification effect is better after using feature selection . This is because after using feature selection , It can more accurately distinguish keywords from categories , In the unused process , There will be more redundant words to interfere .

Under the other two methods , The classification accuracy of the academic affairs department is not ideal . Combined with specific articles , I think the possible reason is that there are too many types of articles , The amount of data is too small to conform to the law . Therefore, it will lead to the decline of classification accuracy . I guess a feasible method is to increase the sample size , Enrich the corresponding words .

原网站

版权声明
本文为[Alex_ SCY]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/185/202207041213182154.html

当前位置：网站首页>[information retrieval] experiment of classification and clustering

[information retrieval] experiment of classification and clustering

Code screenshot 、 Screenshots of running results and detailed text description ：

First step ： Reptiles

Automatic page switching ：

Get text information ：

The second step is to process text information

1. Read document , Classify according to file name

2. Use jieba Do word segmentation , And generate word bags .

The third step is feature selection ：

The fourth step is to sort .

Make a brief introduction to the Chinese word segmentation tools used ：

Through mutual information for each class Choose the most relevant 15 Features ：

Through mutual information for each class Choose the most relevant 15 Make a brief analysis of the characteristics ：

adopt X^2 For each class Choose the most relevant 15 Features ：

Yes X^2 For each class Choose the most relevant 15 Make a brief analysis of the characteristics ：

Through mutual information and X^2 For each class Choose the most relevant 15 Make a brief comparative analysis of the three features ：

Overall design of the system ：

Code screenshot 、 Screenshots of running results and detailed text description ：

The next step is training NB The process of classifier

Application of naive Bayesian algorithm ：

Classification effect when using feature selection ：

Classification effect without feature selection ：

The classification effects of using feature selection and not using feature selection are compared and analyzed ：

边栏推荐

猜你喜欢

随机推荐