当前位置:网站首页>[semi supervised classification] semi supervised web page classification based on K-means and label+propagation

[semi supervised classification] semi supervised web page classification based on K-means and label+propagation

2022-06-10 03:56:00 FPGA and MATLAB

1. Software version

matlab2013b

2. Theoretical knowledge of this algorithm

First “K Mean algorithm ” and “ Based on local and global consistency algorithm ” Integration of , It is not a simple patchwork of two algorithms , here , It actually combines “K Mean algorithm ” and “ Based on local and global consistency algorithm ” The idea of both algorithms . According to the algorithm idea you provided , The basic steps of the algorithm are :

-----------------------------------------------------------------------------------------------------

Input : Data sets ( Training samples and test samples respectively account for a certain proportion ) And images in which a small number of samples have been labeled as categories , And each class is marked with at least one training sample .

-----------------------------------------------------------------------------------------------------

Step1 Calculate the mean value of a small number of labeled samples , obtain c( Number of categories ) Initial cluster centers ;

Step2 Use Euclidean distance to calculate unlabeled data to c The distance between two initial centers , Assign unlabeled samples to the category closest to the center point , Divide into c A cluster of ;

Step3 Use Geodesic distance The similarity measure of , choice Similarity in each cluster Greater than or equal to 0.9 Of individual ( The number in each cluster is different ) sample , Find their mean , As c A new central point and get c Average radius ;

Step4 loop (2)(3), until c The center points are fixed ;

Step5 Yes Samples and samples within the radius from each center point are marked ;

Step6 The remaining unlabeled samples are marked with local and global consistency algorithms , Where marked data only uses c A central point ;( There are ready-made procedures

Step7 After all samples are marked , And then calculate the c A central point .

Step8 For new test data , By calculating the similarity between the test data and each central point , Select the one with the highest confidence to mark .

-----------------------------------------------------------------------------------------------------

Output : The data set is divided into marked and unlabeled data sets and test data sets , Test data set accounts for 30% The proportion of , Marked and unlabeled shares 70%. use 10 Fold and cross validation method , Output F1-measure The results of each indicator , Output classified images and index results . Take the marked data as the training set , Ensure that each category has a tagged training set , Then expand the training set according to different proportions , Of a dataset precision and recall The test results are the mean values of unlabeled data and test data . Test the data set according to the different proportion of marked data .

3. Introduction to programming

here , According to each step , Write the corresponding code , And output the final result :

Step1 Calculate the mean value of a small number of labeled samples , obtain c( Number of categories ) Initial cluster centers ;

This step in the code is step1 Main content

Step2 Use Euclidean distance to calculate unlabeled data to c The distance between two initial centers , Assign unlabeled samples to the category closest to the center point , Divide into c A cluster of ;

Through this step , You can make a preliminary classification of the data .

Step3 Use Geodesic distance The similarity measure of , choice Similarity in each cluster Greater than or equal to 0.9 Of () individual ( The number in each cluster is different ) sample , Find their mean , As c A new central point and get c Average radius ;

Step4 loop (2)(3), until c The center points are fixed ;

Because of the cosine similarity in this paper , So both of these methods are done .

Step5 Yes () Samples and samples within the radius from each center point are marked ;

About this step , The radius is calculated in the code , But for the convenience of the later process , Mark the classified data with the cluster number , Those without classification are numbered as 0;

Step6 The remaining unlabeled samples are marked with local and global consistency algorithms , Where marked data only uses c A central point ;

Step7 After all samples are marked , And then calculate the c A central point .

Step8 For new test data , By calculating the similarity between the test data and each central point , Select the one with the highest confidence to mark .

The running results are as follows :

The classification results and classification accuracy of the test set .

First, input the picture into the picture , Feature extraction , Then classify the features , Finally, classify the pictures to be tested , And calculate the classification accuracy .

The operation results are as follows :

A09-18

原网站

版权声明
本文为[FPGA and MATLAB]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/161/202206100338097572.html