当前位置：网站首页>[semi supervised classification] semi supervised web page classification based on K-means and label+propagation

[semi supervised classification] semi supervised web page classification based on K-means and label+propagation

2022-06-10 03:56:00 【FPGA and MATLAB】

1. Software version

matlab2013b

2. Theoretical knowledge of this algorithm

First “K Mean algorithm ” and “ Based on local and global consistency algorithm ” Integration of , It is not a simple patchwork of two algorithms , here , It actually combines “K Mean algorithm ” and “ Based on local and global consistency algorithm ” The idea of both algorithms . According to the algorithm idea you provided , The basic steps of the algorithm are ：

-----------------------------------------------------------------------------------------------------

Input ： Data sets （ Training samples and test samples respectively account for a certain proportion ） And images in which a small number of samples have been labeled as categories , And each class is marked with at least one training sample .

-----------------------------------------------------------------------------------------------------

Step1： Calculate the mean value of a small number of labeled samples , obtain c（ Number of categories ） Initial cluster centers ;

Step2： Use Euclidean distance to calculate unlabeled data to c The distance between two initial centers , Assign unlabeled samples to the category closest to the center point , Divide into c A cluster of ;

Step3： Use Geodesic distance The similarity measure of , choice Similarity in each cluster Greater than or equal to 0.9 Of individual （ The number in each cluster is different ） sample , Find their mean , As c A new central point and get c Average radius ;

Step4： loop （2）（3）, until c The center points are fixed ;

Step5： Yes Samples and samples within the radius from each center point are marked ;

Step6： The remaining unlabeled samples are marked with local and global consistency algorithms , Where marked data only uses c A central point ;（ There are ready-made procedures ）

Step7： After all samples are marked , And then calculate the c A central point .

Step8： For new test data , By calculating the similarity between the test data and each central point , Select the one with the highest confidence to mark .

-----------------------------------------------------------------------------------------------------

Output ： The data set is divided into marked and unlabeled data sets and test data sets , Test data set accounts for 30% The proportion of , Marked and unlabeled shares 70%. use 10 Fold and cross validation method , Output F1-measure The results of each indicator , Output classified images and index results . Take the marked data as the training set , Ensure that each category has a tagged training set , Then expand the training set according to different proportions , Of a dataset precision and recall The test results are the mean values of unlabeled data and test data . Test the data set according to the different proportion of marked data .

3. Introduction to programming

here , According to each step , Write the corresponding code , And output the final result ：

Step1： Calculate the mean value of a small number of labeled samples , obtain c（ Number of categories ） Initial cluster centers ;

This step in the code is step1 Main content

Through this step , You can make a preliminary classification of the data .

Step3： Use Geodesic distance The similarity measure of , choice Similarity in each cluster Greater than or equal to 0.9 Of （） individual （ The number in each cluster is different ） sample , Find their mean , As c A new central point and get c Average radius ;

Step4： loop （2）（3）, until c The center points are fixed ;

Because of the cosine similarity in this paper , So both of these methods are done .

Step5： Yes （） Samples and samples within the radius from each center point are marked ;

About this step , The radius is calculated in the code , But for the convenience of the later process , Mark the classified data with the cluster number , Those without classification are numbered as 0;

Step6： The remaining unlabeled samples are marked with local and global consistency algorithms , Where marked data only uses c A central point ;

Step7： After all samples are marked , And then calculate the c A central point .

Step8： For new test data , By calculating the similarity between the test data and each central point , Select the one with the highest confidence to mark .

The running results are as follows ：

The classification results and classification accuracy of the test set .

First, input the picture into the picture , Feature extraction , Then classify the features , Finally, classify the pictures to be tested , And calculate the classification accuracy .

The operation results are as follows ：