当前位置:网站首页>Li Mu [practical machine learning] 1.4 data annotation

Li Mu [practical machine learning] 1.4 data annotation

2022-06-12 20:54:00 A summer of swans

Tips : When the article is finished , Directories can be generated automatically , How to generate it, please refer to the help document on the right


Preface

Annotation of data —— Mind mapping

One 、 Semi-supervised learning

A small part is marked by , But many have no feedback .
for example : A web page , A small number of visitors have clear labels , But most users don't know what they do , So there was no feedback and no comments . So how to use small labeled data and large unlabeled data together .
hypothesis :
1. Assumption of continuity : Sample characteristics are similar , Then the labels of the two samples are the same
2. Clustering hypothesis : The user community has similar behavior , If the data has a good clustering structure , Suppose the data in the class has the same label .
3. Popular assumptions : In fact, my data is essentially low dimensional data , Therefore, cleaner data can be obtained by dimensionality reduction .

Important algorithm : Self learning algorithm

 Insert picture description here
1. How to choose confidence samples
2. So you can use more expensive models ( Deep neural networks ), Because it's just for data annotation , It will not be deployed online at all .——》 Make it more accurate .

Two 、 Crowdsourcing marks

Find a lot of people on the Internet , Manpower to mark data
ImageNet Data sets ——> Marked millions of pieces of data .
For example, many data companies , They are also services for labeling data .

You need to consider

1. Need to design relatively simple tasks .( Different educational background )
2. spending : So you also need to consider how many tasks the data needs to generate , How long does the task take , Multiply the two , Figure out how much it will cost .
3. Dimension quality

resolvent

1. In mission design , The complexity of the task needs to be reduced .
 Insert picture description here
2. There are some simple pictures , There is no need for people to mark

Active learning

People will intervene
It will annotate the important data without annotation
Algorithm :
1. Train the model with labeled data . Then choose the data that I am particularly unsure of , Mark it for others
2. Train multiple models , Let multiple models vote and say , Which data is more difficult , Then select the data and mark it

The combination of self-learning and active learning

 Insert picture description here
3. The quality control
People make mistakes
1. Each picture and task will be sent to multiple taggers , But the task has been expanded
2. Send the results to many people if they are not sure .

Weak supervised learning

Semi automatic generation of labels , It's a little worse than the target , But good enough to train some models
Data programming , Use heuristic methods to label data
For example, summing up some rules of annotation , Put it in the program , Let the program label according to these rules .


summary

Tips : Here is a summary of the article :
for example :
Get more labels
1. Self training ( Simple data )
2. crowdsourcing , Let people tabulate the data ( Difficult data )
3. Weak supervised learning ( Find the general rule that people judge labels , Let the machine label )

原网站

版权声明
本文为[A summer of swans]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202281435091811.html