当前位置：网站首页>Li Mu [practical machine learning] 1.4 data annotation

Li Mu [practical machine learning] 1.4 data annotation

2022-06-12 20:54:00 【A summer of swans】

Tips ： When the article is finished , Directories can be generated automatically , How to generate it, please refer to the help document on the right

List of articles

Preface
One 、 Semi-supervised learning
- Important algorithm ： Self learning algorithm
Two 、 Crowdsourcing marks
Weak supervised learning
summary

Preface

Annotation of data —— Mind mapping

One 、 Semi-supervised learning

A small part is marked by , But many have no feedback .
for example ： A web page , A small number of visitors have clear labels , But most users don't know what they do , So there was no feedback and no comments . So how to use small labeled data and large unlabeled data together .
hypothesis ：
1. Assumption of continuity ： Sample characteristics are similar , Then the labels of the two samples are the same
2. Clustering hypothesis ： The user community has similar behavior , If the data has a good clustering structure , Suppose the data in the class has the same label .
3. Popular assumptions ： In fact, my data is essentially low dimensional data , Therefore, cleaner data can be obtained by dimensionality reduction .

Important algorithm ： Self learning algorithm

Insert picture description here
1. How to choose confidence samples
2. So you can use more expensive models （ Deep neural networks ）, Because it's just for data annotation , It will not be deployed online at all .——》 Make it more accurate .

Two 、 Crowdsourcing marks

Find a lot of people on the Internet , Manpower to mark data
ImageNet Data sets ——> Marked millions of pieces of data .
For example, many data companies , They are also services for labeling data .

You need to consider

1. Need to design relatively simple tasks .（ Different educational background ）
2. spending ： So you also need to consider how many tasks the data needs to generate , How long does the task take , Multiply the two , Figure out how much it will cost .
3. Dimension quality

resolvent

1. In mission design , The complexity of the task needs to be reduced .
Insert picture description here
2. There are some simple pictures , There is no need for people to mark

Active learning

People will intervene
It will annotate the important data without annotation
Algorithm ：
1. Train the model with labeled data . Then choose the data that I am particularly unsure of , Mark it for others
2. Train multiple models , Let multiple models vote and say , Which data is more difficult , Then select the data and mark it

The combination of self-learning and active learning

Insert picture description here
3. The quality control
People make mistakes
1. Each picture and task will be sent to multiple taggers , But the task has been expanded
2. Send the results to many people if they are not sure .

Weak supervised learning

Semi automatic generation of labels , It's a little worse than the target , But good enough to train some models
Data programming , Use heuristic methods to label data
For example, summing up some rules of annotation , Put it in the program , Let the program label according to these rules .

summary

Tips ： Here is a summary of the article ：
for example ：
Get more labels
1. Self training （ Simple data ）
2. crowdsourcing , Let people tabulate the data （ Difficult data ）
3. Weak supervised learning （ Find the general rule that people judge labels , Let the machine label ）

原网站

版权声明
本文为[A summer of swans]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202281435091811.html