当前位置：网站首页>Active learning overview, strategy and uncertainty measurement

Active learning overview, strategy and uncertainty measurement

2022-06-22 04:18:00 【deephub】

Active learning refers to the process of prioritizing the data to be marked , In this way, we can determine which data have the greatest impact on the training supervision model .

Active learning is a learning algorithm that can interactively query users (teacher or oracle), Strategy of labeling new data points with real labels . The process of active learning is also called optimizing experimental design .
The motivation for active learning is to recognize that not all labeled samples are equally important .
Active learning can greatly reduce the amount of labeled data required by the training model by prioritizing the labeling work of experts . cost reduction , At the same time, improve the accuracy .
Active learning is a strategy / Algorithm , Is an enhancement to the existing model . Not a new model architecture .
Active learning is easy to understand , Not easy to implement

The key idea behind active learning is , If the machine learning algorithm is allowed to select the data it learns , This allows for greater accuracy with fewer training tags .——Active Learning Literature Survey, Burr Settles

Introduction to active learning

Active learning is not about collecting all the tags for all the data at once , It prioritizes the data that is most difficult for the model to understand , And label only those data . Then the model trains a small amount of labeled data , After the training, it is required to mark the most uncertain data again .

By prioritizing uncertain samples , Models allow experts （ artificial ） Focus on providing the most useful information . This helps the model learn faster , And let experts skip data that is not very helpful to the model . In some cases , It can greatly reduce the number of labels that need to be collected from experts , And you can still get a good model . This can save time and money for machine learning projects !

Active learning strategies

There are many papers on how to determine data points and how to iterate on the method . The most common and direct methods will be introduced in this article , Because this is the simplest and easiest to understand .

The steps to use active learning on unlabeled datasets are ：

The first thing to do is to manually mark a very small sub sample of the data .
Once there is a small amount of tag data , You need to train them . The model is certainly not great , But it will help us understand which areas of the parameter space need the first tag .
After training the model , The model is used to predict the category of each remaining unlabeled data point .
According to the prediction of the model , Select a score on each unlabeled data point （ In the next section , Some of the most commonly used scores will be introduced ）
Once you have chosen the best way to prioritize the tags , This process can be iteratively repeated : The new model is trained on the new label data set marked based on the priority score . Once the new model is trained on the data subset , Unmarked data points can be run in the model and the priority score can be updated , Continue to mark .

In this way , As the model gets better , We can constantly optimize the labeling strategy .

Active learning method based on data flow

In flow based active learning , The set of all training samples is presented to the algorithm in the form of flow . Each sample is sent to the algorithm separately . The algorithm must immediately decide whether to mark this example . The training samples selected from this pool are oracle（ Labor industry experts ） Mark , Before displaying the next sample , The tag is immediately received by the algorithm .

Active learning method based on data pool

In pool based sampling , The training sample is selected from a large unlabeled data pool . The training samples selected from this pool are oracle Mark .

Query based active learning method

This committee query based approach uses multiple models instead of one .

The Committee inquired about (Query by Committee), It maintains a collection of models ( The assembly is called a committee ), By inquiring （ vote ） Choose the most “ Controversial ” As the next data point to be marked . Through this model, the committee can overcome the restrictive assumptions that can be expressed by a single model （ And we don't know what assumptions to use at the beginning of the task ）.

Uncertainty measurement

The process of identifying the most valuable samples that need to be marked next is called “ Sampling strategy ” or “ Query strategy ”. The scoring function in this process is called “acquisition function”. The meaning of this score is ： The data points with higher scores are marked , The higher the value of model training （ Not as good as the model ）. There are many different sampling strategies , For example, uncertainty sampling , Diversity sampling , Expected model changes …, In this paper , We will focus only on the measurement of uncertainty for the most commonly used strategies .

Uncertainty sampling is a set of techniques , It can be used to identify unlabeled samples near the decision boundary in the current machine learning model . The most informative example here is the most uncertain example of the classifier . The most uncertain sample of the model may be the data near the classification boundary . Our model learning algorithm will obtain more information about class boundaries by observing the most difficult samples .

Let's take a concrete example , Suppose you are trying to build a multi class classification , To distinguish 3 Cat like , Dog , Horse . The model may give us the following predictions ：

{
    "Prediction": {
        "Label": "Cat",
        "Prob": {
            "Cat": 0.9352784428596497,
            "Horse": 0.05409964170306921,
            "Dog": 0.038225741147994995,
        }
    }
}

This output is likely to come from softmax, It uses exponents to convert logarithms to 0-1 Score of the range .

Minimum confidence ：（Least confidence)

Minimum confidence =1（100％ Degree of confidence ） And the difference between the most confident labels of each project .

Although it can be ranked separately in the order of confidence , But convert the uncertainty score to 0-1 Range , among 1 Is the most uncertain score that can be useful . Because in this case , We must standardize our scores . We from 1 Subtract this value from , Multiply the result by N/（1-N）,n Is the number of tags . At this time, because the minimum confidence will never be less than the number of tags （ When all tags have the same predictive confidence ).

Let's apply it to the above example , The uncertainty score will be ：（1-0.9352） *（3/2）= 0.0972.

Minimum confidence is the simplest , The most common method , It provides a ranking of the prediction order , In this way, the prediction tag can be sampled with the lowest confidence .

Confidence sampling interval （margin of confidence sampling）

The most intuitive form of uncertainty sampling is the difference between two predictions with high confidence . in other words , How big is the difference between the tags predicted by the model and the second highest tag ？ This is defined as ：

Again, we can convert it to 0-1 Range , Must be used again 1 Subtract the value , But the maximum possible score is 1 了 , So there is no need to do anything else .

Let's apply the confidence sampling interval to the sample data above .“ cat ” and “ Horse ” It's the first two . Use our example , This uncertainty score will be 1.0 - （0.9352–0.0540）= 0.1188.

Sampling ratio （Ratio sampling）

The confidence ratio is the change in the confidence margin , Is the ratio of the difference between two scores, not the absolute value of the difference between the two scores .

Entropy sampling （Entropy Sampling）

The entropy applied to the probability distribution consists of multiplying each probability by its own logarithm , Then sum and take the negative number :

Let's calculate entropy on the sample data ：

obtain 0 - sum（–0.0705,–0.0903,–0.2273）= 0.3881

Divided by the number of tags log obtain 0.3881/ log2（3）= 0.6151

summary

Much of the focus of the machine learning community is on creating better algorithms to learn from data . It is very important to obtain useful labeled data in training , But labeling data can be very laborious , And if the quality of annotation is poor, it will also have a great impact on the training . Active learning is a way to solve this problem , And it is a very good direction .

https://avoid.overfit.cn/post/26eeaad603b540dbba4962c9179f6c64

author ：Zakarya ROUZKI

原网站

版权声明
本文为[deephub]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/172/202206211607347054.html