当前位置：网站首页>[machine learning] learning notes 01- introduction

[machine learning] learning notes 01- introduction

2022-06-12 22:06:00 【NRbene】

Introduction to machine learning

List of articles

Introduction to machine learning

Discrimination of machine learning

machine learning ：

Make use of existing experience E Continuously improve its completion Established tasks T Performance of

Based on empirical data Function estimation problem

x->y Looking for mappings f. That is, machine learning tries to build a mapping function

Machine learning makes use of statistical theory in building mathematical models , Because its core task is to infer from samples . The role of computer science is twofold : First of all , During the training , We need efficient algorithms for solving optimization problems and storing and processing the massive data we usually face . second , Once you have learned a model , Its representation and algorithmic solution for reasoning must also be efficient . In a particular application , The efficiency of learning or reasoning algorithms , That is, its space complexity and time complexity , Probably as important as the accuracy of its predictions .

Deep learning and machine learning

Artificial intelligence 》 machine learning 》 Deep learning

machine learning :input ->feature extraction （ People are responsible for Feature Engineering ）->classification ->output

Deep learning : input ->feature extraction +classification ->output

Machine learning and data mining

Machine learning and statistical learning

machine learning ：data drive , optimization problem , Predicting the future , Focus on model prediction （prediction）

Statistical learning ： theory drive, Regression hypothesis test , Explain cause and effect , Focus on parameter inference （inference）

Machine learning and traditional programming

Machine learning enables computers to simulate human decisions

The design model is y=ax+b, Input data set , Build models after machine learning y=0.8x-100 （ Infer parameters ）

Machine learning concepts

Applicable conditions

Things contain potential laws
Difficult to solve with ordinary programming , Such as ： Image recognition 、 speech recognition
A large number of data samples are available

Challenge

The stability of the model

Counter samples ： An input sample formed by deliberately adding subtle interference to a data set , Results in the model giving a wrong output with high confidence .

In the context of regularization , Reduce the original through confrontation training Independent and identically distributed test sets Error rate —— Train the network on the training set samples against disturbance .

The interpretability of the model

Algorithmic discrimination ？

history

Symbols,

It is believed that all intelligent behaviors can be simplified into a symbolic operation process in a logical system

Automatic theorem proving 、 expert system 、 Knowledge map

Bayesian school

Explain with probability

Connectionism

neural network

Other concepts

Weak supervision

Causal learning ： Most of the current machine learning algorithms can only infer correlation （correlation）, And can not get cause and effect （casuality）. The model obtained from causal learning has good explicability .

Basic concepts

Input $X\in x$

Output $Y\in y$

Input instance $x=(x_{1},x_{2},...,x_{d})\in X$

Output instance

Data sets $D=\left { (x_{1},y_{1}),…,(x_{N},y_N) \right } $

Objective function $Y = f (X)$

Target distribution $P (Y ∣ X)$

For specific input $y = f (X)$ or $P (y ∣ X)$

The mapping of learning Close to the truth f

Three elements

Model

Model ： Decision function or conditional probability distribution

Hypothetical space ： Decision function 、 A set of conditional probability distributions

Set of decision functions ：$H=\left { f|Y=f(X;\theta ),\theta \in R^{n} \right } $

Set of conditional probabilities ：$H=\left { P|P(Y|X;\theta ),\theta \in R^{n} \right } $

Strategy

Strategy ： Select the optimal model from the hypothesis space

Loss function - A function of measuring error

0-1 Loss function ： The same prediction , The loss is 0; Wrong prediction , The loss is 1
Square loss function $L(Y,f(X))=(Y-f(X))^{2} $
Absolute loss function $L(Y,f(X))=\left | (Y-f(X)) \right | $
Logarithmic loss function $L (Y, P (Y ∣ X)) = - l o g P (Y ∣ X)$

The classification number is M In the classification problem of , set up $p_i(x)$ For classifier , take $x$ Forecast as category $$ Probability , Then its logarithmic loss function is

$\mid \mathbf{x}))=-\log P(y \mid \mathbf{x})=-\sum_{i=1}^{M} \left [ y=i \right ] \log p_{i}(\mathbf{x})$

If the data set capacity is N, Then the average loss function of the data set （ Cost function ） by

$-\frac{1}{N} \sum_{j=1}^{n} \sum_{i=1}^{m} \left [ Y=i \right ]\log_{p_{i}}{x_{j}}$

( cost ) Risk function

Expected risk （ The generalization error ）$R_{exp}(f)=\int _{x\times y}^{}L(y,f(x))P(x,y)dxdy $

Given training set $T=\left { (x_{1},y_{1}),…,(x_{N},y_N) \right } $

P(x,y) It's unknowable , So supervised learning is a morbid problem
Theoretically , The implementation doesn't know which model is better

Empirical risk $R_{emp}(f)= \frac{1}{N} \sum_{i=1}^{N}L(y_{i},f(x_{i}) ) $

Structural risk $R_{xrm}=\frac{1}{N} \sum_{i=1}^{N}L(y_i,f(x_i)) + \lambda J(f) $

Structural risk has prior domain knowledge
$J (f)$ Refers to model complexity
$\lambda$ Is the equilibrium factor , Take 0, Or $\infty$

Strategy ( Find an element in the model f, Make experience risk Or the smaller the structural risk, the better ) $\min_{f \in H} R_{emp}(f) $ or $\min_{f \in H} R_{srm}(f) $

Structural risk can be used to address Regularization problem

Make experience risk 、 The smaller the structural risk, the better

Algorithm

Algorithm ： Specific algorithm of learning model , Choose the best model

Optimization problems ：$\min_{w,b}J(w,b) $

The extremum problem
gradient descent
Newton method and quasi Newton method
Constraint optimization problem —— Lagrange multiplier method

Generalize preferences

Hypothesis preference can be seen as the learning algorithm itself in a It may be a huge hypothetical space A heuristic or value for choosing assumptions in .

Okam razor (Occam’s razor): If there are more than one hypothesis consistent with observation , Choose The most simple the .

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-O1uOo2Tm-1646018638390)(img/1.2.1.jpg)]

If all meet the objectives , Three parameter model , Models with priority higher than seven parameters

Theorem 1 ( There is no free lunch Theorem (no free lunch theorem))

For any two learning algorithms A and B, If on some issues Than good , There must be other problems A Than B good

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-VqKmd8Rb-1646018638397)(img\1.2.2.jpg)]

No machine learning algorithm is suitable for all situations . But in fact , Not uniformly distributed , Whether the inductive preference of the learning algorithm itself matches the problem , It often plays a decisive role . Away from specific problems , Talk about in an empty way “ What learning algorithm is better ” meaningless .

prove

For the sake of simplicity , Suppose the sample space $X$ And hypothetical space $H$ All are discrete Of

For a particular learning algorithm , Make Based on training data $D$ Generate assumptions $h$ Probability

As a non deterministic model , Find out according to the training set and algorithm h

Re order $f$ Represents the real objective function we want to learn

Suppose it's true f It's evenly distributed

Then the algorithm is applied to Outside the training set The expected error on all samples of is

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-4cpnd3H0-1646018638398)(img\1.2.30.jpg)]

f It is given. ,h Yes , Then expand this expectation
p(x) Is the discrete integral solution of the distribution 【 The continuous solution is like this $E(f)=\int f*p_i(x)dx$ 】

Consider the dichotomous problem , And the real objective function can be any function
$\to \left\{0,1 \right\}$

Equality is 0, It doesn't wait 1

Sum the errors of all possible functions according to uniform distribution , Yes

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-I6XqnvL5-1646018638399)(img\1.2.31.jpg)]

x A little bit , Each point may take 0,1, So the total probability is $2^{x} $
This example is from the watermelon book , It can be deduced by looking again

The goal of machine learning

The purpose of machine learning is to make the mapping obtained by learning Approaching the truth

Not only the training error is small , More importantly, let the generalization error be small , That is to improve the learning model Applicable to capabilities that have not seen examples （ That is, the content outside the data set ）

Machine learning is divided into training (Training) And testing (Testing) Two phases ：
Training ： Given an inclusion A training set of samples D = {(x1 , y1 ), (x2 , y2 ), … , (xN , yN )}

Adjust the model Parameters of , Make the prediction result and The dimensions are as close as possible to .
test ： Will function Applied to a separate test set T = {(x , ), ( , ), … , ( , )},~1 y~1 x~2 y~2 x~N ′ y~N ′

Verify whether the learned functions can make accurate predictions on the data set , namely Whether or not near

Under fitting and over fitting

Under fitting (underfitting)). The general nature of the training sample has not been learned well by the learner , High training error , Leading to low generalization ability

Over fitting (overfitting). The learner learns the training sample well “ Great ”, Some characteristics of the training sample itself are taken as the general properties of all potential samples , It leads to the decrease of generalization ability

Parameter overfitting ： The amount of data is too small , Too many workouts
Structure over fitting
- Model learning ability is too strong , There are many parameters
- Too much noise in the data

Structure over fitting ：
Model capability is too strong ： You only need a conic , As a result, it was used for too many times
It's noisy 、 The data quality is not high

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-TMKKs29M-1646018638402)(img\1.2.4.jpg)]

As the number of iterations increases , The error of under fitting is decreasing （y Direction gap ）
It's not good to study too much

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-ao13Acxu-1646018638403)(img\1.2.6.jpg)]

Green is the predicted value of the model
a 0 Subcurve
b 1 Subcurve
……

The generalization error ( a key )

Theorem : deviation - variance - Noise decomposition (bias-variance decomposition)

deviation 、 variance +（ noise ）

Set test data $x$ Of The real target value is $h (x)$ , Observed The target value is $t$ , Predicted value of the model $y (x)$ , Syncopation $x$ and $t$ The joint distribution of $p (x, t)$ by , Then the target value $t$ And the forecast $y (x)$ The error between is

（ Expect to calculate ）

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-ZP5Do3V0-1646018638404)(img\1.2.7.jpg)]

This distinguishes between real values 、 Observed value t, In order to introduce the concept of noise
Observed value t It is equivalent to sampling near the real value , however t The expectation at this point must equal the real value , namely $E (t) = h (x)$

The first 2 The rows are expanded in full square form , First integrate t, In integral x, because $E (t) = h (x)$ , elimination , Get the third line

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-wexk9Xo4-1646018638405)(img\1.2.8.jpg)]

1 To different D, It makes a difference y. So for D Find the average , To find y Expectations
variance 【Variance】： The difference between the average value of learning once and learning many times
deviation 【Bias】： The difference between the learned model and the real model

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-5Uw6itFm-1646018638406)(img\1.2.10.jpg)]

The generalization performance of the learning algorithm is determined by

The ability to learn algorithms
Sufficiency of data
The difficulty of the learning task itself

Jointly determined .

deviation - Variance dilemma (bias-variance dilemma).

Given a learning task , When training is insufficient , The fitting ability of the learner is not strong enough , The disturbance of training data is not enough to make the learner change significantly , here deviation Dominates the generalization error rate .
With the deepening of training , The fitting ability of learners is gradually enhanced , Disturbance of training data Gradually can be learned by the learner , variance It gradually dominates the generalization error rate .
After enough training , The fitting ability of the learner has been very strong , The slight disturbance of training data will lead to significant changes in the learner , If the training data itself 、 Non global features are learned by the learner , It will happen Over fitting .

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-OEYlsdnp-1646018638407)(img\1.2.11.jpg)]

The generalization error decreases first , After rising

Contrast figure

Ease of overfitting

Start with a simple model ( Occam's razor principle )
Regularization ( It can reduce the effective complexity of the model 、 Reduce test error )
Data cleaning and preprocessing ( The effect is not necessarily )
- Noise removal
Data expansion
- Increase the data set
Use validation set ( Used to estimate generalization error , Usually used for model selection ).

In depth understanding of generalization errors

Polynomial curve fitting problem

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-9TIY4UAs-1646018638408)(img\1.2.12.jpg)]

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-TQTJXy8X-1646018638409)(img\1.2.13.jpg)]

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-zjRm6QuY-1646018638411)(img\1.2.14.jpg)]

When the complexity of the model increases , Interference causes data to fly randomly

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-3la4A9PC-1646018638412)(img\1.2.15.jpg)]

$\lambda=-\infty$ The penalty item is almost invalid
$\lambda$ The bigger it is , Explain that the more you consider the complexity of the model
$\lambda$ In order to control the complexity of the model , Big $\lambda$ Requires small model complexity . The model is too simple , May cause under fitting , Thus the error increases （ chart 1、2 contrast ）

At ordinary times （ Training ）	test	significance
+	+	good-fit
+	-	Over fitting
-	+	—
-	-	Under fitting

Basic concepts

Test set (testing set)

In the test set Test error As Approximation of generalization error
It is used to test the discriminating ability of the learner to new samples , Model used to evaluate the final
The test set should be mutually exclusive with the training set , Suppose the test sample is also from the real distribution of the sample Independent homologous distribution Sampling .

Verification set (validation set)

Divide the training data into training set and verification set
Based on the performance on the validation set Model selection And for Hyperparameters Adjusting parameters

For example , Use... On the training set $\lambda1$ and $\lambda2$ Run out f1,f2, You can run again with the validation set , Thus in f1 and f2 Make a choice

Method

Set aside method (hold-out):

Directly divide the training data into two Mutually exclusive Set : Training set and verification set
Training / The partition of the validation set should Keep the consistency of data distribution as much as possible
Using several random partitions , Generate a pair of workouts at a time / Verification set For experimental evaluation , Final report all The mean and standard deviation of the results

Normal operation ： Divide several times and take the average value

Generally, it will be about 2/3~4/5 As training data

K fold Cross validation

Randomly divide the training data into K Two disjoint 、 Subsets of the same size
utilize K-1 Data training model of subset , The remaining subset is used as the validation set ;
Put this process to the possible K The choice is repeated
Finally choose one K In the second evaluation The model with the least average test error

matters needing attention :
K The most commonly used value is 10, Other commonly used values are 5 and 20
Each subset keeps the consistency of data distribution as much as possible .
In order to reduce the difference caused by different sample division , Generally, several cross verifications are used , Such as 10
Time 10 Crossover verification .

1	2	3	…	k

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-FptRTCbS-1646018638413)(img\1.2.16.jpg)]

k Time k Crossover verification , need $k^{k}$ Time , Spending too much .

Special cases —— Keep one (leave-one-out)

K Fold cross validation at The special case of time
Since each subset has only one sample , Leaving one method is not affected by the way of random sample division .
The evaluation results of the retention method are often considered to be more accurate . however , When the data set is large , Training The computational overhead of these models is intolerable

summary

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-JncpWaiy-1646018638414)(img\1.2.17.jpg)]

The general process of machine learning

Get a limited set of training data
Determine the hypothetical space that contains all possible models , That is, the set of learning models
Determine the criteria for model selection , Learning strategies
1. Determine whether it is a loss of experience or a loss of structure
An algorithm for solving the optimal model , Learning algorithm
Choosing the best model by learning method
Using the learning optimal model to predict and analyze the new data

Machine learning classification

Sort by whether there are labels or not

Supervised learning ： Spam classification 、 Housing forecast
- There are examples
Unsupervised learning ： Anomaly detection
- I don't know which is abnormal in advance
Semi-supervised learning ： Mark voice
- The cost of tagging data is very high , So some of them are marked , Part is not marked . That is to say, first use supervised learning , After labeling the unlabeled data with a supervised learning model
Reinforcement learning ：Alpha GO
- The data was not ready at first , Data is generated continuously in the process of interaction .

Sort by output space

Two classification ： Spam classification
Many classification ： Image classification
Return to ： Housing forecast
Structured learning ： Machine translation 、 speech recognition 、 chatbot

By model

We all want $P (y ∣ x)$

Discriminant model ： Decision tree 、 Support vector machine

Directly determine $P (y ∣ x)$ or $f (x)$

Generative model ：GAN

Make sure the $P (x, y)$
Then we use Bayesian theorem $P(y|x)=\frac{P(x,y)}{P(x)} $

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-rqs3Bf1Q-1646018638415)(img\1.2.18.jpg)]

By algorithm

Batch learning ： One time batch input to the learning algorithm , Can be visualized It is called cramming learning
Online learning ： In order , Learn in sequence , Constantly revise the module type , To optimize
Active learning ： By some strategy find Samples without category marking The most valuable data in the data , Hand it over to experts for manual marking , The labeled data and its category labels are incorporated into the training set for iterative optimization Class model , Improve the processing effect of the model
- I am learning , Ask about something you are interested in ： What is the label of this thing

原网站

版权声明
本文为[NRbene]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202281152574050.html