当前位置:网站首页>[machine learning] learning notes 01- introduction
[machine learning] learning notes 01- introduction
2022-06-12 22:06:00 【NRbene】
Introduction to machine learning
List of articles
Discrimination of machine learning
machine learning :
Make use of existing experience E Continuously improve its completion Established tasks T Performance of
Based on empirical data Function estimation problem
x->y Looking for mappings f. That is, machine learning tries to build a mapping function
Machine learning makes use of statistical theory in building mathematical models , Because its core task is to infer from samples . The role of computer science is twofold : First of all , During the training , We need efficient algorithms for solving optimization problems and storing and processing the massive data we usually face . second , Once you have learned a model , Its representation and algorithmic solution for reasoning must also be efficient . In a particular application , The efficiency of learning or reasoning algorithms , That is, its space complexity and time complexity , Probably as important as the accuracy of its predictions .
Deep learning and machine learning
Artificial intelligence 》 machine learning 》 Deep learning
machine learning :input ->feature extraction ( People are responsible for Feature Engineering )->classification ->output
Deep learning : input ->feature extraction +classification ->output
Machine learning and data mining
Machine learning and statistical learning
machine learning :data drive , optimization problem , Predicting the future , Focus on model prediction (prediction)
Statistical learning : theory drive, Regression hypothesis test , Explain cause and effect , Focus on parameter inference (inference)
Machine learning and traditional programming
Machine learning enables computers to simulate human decisions
The design model is y=ax+b, Input data set , Build models after machine learning y=0.8x-100 ( Infer parameters )
Machine learning concepts
Applicable conditions
- Things contain potential laws
- Difficult to solve with ordinary programming , Such as : Image recognition 、 speech recognition
- A large number of data samples are available
Challenge
The stability of the model
Counter samples : An input sample formed by deliberately adding subtle interference to a data set , Results in the model giving a wrong output with high confidence .
In the context of regularization , Reduce the original through confrontation training Independent and identically distributed test sets Error rate —— Train the network on the training set samples against disturbance .
The interpretability of the model
Algorithmic discrimination ?
history
Symbols,
It is believed that all intelligent behaviors can be simplified into a symbolic operation process in a logical system
Automatic theorem proving 、 expert system 、 Knowledge map
Bayesian school
Explain with probability
Connectionism
neural network
Other concepts
Weak supervision
Causal learning : Most of the current machine learning algorithms can only infer correlation (correlation), And can not get cause and effect (casuality). The model obtained from causal learning has good explicability .
Basic concepts
Input X ∈ x X\in x X∈x
Output Y ∈ y Y\in y Y∈y
Input instance x = ( x 1 , x 2 , . . . , x d ) ∈ X x=(x_{1},x_{2},...,x_{d})\in X x=(x1,x2,...,xd)∈X
Output instance
Data sets $D=\left { (x_{1},y_{1}),…,(x_{N},y_N) \right } $
Objective function Y = f ( X ) Y=f(X) Y=f(X)
Target distribution P ( Y ∣ X ) P(Y|X) P(Y∣X)
For specific input y = f ( X ) y=f(X) y=f(X) or P ( y ∣ X ) P(y|X) P(y∣X)
The mapping of learning Close to the truth f
Three elements
Model
Model : Decision function or conditional probability distribution
Hypothetical space : Decision function 、 A set of conditional probability distributions
Set of decision functions :$H=\left { f|Y=f(X;\theta ),\theta \in R^{n} \right } $
Set of conditional probabilities :$H=\left { P|P(Y|X;\theta ),\theta \in R^{n} \right } $
Strategy
Strategy : Select the optimal model from the hypothesis space
Loss function - A function of measuring error
- 0-1 Loss function : The same prediction , The loss is 0; Wrong prediction , The loss is 1
- Square loss function $L(Y,f(X))=(Y-f(X))^{2} $
- Absolute loss function $L(Y,f(X))=\left | (Y-f(X)) \right | $
- Logarithmic loss function L ( Y , P ( Y ∣ X ) ) = − l o g P ( Y ∣ X ) L(Y,P(Y|X))=-logP(Y|X) L(Y,P(Y∣X))=−logP(Y∣X)
The classification number is M In the classification problem of , set up p i ( x ) p_i(x) pi(x) For classifier , take x x x Forecast as category $$ Probability , Then its logarithmic loss function is
L ( y , P ( y ∣ x ) ) = − log P ( y ∣ x ) = − ∑ i = 1 M [ y = i ] log p i ( x ) L(y, P(y \mid \mathbf{x}))=-\log P(y \mid \mathbf{x})=-\sum_{i=1}^{M} \left [ y=i \right ] \log p_{i}(\mathbf{x}) L(y,P(y∣x))=−logP(y∣x)=−∑i=1M[y=i]logpi(x)
If the data set capacity is N, Then the average loss function of the data set ( Cost function ) by
L ( Y , P ( Y ∣ X ) ) = − l o g P ( Y ∣ X ) = − 1 N ∑ j = 1 n ∑ i = 1 m [ Y = i ] log p i x j L(Y,P(Y|X))=-logP(Y|X)= -\frac{1}{N} \sum_{j=1}^{n} \sum_{i=1}^{m} \left [ Y=i \right ]\log_{p_{i}}{x_{j}} L(Y,P(Y∣X))=−logP(Y∣X)=−N1∑j=1n∑i=1m[Y=i]logpixj
( cost ) Risk function
Expected risk ( The generalization error )$R_{exp}(f)=\int _{x\times y}^{}L(y,f(x))P(x,y)dxdy $
- Given training set $T=\left { (x_{1},y_{1}),…,(x_{N},y_N) \right } $
P(x,y) It's unknowable , So supervised learning is a morbid problem
Theoretically , The implementation doesn't know which model is better
Empirical risk $R_{emp}(f)= \frac{1}{N} \sum_{i=1}^{N}L(y_{i},f(x_{i}) ) $
Structural risk $R_{xrm}=\frac{1}{N} \sum_{i=1}^{N}L(y_i,f(x_i)) + \lambda J(f) $
Structural risk has prior domain knowledge
J ( f ) J(f) J(f) Refers to model complexity
λ \lambda λ Is the equilibrium factor , Take 0, Or ∞ \infty ∞
Strategy ( Find an element in the model f, Make experience risk Or the smaller the structural risk, the better ) $\min_{f \in H} R_{emp}(f) $ or $\min_{f \in H} R_{srm}(f) $
Structural risk can be used to address Regularization problem
Make experience risk 、 The smaller the structural risk, the better
Algorithm
Algorithm : Specific algorithm of learning model , Choose the best model
Optimization problems :$\min_{w,b}J(w,b) $
- The extremum problem
- gradient descent
- Newton method and quasi Newton method
- Constraint optimization problem —— Lagrange multiplier method
Generalize preferences
Hypothesis preference can be seen as the learning algorithm itself in a It may be a huge hypothetical space A heuristic or value for choosing assumptions in .
Okam razor (Occam’s razor): If there are more than one hypothesis consistent with observation , Choose The most simple the .
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-O1uOo2Tm-1646018638390)(img/1.2.1.jpg)]
If all meet the objectives , Three parameter model , Models with priority higher than seven parameters
Theorem 1 ( There is no free lunch Theorem (no free lunch theorem))
For any two learning algorithms A and B, If on some issues Than good , There must be other problems A Than B good
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-VqKmd8Rb-1646018638397)(img\1.2.2.jpg)]
No machine learning algorithm is suitable for all situations . But in fact , Not uniformly distributed , Whether the inductive preference of the learning algorithm itself matches the problem , It often plays a decisive role . Away from specific problems , Talk about in an empty way “ What learning algorithm is better ” meaningless .
prove
For the sake of simplicity , Suppose the sample space X X X And hypothetical space H H H All are discrete Of
For a particular learning algorithm , Make Based on training data D D D Generate assumptions h h h Probability
As a non deterministic model , Find out according to the training set and algorithm h
Re order f f f Represents the real objective function we want to learn
Suppose it's true f It's evenly distributed
Then the algorithm is applied to Outside the training set The expected error on all samples of is
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-4cpnd3H0-1646018638398)(img\1.2.30.jpg)]
f It is given. ,h Yes , Then expand this expectation
p(x) Is the discrete integral solution of the distribution 【 The continuous solution is like this E ( f ) = ∫ f ∗ p i ( x ) d x E(f)=\int f*p_i(x)dx E(f)=∫f∗pi(x)dx】
Consider the dichotomous problem , And the real objective function can be any function
f : X → { 0 , 1 } f:X \to \left\{0,1 \right\} f:X→{ 0,1}
Equality is 0, It doesn't wait 1
Sum the errors of all possible functions according to uniform distribution , Yes
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-I6XqnvL5-1646018638399)(img\1.2.31.jpg)]
x A little bit , Each point may take 0,1, So the total probability is $2^{x} $
This example is from the watermelon book , It can be deduced by looking again
The goal of machine learning
The purpose of machine learning is to make the mapping obtained by learning Approaching the truth
- Not only the training error is small , More importantly, let the generalization error be small , That is to improve the learning model Applicable to capabilities that have not seen examples ( That is, the content outside the data set )
Machine learning is divided into training (Training) And testing (Testing) Two phases :
Training : Given an inclusion A training set of samples D = {(x1 , y1 ), (x2 , y2 ), … , (xN , yN )}
Adjust the model Parameters of , Make the prediction result and The dimensions are as close as possible to .
test : Will function Applied to a separate test set T = {(x , ), ( , ), … , ( , )},~1 y~1 x~2 y~2 x~N ′ y~N ′
Verify whether the learned functions can make accurate predictions on the data set , namely Whether or not near
Under fitting and over fitting
Under fitting (underfitting)). The general nature of the training sample has not been learned well by the learner , High training error , Leading to low generalization ability
Over fitting (overfitting). The learner learns the training sample well “ Great ”, Some characteristics of the training sample itself are taken as the general properties of all potential samples , It leads to the decrease of generalization ability
- Parameter overfitting : The amount of data is too small , Too many workouts
- Structure over fitting
- Model learning ability is too strong , There are many parameters
- Too much noise in the data
Structure over fitting :
Model capability is too strong : You only need a conic , As a result, it was used for too many times
It's noisy 、 The data quality is not high
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-TMKKs29M-1646018638402)(img\1.2.4.jpg)]

As the number of iterations increases , The error of under fitting is decreasing (y Direction gap )
It's not good to study too much
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-ao13Acxu-1646018638403)(img\1.2.6.jpg)]
Green is the predicted value of the model
a 0 Subcurve
b 1 Subcurve
……
The generalization error ( a key )
Theorem : deviation - variance - Noise decomposition (bias-variance decomposition)
deviation 、 variance +( noise )
Set test data x x x Of The real target value is h ( x ) h(x) h(x) , Observed The target value is t t t , Predicted value of the model y ( x ) y(x) y(x), Syncopation x x x and t t t The joint distribution of p ( x , t ) p(x,t) p(x,t) by , Then the target value t t t And the forecast y ( x ) y(x) y(x) The error between is
( Expect to calculate )
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-ZP5Do3V0-1646018638404)(img\1.2.7.jpg)]
This distinguishes between real values 、 Observed value t, In order to introduce the concept of noise
Observed value t It is equivalent to sampling near the real value , however t The expectation at this point must equal the real value , namely E ( t ) = h ( x ) E(t)=h(x) E(t)=h(x)
The first 2 The rows are expanded in full square form , First integrate t, In integral x, because E ( t ) = h ( x ) E(t)=h(x) E(t)=h(x), elimination , Get the third line
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-wexk9Xo4-1646018638405)(img\1.2.8.jpg)]
1 To different D, It makes a difference y. So for D Find the average , To find y Expectations
variance 【Variance】: The difference between the average value of learning once and learning many times
deviation 【Bias】: The difference between the learned model and the real model
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-5Uw6itFm-1646018638406)(img\1.2.10.jpg)]
The generalization performance of the learning algorithm is determined by
- The ability to learn algorithms
- Sufficiency of data
- The difficulty of the learning task itself
Jointly determined .

deviation - Variance dilemma (bias-variance dilemma).
- Given a learning task , When training is insufficient , The fitting ability of the learner is not strong enough , The disturbance of training data is not enough to make the learner change significantly , here deviation Dominates the generalization error rate .
- With the deepening of training , The fitting ability of learners is gradually enhanced , Disturbance of training data Gradually can be learned by the learner , variance It gradually dominates the generalization error rate .
- After enough training , The fitting ability of the learner has been very strong , The slight disturbance of training data will lead to significant changes in the learner , If the training data itself 、 Non global features are learned by the learner , It will happen Over fitting .
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-OEYlsdnp-1646018638407)(img\1.2.11.jpg)]
The generalization error decreases first , After rising
Contrast figure

Ease of overfitting
- Start with a simple model ( Occam's razor principle )
- Regularization ( It can reduce the effective complexity of the model 、 Reduce test error )
- Data cleaning and preprocessing ( The effect is not necessarily )
- Noise removal
- Data expansion
- Increase the data set
- Use validation set ( Used to estimate generalization error , Usually used for model selection ).
In depth understanding of generalization errors
Polynomial curve fitting problem
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-9TIY4UAs-1646018638408)(img\1.2.12.jpg)]
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-TQTJXy8X-1646018638409)(img\1.2.13.jpg)]
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-zjRm6QuY-1646018638411)(img\1.2.14.jpg)]
When the complexity of the model increases , Interference causes data to fly randomly
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-3la4A9PC-1646018638412)(img\1.2.15.jpg)]
λ = − ∞ \lambda=-\infty λ=−∞ The penalty item is almost invalid
λ \lambda λ The bigger it is , Explain that the more you consider the complexity of the model
λ \lambda λ In order to control the complexity of the model , Big λ \lambda λ Requires small model complexity . The model is too simple , May cause under fitting , Thus the error increases ( chart 1、2 contrast )
At ordinary times ( Training ) | test | significance |
---|---|---|
+ | + | good-fit |
+ | - | Over fitting |
- | + | — |
- | - | Under fitting |
Basic concepts
Test set (testing set)
- In the test set Test error As Approximation of generalization error
- It is used to test the discriminating ability of the learner to new samples , Model used to evaluate the final
- The test set should be mutually exclusive with the training set , Suppose the test sample is also from the real distribution of the sample Independent homologous distribution Sampling .
Verification set (validation set)
- Divide the training data into training set and verification set
- Based on the performance on the validation set Model selection And for Hyperparameters Adjusting parameters
For example , Use... On the training set λ 1 \lambda1 λ1 and λ 2 \lambda2 λ2 Run out f1,f2, You can run again with the validation set , Thus in f1 and f2 Make a choice
Method
Set aside method (hold-out):
- Directly divide the training data into two Mutually exclusive Set : Training set and verification set
- Training / The partition of the validation set should Keep the consistency of data distribution as much as possible
- Using several random partitions , Generate a pair of workouts at a time / Verification set For experimental evaluation , Final report all The mean and standard deviation of the results
Normal operation : Divide several times and take the average value
- Generally, it will be about 2/3~4/5 As training data
K fold Cross validation
- Randomly divide the training data into K Two disjoint 、 Subsets of the same size
- utilize K-1 Data training model of subset , The remaining subset is used as the validation set ;
- Put this process to the possible K The choice is repeated
- Finally choose one K In the second evaluation The model with the least average test error
matters needing attention :
K The most commonly used value is 10, Other commonly used values are 5 and 20
Each subset keeps the consistency of data distribution as much as possible .
In order to reduce the difference caused by different sample division , Generally, several cross verifications are used , Such as 10
Time 10 Crossover verification .
1 | 2 | 3 | … | k |
---|
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-FptRTCbS-1646018638413)(img\1.2.16.jpg)]
k Time k Crossover verification , need k k k^{k} kk Time , Spending too much .
Special cases —— Keep one (leave-one-out)
- K Fold cross validation at The special case of time
- Since each subset has only one sample , Leaving one method is not affected by the way of random sample division .
- The evaluation results of the retention method are often considered to be more accurate . however , When the data set is large , Training The computational overhead of these models is intolerable
summary
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-JncpWaiy-1646018638414)(img\1.2.17.jpg)]
The general process of machine learning
Get a limited set of training data
Determine the hypothetical space that contains all possible models , That is, the set of learning models
Determine the criteria for model selection , Learning strategies
- Determine whether it is a loss of experience or a loss of structure
An algorithm for solving the optimal model , Learning algorithm
Choosing the best model by learning method
Using the learning optimal model to predict and analyze the new data
Machine learning classification
Sort by whether there are labels or not
Supervised learning : Spam classification 、 Housing forecast
- There are examples
Unsupervised learning : Anomaly detection
- I don't know which is abnormal in advance
Semi-supervised learning : Mark voice
- The cost of tagging data is very high , So some of them are marked , Part is not marked . That is to say, first use supervised learning , After labeling the unlabeled data with a supervised learning model
Reinforcement learning :Alpha GO
- The data was not ready at first , Data is generated continuously in the process of interaction .
Sort by output space
- Two classification : Spam classification
- Many classification : Image classification
- Return to : Housing forecast
- Structured learning : Machine translation 、 speech recognition 、 chatbot
By model
We all want P ( y ∣ x ) P(y|x) P(y∣x)
Discriminant model : Decision tree 、 Support vector machine
- Directly determine P ( y ∣ x ) P(y|x) P(y∣x) or f ( x ) f(x) f(x)
Generative model :GAN
- Make sure the P ( x , y ) P(x,y) P(x,y)
- Then we use Bayesian theorem $P(y|x)=\frac{P(x,y)}{P(x)} $
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-rqs3Bf1Q-1646018638415)(img\1.2.18.jpg)]
By algorithm
Batch learning : One time batch input to the learning algorithm , Can be visualized It is called cramming learning
Online learning : In order , Learn in sequence , Constantly revise the module type , To optimize
Active learning : By some strategy find Samples without category marking The most valuable data in the data , Hand it over to experts for manual marking , The labeled data and its category labels are incorporated into the training set for iterative optimization Class model , Improve the processing effect of the model
- I am learning , Ask about something you are interested in : What is the label of this thing
边栏推荐
- Permission to query execution plan in Oracle Database
- IPhone: save Boolean into core data - iphone: save Boolean into core data
- 设计消息队列存储消息数据的 MySQL 表格
- Oracle 19C installation documentation
- 孙老师版本JDBC(2022年6月12日21:34:25)
- How to ensure thread safety?
- [sword finger offer] sword finger offer 35 Replication of complex linked list
- How to write a vscode plug-in by yourself to realize plug-in freedom!
- You can move forward or backward. This function in idea is amazing!
- Ansible playbook和Ansible Roles(三)
猜你喜欢
Turing prize winner: what should I pay attention to if I want to succeed in my academic career?
Dolphin-2.0.3 cluster deployment document
SQL tuning guide notes 17:importing and exporting optimizer statistics
关于 安装Qt5.15.2启动QtCreator后“应用程序无法正常启动0xc0000022” 的解决方法
User guide for JUC concurrency Toolkit
Cloning PDB with ADG standby
SQL tuning guide notes 18:analyzing statistics using optimizer statistics Advisor
MySQL architecture and basic management (II)
最近公共祖先问题你真的学会了吗?
Audio and video technology development weekly 𞓜 234
随机推荐
Ansible playbook和变量(二)
Linux backup MySQL
How to develop programming learning with zero foundation during college
What is your understanding of thread priority?
Jin AI her power | impact tech, she can
打新债开户安全么,新手该怎么操作?
SQL tuning guide notes 15:controlling the use of optimizer statistics
疼痛分级为什么很重要?
PE installation win10 system
[simple] 155 Minimum stack
Permission to query execution plan in Oracle Database
数据库每日一题---第10天:组合两个表
Design and practice of Hudi bucket index in byte skipping
How to ensure thread safety?
Kdd2022 | graphmae: self supervised mask map self encoder
【QNX Hypervisor 2.2 用户手册】4.2 支持的构建环境
Open source background management system suitable for outsourcing projects
Recommended Chinese font in the code input box of Oracle SQL developer
MySQL architecture and basic management (II)
[C language] data type occupation