当前位置:网站首页>Understanding the mathematical essence of machine learning
Understanding the mathematical essence of machine learning
2022-07-26 05:58:00 【Datawhale】
Datawhale dried food
author : Academician ewenan , source : Scientific intelligence AISIBeijing time. 2022 year 7 month 8 On Tuesday night 22:30, Academician ewenan stay 2022 Year of International Congress of mathematicians Last work One hour Conference Report (plenary talk). Today we'll share the content of teacher e's speech . Mr. e first shared his understanding of Understanding the mathematical essence of machine learning ( Function approximation 、 Approximation and sampling of probability distribution 、Bellman The solution of the equation ); Then it introduces the machine learning model Approximation error 、 Mathematical theory of generalization and training ; Finally, it introduces how to use machine learning to Solve difficult scientific calculations and scientific problems , namely AI for science. The authors Hertz.

The mathematical nature of machine learning problems
as everyone knows , The development of machine learning , It has completely changed people's understanding of artificial intelligence . Machine learning has many amazing achievements , for example :
· Recognize pictures more accurately than humans : Use a set of marked pictures , Machine learning algorithm can accurately identify the category of pictures :

Cifar-10 problem : Divide the pictures into ten categories
source :https://www.cs.toronto.edu/~kriz/cifar.html
· Alphago Next go Defeat humanity : The algorithm of playing go is completely realized by machine learning :

Reference resources :https://www.bbc.com/news/technology-35761246
· Generate face pictures , achieve The setup The effect of :

Reference resources :https://arxiv.org/pdf/1710.10196v3.pdf
Machine learning has many other applications . In everyday life , People even often use the services provided by machine learning without knowing , for example : Spam filtering in our email system 、 Speech recognition in our cars and mobile phones 、 The fingerprint in our mobile phone is unlocked ……
All these great achievements , Essentially , But success Solved some classical mathematical problems .
*
For image classification , What we are interested in is actually functions
:
: Images → Category
function
Map the image to the category that the image belongs to . We know
The value on the training set , I want to find the right function
A good enough Close to .
generally speaking , Supervised learning (supervised learning) problem , The essence is to be based on a limited training set S, Give an efficient of the objective function Close to .
*
For face generation , Its essence is Approximate and sample an unknown probability distribution . In this case ,“ Face ” It's a random variable , And we don't know its probability distribution . However , We have “ Face ” The sample of : A huge number of face photos . We can use these samples , Approximate result “ Face ” Probability distribution of , And thus generate new samples ( That is, generate a face ).
generally speaking , The essence of unsupervised learning is Using limited samples , Approximate and sample the unknown probability distribution behind the problem .
*
For those who play go Alphago Come on , If the opponent's strategy is given , The dynamics of go is the solution of a dynamic programming problem . Its optimal strategy satisfies Bellman equation . thus Alphago The essence of is to solve Bellman equation .
generally speaking , Reinforcement learning In essence, it is to solve The optimal strategy of Markov process .
However , These questions are Computational Mathematics Classic problems in the field !! After all , Function approximation 、 Approximation and sampling of probability distribution , And the numerical solution of differential equations and difference equations , Are extremely classic problems in the field of Computational Mathematics . that , These problems are in the context of machine learning , What is the difference between it and classical computational mathematics ? The answer is :
dimension (dimensionality)
for example , In the problem of image recognition , The dimensions entered are
. For classical numerical approximation methods , about
Dimensional problem , contain
Approximation error of a model with parameters
. In other words , If you want to reduce the error 10 times , The number of parameters needs to be increased
. When dimension
increases , The calculation cost increases exponentially . This phenomenon is often called :
Dimension disaster (curse of dimensionality)
All classic algorithms , For example, polynomial approximation 、 Wavelet approximation , Are suffering from dimensional disasters . Obviously , The success of machine learning tells us , In high dimensional problems , Deep neural network The performance of is much better than the classical algorithm . However , such “ success ” How to do it ? Why in high-dimensional problems , No other way , but Deep neural network It has achieved unprecedented success ?
Starting from Mathematics , Understand machine learning “ Black magic ”: Mathematical theory of supervised learning
2.1 Marking and setting
Neural network is a special kind of function . such as , The two-layer neural network is :

There are two sets of parameters ,
and
.
Is the activation function , It can be :
·
,ReLU function ;
·
,Sigmoid function .
The basic component of neural network is : Linear transformation and one-dimensional nonlinear transformation . Deep neural network , Generally, it is the composition of the following structures :


For simplicity , We omit all here bias term
.
It's the weight matrix , Activation function
Act on every component .
We will be in the training set S Upper approximation of the objective function ![]()

Hypothetical hypothesis
The domain of definition of is
. Make
by
The distribution of . Then our goal is : Minimize test errors
(testing error, Also known as population risk or generalization error):

2.2 Error of supervised learning
Supervised learning generally has the following steps :
*
First step : Choose a hypothetical space ( A set of test functions )
(m Directly proportional to the dimension of the test space );
*
The second step : Choose a loss function to optimize . Usually , We will choose empirical error (empirical risk) To fit the data :

Sometimes , We will add other penalties .
*
The third step : Solving optimization problems , Such as :
· gradient descent :

· Stochastic gradient descent :

It's from 1,…n A random selection of .
If you record the output of machine learning
, Then the total error is
. Let's redefine :
*
Is the best approximation in the hypothetical space ;
*
In hypothetical space , Based on data sets S The best approximation .
thus , Then we can divide the error into three parts :

*
Is the approximation error (approximation error): It is completely determined by the selection of hypothetical space ;
*
Is the estimation error (estimation error): Additional error due to limited data set size ;
*
Is the optimization error (optimization error): By training ( Optimize ) Additional errors .
2.3 Approximation error
Let's focus on approximation error (approximation error).
Let's make a comparison with the traditional Fourier transform :

If we use discrete Fourier transform to approximate :

Its error
Is proportional to
, Undoubtedly affected by dimensional disasters .
And if a function can be expressed in the desired form :

Make
It's a measure
Independent identically distributed samples , We have :

Then the error at this time is :

You can see , This is independent of dimension !
If the activation function is
, that
That is to say
A two-layer neural network for activation function . This result means : This kind of ( It can be expressed as expectation ) Function of , Can be approached by two-layer neural network , And forced The rate of near error is independent of dimension !
For general double-layer Neural Networks , We can get a series of similar approximation results . The key problem is : What kind of function can be approximated by double-layer neural network ? So , We introduce Barron The definition of space :

Barron The definition of space
Reference resources :E, Chao Ma, Lei Wu (2019)
For arbitrary Barron function , There is a two-layer neural network
, Its approximation error satisfies :

It can be seen that this approximation error is independent of dimension !( Details about this part of the theory , You can refer to :E, Ma and Wu (2018, 2019), E and Wojtowytsch (2020). Other things about Barron space Classification theory , You can refer to Kurkova (2001), Bach (2017),
Siegel and Xu (2021))
Similar theories can be extended to residual neural networks (residual neural network). In residual neural network , We can use stream - Induced function space (flow-induced function space) replace Barron Space .
2.4 Generalization : The difference between training error and testing error
People usually expect , The difference between training error and testing error will be proportional to
(n Is the number of samples ). However , Our trained machine learning model is strongly correlated with the training data , This leads to this Monte-Carlo Rate does not necessarily hold . So , We give the following generalization theory :

in short , We use it Rademacher Complexity is used to describe the ability of a space to fit random noise on the data set .Rademacher Complexity is defined as :

among
The value is 1 or -1 Independent identically distributed random variables of .
When
It's the unit ball time in lippossis space , Its Rademacher Complexity is proportional to
.
When d increases , It can be seen that the sample size index required for fitting increases . This is actually another form of dimensional disaster .
2.5 Mathematical understanding of the training process
On the training of neural networks , There are two basic problems :
*
Whether the gradient descent method can converge quickly ?
*
The result of training , Is there a better generalization ?
For the first question , I'm afraid the answer is pessimistic .Shamir(2018) The lemma in tells us , Gradient based training method , Its convergence rate is also affected by dimensional disasters . As mentioned above Barron space, Although it is a good means to establish approximation theory , But there is too much space for training to understand neural networks .
Specially , Such a negative result can be highly hyperparametric (highly over-parameterized regime) The circumstances of ( namely m>>n) Get a specific description . In this case , The dynamics of parameters appears Scale separation The phenomenon of : For the following two-layer neural network :

In the process of training ,
The dynamics of are :

From this, we can see the phenomenon of scale separation : When m Very big time ,
The dynamics of is almost frozen .
In this case , The good news is that we have exponential convergence (Du et al, 2018); The bad news is this time , Neural networks are no better than those from random feature model Good model .
We can also understand the gradient descent method from the perspective of mean field . Make :
, And order :


be
Is the solution of the following gradient descent problem :

If and only if
Is the solution of the following equation ( Reference resources :Chizat and Bach (2018), Mei, Montanari and Nguyen (2018), Rotsko and Vanden-Eijnden (2018), Sirignano and Spiliopoulos (2018)):

This mean field dynamics , It's actually in the Wassenstein Gradient dynamics in the sense of metrics . People have proved : If its initial value
The support of is the whole space , And the gradient descent does converge , Then the convergence result must be global optimization ( Reference resources :Chizat and Bach (2018,2020), Wojtowytsch (2020)).
Application of machine learning
3.1 Solve the problem of high-dimensional Scientific Computing
Since machine learning is an effective tool for dealing with high-dimensional problems , We can use machine learning to solve problems that are difficult to deal with by traditional computational mathematics .
The first example is Stochastic control problem . Traditional methods to solve stochastic control problems need to solve an extremely high-dimensional Bellman equation . Using machine learning methods , It can effectively solve stochastic control problems . Its idea is quite similar to the residual neural network ( Reference resources Jiequn Han and E (2016)):

The second example is Solve nonlinear parabolic equation . The nonlinear parabolic equation can be rewritten as a stochastic control problem , Its minimum point is unique , Corresponding to the solution of nonlinear parabolic equation .

3.2 AI for science
The ability to use machine learning to deal with high-dimensional problems , We can solve more scientific problems . Here we give two examples . The first example is Alphafold.

Reference resources :J. Jumper et al. (2021)
Second example , It's our own work : Deep potential molecular dynamics (DeePMD). This is what can be achieved Ab initio precision molecular dynamics . The new simulation we used “ normal form ” That is :
*
Using the first principles of quantum mechanics to calculate and provide data ;
*
Using neural networks , Give an accurate... Of the potential energy surface fitting ( Reference resources :Behler and Parrinello (2007), Jiequn Han et al (2017), Linfeng Zhang et al (2018)).
Application DeePMD, We can simulate a series of materials and molecules , You can achieve The calculation accuracy of the first level :

We have also achieved Simulation of the first principle accuracy of 100 million atoms , To obtain the 2020 Gordonbell award in :

Reference resources :Weile Jia, et al, SC20, 2020 ACM Gordon Bell Prize
We have given Phase diagram of water :

Reference resources :Linfeng Zhang, Han Wang, et al. (2021)
As a matter of fact , Physical modeling spans multiple scales : Macroscopic 、 mesoscopic 、 Microcosmic , and Machine learning happens to provide tools for cross scale modeling .

AI for science, That is to use machine learning to solve scientific problems , There have been a series of important breakthroughs , Such as :
*
Quantum many body problem :RBM (2017), DeePWF (2018), FermiNet (2019),PauliNet (2019),…;
*
Density functional theory : DeePKS (2020), NeuralXC (2020), DM21 (2021), …;
*
Molecular dynamics : DeePMD (2018), DeePCG (2019), …;
*
Kinetic equations : Machine learning moment closure (Han et al. 2019);
*
Continuum dynamics :
(2020)
In the next five to ten years , It is possible for us to : Modeling and computing across all physical scales . This will completely change how we solve practical problems : Such as drug design 、 material 、 Combustion engine 、 catalysis ……

summary
Machine learning is basically a mathematical problem in high dimension . Neural network is an effective means of high-dimensional function approximation ; This is the field of artificial intelligence 、 The field of science and technology offers many new possibilities .
This has also created a new theme in the field of Mathematics : High dimensional analysis . In short , It can be summarized as follows :
*
Supervised learning : High dimensional function theory ;
*
Unsupervised learning : High dimensional probability distribution theory ;
*
Reinforcement learning : High dimensional Bellman equation ;
*
Time series learning : High dimensional dynamical system .

About AISI
Beijing Institute of scientific intelligence (AI for Science Institute, hereinafter referred to as AISI) Founded on 2021 year 9 month , Led by academician ewenan , We are committed to combining AI technology with scientific research , Accelerate the development and breakthroughs in different scientific fields , Promote the innovation of scientific research paradigm , Build a world leading 「AI for Science」 Infrastructure system .
AISI Of the researchers come from top universities at home and abroad 、 Scientific research institutions and scientific and technological enterprises , Common focus on physical modeling 、 Numerical algorithms 、 Artificial intelligence 、 Core issues in cross cutting areas such as high-performance computing .
AISI Committed to creating an academic environment where ideas collide , Encourage free exploration and cross-border cooperation , Jointly explore new possibilities for the combination of artificial intelligence and scientific research .

Sorting is not easy to , spot Fabulous Three even ↓
边栏推荐
- ERROR: Could not open requirements file: [Errno 2] No such file or directory: ‘requirments.txt’
- Redis persistence AOF
- Optical quantum milestone: 3854 variable problems solved in 6 minutes
- MBA-day29 算术-绝对值初步认识
- 1.12 basis of Web Development
- Another open source artifact, worth collecting and learning!
- Byte interview question - judge whether a tree is a balanced binary tree
- Processing method of CDC in SDC
- Lemon class automatic learning after all
- Who is responsible for the problems of virtual idol endorsement products? And listen to the lawyer's analysis
猜你喜欢

递归处理——子问题

Modifiers should be declared in the correct order 修饰符应按正确的顺序声明

2022 National latest fire-fighting facility operator (Senior fire-fighting facility operator) simulation test questions and answers

Unity2D 动画器无法 创建过渡

Acquisition of bidding information

Operating steps for uninstalling the mobile app

金仓数据库 KingbaseES SQL 语言参考手册 (8. 函数(十一))
![[the most complete and detailed] ten thousand words explanation: activiti workflow engine](/img/4c/2e43aef33c6ecd67d40730d78d29dc.png)
[the most complete and detailed] ten thousand words explanation: activiti workflow engine

Kingbasees SQL language reference manual of Jincang database (6. Expression)

Redis transaction
随机推荐
[Oracle SQL] calculate year-on-year and month on month (column to row offset)
Solve vagrant's error b:48:in `join ': incompatible character encodings: GBK and UTF-8 (encoding:: Compatib
Select sort / insert sort / bubble sort
Detailed explanation of the whole process of coding's pressure measurement operation
Introduction to Chinese text error correction task
idea yml 文件代码不提示解决方案
Is the transaction in mysql45 isolated or not?
L. Link with Level Editor I dp
Calling mode and execution sequence of JS
1.12 Web开发基础
【(SV && UVM) 笔试面试遇到的知识点】~ phase机制
Day110. Shangyitong: gateway integration, hospital scheduling management: Department list, statistics based on date, scheduling details
金仓数据库 KingbaseES SQL 语言参考手册 (9. 常见DDL子句)
Modifiers should be declared in the correct order 修饰符应按正确的顺序声明
unity 像素画导入模糊问题
Unity Profiler
How can red star Macalline design cloud upgrade the traditional home furnishing industry in ten minutes to produce film and television level interior design effects
二叉排序树(BST) ~
[cloud native] record of feign custom configuration of microservices
Can you make a JS to get the verification code?