当前位置：网站首页>Understanding the mathematical essence of machine learning

Understanding the mathematical essence of machine learning

2022-07-26 05:58:00 【Datawhale】

Datawhale dried food  
 author ： Academician ewenan , source ： Scientific intelligence AISI

Beijing time. 2022 year 7 month 8 On Tuesday night 22:30, Academician ewenan stay 2022 Year of International Congress of mathematicians Last work One hour Conference Report (plenary talk). Today we'll share the content of teacher e's speech . Mr. e first shared his understanding of Understanding the mathematical essence of machine learning （ Function approximation 、 Approximation and sampling of probability distribution 、Bellman The solution of the equation ）; Then it introduces the machine learning model Approximation error 、 Mathematical theory of generalization and training ; Finally, it introduces how to use machine learning to Solve difficult scientific calculations and scientific problems , namely AI for science. The authors Hertz.

The mathematical nature of machine learning problems

as everyone knows , The development of machine learning , It has completely changed people's understanding of artificial intelligence . Machine learning has many amazing achievements , for example ：

· Recognize pictures more accurately than humans ： Use a set of marked pictures , Machine learning algorithm can accurately identify the category of pictures ：

Cifar-10 problem ： Divide the pictures into ten categories

source ：https://www.cs.toronto.edu/~kriz/cifar.html

· Alphago Next go Defeat humanity ： The algorithm of playing go is completely realized by machine learning ：

Reference resources ：https://www.bbc.com/news/technology-35761246

· Generate face pictures , achieve The setup The effect of ：

Reference resources ：https://arxiv.org/pdf/1710.10196v3.pdf

Machine learning has many other applications . In everyday life , People even often use the services provided by machine learning without knowing , for example ： Spam filtering in our email system 、 Speech recognition in our cars and mobile phones 、 The fingerprint in our mobile phone is unlocked ……

All these great achievements , Essentially , But success Solved some classical mathematical problems .

For image classification , What we are interested in is actually functions ：

: Images → Category

function Map the image to the category that the image belongs to . We know The value on the training set , I want to find the right function A good enough Close to .

generally speaking , Supervised learning (supervised learning) problem , The essence is to be based on a limited training set S, Give an efficient of the objective function Close to .

For face generation , Its essence is Approximate and sample an unknown probability distribution . In this case ,“ Face ” It's a random variable , And we don't know its probability distribution . However , We have “ Face ” The sample of ： A huge number of face photos . We can use these samples , Approximate result “ Face ” Probability distribution of , And thus generate new samples （ That is, generate a face ）.

generally speaking , The essence of unsupervised learning is Using limited samples , Approximate and sample the unknown probability distribution behind the problem .

For those who play go Alphago Come on , If the opponent's strategy is given , The dynamics of go is the solution of a dynamic programming problem . Its optimal strategy satisfies Bellman equation . thus Alphago The essence of is to solve Bellman equation .

generally speaking , Reinforcement learning In essence, it is to solve The optimal strategy of Markov process .

However , These questions are Computational Mathematics Classic problems in the field ！！ After all , Function approximation 、 Approximation and sampling of probability distribution , And the numerical solution of differential equations and difference equations , Are extremely classic problems in the field of Computational Mathematics . that , These problems are in the context of machine learning , What is the difference between it and classical computational mathematics ？ The answer is ：

dimension （dimensionality）

for example , In the problem of image recognition , The dimensions entered are . For classical numerical approximation methods , about Dimensional problem , contain Approximation error of a model with parameters . In other words , If you want to reduce the error 10 times , The number of parameters needs to be increased . When dimension increases , The calculation cost increases exponentially . This phenomenon is often called ：

Dimension disaster （curse of dimensionality）

All classic algorithms , For example, polynomial approximation 、 Wavelet approximation , Are suffering from dimensional disasters . Obviously , The success of machine learning tells us , In high dimensional problems , Deep neural network The performance of is much better than the classical algorithm . However , such “ success ” How to do it ？ Why in high-dimensional problems , No other way , but Deep neural network It has achieved unprecedented success ？

Starting from Mathematics , Understand machine learning “ Black magic ”： Mathematical theory of supervised learning

2.1 Marking and setting

Neural network is a special kind of function . such as , The two-layer neural network is ：

There are two sets of parameters , and . Is the activation function , It can be ：

· ,ReLU function ;

· ,Sigmoid function .

The basic component of neural network is ： Linear transformation and one-dimensional nonlinear transformation . Deep neural network , Generally, it is the composition of the following structures ：

For simplicity , We omit all here bias term . It's the weight matrix , Activation function Act on every component .

We will be in the training set S Upper approximation of the objective function

Hypothetical hypothesis The domain of definition of is . Make by The distribution of . Then our goal is ： Minimize test errors (testing error, Also known as population risk or generalization error)：

2.2 Error of supervised learning

Supervised learning generally has the following steps ：

First step ： Choose a hypothetical space （ A set of test functions ）（m Directly proportional to the dimension of the test space ）;

The second step ： Choose a loss function to optimize . Usually , We will choose empirical error (empirical risk) To fit the data ：

Sometimes , We will add other penalties .

The third step ： Solving optimization problems , Such as ：

· gradient descent ：

· Stochastic gradient descent ：

It's from 1,…n A random selection of .

If you record the output of machine learning , Then the total error is . Let's redefine ：

Is the best approximation in the hypothetical space ;

In hypothetical space , Based on data sets S The best approximation .

thus , Then we can divide the error into three parts ：

Is the approximation error (approximation error)： It is completely determined by the selection of hypothetical space ;

Is the estimation error (estimation error)： Additional error due to limited data set size ;

Is the optimization error (optimization error)： By training （ Optimize ） Additional errors .

2.3 Approximation error

Let's focus on approximation error (approximation error).

Let's make a comparison with the traditional Fourier transform ：

If we use discrete Fourier transform to approximate ：

Its error Is proportional to , Undoubtedly affected by dimensional disasters .

And if a function can be expressed in the desired form ：

Make It's a measure Independent identically distributed samples , We have ：

Then the error at this time is ：

You can see , This is independent of dimension ！

If the activation function is , that That is to say A two-layer neural network for activation function . This result means ： This kind of （ It can be expressed as expectation ） Function of , Can be approached by two-layer neural network , And forced The rate of near error is independent of dimension ！

For general double-layer Neural Networks , We can get a series of similar approximation results . The key problem is ： What kind of function can be approximated by double-layer neural network ？ So , We introduce Barron The definition of space ：

Barron The definition of space

Reference resources ：E, Chao Ma, Lei Wu (2019)

For arbitrary Barron function , There is a two-layer neural network , Its approximation error satisfies ：

It can be seen that this approximation error is independent of dimension ！（ Details about this part of the theory , You can refer to ：E, Ma and Wu (2018, 2019), E and Wojtowytsch (2020). Other things about Barron space Classification theory , You can refer to Kurkova (2001), Bach (2017),

Siegel and Xu (2021)）

Similar theories can be extended to residual neural networks (residual neural network). In residual neural network , We can use stream - Induced function space （flow-induced function space） replace Barron Space .

2.4 Generalization ： The difference between training error and testing error

People usually expect , The difference between training error and testing error will be proportional to （n Is the number of samples ）. However , Our trained machine learning model is strongly correlated with the training data , This leads to this Monte-Carlo Rate does not necessarily hold . So , We give the following generalization theory ：

in short , We use it Rademacher Complexity is used to describe the ability of a space to fit random noise on the data set .Rademacher Complexity is defined as ：

among The value is 1 or -1 Independent identically distributed random variables of .

When It's the unit ball time in lippossis space , Its Rademacher Complexity is proportional to .

When d increases , It can be seen that the sample size index required for fitting increases . This is actually another form of dimensional disaster .

2.5 Mathematical understanding of the training process

On the training of neural networks , There are two basic problems ：

Whether the gradient descent method can converge quickly ？

The result of training , Is there a better generalization ？

For the first question , I'm afraid the answer is pessimistic .Shamir(2018) The lemma in tells us , Gradient based training method , Its convergence rate is also affected by dimensional disasters . As mentioned above Barron space, Although it is a good means to establish approximation theory , But there is too much space for training to understand neural networks .

Specially , Such a negative result can be highly hyperparametric (highly over-parameterized regime) The circumstances of （ namely m>>n） Get a specific description . In this case , The dynamics of parameters appears Scale separation The phenomenon of ： For the following two-layer neural network ：

In the process of training , The dynamics of are ：

From this, we can see the phenomenon of scale separation ： When m Very big time , The dynamics of is almost frozen .

In this case , The good news is that we have exponential convergence （Du et al, 2018）; The bad news is this time , Neural networks are no better than those from random feature model Good model .

We can also understand the gradient descent method from the perspective of mean field . Make ：, And order ：

be Is the solution of the following gradient descent problem ：

If and only if Is the solution of the following equation （ Reference resources ：Chizat and Bach (2018), Mei, Montanari and Nguyen (2018), Rotsko and Vanden-Eijnden (2018), Sirignano and Spiliopoulos (2018)）：

This mean field dynamics , It's actually in the Wassenstein Gradient dynamics in the sense of metrics . People have proved ： If its initial value The support of is the whole space , And the gradient descent does converge , Then the convergence result must be global optimization （ Reference resources ：Chizat and Bach (2018,2020), Wojtowytsch (2020)）.

Application of machine learning

3.1 Solve the problem of high-dimensional Scientific Computing

Since machine learning is an effective tool for dealing with high-dimensional problems , We can use machine learning to solve problems that are difficult to deal with by traditional computational mathematics .

The first example is Stochastic control problem . Traditional methods to solve stochastic control problems need to solve an extremely high-dimensional Bellman equation . Using machine learning methods , It can effectively solve stochastic control problems . Its idea is quite similar to the residual neural network （ Reference resources Jiequn Han and E (2016)）：

The second example is Solve nonlinear parabolic equation . The nonlinear parabolic equation can be rewritten as a stochastic control problem , Its minimum point is unique , Corresponding to the solution of nonlinear parabolic equation .

3.2 AI for science

The ability to use machine learning to deal with high-dimensional problems , We can solve more scientific problems . Here we give two examples . The first example is Alphafold.

Reference resources ：J. Jumper et al. (2021)

Second example , It's our own work ： Deep potential molecular dynamics (DeePMD). This is what can be achieved Ab initio precision molecular dynamics . The new simulation we used “ normal form ” That is ：

Using the first principles of quantum mechanics to calculate and provide data ;

Using neural networks , Give an accurate... Of the potential energy surface fitting （ Reference resources ：Behler and Parrinello (2007), Jiequn Han et al (2017), Linfeng Zhang et al (2018)）.

Application DeePMD, We can simulate a series of materials and molecules , You can achieve The calculation accuracy of the first level ：

We have also achieved Simulation of the first principle accuracy of 100 million atoms , To obtain the 2020 Gordonbell award in ：

Reference resources ：Weile Jia, et al, SC20, 2020 ACM Gordon Bell Prize

We have given Phase diagram of water ：

Reference resources ：Linfeng Zhang, Han Wang, et al. (2021)

As a matter of fact , Physical modeling spans multiple scales ： Macroscopic 、 mesoscopic 、 Microcosmic , and Machine learning happens to provide tools for cross scale modeling .

AI for science, That is to use machine learning to solve scientific problems , There have been a series of important breakthroughs , Such as ：

Quantum many body problem ：RBM (2017), DeePWF (2018), FermiNet (2019),PauliNet (2019),…;

Density functional theory : DeePKS (2020), NeuralXC (2020), DM21 (2021), …;

Molecular dynamics : DeePMD (2018), DeePCG (2019), …;

Kinetic equations : Machine learning moment closure (Han et al. 2019);

Continuum dynamics : (2020)

In the next five to ten years , It is possible for us to ： Modeling and computing across all physical scales . This will completely change how we solve practical problems ： Such as drug design 、 material 、 Combustion engine 、 catalysis ……

summary

Machine learning is basically a mathematical problem in high dimension . Neural network is an effective means of high-dimensional function approximation ; This is the field of artificial intelligence 、 The field of science and technology offers many new possibilities .

This has also created a new theme in the field of Mathematics ： High dimensional analysis . In short , It can be summarized as follows ：

Supervised learning ： High dimensional function theory ;

Unsupervised learning ： High dimensional probability distribution theory ;

Reinforcement learning ： High dimensional Bellman equation ;

Time series learning ： High dimensional dynamical system .

About AISI

Beijing Institute of scientific intelligence （AI for Science Institute, hereinafter referred to as AISI） Founded on 2021 year 9 month , Led by academician ewenan , We are committed to combining AI technology with scientific research , Accelerate the development and breakthroughs in different scientific fields , Promote the innovation of scientific research paradigm , Build a world leading 「AI for Science」 Infrastructure system .

AISI Of the researchers come from top universities at home and abroad 、 Scientific research institutions and scientific and technological enterprises , Common focus on physical modeling 、 Numerical algorithms 、 Artificial intelligence 、 Core issues in cross cutting areas such as high-performance computing .

AISI Committed to creating an academic environment where ideas collide , Encourage free exploration and cross-border cooperation , Jointly explore new possibilities for the combination of artificial intelligence and scientific research .

Sorting is not easy to , spot Fabulous Three even ↓

原网站

版权声明
本文为[Datawhale]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/201/202207181946494322.html

当前位置：网站首页>Understanding the mathematical essence of machine learning

Understanding the mathematical essence of machine learning

边栏推荐

猜你喜欢

随机推荐