当前位置:网站首页>cs231n learning record
cs231n learning record
2022-08-05 06:45:00 【ProfSnail】
cs231nThe depth of the Stanford university to open a course learning combined with computer vision,Can be found in many parts of the course resources of public.Due to recent doing work related to computer vision,So to learn this course content,Some of this course and insightful content records.Given that many places have comparative perfect system and system notes,I will not do the outcomes of learning notes.Like the previous post,This post will be updated constantly with my study in-depth.
After a rough convolution neural network learning process,Especially when calculating the gradient problem of back propagation time,Often stuck in the pooling layer on the reverse gradient calculation of.Get inspired by chain transmission rules is,使用最大值池化,In the process of back propagation from upstream handed down gradient,Will be complete before to take the maximum branch;前向传播时,Is not the branch of the maximum,Will not be able to receive back propagation over the gradient value of.For the average pool,The forward propagation time function is f ( x ) = ( x 1 + x 2 + x 3 + x 4 ) / 4 f(x)=(x_1+x_2+x_3+x_4)/4 f(x)=(x1+x2+x3+x4)/4,This is an additive with division combination formula,Can use derivative method,Conduct gradient value upstream to the downstream in the.
简而言之,Gradient neural network conduction calculation rules,Based on a large bunch of chain derivation rules,Good command of the chain rule can conduct gradient adept.此外,Want to know a layer back propagation time of gradient,Only need to remember how in this layer of conduction in,This is convenient for constructing a class,To store the algorithm and the derivation rule.这件事情在PytorchDoing well.Some summary of activation functions.This course finally let me find out three years ago of a knot,Also what exactly is gradient disappeared?
sigmodFunction in the independent variable tends to be infinite,Function is smooth.This smoothing corresponds to the derivation of the gradient back propagation time is quite unfair,When reverse derivation of the gradient value is very small,接近于0,This has led to even upstream conduction down gradient value is very big,Continue to spread downstream,Will only spread back to a close0的值,This leads to a gradient disappear question.
类似的,Say to the saturation state,This is my previous vague concept of a.The so-called saturation,Can be associated with the saturated solution of chemical,When have enough salt was dissolved in the water,Continue to add salt,无论加多少,Water is unable to continue to dissolve the salt,Salt will continue to exist in the form of crystal particles in the water.Neural network saturation can on,也就是说,When network converge to a certain degree,Cannot be spread by gradient methods such as to modify the parameters in the network,This state can be referred to as the saturation state.Solution to the gradient disappeared in the course, speak,似乎是tanh和ReLu函数可以解决,But have learned over the past two or three weeks,忘得差不多了.This blog is mainly of not doing a memoir,Priority to write this blog,While waiting for the review of previous content after,Then a solution to this patch.印象中,ReLuThere are many variant function,包括LeakyReLu,To function in the positive part has been guaranteed exist gradient,In part of the negative or don't care about,There will be a smaller gradient value.
About the meaning of the bias term.Calculate activation function,Always compare w x + b wx+b wx+b与0的大小,As the basis of whether the activation.Before only think with0作比较.学习之后发现,其实是wx与b作比较,bIs to determine whether to activate the threshold,即threshold.This event is important,Neural network to the adjustable parameter,Back propagation ofb进行修正,这是否意味着,经过不断的学习,Whether to accept the lower limit of stimulation is also in a constantly adjust.或是说,In fact this threshold should not adjust,But as something of a consistent,从头至尾?I prefer to the previous inference.
多层感知机,Can be classified to solve xor problem.This I had been more confused,Heart does not have a clear answer.Now that the course has a final verdict,I also don't have to struggle.
Will continue in the course involves the idea of regularization.正则化,Regular who?The concept and regular,More in depth study of entry,Also appeared in some of the traditional machine vision theory.Regular the idea,不太理解.
批量归一化.Study this paragraph of time is the big night,有点犯困,In retrospect can remember not so special.Mass is the main point of the normalized,Hope that through training in data,Return to0为中心的、Variance uniform point.这样做的优势在于,经过归一化,Can make is located in the center of the straight slope changes,To deviate from the center of the data is not so obvious.
Chose a batch normalization,Rather than global normalization,Are of limited computing power Angle.If all global data normalization,The long time usually hard to accept.Course also spoke of the initialization method of some parameters,I think it is also quite useful,But did not remember.While waiting for the review add this paragraph.If the weights initialization is too small,In the process of transmission value will be more and more small,No data have been learning to.If the initialization parameter is too big,Saturation phenomenon will appear previously said,Also there is no way to learn.Good initialization,Can let the weight of each layer are effective study.
What do you think about the loss function contour map.
Just shows a loss function of the image contour map,The center of the image point,Is the point of minimum loss function.Around a circle line,Is composed of all loss error of equal points,And the farther away from the center,This ring oval online on behalf of the greater loss of numerical.Means that if the vertical direction to move,Don't need to move far,From a higher error,Moving to another smaller error line up;And if you want the same in two vertical circle to move,Horizontal movement requires moving beyond distance.So when transversely and set the rate of change is the same,With mobile and fixed,Is likely to move faster,Even directly over the optimal plane line,The sideways move along while also can't get the lowest.This leads to a kind of to the parameters of different direction(Vector in different positions),Learning and correction method.随机梯度下降(SGD,Stochastic Gradient Decent)Another problem is that,Reside in a local minimum point,Or is a relatively smooth curve near,Due to the gradient value of the place0,Stochastic gradient descent method will remain in this,Power loss and continue to look for.The solution is to provide the initial speed,Even if encountered in the process of moving a smooth point,Due to the velocity are still there,Can't stop immediately stop,Therefore also will continue to slide down.If you find that has just passed,The gradient will pull back,Until go to extreme value point,Or the minimum points so far,It's like a high school physics problem,A ball is affected by the surface friction, If the initial velocity is very big,He may be over the hill,If the initial velocity is not big enough,He may stop in the current small trough.Just I said this way,Referred to as momentum optimization method,SGD+Momentum.
The momentum before correction was conducted at the current point.This method will have an improved,称作Nesterov Momentum,The improved scheme is,According to the first speed walk,Look at the effect how to go after,And gradient value calculated according to the location of the mobile after.After returning to the original location,修正路线.AdaGradIs different according to different position change of gradient rate,And put forward correction strategy.The thinking of the strategy is,Accumulative gradient square,If in a certain direction have been to do correction,And revised soon,Divided by the larger Numbers,To make correction on this direction is not so fast;If the direction of the gradient has been very small,Divided by a relatively small number,Make correction faster.另外,Due to the gradient has been accumulated,Can lead to a divisor of the sum of the squares of the values is more and more big,The last of the learning rate is higher.Such effect in fact is the hope of.
不过AdaGradMethod and a disadvantages,It is just said, at last,Learning speed slower and slower,Result in the convex optimization,Gathered at a local extreme value point near,Unable to move further and correction.This is a bad thing.RMSPropOptimization scheme was proposed for this,He has been to let gradient square not so smooth accumulation,But restricted by attenuation,Make the accumulative speed is not soon.Combined with the advantages of the momentum method and attenuation strategy,提出了类似Adam的算法.
But this approach is flawed,在初始化过程中,second_momentum会由于beta=0.99变得很小,Would be moving step length as the denominator smaller and larger.AdamMethods using the number of iterations to restrain them,To prevent the above problems.同时,这里也给出了AdamOptimization method using the parameters of the.学习速率的调整.After learning rate can be in a certain step binary,Or use index、Inverse methods such as attenuation.但是AdamMethods have been the adjustment method of step gradually less,Learning rate reduction strategies do not have toAdam配合使用.
Quadratic optimization strategy.This formula is look not to understand,Wait until mathematical skills a bit more time to look back again.
Quadratic function optimization strategies mentioned in the course,Also because the mathematical formula is fierce,没看懂.General meaning is the insufficient memory capacity,Hazen matrix is an*n的矩阵,存不下来.Then use the following this fitting strategy.Model integration method is a multiple independent training model,Add up to an average strategy.Stable this strategy can improve the effect of two percent,To improve the generalization ability of the models.
使用随机dropoutMethods to improve the performance of global general data and model.之前提到的,什么是正则化,Refers to is probably this will help keep the model can be less fitting very too、Even if some places less、Change the model of a part of the selection criteria,Also before we can continue to apply the learning outcomes of.This should be regularized a universal thinking mode,简而言之,Is to prevent a fitting.
另外,课程中提到了,当使用dropout的时候,According to the probabilitypRandom out some neurons,The convolutional neural network technology now randomly remove a channel,And in all the connections in the neural network will be embodied in random place some neurons as0.In order to make up for this loss probability on the,在测试的时候,The calculated results multiplied by the probabilityp,Represented as random after excluding the mathematical expectation of its.
Other regularization method,Such as the loss function calculation,Ownership and the sum of the squares of the heavy、The absolute value of weight and,And a linear combination of the both methods;Or a random sampling to one image,And to do image rotation、翻转等操作,Are some of the more standard regularization method,Can effectively reduce the model on training data fitting situation.Can the image contrast、PCAThe principal component analysis combined with some of the random disturbance,Under the condition of without changing the data overall meaning,防止过拟合的情况出现.
DropOutMethod is a variation form,是dropout connect,Random connect some sacrifice,The random will some weightw赋值为0.迁移学习.Migration study it sounds cool,But according to the mentioned in the course,When have already trained neural network,For example, trainedVGG卷积神经网络,Frozen in previous convolution layer,And modify the last output layer,Retraining the whole connection layer 的参数即可.This is a very good use of existing neural network,The practice of migrating to small data set time.
When data set is not so small,Fine tuning method is used to modify all the learning parameters,But only use one over ten of the original vector,This will make a strong generalization ability to focus on more moderate their data set up.
When has trained model and data model is more similar to migrate,Can use the above method for fine-tuning.But there has great difference when,Small data set, it's hard to do data migration,While large data sets need more training.Migration study it is more common.应该尽可能多的使用cuDNNTo accelerate neural network code,And as far as possible much useGPUHigh-speed computing power.
Tensorflow, Pytorch, Caffe. Caffe从UC Berkeley进化到Facebook的Caffe2,从NYU/Facebook的Torch进化到Facebook的Pytorch,从U Montreal的Theano进化到Google的Tensorflow,Framework for development is very fast.
The idea of having a temporary.Since human memory process is follow ebbinghaus forgetting curve,Neural network learning process,Every time feed his small data set is called a batch.Every time feed his batch,Whether can use artificially follow ebbinghaus forgetting curve way feeding.Process characteristics in learning a new batch of figure,Review the past figure.Is also designed a newdataloader.我觉得是可行的.准备试一试.
临时的想法.在介绍GooleNet的时候,See the whole period of neural network is divided into three sections,The first and middle has an output,And reverse the spread of the gradient.The idea that interesting,Because before studying has talked about neural network transmission may occur in the process of gradient disappear question,This problem may be with the depth gradually deepen,And all the more obvious in the front part of the neural network.The feeling is similar to the gas station,To drive home after a long journey,If don't go where it is difficult to spread back to their hometown.值得思索.
临时的想法.Residual error of the neural network is an important is he proved through proper design model,Deeper neural network can lead to a better training result.Behind it is to continue to use deeper neural network provides a belief support.When the training data is not just a square,Such as the image of 3 d data before,Convolution method is not used in the square before the convolution kernel;Training data into a d,Or two-dimensional three-dimensional data,But these data in space there is no correlation,That is to say they are not in a three-dimensional world there is a corresponding relationship between,Can use other one-dimensional or corresponding convolution method of dimension,To achieve the result that also provides residuals training way.
LeNetIs the first successful application of convolution network model of neural network to identify the image,模型简单,仅包含卷积层、池化层和全连接层.In handwritten numerals recognition successfully applied.
2012年AlexNet横空出世,Using deep convolution neural network model big kill square.Finally understandAlexNetUpper and lower two layers in the network of what is the meaning of the images.Early training for computer capacity shortages,And a huge number of parameters,Have to distribution of the training process of neural networks to two graphics card.Another solution I wonder is,确实两个GPU之间,In certain convolution layer is not to communicate with each other.Only get half the original features,在GPU上面进行训练.但是仍然有疑问,AlexNetIn the middle layer,GPUThere is a mutual communication between,They are how to communicate?And the parallel neural network training,What is the low-level details,目前也不太清楚.
VGG神经网络模型在2014年提出,And has obtained the good result.Here's an important thought is a stack of smaller convolution kernel,In order to realize the instead of the larger convolution kernels effect.The principle of stack can replace big nuclear is,Overlap each layer is growing,Edge to the location of the details of the four will be reflected,So this method can get.拥有更少的参数,And get more deeply into the neural network.It's a good plan,但是内存占用、The number of arguments or too much.As a result, each time in the process of training,It is difficult to a group of training more times the input image,Because of the need to save the memory of the parameters of the position of the too much,Not enough to put.VGGNetwork also found partial response normalized doesn't do anything,就给删掉了.总算搞明白GoogleNet里面的InceptionWhat is the principle.Although he y,但是通过padding填充,The features of the variety of convolution kernels convolution get figure is the same dimension,So you can combine multiple dimensions of figure,This is many years of doubt.Another doubt also solved,Why use a variety of1×1的卷积核,The purpose is in large convolution before,Reduce the depth of the input image,也就是使用1×1The number of convolution kernels into half of the input image depth;Convolution operation after the end of,又可以通过1×1的卷积核,Return to a deeper depth.这种方法被称为“瓶颈层”,Very interesting setting.Have a doubt been solved,Is why there will be multiple output,This matter has been explained in this paper on the,所以不再赘述.
GooogleNetReally there is a lot to think of,He even has a,使用InceptionA variety of convolution kernels between,It can be parallel.Computing performance is true is a worth studying problem,Calculates the various steps of design can be parallel computing steps,Is a method to improve calculation speed of.
ResNet残差神经网络(2015年)Is the interpretation of to solve let me puzzled for years.残差,Residual is what meaning.Residual mean each layer of input,From the network after learning something real,The corresponding output layer,之间的差值.直觉上来说,No longer need to be trained in each layer of the actual value,Only need to be trained in thisdeltaDifference is ok.Never thought there will be such a network,It seems many open mind.ResNetUse same bottleneck layer,In order to reduce the training time required parameters.
The concept of bottleneck layer comes fromNin网络.Many are in the works to the width of the neural network.Random depth model,Is a kind of neural network model based on global integration structure,He will randomly select a network parameters in depth training.Points form the network architecture is also in order to reduce the depth,To improve the efficiency of the gradient back propagation time.Close connection of the neural network was designed to,Also in order to improve the network layer and the close relationship between,Convenient gradient spread smoothly.
In the process of residual neural network training,Are being added to the error functionL2正则化项.The matter in the first time I learn don't realize,Wait until the next time will beRNNReview found that only when.L2Regularization item is for error function with a numeric,This number is all of the neural network parameters, the sum of the squares of the.In order to minimize the error function,Will make smaller parameters as possible,So also for residual neural network provides a model of straight into the straight out of the concept.He will force the neural network to give up those useless layer,Also is to let the redundancy layer parameter values tend to be0.
残差神经网络.Due to the output of neural network are two parts and,After respectively before the convolution of the input and convolution of residual,Therefore in the process of back propagation,Handed down from upstream from the gradient of,Can be directly transmitted to the input layer.To a certain extent, reduce the possibility of a gradient disappear.
VGG或者GoogleNet神经网络,Dropped the last of a large number of full connection layer neural network parameters,And choose to adopt more sophisticated design,To maintain the network depth.They choose at the end of the use of global pooling method,Rather than a large number of parameters of the link layer.
RNN循环神经网络.Circulation is not adopted neural network each time slice computed scores the highest category as the next layer of input,But by sampling(Sample)Methods to choose.He doesn't seem to have said how sampling operation,When it comes to using the method of probability sampling.My understanding is that the higher the score,被选中的概率越大;Low score also have the opportunity to be selected as input of the next layer of neurons.
VGGSet the first convergence11Layer of the network and improve the design of the network layer gradually,以及GoogleNetThe underlying neural network feedback layer,Are in the batch regularization(批量标准化)This method of large area is used before design.在有了Batch Normalization之后,These methods are no step by step.
RNNNeural network can be used to deal with a couple more,多对多,多对一,Number of different variables such as the problem,比如文本翻译,Describe the tasks such as images.Because of corpus can be very large,And as the growth of the time series,To reverse conduction path is longer,So using dense matrix makes operation process is very slow,在RNNTend to useOne-HotVector way.One-HotI don't know how to translation,Just know what he means,In a row or column vector is only one position is1,And other places are0,是1Only the position of an element of the corresponding to the corpus,May be a word or a character,Or is somewhere,Or some other coding way to get the content of the.
在RNNNeural network learning a long corpus,In order to prevent an iteration time is too long,The way he chose a truncated training.截断(Truncated)Training means that every some distance on a back propagation.根据研究显示,在学习过程中,Part of the cycle of neurons in the nervous network node,Will be used to study to a particular structure,In some structure presents the activated state,In other part of the don't care,It will be temporarily closed.
How to control the variable-length input or output?The answer is in the corpus mark inside set start and end,When the end of the predicted sign,Will stop the prediction process of the.
Cycle the structure of the neural network can be designed specific view in a certain period of time where their attention on.Attention mechanism is divided into two kinds of,一种是软注意力机制,It with the method of weighted sum,Check all the sum total of attention;The other is a hard attention mechanism,This force can only have one in the image of point.This paragraph actually I'm a little didn't understand this too,PPTThe above each step in to do two things,Which is how to do two things.May come later will read this paper mentioned,看看具体是如何实现的.
The depth of the neural network will also increase cycles,Here means to provide more hidden unit layer,Let each time slice can be transmitted more times.But do you just need to stack three or four times even if very deep,Don't have to do more stack.
Vanilla RNNNetwork has a disadvantage,He in circular neural network layer too deep,Every time back to the gradient of all include the same weight matrixW,当WMultiplicative many times,要么是梯度爆炸,要么是梯度消失,Network is very bad.一种解决方法是,Back propagation of gradient if exceed the threshold to zoom in,But the violence was too much,So evolved intoLSTM,长短期记忆神经网络,拥有ifog(input,forget,output,gate)The middle of the four different yuan,The first three are usedsigmod作为非线性函数,With myself to0和1之间,最后一个是tanh函数,With myself to-1和1之间.
GRU框架(Gated recurrent unit),Because not very carefully,So I understand it is not very deep.我的直观感受是,He USES the other methods of dealing with the state of a kind of hidden neurons, Let the propagation and combination process more complicated,In order to obtain a better network performance results.
边栏推荐
- The method of using ROS1 bag under ROS2
- Configuration of routers and static routes
- [ingress]-ingress exposes services using tcp port
- By solving these three problems, the operation and maintenance efficiency will exceed 90% of the hospital
- 系统基础-学习笔记(一些命令记录)
- 滚动条问题,未解决
- Successful indie developers deal with failure & imposters
- Advantages of overseas servers
- LaTeX image captioning text column automatic line wrapping
- sql server 重复值在后面计数
猜你喜欢
Tencent Internal Technology: Evolution of Server Architecture of "The Legend of Xuanyuan"
transport layer protocol
LeetCode练习及自己理解记录(1)
sql server duplicate values are counted after
Nacos配置服务的源码解析(全)
5分钟完成mysql离线安装
Take you in-depth understanding of cookies
DevOps流程demo(实操记录)
Autoware--Beike Tianhui rfans lidar uses the camera & lidar joint calibration file to verify the fusion effect of point cloud images
VRRP overview and experiment
随机推荐
docker部署完mysql无法连接
The use of three parameters of ref, out, and Params in Unity3D
[问题已处理]-虚拟机报错contains a file system with errors check forced
selenium学习
UI刘海屏适配方式
产品学习资料
By solving these three problems, the operation and maintenance efficiency will exceed 90% of the hospital
Difference between link and @improt
深夜小酌,50道经典SQL题,真香~
el-autocomplete use
指针常量与常量指针 巧记
config.js相关配置汇总
教您简单几步实现工业树莓派正确安装RS232转USB驱动
文件内音频的时长统计并生成csv文件
Transformer interprets and predicts instance records in detail
selenium learning
GetEnumerator method and MoveNext and Reset methods in Unity
cs231n学习记录
Network Troubleshooting Basics - Study Notes
Wireshark packet capture and common filtering methods