当前位置：网站首页>Self learning neural network series - 7 feedforward neural network pre knowledge

Self learning neural network series - 7 feedforward neural network pre knowledge

2022-06-26 09:10:00 【ML_ python_ get√】

One Perceptron algorithm

1 Model form

$\sum_{i=0}^D w_dx_d+b$

2 Linear classifier

Accept multiple input signals , Output a signal
Neuron ： Weighting the input signal , If the weighted number meets a certain condition, enter 1, Otherwise output 0
The weight represents the importance of the signal

3 Existing problems

And gate ： Only two inputs are 1 When the output 1, Other situation output 0
NAND gate ： Invert the and gate output , Only two inputs are 1 When the output 0, Otherwise output 1
Or gate ： As long as one input is 1, Then output 1, Only all inputs are 0, Just output 0
Exclusive OR gate ： Only one input is 1 when , Will enter 1, If the two inputs are 1 when , Output 0, Pictured 1 Shown
Perceptron algorithms cannot handle XOR gates
The idea of perceptron algorithm ： As long as the parameters of the sensor are adjusted, the switch between different doors can be realized ; Parameter adjustment is left to the computer , Let the computer decide what kind of door .

chart 1

chart 1 Exclusive OR gate

4 python Realization

（1） And gate

def AND(x1,x2):
    ''' Implementation of and gate '''
    w1,w2,theta = 0.5,0.5,0.7
    tmp = x1*w1+x2*w2
    if tmp<=theta:
        return 0
    elif tmp>theta:
        return 1


#  Test functions 
print(AND(1,1)) # 1
print(AND(0,0)) # 0
print(AND(1,0)) # 0
print(AND(0,1)) # 0

#  Use offset   and numpy Realization 

def AND(x1,x2):
    import numpy as np

    x = np.array([x1,x2])
    w = np.array([0.5,0.5])
    b = -0.7  #  threshold , Adjust how easily neurons are activated 
    tmp = np.sum(w*x) +b
    if tmp<=0:
        return 0
    elif tmp>0:
        return 1

#  test   Just like a normal implementation 
print(AND(1,1)) # 1
print(AND(0,0)) # 0
print(AND(1,0)) # 1
print(AND(0,1)) # 1

（2） NAND gate

The output is just the opposite , The weights and offsets are opposite to each other



def NAND(x1,x2):
    ''' NAND gate '''
    import numpy as np

    x = np.array([x1,x2])
    w = np.array([-0.5,-0.5])
    b = 0.7
    tmp = np.sum(w*x)+b
    if tmp <=0:
        return 0
    elif tmp>0:
        return 1

#  test 
print(NAND(1,1)) # 0
print(NAND(1,0)) # 1
print(NAND(0,1)) # 1
print(NAND(0,0)) # 1

（3） Or gate

The absolute value of the offset is less than 0.5 that will do Easier to activate


def OR(x1,x2):
    import numpy as np
    
    x = np.array([x1,x2])
    w = np.array([0.5,0.5])   #  As long as not both inputs go 0,  It outputs 1
    b = -0.2 
    tmp = np.sum(w*x)+b
    if tmp<=0:
        return 0
    else:
        return 1


OR(1,1)
OR(0,1)
OR(1,0)
OR(1,1)

5 Multi layer perceptron to solve XOR problem

Exclusive OR gate ： Cannot be separated by a straight line
Introduce nonlinearity ： Overlay layer perceptron
Through the NAND gate, we get S1 Or door access S2 You can see the result of the XOR gate that can be reached through the and gate
Multilayer perceptron ： There are multiple linear classifiers , The nonlinear fitting can be realized through and gate

def XOR(x1,x2):
    s1 = NAND(x1,x2) # NAND gate 
    s2 = OR(x1,x2) #  Or gate 
    y = AND(s1,s2) #  It is equivalent to combining two 
    return y

#  test 
XOR(1,1)
XOR(0,0)
XOR(1,0)
XOR(0,1)

Two Neural network structure

The basic idea ： In the last section , Multi-layer perceptron is a multi-linear function that realizes nonlinear classification through logical operators , It is natural to associate the nonlinear transformation of linear function to solve the nonlinear separable classification problem . This kind of nonlinear transformation is an activation function in neural networks .

neural network ： The multi-layer perceptron model with activation function is used to learn the statistical model of nonlinear feature expression . The common fully connected neural network structure is shown in the figure 2 Shown ：

chart 2 Neural network structure （ Excerpt from 《 Neural networks and deep learning 》（ Qiu Xipeng ））

1 Common activation functions

Any curve can be approximated by an activation function ： Polynomial function can fit any point in space perfectly
Any curve is the sum of some activation functions , Similar to the idea of spline estimation .
The neural network is divided into several layers ： Each layer uses the same activation function , These activation functions only differ in weight and bias
Two ReLU Function to construct a step function （ Value 0,1） perhaps sigmoid function
So in the same case ReLU The activation function of ( Neuron ) It needs to be doubled

（1）sigmoid Activation function

Defined in machine learning sigmoid The activation function is Logstic The distribution of the CDF, Form the following ：

$\sigma(x) ={1\over1+exp(-x)}$

Logical distribution belongs to exponential distribution family
Logstic Distribution is often used in periodic analysis , For example, the economic depression 、 recovery 、 prosperity 、 decline , At first the economy grew slowly , The economy began to grow rapidly after recovery , After the boom, the economy began to stagnate , Growth is slowing , Finally, it even began to decline , Continue into the depression , More in line with logistic Distribution .
The distribution function has 0-1 Characteristics of , So its output can be regarded as a probability distribution , Used for classification .
Intermediate activation value of distribution function , The characteristic of being suppressed on both sides , It conforms to the characteristics of neurons
The gradient vanishing problem
- The derivatives at both ends are close to 0
- The gradient is less than 1, When the chain rule is conducted too long , The gradient vanishing problem

（2）Tanh Activation function

Tanh Activation function form

${exp(x)-exp(-x)\over exp(x)+exp(-x)}$

Tanh Can be seen as sigmoid Deformation of the activation function , Both belong to the family of exponential functions

$2\sigma(2x)-1$

tanh(x) range by (-1,1) In line with the characteristics of centralization ,sigmoid(x) The value range is (0,1) Output constant greater than 0

（3）Relu Activation function

Relu Activation function is the most commonly used activation function in neural networks , Form the following ：

$Relu(x)=\begin{cases} x ,& x>0 \\ 0,& x<=0 \\ \end{cases}$

Unilateral inhibition ： Left saturation , The axis is infinitely far away , The phenomenon that the value of a function does not change significantly is called saturation ,sigmoid Functions and Tanh A function is a saturated function at both ends ,Relu The activation function is left saturated ;
Wide boundaries of excitement ： The activation area is wide , Positive input can be activated
Ease the problem of gradient disappearance ： Derivative is 1, To some extent, the gradient vanishing problem can be alleviated
Relu The question of death ： When the input is an outlier ,Target There will be a big deviation in our prediction , So back propagation （ The reverse is the error ）, Will cause the offset b The update of is offset （ Offset offset problem ）, At the same time, it may make b Negative , Offset is negative , In this way, no amount of samples will make the calculated hidden layer negative , after relu When the function is activated , The gradient of 0, Parameters are no longer updated , This phenomenon is called Death Relu problem .

（4）Leaky Relu

To improve Relu function , Avoid death Relu problem ,Leaky Relu Activate function reenter x When it is negative , Introduce a small gradient , as follows ：

$Relu(x)=\begin {cases} x & if \space x>0 \\ \gamma_i x & if \space x<=0\\ \end{cases}$

among $\gamma_i$ It can be with estimated parameters , It can also be a constant
ELU Activation function ： Make $\gamma_ix=\gamma_i(exp(x)-1)$

（5）Softplus Activation function

Relu The smooth version of the function , Form the following ：

$S o f t p l u s (x) = l o g (1 + e x p (x))$

Derivative is sigmoid Activation function , The gradient vanishing problem has no sparse activation
Unilateral inhibition 、 Wide boundaries of excitement

（6） Other activation functions

Swish function ： $\sigma(\beta x)$
- σ Function as a gating unit , control x The output size of
- Be situated between Relu And linear function
Gelu function : $G e l u (x) = x P (X < = x)$
- Gaussian function is used as gating unit , control x Output
Maxout unit
- Piecewise linear functions
- Use all the outputs of the upper layer neurons instead of one of them , Get multiple parameter vectors
- The output takes the maximum value of multiple outputs after linear transformation

2 Network structure

（1） Feedforward neural networks

It can only spread in one direction , There's no reverse flow of information
All connected neural networks

stay 11 Insert picture description here

chart 3 All connected neural networks 《 Hands-on deep learning 》（ Li Mu ）

Convolutional neural networks

chart 4 Convolutional neural networks ( Source network )

Different from full connection ： Partial connections between neurons of different layers 、 Share weight （ Convolution kernel ） Reduce the number of parameter estimates

Practical machine learning

chart 5 Convolution kernel ( Practical machine learning slides Li Mu ）

（2） Cyclic neural network

It can not only receive information from other neurons , You can also accept your own historical information
The oblivion of past information + Now the information is updated + Future information