当前位置：网站首页>The four most common errors when using pytorch

The four most common errors when using pytorch

2022-07-04 15:43:00 【Xiaobai learns vision】

Reading guide

this 4 A mistake , I'm sure most people have committed , I hope I can give you a little reminder .

The most common neural network error ：1) You didn't first try to fit a single batch.2) You forgot to set up for the network train/eval Pattern .3) stay .backward() Forget before .zero_grad()( stay pytorch in ).4) take softmaxed The output is passed to the expected raw logits The loss of , Anything else? ？

This article will analyze point by point how these errors are in PyTorch In the code example . Code ：https://github.com/missinglinkai/common-nn-mistakes

Common mistakes 1 You didn't first try to fit a single batch

Andrej Say we should over fit individual batch. Why? ？ ok , When you over fit a single batch —— You're actually making sure the model works . I don't want to waste hours of training on a huge dataset , Just to find out because of a small mistake , It's just 50% The accuracy of the . When your model fully remembers the input , What you get is a good prediction of its best performance .

Maybe the best performance is zero , Because an exception was thrown during execution . But it doesn't matter , Because we can quickly find the problem and solve it . To sum up , Why should you start over fitting from a small subset of the dataset ：

Find out bug
Estimate the best possible loss and accuracy
Fast iteration

stay PyTorch Data set , You are usually in dataloader Last iteration . Your first attempt might be to index train_loader.

# TypeError: 'DataLoader' object does not support indexing
first_batch = train_loader[0]

You will immediately see a mistake , because DataLoaders Hope to support network flow and other scenarios that do not require indexing . So there was no __getitem__ Method , This led to the [0] operation failed , Then you'll try to convert it to list, This will support indexing .

# slow, wasteful
first_batch = list(train_loader)[0]

But that means you have to evaluate the entire dataset, which consumes your time and memory . So what else can we try ?

stay Python for In circulation , When you type in the following :

for item in iterable:
    do_stuff(item)

You effectively get this ：

iterator = iter(iterable)
try:
 while True:
        item = next(iterator)
        do_stuff(item)
except StopIteration:
 pass

call “iter” Function to create an iterator , Then call the function's “next” To get the next entry . Until we finish ,StopIteration Be triggered . In this cycle , We just need to call next, next, next… . To simulate this behavior, but only get the first one , We can use this ：

first = next(iter(iterable))

We call “iter” To get the iterator , But we just call “next” Function a . Be careful , For the sake of clarity , I assign the next result to a named “first” Variables in . I call this “next-iter” trick. In the following code , You can see the complete train data loader Example ：

for batch_idx, (data, target) in enumerate(train_loader):
 # training code here

Here's how to modify the loop to use first-iter trick ：

first_batch = next(iter(train_loader))
for batch_idx, (data, target) in enumerate([first_batch] * 50):
 # training code here

You can see that I will “first_batch” Multiplied by 50 Time , To make sure I over fit .

Common mistakes 2: Forget to set up for the network train/eval Pattern

Why? PyTorch Focus on whether we're training or evaluating models ？ The biggest reason is dropout. This technique removes neurons at random during training .

Imagine , If the red neuron on the right is the only neuron that contributes to the right result . Once we remove the red neurons , It forces other neurons to train and learn how to be accurate without red . such drop-out Improved the performance of the final test —— But it has a negative impact on performance during training , Because the network is incomplete . Run the script and see MissingLink dashobard The accuracy of , Remember that .

In this particular case , It seems that every 50 The next iteration will reduce the accuracy .

If we check the code —— We see that it is train The training mode is set in the function .

def train(model, optimizer, epoch, train_loader, validation_loader):
    model.train() # ????????????
 for batch_idx, (data, target) in experiment.batch_loop(iterable=train_loader):
        data, target = Variable(data), Variable(target)
 # Inference
        output = model(data)
        loss_t = F.nll_loss(output, target)
 # The iconic grad-back-step trio
        optimizer.zero_grad()
        loss_t.backward()
        optimizer.step()
 if batch_idx % args.log_interval == 0:
            train_loss = loss_t.item()
            train_accuracy = get_correct_count(output, target) * 100.0 / len(target)
            experiment.add_metric(LOSS_METRIC, train_loss)
            experiment.add_metric(ACC_METRIC, train_accuracy)
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx, len(train_loader),
 100. * batch_idx / len(train_loader), train_loss))
 with experiment.validation():
                val_loss, val_accuracy = test(model, validation_loader) # ????????????
                experiment.add_metric(LOSS_METRIC, val_loss)
                experiment.add_metric(ACC_METRIC, val_accuracy)

This problem is not easy to notice , In the loop, we call test function .

def test(model, test_loader):
    model.eval()
 # ...

stay test Internal function , We set the mode to eval! It means , If we call it in the training process test function , We'll go in eval Pattern , Until the next time train Function called . This leads to every one of them epoch Only one of them batch Used drop-out , This leads to the performance degradation we see .

It's easy to fix —— We will model.train() Move one line down , Let's be in the training cycle . The ideal pattern setting is as close as possible to the reasoning step , To avoid forgetting to set it . After correction , Our training process seems more reasonable , There is no middle peak . Please note that , Due to the use of drop-out , Training accuracy will be lower than verification accuracy .

Common mistakes 3: Forget in .backward() Before .zero_grad()

When in “loss” Tensor calls “backward” when , You're telling me PyTorch from loss Go back , And calculate the impact of each weight on the loss , That is, this is the gradient of each node in the graph . Use this gradient , We can optimally update weights .

Here it is PyTorch The look in the code . final “step” The method will be based on “backward” The result of the step updates the weight . What may not be obvious from this code is , If we've been in a lot of batch I'm going to do this , Gradients explode , What we use step It's going to get bigger .

output = model(input) # forward-pass
loss_fn.backward()    # backward-pass
optimizer.step()      # update weights by an ever growing gradient ????????????

for fear of step Get too big , We use zero_grad Method .

output = model(input) # forward-pass
optimizer.zero_grad() # reset gradient ????
loss_fn.backward()    # backward-pass
optimizer.step()      # update weights using a reasonably sized gradient ????

It may feel a little too obvious , But it does give precise control of gradients . There's a way to make sure you're not confused , That's putting these three functions together ：

zero_grad
backward
step

In our code example , Don't use... At all zero_grad Under the circumstances . Neural networks are starting to get better , Because it's improving , But gradients eventually explode , All the updates are getting more and more junk , Until the Internet finally becomes useless .

call backward Do it later zero_grad. Nothing happened , Because we erased the gradient , So the weight is not updated . The only change left is dropout.

I think every time step It makes sense to automatically reset the gradient when a method is called .

stay backward When you don't use zero_grad One reason for that is , If you call step() You have to call backward, for example , If you each batch Only one sample can be put into memory , So a gradient would be too noisy , You want to be in every step A few batch Gradient of . Another reason might be to call backward —— But in this case , You can also add up the losses , And then call... On the sum backward.

Common mistakes 4: You finish softmax The results are sent to the original logits In the loss function of

logits Is the last fully connected layer activation value .softmax It's the same activation value , But it's standardized .logits value , There's something you can see , Some are negative . and log_softmax The value after , It's all negative . If you look at the bar chart , The same can be seen in distributed , The only difference is the scale , But it's the subtle difference , The result is that the final mathematical calculation is completely different . But why is this a common mistake ？ stay PyTorch The official MNIST In the example , see forward Method , At the end, you can see the last full connectivity layer self.fc2, Then is log_softmax.

But when you look at the official PyTorch resnet perhaps AlexNet When modeling , You'll find that these models don't end up with softmax layer , The final result is a fully connected output , Namely logits.

The difference between the two is not made clear in the document . If you check nll_loss function , It is not mentioned that the input is logits still softmax, Your only hope is to find nll_loss Used log_softmax As input .

原网站

版权声明
本文为[Xiaobai learns vision]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202141224392680.html