前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Deep Learning with PyTorch > A Gentle Introduction to torch.autograd

Deep Learning with PyTorch > A Gentle Introduction to torch.autograd

原创
作者头像
望天
修改2024-06-07 18:48:09
1470
修改2024-06-07 18:48:09
举报
文章被收录于专栏:along的开发之旅along的开发之旅

torch.autograd is PyTorch’s automatic differentiation engine that powers neural network training. In this section, you will get a conceptual understanding of how autograd helps a neural network train.

Background

Neural networks (NNs) are a collection of nested functions that are executed on some input data. These functions are defined by parameters (consisting of weights and biases), which in PyTorch are stored in tensors.

Training a NN happens in two steps:

Forward Propagation: In forward prop, the NN makes its best guess about the correct output. It runs the input data through each of its functions to make this guess.

Backward Propagation: In backprop, the NN adjusts its parameters proportionate to the error in its guess. It does this by traversing backwards from the output, collecting the derivatives of the error with respect to the parameters of the functions (gradients), and optimizing the parameters using gradient descent. For a more detailed walkthrough of backprop, check out this video from 3Blue1Brown.

Usage in PyTorch

Let’s take a look at a single training step. For this example, we load a pretrained resnet18 model from torchvision. We create a random data tensor to represent a single image with 3 channels, and height & width of 64, and its corresponding label initialized to some random values. Label in pretrained models has shape (1,1000).

This tutorial works only on the CPU and will not work on GPU devices (even if tensors are moved to CUDA).

代码语言:python
代码运行次数:0
复制
import torch
from torchvision.models import resnet18, ResNet18_Weights
model = resnet18(weights=ResNet18_Weights.DEFAULT)
data = torch.rand(1, 3, 64, 64)
labels = torch.rand(1, 1000)
代码语言:out
复制
Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /var/lib/ci-user/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth

  0%|          | 0.00/44.7M [00:00<?, ?B/s]
 38%|###7      | 16.9M/44.7M [00:00<00:00, 176MB/s]
 76%|#######5  | 33.9M/44.7M [00:00<00:00, 177MB/s]
100%|##########| 44.7M/44.7M [00:00<00:00, 177MB/s]

在Ascend NPU上,参考vision_npu, 但仅安装 torchvison. 请从源码安装,避免破坏原有的torch/torch_npu依赖, torch 2.1 对应 torchvision 0.16。其他操作和上方一样。

代码语言:shell
复制
 git clone https://github.com/pytorch/vision.git
 cd vision
 git checkout v0.16.0
 # 编包
 python setup.py bdist_wheel
 # 安装
 cd dist
 pip3 install torchvision-0.16.*.whl

Next, we run the input data through the model through each of its layers to make a prediction. This is the forward pass.

代码语言:python
代码运行次数:0
复制
prediction = model(data) # forward pass

We use the model’s prediction and the corresponding label to calculate the error (loss). The next step is to backpropagate this error through the network. Backward propagation is kicked off when we call .backward() on the error tensor. Autograd then calculates and stores the gradients for each model parameter in the parameter’s .grad attribute.

代码语言:python
代码运行次数:0
复制
loss = (prediction - labels).sum()
loss.backward() # backward pass

Next, we load an optimizer, in this case SGD with a learning rate of 0.01 and momentum of 0.9. We register all the parameters of the model in the optimizer.

代码语言:python
代码运行次数:0
复制
optim = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)

Finally, we call .step() to initiate gradient descent. The optimizer adjusts each parameter by its gradient stored in .grad.

代码语言:python
代码运行次数:0
复制
optim.step() #gradient descent

At this point, you have everything you need to train your neural network. The below sections detail the workings of autograd - feel free to skip them.


Differentiation in Autograd

Let’s take a look at how autograd collects gradients. We create two tensors a and b with requires_grad=True. This signals to autograd that every operation on them should be tracked.

代码语言:pytorch
复制
import torch

a = torch.tensor([2., 3.], requires_grad=True)
b = torch.tensor([6., 4.], requires_grad=True)

We create another tensor Q from a and b.

Q = 3a^3 - b^2
代码语言:python
代码运行次数:0
复制
Q = 3*a**3 - b**2
\frac{\partial Q}{\partial a} = 9a^2
\frac{\partial Q}{\partial b} = -2b

When we call .backward() on Q, autograd calculates these gradients and stores them in the respective tensors’ .grad attribute.

We need to explicitly pass a gradient argument in Q.backward() because it is a vector. gradient is a tensor of the same shape as Q, and it represents the gradient of Q w.r.t. itself, i.e.

\frac{\partial Q}{\partial Q} = 1

Equivalently, we can also aggregate Q into a scalar and call backward implicitly, like Q.sum().backward().

代码语言:python
代码运行次数:0
复制
external_grad = torch.tensor([1., 1.])
Q.backward(gradient=external_grad)

Gradients are now deposited in a.grad and b.grad

代码语言:python
代码运行次数:0
复制
# check if collected gradients are correct
print(9*a**2 == a.grad)
print(-2*b == b.grad)
代码语言:out
复制
tensor([True, True])
tensor([True, True])

Optional Reading - Vector Calculus using autograd

x^2 + y^2 = r^2

Mathematically, if you have a vector valued function

\vec{y}=f(\vec{x}) , then the gradient of \vec{y} with respect to \vec{x} is a Jacobian matrix J :

\begin{aligned}J = \left(\begin{array}{cc} \frac{\partial \bf{y}}{\partial x_{1}} & ... & \frac{\partial \bf{y}}{\partial x_{n}} \end{array}\right) = \left(\begin{array}{ccc} \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}}\\ \vdots & \ddots & \vdots\\ \frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}} \end{array}\right) \end{aligned}

Generally speaking, torch.autograd is an engine for computing vector-Jacobian product. That is, given any vector \vec{v} , compute the product J^{T}\cdot \vec{v}

If \vec{v} happens to be the gradient of a scalar function l=g\left(\vec{y}\right) :

\vec{v}= \left(\begin{array}{ccc}\frac{\partial l}{\partial y_{1}} & \cdots & \frac{\partial l}{\partial y_{m}}\end{array}\right)^{T}

then by the chain rule, the vector-Jacobian product would be the gradient of l with respect to \vec{x} :

\begin{aligned} J^{T}\cdot \vec{v}=\left(\begin{array}{ccc} \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{1}}\\ \vdots & \ddots & \vdots\\ \frac{\partial y_{1}}{\partial x_{n}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}} \end{array}\right)\left(\begin{array}{c} \frac{\partial l}{\partial y_{1}}\\ \vdots\\ \frac{\partial l}{\partial y_{m}} \end{array}\right)=\left(\begin{array}{c} \frac{\partial l}{\partial x_{1}}\\ \vdots\\ \frac{\partial l}{\partial x_{n}} \end{array}\right) \end{aligned}

This characteristic of vector-Jacobian product is what we use in the above example; external_grad represents \vec{v} .

Computational Graph

Conceptually, autograd keeps a record of data (tensors) & all executed operations (along with the resulting new tensors) in a directed acyclic graph (DAG) consisting of Function objects. In this DAG, leaves are the input tensors, roots are the output tensors. By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.

In a forward pass, autograd does two things simultaneously:

  • run the requested operation to compute a resulting tensor, and 在正向传播过程中,我们通常会对输入数据应用一系列操作(如卷积、全连接层等)来计算模型的输出。torch.autograd 在这个阶段的作用是执行这些操作,从而计算出模型的输出张量。这些操作可以是 PyTorch 提供的各种张量操作,如加法、乘法、矩阵乘法等。
  • maintain the operation's gradient function in the DAG. 在正向传播过程中,torch.autograd 不仅执行操作来计算输出张量,还会维护一个计算图(Directed Acyclic Graph,简称 DAG),用于表示这些操作之间的依赖关系。计算图中的每个节点表示一个操作,边表示操作之间的依赖关系。在计算图中,每个节点还包含一个与之关联的 gradient function,它表示如何从当前节点计算出梯度。

The backward pass kicks off when .backward() is called on the DAG

root. autograd then:

  • computes the gradients from each .grad_fn, 在 PyTorch 中,每个张量(tensor)都有一个与之关联的 .grad_fn 属性。这个属性是一个函数,它表示了如何从当前张量计算出梯度。在反向传播过程中,torch.autograd 会遍历计算图中的每个节点(即每个张量),并使用其 .grad_fn 属性来计算梯度。
  • accumulates them in the respective tensor's .grad attribute, and 在计算出梯度后,torch.autograd 会将这些梯度累积到相应张量的 .grad 属性中。这意味着,如果一个张量在计算图中被多次使用,那么它的 .grad 属性将包含所有这些使用情况的梯度之和。这是因为在反向传播过程中,我们通常希望计算损失函数关于所有可训练参数的梯度,而不仅仅是某个特定操作的梯度。
  • using the chain rule, propagates all the way to the leaf tensors. 在计算图中,有两种类型的张量:叶子张量(leaf tensors)和非叶子张量(non-leaf tensors)。叶子张量是计算图的输入,它们通常是模型的参数或输入数据。非叶子张量是计算图中的中间结果,它们是通过对叶子张量应用操作而得到的。在反向传播过程中,torch.autograd 会从损失函数(或任何其他标量输出)开始,逐步向后计算梯度,直到到达叶子张量。这个过程涉及到链式法则(chain rule),因为梯度是通过链式法则逐步传播的。

Below is a visual representation of the DAG in our example. In the graph, the arrows are in the direction of the forward pass. The nodes represent the backward functions of each operation in the forward pass.

The leaf nodes in blue represent our leaf tensors a and b.

DAGs are dynamic in PyTorch An important thing to note is that the graph is recreated from scratch; after each .backward() call, autograd starts populating a new graph. This is exactly what allows you to use control flow statements in your model; you can change the shape, size and operations at every iteration if needed.

Exclusion from the DAG

torch.autograd tracks operations on all tensors which have their requires_grad flag set to True. For tensors that don't require gradients, setting this attribute to False excludes it from the gradient computation DAG.

The output tensor of an operation will require gradients even if only a single input tensor has requires_grad=True.

代码语言:python
代码运行次数:0
复制
x = torch.rand(5, 5)
y = torch.rand(5, 5)
z = torch.rand((5, 5), requires_grad=True)

a = x + y
print(f"Does `a` require gradients?: {a.requires_grad}")
b = x + z
print(f"Does `b` require gradients?: {b.requires_grad}")
代码语言:out
复制
Does `a` require gradients?: False
Does `b` require gradients?: True

In a NN, parameters that don't compute gradients are usually called frozen parameters. It is useful to "freeze" part of your model if you know in advance that you won't need the gradients of those parameters (this offers some performance benefits by reducing autograd computations).

In finetuning, we freeze most of the model and typically only modify the classifier layers to make predictions on new labels. Let's walk through

a small example to demonstrate this. As before, we load a pretrained resnet18 model, and freeze all the parameters.

代码语言:python
代码运行次数:0
复制
from torch import nn, optim

model = resnet18(weights=ResNet18_Weights.DEFAULT)

# Freeze all the parameters in the network
for param in model.parameters():
    param.requires_grad = False

Let's say we want to finetune the model on a new dataset with 10 labels. In resnet, the classifier is the last linear layer model.fc. We can simply replace it with a new linear layer (unfrozen by default) that acts as our classifier.

model.fc = nn.Linear(512, 10)

Now all parameters in the model, except the parameters of model.fc, are frozen. The only parameters that compute gradients are the weights and bias of model.fc.

代码语言:python
代码运行次数:0
复制
# Optimize only the classifier
optimizer = optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)

Notice although we register all the parameters in the optimizer, the only parameters that are computing gradients (and hence updated in gradient descent) are the weights and bias of the classifier.

The same exclusionary functionality is available as a context manager in torch.no_grad()

主要转自

https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • Background
  • Usage in PyTorch
  • Differentiation in Autograd
    • Optional Reading - Vector Calculus using autograd
      • Computational Graph
      • Exclusion from the DAG
      • 主要转自
      领券
      问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档