PyTorch教程-2.5. 自动微分-电子发烧友网

回想一下2.4 节，计算导数是我们将用于训练深度网络的所有优化算法中的关键步骤。虽然计算很简单，但手工计算可能很乏味且容易出错，而且这个问题只会随着我们的模型变得更加复杂而增长。

幸运的是，所有现代深度学习框架都通过提供自动微分（通常简称为 autograd ）来解决我们的工作。当我们通过每个连续的函数传递数据时，该框架会构建一个计算图来跟踪每个值如何依赖于其他值。为了计算导数，自动微分通过应用链式法则通过该图向后工作。以这种方式应用链式法则的计算算法称为反向传播。

虽然 autograd 库在过去十年中成为热门话题，但它们的历史悠久。事实上，对 autograd 的最早引用可以追溯到半个多世纪以前（Wengert，1964 年）。现代反向传播背后的核心思想可以追溯到 1980 年的一篇博士论文 ( Speelpenning, 1980 )，并在 80 年代后期得到进一步发展 ( Griewank, 1989 )。虽然反向传播已成为计算梯度的默认方法，但它并不是唯一的选择。例如，Julia 编程语言采用前向传播（Revels等人，2016 年）. 在探索方法之前，我们先来掌握autograd这个包。

import torch

from mxnet import autograd, np, npx

npx.set_np()

from jax import numpy as jnp

import tensorflow as tf

2.5.1. 一个简单的函数

假设我们有兴趣区分函数 y=2x⊤x关于列向量x. 首先，我们分配x一个初始值。

x = torch.arange(4.0)
x

tensor([0., 1., 2., 3.])

在我们计算梯度之前y关于 x，我们需要一个地方来存放它。通常，我们避免每次求导时都分配新内存，因为深度学习需要针对相同参数连续计算导数数千或数百万次，并且我们可能会面临内存耗尽的风险。请注意，标量值函数相对于向量的梯度x是向量值的并且具有相同的形状x.

# Can also create x = torch.arange(4.0, requires_grad=True)
x.requires_grad_(True)
x.grad # The gradient is None by default

x = np.arange(4.0)
x

array([0., 1., 2., 3.])

Before we calculate the gradient of y with respect to x, we need a place to store it. In general, we avoid allocating new memory every time we take a derivative because deep learning requires successively computing derivatives with respect to the same parameters thousands or millions of times, and we might risk running out of memory. Note that the gradient of a scalar-valued function with respect to a vector x is vector-valued and has the same shape as x.

# We allocate memory for a tensor's gradient by invoking `attach_grad`
x.attach_grad()
# After we calculate a gradient taken with respect to `x`, we will be able to
# access it via the `grad` attribute, whose values are initialized with 0s
x.grad

array([0., 0., 0., 0.])

x = jnp.arange(4.0)
x

No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

Array([0., 1., 2., 3.], dtype=float32)

x = tf.range(4, dtype=tf.float32)
x

x = tf.Variable(x)

我们现在计算我们的函数x并将结果分配给y。

y = 2 * torch.dot(x, x)
y

tensor(28., grad_fn=)

我们现在可以通过调用它的方法来获取y关于的梯度。接下来，我们可以通过的属性访问渐变。xbackwardxgrad

y.backward()
x.grad

tensor([ 0., 4., 8., 12.])

# Our code is inside an `autograd.record` scope to build the computational
# graph
with autograd.record():
  y = 2 * np.dot(x, x)
y

array(28.)

We can now take the gradient of y with respect to x by calling its backward method. Next, we can access the gradient via x’s grad attribute.

y.backward()
x.grad

[09:38:36] src/base.cc:49: GPU context requested, but no GPUs found.

array([ 0., 4., 8., 12.])

y = lambda x: 2 * jnp.dot(x, x)
y(x)

Array(28., dtype=float32)

We can now take the gradient of y with respect to x by passing through the grad transform.

from jax import grad

# The `grad` transform returns a Python function that
# computes the gradient of the original function
x_grad = grad(y)(x)
x_grad

Array([ 0., 4., 8., 12.], dtype=float32)

# Record all computations onto a tape
with tf.GradientTape() as t:
  y = 2 * tf.tensordot(x, x, axes=1)
y

We can now calculate the gradient of y with respect to x by calling the gradient method.

x_grad = t.gradient(y, x)
x_grad

我们已经知道函数的梯度 y=2x⊤x关于 x应该4x. 我们现在可以验证自动梯度计算和预期结果是否相同。

x.grad == 4 * x

tensor([True, True, True, True])

现在让我们计算另一个函数x并获取它的梯度。请注意，当我们记录新的梯度时，PyTorch 不会自动重置梯度缓冲区。相反，新的渐变被添加到已经存储的渐变中。当我们想要优化多个目标函数的总和时，这种行为会派上用场。要重置梯度缓冲区，我们可以调用x.grad.zero()如下：

x.grad.zero_() # Reset the gradient
y = x.sum()
y.backward()
x.grad

tensor([1., 1., 1., 1.])

x.grad == 4 * x

array([ True, True, True, True])

Now let’s calculate another function of x and take its gradient. Note that MXNet resets the gradient buffer whenever we record a new gradient.

with autograd.record():
  y = x.sum()
y.backward()
x.grad # Overwritten by the newly calculated gradient

array([1., 1., 1., 1.])

x_grad == 4 * x

Array([ True, True, True, True], dtype=bool)

y = lambda x: x.sum()
grad(y)(x)

Array([1., 1., 1., 1.], dtype=float32)

x_grad == 4 * x

Now let’s calculate another function of x and take its gradient. Note that TensorFlow resets the gradient buffer whenever we record a new gradient.

with tf.GradientTape() as t:
  y = tf.reduce_sum(x)
t.gradient(y, x) # Overwritten by the newly calculated gradient

2.5.2. 非标量变量的后向

当y是向量时，y关于向量的导数最自然的解释是称为雅可比x矩阵的矩阵，其中包含关于每个分量的每个分量的偏导数。同样，对于高阶和，微分结果可能是更高阶的张量。yxyx

y 虽然 Jacobian 矩阵确实出现在一些高级机器学习技术中，但更常见的是，我们希望将的每个分量相对于完整向量的梯度求和x，从而产生与形状相同的向量x。例如，我们通常有一个向量表示我们的损失函数的值，分别为一批训练示例中的每个示例计算。在这里，我们只想总结为每个示例单独计算的梯度。

由于深度学习框架在解释非标量张量梯度的方式上有所不同，因此 PyTorch 采取了一些措施来避免混淆。调用backward非标量会引发错误，除非我们告诉 PyTorch 如何将对象缩减为标量。更正式地说，我们需要提供一些向量v这样backward会计算v⊤∂xy而不是∂xy. 下一部分可能令人困惑，但出于稍后会变得清楚的原因，这个论点（代表v) 被命名为gradient。更详细的描述见杨章的Medium帖子。

x.grad.zero_()
y = x * x
y.backward(gradient=torch.ones(len(y))) # Faster: y.sum().backward()
x.grad

tensor([0., 2., 4., 6.])

MXNet handles this problem by reducing all tensors to scalars by summing before computing a gradient. In other words, rather than returning the Jacobian ∂xy, it returns the gradient of the sum ∂x∑iyi.

with autograd.record():
  y = x * x
y.backward()
x.grad # Equals the gradient of y = sum(x * x)

array([0., 2., 4., 6.])

y = lambda x: x * x
# grad is only defined for scalar output functions
grad(lambda x: y(x).sum())(x)

Array([0., 2., 4., 6.], dtype=float32)

By default, TensorFlow returns the gradient of the sum. In other words, rather than returning the Jacobian ∂xy, it returns the gradient of the sum ∂x∑iyi.

with tf.GradientTape() as t:
  y = x * x
t.gradient(y, x) # Same as y = tf.reduce_sum(x * x)

2.5.3. 分离计算

有时，我们希望将一些计算移到记录的计算图之外。例如，假设我们使用输入来创建一些我们不想为其计算梯度的辅助中间项。在这种情况下，我们需要从最终结果中分离出相应的计算图。下面的玩具示例更清楚地说明了这一点：假设我们有，但我们想关注on的直接影响，而不是通过传达的影响。在这种情况下，我们可以创建一个新变量，该变量具有与相同的值，但其出处（创建方式）已被清除。因此z = x * yy = x * xxzyuyu图中没有祖先，梯度不会u流向x. 例如，采用的梯度将产生结果，（与您自以来可能预期的不同）。z = x * ux3 * x * xz = x * x * x

x.grad.zero_()
y = x * x
u = y.detach()
z = u * x

z.sum().backward()
x.grad == u

tensor([True, True, True, True])

with autograd.record():
  y = x * x
  u = y.detach()
  z = u * x
z.backward()
x.grad == u

array([ True, True, True, True])

import jax

y = lambda x: x * x
# jax.lax primitives are Python wrappers around XLA operations
u = jax.lax.stop_gradient(y(x))
z = lambda x: u * x

grad(lambda x: z(x).sum())(x) == y(x)

Array([ True, True, True, True], dtype=bool)

# Set persistent=True to preserve the compute graph.
# This lets us run t.gradient more than once
with tf.GradientTape(persistent=True) as t:
  y = x * x
  u = tf.stop_gradient(y)
  z = u * x

x_grad = t.gradient(z, x)
x_grad == u

请注意，虽然此过程将y的祖先与的图分离z，但导致的计算图仍然存在，因此我们可以计算关于y的梯度。yx

x.grad.zero_()
y.sum().backward()
x.grad == 2 * x

tensor([True, True, True, True])

y.backward()
x.grad == 2 * x

array([ True, True, True, True])

grad(lambda x: y(x).sum())(x) == 2 * x

Array([ True, True, True, True], dtype=bool)

t.gradient(y, x) == 2 * x

2.5.4. 渐变和 Python 控制流

到目前为止，我们回顾了从输入到输出的路径通过诸如. 编程为我们计算结果的方式提供了更多的自由。例如，我们可以使它们依赖于辅助变量或对中间结果的条件选择。使用自动微分的一个好处是，即使构建函数的计算图需要通过迷宫般的 Python 控制流（例如，条件、循环和任意函数调用），我们仍然可以计算结果变量的梯度。为了说明这一点，请考虑以下代码片段，其中循环的迭代次数和语句的评估都取决于输入的值。z = x * x * xwhileifa

def f(a):
  b = a * 2
  while b.norm() < 1000:
    b = b * 2
  if b.sum() > 0:
    c = b
  else:
    c = 100 * b
  return c

def f(a):
  b = a * 2
  while np.linalg.norm(b) < 1000:
    b = b * 2
  if b.sum() > 0:
    c = b
  else:
    c = 100 * b
  return c

def f(a):
  b = a * 2
  while jnp.linalg.norm(b) < 1000:
    b = b * 2
  if b.sum() > 0:
    c = b
  else:
    c = 100 * b
  return c

def f(a):
  b = a * 2
  while tf.norm(b) < 1000:
    b = b * 2
  if tf.reduce_sum(b) > 0:
    c = b
  else:
    c = 100 * b
  return c

下面，我们调用这个函数，传入一个随机值作为输入。由于输入是一个随机变量，我们不知道计算图将采用什么形式。然而，每当我们f(a)对一个特定的输入执行时，我们就会实现一个特定的计算图并可以随后运行backward。

a = torch.randn(size=(), requires_grad=True)
d = f(a)
d.backward()

a = np.random.normal()
a.attach_grad()
with autograd.record():
  d = f(a)
d.backward()

from jax import random

a = random.normal(random.PRNGKey(1), ())
d = f(a)
d_grad = grad(f)(a)

a = tf.Variable(tf.random.normal(shape=()))
with tf.GradientTape() as t:
  d = f(a)
d_grad = t.gradient(d, a)
d_grad

尽管我们的函数f出于演示目的有点人为设计，但它对输入的依赖性非常简单：它是具有分段定义比例的线性函数。a因此，是一个包含常量项的向量，此外，需要匹配关于的梯度。f(a) / af(a) / af(a)a

a.grad == d / a

tensor(True)

a.grad == d / a

array(True)

d_grad == d / a

Array(True, dtype=bool)

d_grad == d / a

动态控制流在深度学习中很常见。例如，在处理文本时，计算图取决于输入的长度。在这些情况下，自动微分对于统计建模变得至关重要，因为不可能先验地计算梯度。

2.5.5. 讨论

您现在已经领略了自动微分的威力。用于自动和高效计算导数的库的开发极大地提高了深度学习从业者的生产力，使他们能够专注于更高级的问题。此外，autograd 允许我们设计大量模型，笔和纸的梯度计算将非常耗时。有趣的是，虽然我们使用 autograd 来优化模型（在统计意义上），但autograd 库本身的优化（在计算意义上）是框架设计者非常感兴趣的一个丰富主题。在这里，来自编译器和图形操作的工具被用来以最方便和内存效率最高的方式计算结果。

现在，试着记住这些基础知识：(i) 将梯度附加到那些我们想要导数的变量；(ii) 记录目标值的计算；(iii) 执行反向传播功能；(iv) 访问生成的梯度。

2.5.6. 练习

为什么二阶导数的计算成本比一阶导数高得多？

运行反向传播函数后，立即再次运行它，看看会发生什么。为什么？

d在我们计算关于的导数的控制流示例中 a，如果我们将变量更改a为随机向量或矩阵会发生什么？此时，计算的结果f(a)不再是标量。结果会怎样？我们如何分析这个？

让f(x)=sin⁡(x). 绘制图形f及其衍生物f′. 不要利用这个事实 f′(x)=cos⁡(x)而是使用自动微分来获得结果。

让f(x)=((log⁡x2)⋅sin⁡x)+x−1. 写出依赖图跟踪结果x到f(x).

使用链式法则计算导数dfdx上述函数，将每个术语放在您之前构建的依赖图上。

给定图形和中间导数结果，您在计算梯度时有多种选择。从开始评估结果x到f一次来自f 追溯到x. 路径从x到f通常称为前向微分，而从 f到x被称为向后微分。

你什么时候想用前向微分，什么时候用后向微分？提示：考虑所需的中间数据量、并行化步骤的能力以及涉及的矩阵和向量的大小。

声明：本文内容及配图由入驻作者撰写或者入驻合作网站授权转载。文章观点仅代表作者本人，不代表电子发烧友网立场。文章及其配图仅供工程师学习之用，如有内容侵权或者其他违规问题，请联系本站处理。举报投诉

pytorch

pytorch

+关注

关注
2

文章
808

浏览量
13404

Pytorch自动求导示例

Pytorch自动微分的几个例子

发表于 08-09 11:56

PyTorch如何入门

PyTorch 入门实战（一）——Tensor

发表于 06-01 09:58

通过Cortex来非常方便的部署PyTorch模型

的工作。那么，问题是如何将 RoBERTa 部署为一个 JSON API，而不需要手动滚动所有这些自定义基础设施？将 PyTorch 模型与 Cortex 一起投入生产你可以使用 Cortex 自动化部署

发表于 11-01 15:25

实用微分器

实用微分器微分器

发表于 09-25 14:34 •820次阅读

一篇非常新的介绍PyTorch内部机制的文章

/pytorch-internals/ 翻译努力追求通俗、易懂，有些熟知的名词没有进行翻译比如(Tensor, 张量) 部分专有名词翻译对照表如下英文译文 autograde 自动微分

发表于 12-26 10:17 •2270次阅读

一篇非常新的介绍<b class='flag-5'>PyTorch</b>内部机制的文章

基于PyTorch的深度学习入门教程之PyTorch简单知识

本文参考PyTorch官网的教程，分为五个基本模块来介绍PyTorch。为了避免文章过长，这五个模块分别在五篇博文中介绍。 Part1：PyTorch简单知识 Part2：PyTorch

发表于 02-16 15:20 •2326次阅读

基于PyTorch的深度学习入门教程之PyTorch的自动梯度计算

本文参考PyTorch官网的教程，分为五个基本模块来介绍PyTorch。为了避免文章过长，这五个模块分别在五篇博文中介绍。 Part1：PyTorch简单知识 Part2：PyTorch

发表于 02-16 15:26 •2097次阅读

基于PyTorch的深度学习入门教程之PyTorch重点综合实践

前言 PyTorch提供了两个主要特性：（1）一个n维的Tensor，与numpy相似但是支持GPU运算。（2）搭建和训练神经网络的自动微分功能。我们将会使用一个全连接的ReLU网络作为

发表于 02-15 10:01 •1859次阅读

PyTorch1.8和Tensorflow2.5该如何选择？

自深度学习重新获得公认以来，许多机器学习框架层出不穷，争相成为研究人员以及行业从业人员的新宠。从早期的学术成果 Caffe、Theano，到获得庞大工业支持的 PyTorch、TensorFlow

发表于 07-09 10:33 •1610次阅读

PyTorch 的 Autograd 机制和使用

PyTorch 作为一个深度学习平台，在深度学习任务中比 NumPy 这个科学计算库强在哪里呢？我觉得一是 PyTorch 提供了自动求导机制，二是对 GPU 的支持。由此可见，自动求

发表于 08-15 09:37 •1174次阅读

PyTorch教程2.5之自动微分

电子发烧友网站提供《PyTorch教程2.5之自动微分.pdf》资料免费下载

发表于 06-05 11:38 •0次下载

PyTorch教程13.3之自动并行

电子发烧友网站提供《PyTorch教程13.3之自动并行.pdf》资料免费下载

发表于 06-05 14:47 •0次下载

PyTorch的介绍与使用案例

学习领域的一个重要工具。PyTorch底层由C++实现，提供了丰富的API接口，使得开发者能够高效地构建和训练神经网络模型。PyTorch不仅支持动态计算图，还提供了强大的自动微分系统

发表于 07-10 14:19 •497次阅读

pytorch怎么在pycharm中运行

第一部分：PyTorch和PyCharm的安装 1.1 安装PyTorch PyTorch是一个开源的机器学习库，用于构建和训练神经网络。要在PyCharm中使用PyTorch，首先需

发表于 08-01 16:22 •1665次阅读

如何使用 PyTorch 进行强化学习

的计算图和自动微分功能，非常适合实现复杂的强化学习算法。 1. 环境（Environment）在强化学习中，环境是一个抽象的概念，它定义了智能体（agent）可以执行的动作（actions）、观察到

发表于 11-05 17:34 •450次阅读

搜索历史

PyTorch教程-2.5. 自动微分

评论

Pytorch自动求导示例

PyTorch如何入门

通过Cortex来非常方便的部署PyTorch模型

实用微分器

一篇非常新的介绍PyTorch内部机制的文章

基于PyTorch的深度学习入门教程之PyTorch简单知识

基于PyTorch的深度学习入门教程之PyTorch的自动梯度计算

基于PyTorch的深度学习入门教程之PyTorch重点综合实践

PyTorch1.8和Tensorflow2.5该如何选择？

PyTorch 的 Autograd 机制和使用

PyTorch教程2.5之自动微分

PyTorch教程13.3之自动并行

PyTorch的介绍与使用案例

pytorch怎么在pycharm中运行

如何使用 PyTorch 进行强化学习