PyTorch教程13.3之自动并行

2512907 2023-06-05 | pdf | 0.44 MB | 次下载 | 免费

资料介绍

深度学习框架（例如 MXNet 和 PyTorch）在后端自动构建计算图。使用计算图，系统了解所有依赖关系，并可以选择性地并行执行多个非相互依赖的任务以提高速度。例如，第 13.2 节中的图 13.2.2 独立地初始化了两个变量。因此，系统可以选择并行执行它们。

通常，单个运算符将使用所有 CPU 或单个 GPU 上的所有计算资源。例如，dot算子将使用所有 CPU 上的所有内核（和线程），即使在一台机器上有多个 CPU 处理器。这同样适用于单个 GPU。因此，并行化对于单设备计算机不是很有用。有了多个设备，事情就更重要了。虽然并行化通常在多个 GPU 之间最相关，但添加本地 CPU 会略微提高性能。例如，参见 Hadjis等人。( 2016 年)专注于训练结合 GPU 和 CPU 的计算机视觉模型。借助自动并行化框架的便利，我们可以在几行 Python 代码中实现相同的目标。更广泛地说，我们对自动并行计算的讨论集中在使用 CPU 和 GPU 的并行计算，以及计算和通信的并行化。

请注意，我们至少需要两个 GPU 才能运行本节中的实验。

						import torch
from d2l import torch as d2l

						from mxnet import np, npx
from d2l import mxnet as d2l

npx.set_np()

13.3.1。GPU 上的并行计算

让我们首先定义一个要测试的参考工作负载：run 下面的函数使用分配到两个变量中的数据在我们选择的设备上执行 10 次矩阵-矩阵乘法：x_gpu1和 x_gpu2。

							devices = d2l.try_all_gpus()
def run(x):
  return [x.mm(x) for _ in range(50)]

x_gpu1 = torch.rand(size=(4000, 4000), device=devices[0])
x_gpu2 = torch.rand(size=(4000, 4000), device=devices[1])

							 

现在我们将函数应用于数据。为了确保缓存不会在结果中发挥作用，我们通过在测量之前对其中任何一个执行单次传递来预热设备。torch.cuda.synchronize() 等待 CUDA 设备上所有流中的所有内核完成。它接受一个device参数，即我们需要同步的设备。current_device()如果设备参数为（默认），则它使用由给出的当前设备None。

							run(x_gpu1)
run(x_gpu2) # Warm-up all devices
torch.cuda.synchronize(devices[0])
torch.cuda.synchronize(devices[1])

with d2l.Benchmark('GPU1 time'):
  run(x_gpu1)
  torch.cuda.synchronize(devices[0])

with d2l.Benchmark('GPU2 time'):
  run(x_gpu2)
  torch.cuda.synchronize(devices[1])

							 

							GPU1 time: 0.4967 sec
GPU2 time: 0.5151 sec

如果我们删除synchronize两个任务之间的语句，系统就可以自由地自动在两个设备上并行计算。

							with d2l.Benchmark('GPU1 & GPU2'):
  run(x_gpu1)
  run(x_gpu2)
  torch.cuda.synchronize()

							 

							GPU1 & GPU2: 0.5000 sec

						

							devices = d2l.try_all_gpus()
def run(x):
  return [x.dot(x) for _ in range(50)]

x_gpu1 = np.random.uniform(size=(4000, 4000), ctx=devices[0])
x_gpu2 = np.random.uniform(size=(4000, 4000), ctx=devices[1])

							 

Now we apply the function to the data. To ensure that caching does not play a role in the results we warm up the devices by performing a single pass on either of them prior to measuring.

							run(x_gpu1) # Warm-up both devices
run(x_gpu2)
npx.waitall()

with d2l.Benchmark('GPU1 time'):
  run(x_gpu1)
  npx.waitall()

with d2l.Benchmark('GPU2 time'):
  run(x_gpu2)
  npx.waitall()

							 

							GPU1 time: 0.5233 sec
GPU2 time: 0.5158 sec

If we remove the waitall statement between both tasks the system is free to parallelize computation on both devices automatically.

							with d2l.Benchmark('GPU1 & GPU2'):
  run(x_gpu1)
  run(x_gpu2)
  npx.waitall()

							 

							GPU1 & GPU2: 0.5214 sec

						

在上述情况下，总执行时间小于其各部分的总和，因为深度学习框架会自动安排两个 GPU 设备上的计算，而不需要代表用户编写复杂的代码。

13.3.2。并行计算与通信

在许多情况下，我们需要在不同设备之间移动数据，比如在 CPU 和 GPU 之间，或者在不同 GPU 之间。例如，当我们想要执行分布式优化时会发生这种情况，我们需要在多个加速器卡上聚合梯度。让我们通过在 GPU 上计算然后将结果复制回 CPU 来对此进行模拟。

							def copy_to_cpu(x, non_blocking=False):
  return [y.to('cpu', non_blocking=non_blocking) for y in x]

with d2l.Benchmark('Run on GPU1'):
  y = run(x_gpu1)
  torch.cuda.synchronize()

with d2l.Benchmark('Copy to CPU'):
  y_cpu = copy_to_cpu(y)
  torch.cuda.synchronize()

							 

							Run on GPU1: 0.5019 sec
Copy to CPU: 2.7168 sec

这有点低效。请注意，我们可能已经开始将的部分内容复制y到 CPU，而列表的其余部分仍在计算中。这种情况会发生，例如，当我们计算小批量的（反向传播）梯度时。一些参数的梯度将比其他参数更早可用。因此，在 GPU 仍在运行时开始使用 PCI-Express 总线带宽对我们有利。在 PyTorch 中，几个函数（例如to()和）copy_()承认一个显式non_blocking参数，它允许调用者在不需要时绕过同步。设置non_blocking=True 允许我们模拟这种情况。

							with d2l.Benchmark('Run on GPU1 and copy to CPU'):
  y = run(x_gpu1)
  y_cpu = copy_to_cpu(y, True)
  torch.cuda.synchronize()

							 

							Run on GPU1 and copy to CPU: 2.4682 sec

						

							def copy_to_cpu(x):
  return [y.copyto(npx.cpu()) for y in x]

with d2l.Benchmark('Run on GPU1'):
  y = run(x_gpu1)
  npx.waitall()

with d2l.Benchmark('Copy to CPU'):
  y_cpu = copy_to_cpu(y)
  npx.waitall()

							 

							Run on GPU1: 0.5796 sec
Copy to CPU: 3.0989 sec

This is somewhat inefficient. Note that we could already start copying parts of y to the CPU while the remainder of the list is still being computed. This situation occurs, e.g., when we compute the gradient on a minibatch. The gradients of some of the parameters will be available earlier than that of others. Hence it works to our advantage to start using PCI-Express bus bandwidth while the GPU is still running. Removing waitall between both parts allows us to simulate this scenario.

							with d2l.Benchmark('Run on GPU1 and copy to CPU'):
  y = run(x_gpu1)
  y_cpu = copy_to_cpu(y)
  npx.waitall()

							 

							Run on GPU1 and copy to CPU: 3.3488 sec

						

两个操作所需的总时间（正如预期的那样）小于它们各部分的总和。请注意，此任务不同于并行计算，因为它使用不同的资源：CPU 和 GPU 之间的总线。事实上，我们可以同时在两个设备上进行计算和通信。如上所述，计算和通信之间存在依赖关系：y[i]必须在将其复制到 CPU 之前进行计算。幸运的是，系统可以y[i-1]边计算边复制y[i]，以减少总运行时间。

我们以在一个 CPU 和两个 GPU 上进行训练时简单的两层 MLP 的计算图及其依赖关系的图示作为结尾，如图13.3.1所示。手动安排由此产生的并行程序将非常痛苦。这就是拥有基于图形的计算后端进行优化的优势所在。