总结FasterTransformer Encoder(BERT)的cuda相关优化技巧-电子发烧友网

FasterTransformer BERT

FasterTransformer BERT 包含优化的 BERT 模型、高效的 FasterTransformer 和 INT8 量化推理。

模型结构

标准的 BERT 和高效的 FasterTransformer

FasterTransformer 编码器支持以下配置。

Batch size (B1): 批量大小 <= 4096

Sequence length (S): 序列长度 <= 4096。对于 INT8 模型，当 S > 384 时 S 需要是 32 的倍数。

Size per head (N): 小于 128 的偶数。

Head number (H): 在 FP32 下满足 H * N <= 1024 或在 FP16 下满足 H * N <= 2048 的任何数字。

Data type: FP32, FP16, BF16, INT8 and FP8 (Experimental).

如果内存足够，任意层数（N1）

在 FasterTransformer v1.0 中，我们提供了高度优化的 BERT 等效编码器模型。

接下来，基于Effective Transformer的思想，我们在 FasterTransformer v2.1 中通过去除无用的 padding 来进一步优化BERT推理，并提供 Effective FasterTransformer。

在 FasterTransformer v3.0 中，我们提供了 INT8 量化推理以获得更好的性能。

在 FasterTransformer v3.1 中，我们优化了 INT8 Kernel 以提高 INT8 推理的性能，并将 TensorRT 的多头注意力插件集成到 FasterTransformer 中。

在 FasterTransformer v4.0 中，我们添加了多头注意力 Kernel 支持 V100 的 FP16 模式和 T4, A100 的 INT8 模式。

下图演示了除 INT8 外的这些优化的流程图。在FasterTransformer v5.0中，我们重构了代码，将 mask building 和 padding 移动到 Bert 的 forward 函数中，并在 Ampere GPU 上基于稀疏特性来加速GEMM。

在 FasterTransformer v5.1 中，我们支持对 Bert FP16 进行进行多节点多 GPU 推理。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-v85Qr1r0-1674747365445)(null)]

BERT 模型是 google 在2018年提出的。FasterTransformer 的encoder 相当于 BERT 模型，但是做了很多优化。

图 1 最左边的流程显示了 FasterTransformer 中的优化。经过优化后，FasterTransformer 仅使用 8 或 6 个 gemms（蓝色块）和 6 个自定义 CUDA kernel（绿色块）来实现一个 transformer 块。

对于 Effective FasterTransformer，主要思想是去除句子的填充以防止计算无用的标记。当一个 Batch 的平均序列长度远小于最大序列长度时，此方法可以节省大量时间。

图 2 显示了我们使用的想法和偏移量（橙色）。要实现 Effective FasterTransformer，我们需要考虑两个问题。

首先，我们需要去除 BERT 之前的 padding，离开 BERT 之后重建 padding 以保持结果的形状。

这很简单，带来的开销基本可以忽略。第二个问题是多头注意力的计算。

一个天真的解决方案是在多头注意力之前重建填充并在多头注意力之后移除填充，如图 1 的第二个流程图所示。

因为我们可以将这些重建/移除融合到其他 kernel 中，额外的开销也是可以忽略的。

为了进一步提高多头注意力的性能，我们集成了 TensorRT 的多头注意力，将整个注意力计算融合到一个 kernel 中。

源代码在这里。该 kernel 同时支持 Effective FasterTransformer 和标准 BERT 模型。

图 1 中的第三个和第四个流程图显示了工作流程。

有了这样的 kernel ，我们就不用担心多头注意力的填充问题了。

这个 kernel 需要另一个偏移量，如图 2 所示。

第一个偏移量 [0, 0, 1, 3, 3, 3]比较好理解，直接和[0, 1, 2, 3, 4, 5]迭代就可以得到原始的位置了。第二个偏移量是从0位置开始，记录连续的原始token个数，比如我们将[0, 2, 3, 6]做差分，得到[2, 1, 3]也对应了原始的数据中每行做的padding的tokn数目。

此外，我们发现 padding 会影响某些任务的准确性，尽管它们应该是无用的。因此，我们建议删除下游任务最终输出中的填充。

编码器的参数、输入和输出：

Constructor of BERT

Classification	Name	Data Type	Description
[0]	max_batch_size	int	Deprecated, move to input
[1]	max_seq_len	int	Deprecated, move to input
[2]	head_num	int	Head number for model configuration
[3]	size_per_head	int	Size per head for model configuration
[4]	inter_size	int	The inter size of feed forward network. It is often set to 4 * head_num * size_per_head.
[5]	num_layer	int	Number of transformer layers for model configuration
[6]	sm	int	The compute capacity of GPU
[7]	q_scaling	float	It is used to scale the query before the batch multiplication of query and key
[8]	stream	cudaStream_t	CUDA stream
[9]	cublas_wrapper	cublasMMWrapper*	Pointer of cuBLAS wrapper, which is declared in src/fastertransformer/utils/cublasMMWrapper.h
[10]	allocator	IAllocator*	Pointer of memory allocator, which is declared in src/fastertransformer/utils/allocator.h
[11]	is_free_buffer_after_forward	bool	If setting to be true, FasterTransformer will allocate buffer before forward, and free buffer after forward. When the allocator is based on memory pool, setting to true may help reducing the memory usage during inference.
[12]	attention_type	AttentionType	Determine fusing the attention or not, remove padding or not, which is declared in src/fastertransformer/layers/attention_layers/BaseAttentionLayer.h
[13]	sparse	bool	Is using sparsity. Experimental feature
[14]	activation_type	ActivationType	Determine the activation in FFN, which is declared in src/fastertransformer/layers/attention_layers/FfnLayer.h
[15]	layernorm_type	LayerNormType	Determine using pre-layernorm or post-layernorm, which is declared in src/fastertransformer/kernels/layernorm_kernels.h
[16]	tensor_para	NcclParam	Tensor Parallel information, which is declared in src/fastertransformer/utils/nccl_utils.h
[17]	pipeline_para	NcclParam	Pipeline Parallel information, which is declared in src/fastertransformer/utils/nccl_utils.h
[18]	custom_all_reduce_comm	AbstractCustomComm	Custom all reduction communication for custom all reduction in model parallelism. It is only supported in 8-way tensor parallelism
[19]	enable_custom_all_reduce	int	Flag of enabling custom all reduction or not

Input of BERT

Name	Tensor/Parameter Shape	Location	Data Type	Description
input_hidden_state	[batch_size, sequence_length, head_num * size_per_head]	GPU	fp32/fp16/bf16	The input of transformer layer
input_lengths	[batch_size]	GPU	int	The lengths of input_hidden_state

Output of BERT

Name	Tensor/Parameter Shape	Location	Data Type	Description
output_hidden_state	[batch_size, sequence_length, head_num * size_per_head]	GPU	fp32/fp16/bf16	The output of transformer layer

上面声明了 Bert 模型的输入参数，以及输入和输出Tensor的shape。

此外，注意到 TensorRT 的多头注意力Kernel虽然功能很强大但是也有一些限制。首先，这个kernel需要 Turing 或者更新架构的 GPU，并且每个头的大小必须是64。当条件不满足时，我们使用FasterTransformer的原始多头注意力实现。其次，它需要一个额外的序列长度偏移量，如Figure2所示，更多的细节在这里。

当输入有 padding 时，序列长度偏移的形状为。假设这里有3个序列，长度分别为 , , ，然后 padding 之后的序列长度为。那么序列长度偏移时。即，序列长度偏移记录了每个句子的序列长度。当我们有 padding 时，我们将 padding 视为一些独立的句子。

在 FasterTransformer v4.0 中，我们实现了两条 INT8 推理的流水线，如图 3 所示。对于 int8_mode == 1 (int8v1)，我们不量化残差连接，使用 int32 作为 int8 gemms 的输出，并对权重采用逐通道的量化方式。

对于 int8_mode == 2 (int8v2)，我们量化残差连接，使用 int8 作为 int8 gemms 的输出，并对权重采用逐张量的量化。一般来说，int8_mode == 1 的精度更高，而 int8_mode == 2 的性能更好。

feature	int8_mode == 1	int8_mode == 2
quantize residual	No	Yes
int8 output gemm	No	Yes
per-channel quantiztion for weights	Yes	No

对于 INT8 推理，需要量化模型。我们提供了 TensorFlow 量化工具和示例代码，同时还提供了带有 TensorRT 量化工具的 PyTorch 示例代码。

请先参考bert-quantization/bert-tf-quantization和examples/pytorch/bert/bert-quantization-sparsity中的README。

在 FasterTransformer v5.0 中，我们支持稀疏 gemm 以利用 Ampere GPU 的稀疏特性。我们还提供了一个关于 PyTorch 的示例。

在 FasterTransformer v5.1 中，我们支持 BERT 模型的多 GPU 多节点推理。

优化点解读

优化主要是针对 Figure 1 也就是 BERT 的编码器模块的各个组件来讲(我这里忽略了 Figure1 的和 padding 相关的组建的讲解，感兴趣的读者可以自己看看 FasterTransformer)。

我么先把 BERT 的多头注意力机制的实现贴一下，方便下面的讲解：

importtorch.nnasnn

classAttention(nn.Module):
"""
Compute'ScaledDotProductAttention
"""

defforward(self,query,key,value,mask=None,dropout=None):
scores=torch.matmul(query,key.transpose(-2,-1))
/math.sqrt(query.size(-1))

ifmaskisnotNone:
scores=scores.masked_fill(mask==0,-1e9)

p_attn=F.softmax(scores,dim=-1)

ifdropoutisnotNone:
p_attn=dropout(p_attn)

returntorch.matmul(p_attn,value),


classMultiHeadedAttention(nn.Module):
"""
Takeinmodelsizeandnumberofheads.
"""

def__init__(self,h,d_model,dropout=0.1):
super().__init__()
assertd_model%h==0

#Weassumed_valwaysequalsd_k
self.d_k=d_model//h
self.h=h

self.linear_layers=nn.ModuleList([nn.Linear(d_model,d_model)for_inrange(3)])
self.output_linear=nn.Linear(d_model,d_model)
self.attention=Attention()

self.dropout=nn.Dropout(p=dropout)

defforward(self,query,key,value,mask=None):
batch_size=query.size(0)

#1)Doallthelinearprojectionsinbatchfromd_model=>hxd_k
query,key,value=[l(x).view(batch_size,-1,self.h,self.d_k).transpose(1,2)
forl,xinzip(self.linear_layers,(query,key,value))]

#2)Applyattentiononalltheprojectedvectorsinbatch.
x,attn=self.attention(query,key,value,mask=mask,dropout=self.dropout)

#3)"Concat"usingaviewandapplyafinallinear.
x=x.transpose(1,2).contiguous().view(batch_size,-1,self.h*self.d_k)

returnself.output_linear(x)

Compute Q, K, V by three GEMMs or one Batch GEMM

这里的意思就是计算 Q，K，V 的时候有两种方式，一种是用3个单独的gemm算子对应FasterTransformer v1.0版本的这段代码：

另外一种就是通过一个 Batch GEMM算子同时完成对 Q, K, V 的计算，

add_QKV_bias 优化

这个是针对上面forward函数中 (1) 这部分存在的分别对 Q, K, V进行bias_add以及transpose的优化，将其融合成一个cuda kernel。这里启动 add_QKV_bias 的参数来看。

对于FP32，FasterTransformer是启动 batch_size * seq_len * 3 个 Block，每个 Block 里面启动 head_num * size_per_head 个线程只处理一个token（对应 head_num * size_per_head 次计算）的 bias_add 计算。

我们注意到这里还将输入的shape进行了改变，也就是将原始的[batch_size, seq_length, head_num * size_per_head] -> [batch_size, seq_length, head_num, size_per_head]（对应 .view(batch_size, -1, self.h, self.d_k)）->[batch_size, head_num, seq_length, size_per_head]（对应.transpose(1, 2)），这个过程对应了 https://github.com/NVIDIA/FasterTransformer/blob/release/v1.0_tag/fastertransformer/cuda/open_attention.cu#L149 这里的索引代码。

而对于FP16模式，FasterTransformer是启动 batch_size * seq_len 个 Block，,每个 Block 里面启动 head_num * size_per_head 个线程同时处理QKV的同一个token（对应head_num * size_per_head次计算），在实际计算时会把half pack成half2进行计算：https://github.com/NVIDIA/FasterTransformer/blob/release/v1.0_tag/fastertransformer/cuda/open_attention.cu#L172 ，并使用了half2相关的数学函数。这样不仅仅可以达到2倍于half的访存带宽和计算吞吐，还可以极大地减少指令的发射数量。

高效的softmax kernel

这里我没有怎么看，因为oneflow已经有一个比FasterTransformer更好的softmax kernel实现了。

transpose kernel

这个 kernel 是对应上面 BERT 的 Encoder 部分的：

x=x.transpose(1,2).contiguous().view(batch_size,-1,self.h*self.d_k)

这里的 x 的 shape 仍然和之前的 q 的 shape 一致，为[batch_size, head_num, seq_length, size_per_head]。

因为Attetion 层不会改变输入的形状，因为 Attention 的计算过程是：q * k 转置(.transpose(2, 3))，除以 d_k ** 0.5，输出维度是 [b, head_num , seq_length, seq_length] 即单词和单词直接的相似性，然后对最后一个维度进行 softmax 操作得到 [b, head_num, seq_length, seq_length] , 最后和 v（shape 也是 [batch_size, head_num, seq_length, size_per_head]）做一个矩阵乘法，结果的 shape 和输入的 shape 形状都是：[batch_size, head_num, seq_length, size_per_head] 。因此这里的 x.transpose(1, 2) 就是把 shape 为 [batch_size, head_num, seq_length, size_per_head] 的 x 重新排列为 [batch_size, head_num, size_per_head, seq_length]。然后 x.contiguous().view(batch_size, -1, self.h * self.d_k) 进一步将 shape 重新排列为 [batch_size, seq_length, head_num * size_per_head] 。

对于 FP32 模式，启动 batch_size * head_num * seq_length 个 Block , 然后每个 Block 启动 size_per_head 个线程处理一个序列（一个序列对应 size_per_head 个元素）。如下：

constintseq_per_block=1;
grid.x=batch_size*head_num*seq_len/seq_per_block;
block.x=seq_per_block*size_per_head;
transpose<<>>(transpose_dst_,dst,
batch_size,seq_len,head_num,size_per_head);

而 transpose 的kernel实现也比较简单，根据blockIdx.x计算下batch_id和seq_id以及head_id（输入 x 的 shape 为 [batch_size, head_num, seq_length, size_per_head]）：

template
__global__
voidtranspose(T*src,T*dst,constintbatch_size,constintseq_len,constinthead_num,constintsize_per_head)
{
intbatch_id=blockIdx.x/(head_num*seq_len);
intseq_id=blockIdx.x%seq_len;
inthead_id=(blockIdx.x%(head_num*seq_len))/seq_len;
dst[batch_id*(head_num*seq_len*size_per_head)+seq_id*head_num*size_per_head
+head_id*size_per_head+threadIdx.x]=src[blockIdx.x*size_per_head+threadIdx.x];
}

对于 half 来说，采用和 add_QKV_bias 一样的优化方式，每个 block 处理 4 个sequence。具体来说，就是现在启动 batch_size * head_num * seq_len / 4 个 Block，每个 Block 使用 2 * size_per_head 个线程处理 4 个序列。为什么 2 * size_per_head 个线程可以处理 4 个序列（一个序列对应 size_per_head 个元素），原因是因为使用了 half2 来做数据读取。half 类型的 kernel 实现如下：

__inline____device__
inttarget_index(intid1,intid2,intid3,intid4,intdim_1,intdim_2,intdim_3,intdim_4)
{
returnid1*(dim_2*dim_3*dim_4)+id3*(dim_2*dim_4)+id2*dim_4+id4;
}

template<>
__global__
voidtranspose(__half*src,__half*dst,
constintbatch_size,constintseq_len,constinthead_num,constintsize_per_head)
{
inttid=blockIdx.x*blockDim.x+threadIdx.x;

intbatch_id=tid/(head_num*seq_len*size_per_head);
inthead_id=(tid%(head_num*seq_len*size_per_head))/(seq_len*size_per_head);
intseq_id=(tid%(seq_len*size_per_head))/size_per_head;
intid=tid%size_per_head;

inttarget_id=target_index(batch_id,head_id,seq_id,id,batch_size,head_num,seq_len,size_per_head);
half2*src_ptr=(half2*)src;
half2*dst_ptr=(half2*)dst;

dst_ptr[target_id]=src_ptr[tid];
}

trt_add_QKV_bias 和 TensorRT fused multi-head attention kernel

实际上从 Figure1 也可以看出我们上面讲到的 batch GEMM，softmax, GEMM，transpose 等操作都可以被合成一个超大的 cuda kernel，进一步进行优化，也就是这里的 TensorRT fused multi-head attention kernel。这个是将 TensorRT 的这个插件作为第三方仓库引入到 FasterTransformer 进行加速的，具体的代码我没有研究过，这里就不展开了。

现在 MultiHeadAttention 部分涉及到的优化其实就讲完了，我们接着看一下FasterTransformer 对 BERT Encoder 的其它部分的优化。我们这里贴一下 Transformer 的结构图：

在 MultiHeadAttention 的后面接了一个 Add & Norm，这里的 Add 其实就是残差，Norm 就是 LayerNorm。所以 Encoder 部分的两个 Add & Norm 可以总结为：

add_bias_input_layernorm

这里的 LayerNorm(X + MultiHeadAttention(X)) 就对应了 FasterTransformer 里面的 add_bias_input_layernorm 这个优化

实际上 layernorm 在 oneflow 也有比 Faster Transformer 更好的 kernel，

对于 softmax 和 layernorm 我还没看 FasterTransformer 的源码，后续研究了之后再分享。

总的来说就是 add_bias_input_layernorm 这个优化把残差连接和LayerNorm fuse到一起了，性能更好并且降低了kernel launch的开销。

add_bias_act

在上图的 Feed Forward 的实现中，还有一个 bias_add 和 gelu 激活函数挨着的 pattern ，所以 FasterTransformer 实现了这个 add_bias_act kernel 将两个操作融合起来，常规操作。

番外

除了上述优化技巧之外，我们可以为 BERT 模型在我们的GPU上试跑 GEMM，并且保存 GEMM 性能最高的超参数配置，这个对于 cublas 和 cutlass 实现的卷积应该都是成立的。

另外我观察到在 Faster Transformer 的cuda kernel实现中，大量应用了__ldg这个指令，查了一下资料说这个是只读缓存指令，在读地址比较分散的情况下，这个只读缓存比L1的表现要好，对一些带宽受限的kernel有性能提升。后续有空继续研究下...

总结

我这边从文档上初步看的东西也就这么多，后续可能会继续研究学习下Faster Transformer的softmax/layernorm实现，或者解读一下其它Transformer架构的优化技巧。

审核编辑：刘清

声明：本文内容及配图由入驻作者撰写或者入驻合作网站授权转载。文章观点仅代表作者本人，不代表电子发烧友网立场。文章及其配图仅供工程师学习之用，如有内容侵权或者其他违规问题，请联系本站处理。举报投诉

编码器

编码器

+关注

关注
45

文章
3640

浏览量
134451
J-BERT

J-BERT

+关注

关注
0

文章
5

浏览量
7795
GEM

GEM

+关注

关注
0

文章
8

浏览量
6684

原文标题：【BBuf的CUDA笔记】六，总结 FasterTransformer Encoder(BERT) 的cuda相关优化技巧

文章出处：【微信号：GiantPandaCV，微信公众号：GiantPandaCV】欢迎添加关注！文章转载请注明出处。

解析优化的调度逻辑和cuda实现

/how-to-optim-algorithm-in-cuda 这个仓库整理了一些cuda优化相关链接以及大模型训练推理相关的知识链接（la

发表于 08-24 11:15 •1343次阅读

Faster Transformer v2.1版本源码解读

在 FasterTransformer v1.0 中，Nvidia 提供了一个高度优化的 BERT Transformer Encoder 模块，主要应用于序列标注推理场景，笔者针对

发表于 09-19 11:39 •1385次阅读

BERT原理详解

BERT原理详解

发表于 07-02 16:45

LInux安装cuda sdk

1.安装toolkit(1)cd /home/CUDA_train/software/cuda4.1(2)./cudatoolkit_4.1.28_linux_64_rhel6.x.run

发表于 07-24 06:11

CUDA教程之Linux系统下CUDA安装教程

CUDA教程之1：Linux系统下CUDA安装教程

发表于 06-02 16:53

什么是CUDA？

在大家开始深度学习时，几乎所有的入门教程都会提到CUDA这个词。那么什么是CUDA？她和我们进行深度学习的环境部署等有什么关系？通过查阅资料，我整理了这份简洁版CUDA入门文档，希望能帮助大家用最快

发表于 07-26 06:28

什么是CUDA？

什么是CUDA？

发表于 09-28 07:37

DSP程序优化总结

发表于 10-23 14:24 •2次下载

DSP程序<b class='flag-5'>优化</b><b class='flag-5'>总结</b>

基于Hadoop+CUDA平台实现软相关器的方法

根据2ICMA相关器的算法特点，在对比基于CPU并行的MPI集群、MPI+CUDA异构并行集群和Hadoop+ CUDA异构并行集群的架构特点的基础上，提出了一种基于Hadoop+ CUDA

发表于 12-06 10:12 •0次下载

一篇BERT用于推荐系统的文章

今天给大家介绍一篇BERT用于推荐系统的文章，题目是《BERT4Rec: Sequential Recommendation with Bidirectional Encoder

发表于 11-03 17:11 •3048次阅读

CUDA简介: CUDA编程模型概述

在 CUDA 编程模型中，线程是进行计算或内存操作的最低抽象级别。从基于 NVIDIA Ampere GPU 架构的设备开始，CUDA 编程模型通过异步编程模型为内存操作提供加速。异步编程模型定义了与 CUDA 线程

发表于 04-20 17:16 •2999次阅读

NVIDIA FasterTransformer库的概述及好处

这是讨论 NVIDIA FasterTransformer 库的两部分系列的第一部分，该库是用于对任意大小（多达数万亿个参数）的 Transformer 进行分布式推理的最快库之一。它提供了 FasterTransformer 的概述，包括使用该库的好处。

发表于 08-31 09:30 •1538次阅读

CUDA矩阵乘法优化手段详解

单精度矩阵乘法（SGEMM）几乎是每一位学习 CUDA 的同学绕不开的案例，这个经典的计算密集型案例可以很好地展示 GPU 编程中常用的优化技巧。本文将详细介绍 CUDA SGEMM 的优化

发表于 09-28 09:46 •1920次阅读

STM32 Encoder编码器使用总结

目录 Encoder 原理 STM32 Encoder 计数原理模型仿真模拟Encoder 基于Encoder计算角度和速度关于启动的仿真代码生成运行演示

发表于 05-06 09:44 •3次下载

总结FasterTransformer Encoder优化技巧

FasterTransformer BERT 包含优化的 BERT 模型、高效的 FasterTransformer 和 INT8 量化推理

发表于 05-30 15:15 •1280次阅读