使用NVIDIA NeMo定制LLM的过程-电子发烧友网

在过去的几年里，世代人工智能吸引了公众的注意力和想象力。从给定的自然语言提示，这些生成模型能够生成人类质量的结果，从清晰表达的儿童故事到产品原型可视化。

大型语言模型（ LLM ）是这场革命的中心。 LLM 是一种通用的语言理解器，它将人类知识编纂成法典，可以很容易地应用于许多自然语言和编程语言理解任务，开箱即用。其中包括摘要、翻译、问题回答以及代码注释和完成。

单个基础语言模型完成许多任务的能力开辟了一个全新的人工智能软件范式，其中单个基础模型可以用于满足公司所有部门的多个下游语言任务。这简化并降低了人工智能软件开发、部署和维护的成本。

创建自定义大型语言模型简介

尽管 LLM 强大且前景光明，但通过针对特定用例的零样本或少量快照学习，与 LLM 现成的性能仍存在差距。特别是，零样本学习性能往往很低且不可靠。另一方面，很少有镜头学习依赖于找到最佳的离散提示，这是一个不平凡的过程。

如 GPT Understands, Too 中所述，用于解决下游问题的提示模板的微小变化可能会对最终精度产生重大影响。此外，由于提示更大，少镜头推理的成本也更高。

已经提出了参数有效的微调技术来解决这个问题。即时学习就是这样一种技术，它将虚拟提示令牌附加到请求中。这些虚拟令牌是可学习的参数，可以使用标准优化方法进行优化，而 LLM 参数是冻结的。

本文介绍了使用 NVIDIA NeMo 定制 LLM 的过程，这是一个用于训练、定制和部署基础模型的通用框架。

什么是 NVIDIA NeMo ？

NVIDIA NeMo 是用于训练、定制和部署大型基础模型的通用框架。 NeMo 利用各种并行技术来加速训练和推理，可以部署在用户首选云、本地和边缘系统上的多节点、多 GPU 系统上。要了解更多信息，请参阅 NVIDIA AI Platform Delivers Big Gains for Large Language Models 和 Accelerated Inference for Large Transformer Models Using NVIDIA Triton Inference Server 。

NeMo 生态系统由以下主要组成部分组成：

NVIDIA NeMo service ：通过 NVIDIA 管理的服务，为 LLM 的产品化提供快速途径。开发人员可以利用 LLM 功能快速轻松地开发企业人工智能应用程序，而无需担心底层基础设施。您还可以通过云 API 或网络游乐场界面体验最大的语言模型之一 Megatron 530B 。目前处于早期访问状态。

NVIDIA NeMo framework ：一个端到端的容器化框架，允许开发人员高效地训练和部署具有数十亿和数万亿参数的语言模型，在数千 GPU 秒内提供高训练效率。 NeMo 框架容器目前位于 open beta 中，可通过 NGC 获得。

NVIDIA/NeMo ：为研究语音人工智能和 NLP （包括 LLM ）的研究人员构建的开源对话式人工智能工具包。可通过 GitHub 获得。

NeMo 模型： NVIDIA 最近开放了源代码的预训练 NeMo 框架模型，从 1.3B GPT-3 、 5B GPT-3 和 3B mT5 model 等小型模型到 20B GPT-3 等大型模型。

NVIDIA/FasterTransformer ：一个开源工具包，用于通过 GitHub 进行 LLM 的高性能推理。要了解有关如何使用 Faster transformer 部署公共 NeMo 框架模型的更多信息，请参阅 Deploying a 1.3B GPT-3 Model with NVIDIA NeMo Megatron 。

这篇文章解释了如何使用 NeMo 框架容器通过即时学习技术自定义公共 NeMo 模型。

使用 NeMo 快速学习

Prompt learning 统称为两参数高效微调技术，如下所述。有关更多信息，请参阅 Adapting P-Tuning to Solve Non-English Downstream Tasks 。

在提示调谐中，软提示嵌入被初始化为 2D 矩阵。每个任务都有自己的 2D 嵌入矩阵。任务在训练或推理过程中不共享任何参数。所有 LLM 参数都被冻结，并且在训练期间仅更新每个任务的嵌入参数。 NeMo 提示调谐实现基于 The Power of Scale for Parameter-Efficient Prompt Tuning 。

在 p 调谐中， LSTM 模型或“提示编码器”用于预测虚拟令牌嵌入。 LSTM 参数在 p 调谐开始时被随机初始化。所有 LLM 参数都被冻结，并且在每个训练步骤仅更新 LSTM 权重。 LSTM 参数在同时 p 调谐的所有任务之间共享，但 LSTM 模型为每个任务输出唯一的虚拟令牌嵌入。 NeMo p 调谐实现基于 GPT Understands, Too 。

本例的即时学习使用 NeMo 生态系统的两个开源组件： NeMo OSS 工具包和公共 NeMo 模型。

GitHub 上的 NeMo Multitask Prompt and P-Tuning 教程详细介绍了在小型 GPT-3 345M 参数模型上进行提示学习的过程。本教程演示了即时学习的端到端过程：下载和预处理数据、下载模型、训练即时学习模型，以及在三个不同的应用程序上进行推理。

下面的部分首先浏览笔记本，同时总结主要概念。然后，这个笔记本将被扩展到对更大的 NeMo 模型进行即时学习。

先决条件

您可以通过 NeMo Docker 容器体验 NeMo 。这为 NeMo 的实验提供了一个自给自足和可再生的环境。 NeMo Multitask Prompt and P-Tuning 教程使用 NeMo 22.09 容器进行了测试，但您可以尝试相同容器的后续版本。使用以下脚本下载并运行此容器：

docker run  -u $(id -u ${USER}):$(id -g ${USER}) --rm -it --net=host nvcr.io/nvidia/nemo:22.09 bash

然后从容器交互式 bash 环境中启动 Jupyter 实验室：

cd /workspace
jupyter lab --ip 0.0.0.0 --allow-root --port=8888

在 Jupyter 实验室，您可以在/ workspace / NeMo / tutorial / nlp / Multitask _ Pompt _ and _ PTuning.ipynb 下找到 NeMo 示例，包括上述笔记本。

此外，您需要一个 GPU 来处理较小的 5B 和 1.3B GPT-3 模型，需要四个 NVIDIA Ampere architecture 或 NVIDIA Hopper architecture GPU 用于处理 20B 模型，因为它具有四个张量平行度（ TP ）。

数据准备

笔记本将引导您完成三种不同应用程序的数据收集和预处理过程： Financial PhraseBank dataset 用于情绪分析任务， SQuAD dataset 用于问答任务， Assistant Benchmarking dataset 用于意图和时段分类任务。

数据集应为. jsonl 格式，其中包含一组 JSON 对象。每个 JSON 对象必须包括字段任务名称，这是数据示例所对应任务的字符串标识符。每个 JSON 对象还应包括一个或多个字段，这些字段对应于离散文本提示的不同部分。示例见图 1 。

图 1 。 NeMo 即时学习的数据集格式

提示模板

在形成提示时，您应该确定并遵守一个模式。这种模式被称为 prompt template ，并根据使用情况而变化。情绪分析的示例如下所示。

{
        "taskname": "sentiment",
        "prompt_template": "<|VIRTUAL_PROMPT_0|> {sentence} sentiment:{label}",
        "total_virtual_tokens": 10,
        "virtual_token_splits": [10],
        "truncate_field": None,
        "answer_only_loss": True,
        "answer_field": "label",
    }

提示包含开头的所有 10 个虚拟标记，然后是要分类的目标句子。接下来是一个文本标记（“sentiment:”），最后是用于训练的句子的标签。训练数据 JSON 对象中的相应字段将映射到此提示模板，以形成完整的训练示例。 NeMo 支持修剪特定字段以满足模型令牌长度限制（使用 HuggingFace GPT-2 令牌化器的 NeMo 公共模型通常为 2048 个令牌）。

训练

默认的 NeMo 提示调优配置在 yaml 文件中提供，可通过 GitHub 上的 NVIDIA/NeMo 获得。笔记本加载这个 yaml 文件，然后覆盖训练选项以适应 345M GPT 模型。 NeMo p 调谐使得能够同时学习多个任务。 NeMo 利用 PyTorch Lightning 接口，因此只需调用trainer.fit(model)语句即可完成训练。

推论

最后，一旦经过训练，模型就可以通过调用model.generate(inputs=test_examples)语句来用于对新样本的推理（省略“answer_field”）。

快速学习大型模型

笔记本电脑中演示的 345M GPT-3 模型过程可以应用于更大的公共 NeMo GPT-3 型号，最多 1.3B GPT-3 和 5B GPT-3 。这种尺寸的型号只需要一个足够内存容量的 GPU ，例如 NVIDIA V100 、 NVIDIA A100 和 NVIDIA H100 。下载模型后，替换模型名称；特别是在以下单元格中：

# Download the model from NGC
gpt_file_name = "megatron_gpt_345m.nemo"
!wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/nemo/megatron_gpt_345m/versions/1/files/megatron_gpt_345m.nemo -

不要从 NGC 下载 345M GPT 模型，而是按照 HuggingFace 上的说明下载 1.3B GPT-3 或 5B GPT-3 模型，然后将gpt_file_name变量指向。 NeMo 模型文件。

请注意，对于 5B 型号，有两种变体，一种是 TP 度为 1 （ nemo_gpt5B_fp16_tp1.nemo ），另一种是 TP = 2 （ nemo_gpt5B_fp16_tp2.nemo, nemo_gpt5B_bf16_tp2.nemo ) ）。笔记本电脑只能支持 TP = 1 变体。在其他一切不变的情况下，您可以端到端执行同一笔记本电脑。

多 – GPU 即时学习

由于 Jupyter 笔记本环境的限制，即时学习笔记本仅支持单次 – GPU 训练。针对更大的模型利用多 GPU 训练，具有更高程度的 TP （例如 20B GPT-3 为 4 ， 5B GPT-3 为其他变体为 2 ）需要使用不同的 NeMo prompt learning script 。此脚本受 config文件在这里可以找到许多参数的默认值。

模型

本节演示了在作为提示学习笔记本一部分下载并预处理的辅助数据集上使用多个 GPU 对大型模型进行提示学习的过程。

您可以下载 TP = 2 的 5B GPT 型号（ nemo_gpt5B_fp16_tp2.nemo) ）或 TP = 4 的 20B GPT-3 型号。请注意，这些模型存储在中。 NeMo 压缩存档。要大幅加快模型加载速度，请提前解压缩模型，并在 NeMo 配置中使用此解压缩的文件夹。使用以下脚本：

tar -xvf nemo_gpt5B_fp16_tp2.nemo -C nemo_gpt5B_fp16_tp2.nemo.extracted

然后使用nemo_gpt5B_fp16_tp2.nemo.extracted NeMo 中提取的目录nemo_gpt5B_fp16_tp2.nemo.extracted。

配置

适用于辅助数据集（意图和插槽检测应用程序）的配置文件如下所示：

name: megatron_virtual_prompt_gpt

trainer:
  devices: 2
  accelerator: gpu
  num_nodes: 1
  precision: 16
  logger: False # logger provided by exp_manager
  enable_checkpointing: False
  replace_sampler_ddp: False
  max_epochs: 25 # min 25 recommended
  max_steps: -1 # consumed_samples = global_step * micro_batch_size * data_parallel_size * accumulate_grad_batches
  log_every_n_steps: 10 # frequency with which training steps are logged 
  val_check_interval: 1.0 # If is an int n > 1, will run val every n training steps, if a float 0.0 - 1.0 will run val every epoch fraction, e.g. 0.25 will run val every quarter epoch
  gradient_clip_val: 1.0
  resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
  benchmark: False


exp_manager:
  explicit_log_dir: null
  exp_dir: null
  name: ${name}
  create_wandb_logger: False
  wandb_logger_kwargs:
    project: null
    name: null
  resume_if_exists: True
  resume_ignore_no_checkpoint: True
  create_checkpoint_callback: True
  checkpoint_callback_params:
    monitor: val_loss
    save_top_k: 2
    mode: min
    save_nemo_on_train_end: False # Should be false, correct prompt learning model file is saved at model.nemo_path set below, 
    filename: 'megatron_gpt_prompt_tune--{val_loss:.3f}-{step}'
    model_parallel_size: ${model.tensor_model_parallel_size}
    save_best_model: True

model:
  seed: 1234
  nemo_path: ${name}.nemo # .nemo filename/absolute path to where the virtual prompt model parameters will be saved
  virtual_prompt_style: 'p-tuning' # one of 'prompt-tuning', 'p-tuning', or 'inference'
  tensor_model_parallel_size: 1 # intra-layer model parallelism
  pipeline_model_parallel_size: 1 # inter-layer model parallelism
  global_batch_size: 8
  micro_batch_size: 4

  restore_path: null # Path to an existing p-tuned/prompt tuned .nemo model you wish to add new tasks to or run inference with
  language_model_path: ??? # Path to the GPT language model .nemo file, always required
  save_nemo_on_validation_end: True # Saves an inference ready .nemo file every time a checkpoint is saved during training. 
  existing_tasks: [] # List of tasks the model has already been p-tuned/prompt-tuned for, needed when a restore path is given
  new_tasks: ['intent_and_slot'] # List of new tasknames to be prompt-tuned
  


  ## Sequence Parallelism
  # Makes tensor parallelism more memory efficient for LLMs (20B+) by parallelizing layer norms and dropout sequentially
  # See Reducing Activation Recomputation in Large Transformer Models: https://arxiv.org/abs/2205.05198 for more details.
  sequence_parallel: False

  ## Activation Checkpoint 
  activations_checkpoint_granularity: null # 'selective' or 'full' 
  activations_checkpoint_method: null # 'uniform', 'block', not used with 'selective'
  # 'uniform' divides the total number of transformer layers and checkpoints the input activation
  # of each chunk at the specified granularity
  # 'block' checkpoints the specified number of layers per pipeline stage at the specified granularity
  activations_checkpoint_num_layers: null # not used with 'selective'

  task_templates: # Add more/replace tasks as needed, these are just examples
  - taskname: "intent_and_slot"    
    prompt_template: "<|VIRTUAL_PROMPT_0|>Predict intent and slot: {utterance} nLabel:{label}"
    total_virtual_tokens: 10
    virtual_token_splits: [10]
    truncate_field: null
    answer_only_loss: False
    "answer_field": "label"


  prompt_tuning: # Prompt tunin specific params
    new_prompt_init_methods: ['text'] # List of 'text' or 'random', should correspond to tasks listed in new tasks
    new_prompt_init_text: ['some init text goes here'] # some init text if init method is text, or None if init method is random

  p_tuning: # P-tuning specific params
    encoder_type: "tpmlp" # ['tpmlp', 'lstm', 'biglstm', 'mlp'] 
    dropout: 0.0
    num_layers: 2  # number of layers for MLP or LSTM layers. Note, it has no effect for tpmlp currently as it always assumes it is two layers.
    encoder_hidden: 2048 # encoder hidden for biglstm and tpmlp
    init_std: 0.023  # init std for tpmlp layers

  data:
    train_ds: ???
    validation_ds: ???
    add_eos: True
    shuffle: True
    num_workers: 8
    pin_memory: True
    train_cache_data_path: null  # the path to the train cache data 
    validation_cache_data_path: null  # the path to the validation cache data 
    test_cache_data_path: null  # the path to the test cache data 
    load_cache: False  # whether to load from the cache data


  optim:
    name: fused_adam
    lr: 1e-4
    weight_decay: 0.01 
    betas: 
    - 0.9
    - 0.98
    sched:
      name: CosineAnnealing
      warmup_steps: 50
      min_lr: 0.0 # min_lr must be 0.0 for prompt learning when pipeline parallel > 1
      constant_steps: 0 # Constant steps should also be 0 when min_lr=0
      monitor: val_loss
      reduce_on_plateau: false

得益于 yaml 文本格式和注释，大多数超参数都是不言自明的。使用 Jupyter 实验室界面，创建一个包含此内容的文件，并将其保存在/workspace/nemo/examples/nlp/language_modeling/conf/megatron_gpt_prompt_learning_intent_n_slot.yaml下。

config文件中最重要的是如下所示的提示模板：

 prompt_template: "<|VIRTUAL_PROMPT_0|>Predict intent and slot: {utterance} nLabel:{label}"
    total_virtual_tokens: 10
    virtual_token_splits: [10]
    truncate_field: null

这里， 10 个虚拟提示令牌与一些永久文本标记一起使用。

训练

要开始培训，请在 Jupyter 实验室界面中打开一个终端窗口（文件→ 新建→ 终端）。然后发出 bash 命令：

python /workspace/nemo/examples/nlp/language_modeling/megatron_gpt_prompt_learning.py 
    	--config-name=megatron_gpt_prompt_learning_intent_n_slot.yaml 
    	trainer.devices=2 
    	trainer.num_nodes=1 
    	trainer.max_epochs=25 
    	trainer.precision=bf16 
    	model.language_model_path=/workspace/nemo/tutorials/nlp/nemo-megatron-gpt-5B/nemo_gpt5B_fp16_tp2.nemo.extracted 
    	model.nemo_path=/workspace/nemo/examples/nlp/language_modeling/intent_n_slot.nemo 
    	model.tensor_model_parallel_size=2 
    	model.pipeline_model_parallel_size=1 
    	model.global_batch_size=16 
    	model.micro_batch_size=1 
    	model.optim.lr=1e-4 
    	model.data.train_ds=[/workspace/nemo/tutorials/nlp/data/assistant/assistant_train.jsonl] 
    	model.data.validation_ds=[/workspace/nemo/tutorials/nlp/data/assistant/assistant_val.jsonl]

请注意以下内容：

对于 5B GPT 模型（ nemo_gpt5B_fp16_tp2.nemo) ），model.tensor_model_parallel_size应设置为 2 ，对于 20B GPT-3 模型，应设置为 4

trainer.devices应设置为 TP 值的倍数。如果 5B 模型为 4 ，则将有两个数据并行工作者，每个工作者有两个 GPU

model.language_model_path应设置为模型提取目录的绝对路径

model.data.train_ds、model.data.validation_ds应设置为列车位置和验证数据

推论

最后，经过训练后，使用以下脚本在 NeMo 中进行推理：

python /workspace/nemo/examples/nlp/language_modeling/megatron_gpt_prompt_learning_eval.py 
            virtual_prompt_model_file=/workspace/nemo/examples/nlp/language_modeling/intent_n_slot.nemo 
            gpt_model_file=/workspace/nemo/tutorials/nlp/nemo-megatron-gpt-5B/nemo_gpt5B_fp16_tp2.nemo.extracted  
            inference.greedy=True 
            inference.add_BOS=False 
            inference.tokens_to_generate=128 
            trainer.devices=2 
            trainer.num_nodes=1 
            tensor_model_parallel_size=2 
            pipeline_model_parallel_size=1 
            data_paths=["/workspace/nemo/tutorials/nlp/data/assistant/assistant_test.jsonl"] 
            pred_file_path="test-results.txt"

请注意以下内容：

对于 5B GPT 模型（ nemo_gpt5B_fp16_tp2.nemo) ），model.tensor_model_parallel_size应设置为 2 ，对于 20B GPT-3 模型，应设置为 4

trainer.devices应设置为等于 TP 值（如上）

pred_file_path是记录测试结果的文件，每个测试样本一行

声明：本文内容及配图由入驻作者撰写或者入驻合作网站授权转载。文章观点仅代表作者本人，不代表电子发烧友网立场。文章及其配图仅供工程师学习之用，如有内容侵权或者其他违规问题，请联系本站处理。举报投诉