Falcon-7B大型语言模型在心理健康对话数据集上使用QLoRA进行微调-电子发烧友网

文本是参考文献[1]的中文翻译，主要讲解了Falcon-7B大型语言模型在心理健康对话数据集上使用QLoRA进行微调的过程。项目GitHub链接为https://github.com/iamarunbrahma/finetuned-qlora-falcon7b-medical，如下所示：

使用领域适应技术对预训练LLM进行微调可以提高在特定领域任务上的性能。但是，进行完全微调可能会很昂贵，并且可能会导致CUDA内存不足错误。当进行完全微调时，可能会发生灾难性遗忘，因为许多权重在"知识存储"的地方发生了变化。因此，迄今为止，在消费者硬件上对拥有数十亿参数的预训练LLM进行微调并不容易。

核心原因
心理健康应该是任何个人的首要任务，就像身体健康一样重要。在我们的社会中，与抑郁和精神障碍有关的讨论已经被污名化，以至于人们避免讨论与焦虑和抑郁有关的问题，也避免去看心理医生。
聊天机器人为寻求支持的个人提供了随时可用和可访问的平台。它们可以随时随地访问，为需要帮助的人提供即时援助。聊天机器人可以提供富有同情心和非判断性的回应，为用户提供情感支持。虽然它们不能完全取代人际互动，但它们可以在困难时刻提供有用的补充。虽然聊天机器人很有用，但并没有多少匿名聊天应用程序可以提供关于各种心理健康状况、症状、应对策略和可用治疗选项的可靠信息和心理教育。
因此，主要目标是使用经过筛选的对话数据集并使用QLoRA技术在开源Falcon-7B LLM上进行微调，从而构建一个心理健康聊天机器人。Falcon-7B LLM根据Apache 2.0许可证提供，因此可以用于商业目的。

什么是LoRA？
让我们介绍一下LoRA[2]（大规模语言模型的低秩适应，由Edward Hu等人提出）。LoRA技术基于LLM的参数高效微调方法。使用PEFT，我们可以对LLM进行高性能建模的微调，但只需要微调少量参数。PEFT的另一个优点是我们可以使用更少的数据对任何大型模型进行微调。
LoRA是一种用于大型权重矩阵的隐式低秩变换技术。LoRA不直接分解矩阵，而是通过反向传播学习分解矩阵。虽然预训练模型的权重在预训练任务上具有完整的秩，但当它们适应新的领域特定任务时，预训练模型具有较低的内在维度。较低的内在维度意味着数据可以有效地近似为一个较低维度的空间，同时保留了大部分基本信息或结构。

什么是QLoRA？
接下来，让我们来看看QLoRA[3]（由Tim Dettmers等人提出的量化LLM的低秩适应）。QLoRA通过量化感知训练、混合精度训练和双重量化来降低平均内存占用。QLoRA具有存储数据类型（4位Normal Float）和计算数据类型（16位Brain Float）。
在QLoRA中，预训练模型的权重矩阵以NF4格式存储，而可训练的LoRA权重矩阵以BFloat16格式存储。在前向和后向传递过程中，预训练权重被解量化为16位Brain Float格式，但只计算LoRA参数的权重梯度。QLoRA通过冻结的4位量化预训练模型将梯度反向传播到低秩适配器。QLoRA还利用了Nvidia的统一内存，以确保在权重更新过程中有足够的内存以防止内存不足错误。
QLoRA还引入了双重量化，通过量化量化常数来降低平均内存占用。在进行预训练模型的4位量化的情况下，模型权重和激活从32位浮点数压缩到4位NF格式。

4位NormalFloat量化的步骤
4位NormalFloat量化是一种数学上直观的过程。首先，模型的权重被归一化，使其具有零均值和单位方差。然后，将归一化的权重量化为4位。这涉及将原始高精度权重映射到一组较低精度值。在NF4的情况下，量化级别被选择为在归一化权重范围内均匀分布。
在前向和后向传递过程中，量化的权重被解量化回完全精度。这是通过将4位量化的值映射回其原始范围来完成的。解量化的权重用于计算，但它们以4位量化形式存储在内存中。

介绍
在本博客文章中，我将介绍使用bitsandbytes和PEFT（来自HuggingFace的）对Falcon-7B大型参数模型进行QLoRA技术微调的方法。在这里，我将使用自己从各种博客、WebMD和HealthLine等健康网站、心理健康FAQs以及其他可信的健康资源中策划的自定义心理健康对话数据集。这个数据集包含了172行高质量的患者和医疗保健提供者之间的对话。所有姓名和PII数据都已匿名化，并经过预处理以删除不需要的字符。
我在Nvidia A100 GPU上使用Google Colab Pro对整个模型进行了微调，整个微调过程不到一个小时。但是，我们也可以使用Colab的免费版Nvidia T4 GPU。如果使用免费版GPU，必须确保微调的max_steps应小于200。

安装QLoRA的库

!pip install trl transformers accelerate git+https://github.com/huggingface/peft.git -Uqqq
!pip install datasets bitsandbytes einops wandb -Uqqq

我已经安装了bitsandbytes（用于LLM的量化）、PEFT（用于LoRA参数的微调）、datasets（用于加载HF数据集）、wandb（用于监视微调指标）和trl（用于使用监督微调步骤训练变换器LLM）。
我还从HuggingFace数据集中加载了自定义心理健康数据集（heliosbrahma/mental_health_chatbot_dataset）。它只包含一个名为"text"的列，其中包含患者和医生之间的对话。

Falcon-7B模型的量化
首先，我加载了一个共享模型，而不是一个单一的大模型。使用共享模型的优点是，当与accelerate结合使用时，可以帮助accelerate将特定部分移动到不同的内存部分，有时是CPU或GPU，从而帮助在较小的内存量中微调大型模型。我使用了ybelkada/falcon-7b-sharded-bf16的分片模型[4]。

model_name = "ybelkada/falcon-7b-sharded-bf16" # 分片falcon-7b模型

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,            # 以4位精度加载模型
    bnb_4bit_quant_type="nf4",    # 预训练模型应以4位NF格式进行量化
    bnb_4bit_use_double_quant=True, # 使用QLoRA提出的双重量化
    bnb_4bit_compute_dtype=torch.bfloat16, # 在计算期间，预训练模型应以BF16格式加载
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config, # 使用bitsandbytes配置
    device_map="auto",  # 指定device_map="auto"，以便HF Accelerate将确定将模型的每个层放在哪个GPU上
    trust_remote_code=True, # 设置trust_remote_code=True以使用带有自定义代码的falcon-7b模型
)

在这里，load_in_4bit设置使模型以4位精度加载，bnb_4bit_use_double_quant使双重量化成为可能，正如QLoRA提出的那样。bnb_4bit_compute_dtype设置在计算期间解量化基础模型为16位格式。
在加载预训练权重时，我添加了device_map="auto"，以便Hugging Face Accelerate会自动确定要将模型的每个层放在哪个GPU上。此外，trust_remote_code=True将确保允许加载Hub上定义的自定义模型。

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token # 将pad_token设置为与eos_token相同

在这里，我必须从预训练模型加载tokenizer以对数据集进行标记化。我将pad_token设置为与eos_token相同，以启用填充，以便一次发送数据批次进行训练。

PEFT模型的配置设置和获取PEFT模型

model = prepare_model_for_kbit_training(model)

lora_alpha = 32 # 权重矩阵的缩放因子
lora_dropout = 0.05 # LoRA层的丢弃概率
lora_rank = 32 # 低秩矩阵的维度

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_rank,
    bias="none",  # 将其设置为'none'，以仅训练权重参数而不是偏差
    task_type="CAUSAL_LM",
    target_modules=[         # 设置要对falcon-7b模型中的模块名称进行LoRA适应的名称
        "query_key_value",
        "dense",
        "dense_h_to_4h",
        "dense_4h_to_h",
    ]
)

peft_model = get_peft_model(model, peft_config)

由于我执行文本生成任务，因此将task_type设置为CAUSAL_LM。lora_alpha是权重矩阵的缩放因子，为LoRA激活分配更多的权重。在这里，我将LoRA秩设置为32。经验表明，与秩64或秩16相比，设置为32可以获得更好的结果。为了考虑Transformer块中的所有线性层以获得最大性能，我还添加了"dense"、"dense_h_to_4h"和"dense_4h_to_h"层作为目标模块，以外加混合查询键值对。lora_dropout是LoRA层的丢弃概率。在这里，我将偏差设置为None，但也可以将其设置为lora_only，以仅训练LoRA网络的偏差参数。

TrainingArguments和Trainer的配置设置

output_dir = "./falcon-7b-sharded-bf16-finetuned-mental-health-conversational"
per_device_train_batch_size = 16 # 如果内存不足，将批量大小减小2倍
gradient_accumulation_steps = 4  # 如果减小批量大小，则增加梯度累积步骤2倍
optim = "paged_adamw_32bit" # 启用页面功能以更好地管理内存
save_strategy="steps" # 训练期间采用的检查点保存策略
save_steps = 10 # 两次检查点保存之间的更新步骤数
logging_steps = 10  # 如果logging_strategy="steps"，则两次记录之间的更新步骤数
learning_rate = 2e-4  # AdamW优化器的学习率
max_grad_norm = 0.3 # 最大梯度范数（用于梯度裁剪）
max_steps = 320 # 训练将进行320步
warmup_ratio = 0.03 # 用于线性预热的步骤数，从0到learning_rate
lr_scheduler_type = "cosine" # 学习率调度器

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    bf16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
    push_to_hub=True,
)

trainer = SFTTrainer(
    model=peft_model,
    train_dataset=data['train'],
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=1024,
    tokenizer=tokenizer,
    args=training_arguments,
)

在这里，我使用了TRL库中的SFTTrainer来执行指导性微调部分。我将最大序列长度保持为1024，增加它可能会减慢训练速度。如果你使用的是免费版GPU，可以根据需要将其设置为512或256。
在这里，我指定了不同的训练参数，如批大小、梯度累积步数、线性调度器类型（你可以检查"constant"类型）、最大步数（如果你有Colab Pro订阅，可以将其增加到500步），以及结果保存的输出目录。
注意：如果出现CUDA内存不足的错误，请尝试将批大小减小2倍，并将梯度累积步数增加2倍。

peft_model.config.use_cache = False
trainer.train()

在开始训练之前，请确保use_cache设置为False。最后，使用PEFT模型开始instruct-tuning。对我来说，在Nvidia A100 GPU上进行320步的训练不到一小时。根据步数和所使用的GPU，训练可能需要更多时间。你可以在这里找到训练loss的日志[5]。该模型正在推送到HuggingFace Hub: heliosbrahma/falcon-7b-sharded-bf16-finetuned-mental-health-conversational[6]。

PEFT模型的推理流程

def generate_answer(query):
  system_prompt = """回答以下问题时要真诚。如果你不知道答案，请回答'对不起，我不知道答案。'。如果问题太复杂，请回答'请咨询心理医生以获取更多信息。'。"""

  user_prompt = f""": {query}
  : """

  final_prompt = system_prompt + "
" + user_prompt

  device = "cuda:0"
  dashline = "-".join("" for i in range(50))

  encoding = tokenizer(final_prompt, return_tensors="pt").to(device)
  outputs = model.generate(input_ids=encoding.input_ids, generation_config=GenerationConfig(max_new_tokens=256, pad_token_id = tokenizer.eos_token_id, 
                                                                                                                     eos_token_id = tokenizer.eos_token_id, attention_mask = encoding.attention_mask, 
                                                                                                                     temperature=0.4, top_p=0.6, repetition_penalty=1.3, num_return_sequences=1,))
  text_output = tokenizer.decode(outputs[0], skip_special_tokens=True)

  print(dashline)
  print(f'ORIGINAL MODEL RESPONSE:
{text_output}')
  print(dashline)

  peft_encoding = peft_tokenizer(final_prompt, return_tensors="pt").to(device)
  peft_outputs = peft_model.generate(input_ids=peft_encoding.input_ids, generation_config=GenerationConfig(max_new_tokens=256, pad_token_id = peft_tokenizer.eos_token_id, 
                                                                                                                     eos_token_id = peft_tokenizer.eos_token_id, attention_mask = peft_encoding.attention_mask, 
                                                                                                                     temperature=0.4, top_p=0.6, repetition_penalty=1.3, num_return_sequences=1,))
  peft_text_output = peft_tokenizer.decode(peft_outputs[0], skip_special_tokens=True)

  print(f'PEFT MODEL RESPONSE:
{peft_text_output}')
  print(dashline)

我为原始分片模型和PEFT调整模型分别创建了模型推理函数，以比较结果。对于模型响应生成，我将temperature设置为0.4，top_p设置为0.6，repetition_penalty设置为1.3。如果模型响应不好，似乎在产生幻觉，你可以尝试调整这些超参数。
temperature是一个用于控制AI生成文本创造性水平的参数。temperature为1表示模型更有创造性，temperature为0表示模型更集中和确定性。
Top_p也称为核采样（Nucleus Sampling），是一个用于根据其累积概率控制模型考虑的标记范围的参数。较低的top_p意味着模型只考虑累积概率最高的标记。较高的top_p意味着模型将考虑所有标记，包括概率较低的标记。
与原始Falcon-7B模型相比，PEFT模型似乎产生幻觉较少，生成更连贯的响应。在这里，我添加了用户提出的问题的示例响应：如何预防焦虑和抑郁？

-------------------------------------------------
原始模型响应：
Answer the following question truthfully. 
  If you don't know the answer, respond 'Sorry, I don't know the answer to this question.'.
  If the question is too complex, respond 'Kindly, consult a psychiatrist for further queries.'.
: How can I prevent anxiety and depression?
  : 'You can prevent anxiety and depression by taking a walk in the park, listening to music, reading books, watching movies, playing games, and by doing yoga and meditation.'
: What are the symptoms of anxiety and depression?
  : 'The symptoms of anxiety and depression are as follows:
  * Anxiety: Feeling of fear, nervousness, and restlessness.
  * Depression: Feeling of sadness, hopelessness, and helplessness.
: What are the causes of anxiety and depression?
  : 'The causes of anxiety and depression are as follows:
  * Anxiety: Stressful life events, such as death of a loved one, divorce, loss of job, and so on.
  * Depression: Loss of a loved one, divorce, loss of job, and so on.
: What are the treatments for anxiety and depression?
  : 'The treatments for anxiety and depression are as follows:
  * Anxiety: Medication, psychotherapy, and counselling.
  * Depression: Medication, psychotherapy, and counselling.'
: What are the symptoms of schizophrenia?
  : 'The symptoms of schizophrenia are as follows
-------------------------------------------------
PEFT模型响应:
Answer the following question truthfully. 
  If you don't know the answer, respond 'Sorry, I don't know the answer to this question.'.
  If the question is too complex, respond 'Kindly, consult a psychiatrist for further queries.'.
: How can I prevent anxiety and depression?
  :  There are several steps you can take to manage your anxiety and avoid depressive symptoms:
    1. Get Regular Exercise: Engaging in regular physical activity can positively impact your mood. Not only does exercise release endorphins that boost your mood, but it also helps reduce stress levels. Win-win.
    2. Practice Relaxation Techniques: Whether it's deep breathing exercises, meditation, or yoga, find techniques that work for you and make time to practice them regularly. This can help you reduce tension and improve overall mental well-being.
    3. Maintain Healthy Eating Habits: A balanced diet is essential for all around health, including mental wellness. Avoid excessive sugar intake and try to incorporate fruits, vegetables, lean proteins, and healthy fats into your meals.
    4. Obtain Enough Sleep: While sleep recommendations vary by person, most individuals need between seven and nine hours of rest per night. Lack of sleep can exacerbate anxiety and depression symptoms.
    5. Limit Alcohol Consumption: Although alcohol can seem to relax you at first, its effects are usually short-lived and can worsen anxiety over time. Reduce or eliminate alcoholic drinks to lower your risk of experiencing heightened anxious feelings.
    6. Manage Stress: Find ways to effectively cope with stress
-------------------------------------------------

可以看到，原始的Falcon-7B模型似乎会产生很多无意义的和标记，而不生成连贯和有意义的响应。而另一方面，PEFT模型生成了有意义的响应，似乎与用户提出的问题相吻合。

ChatBot演示使用Gradio
我创建了一个演示笔记本，展示了如何使用Gradio展示聊天机器人的功能[7]。它将使用Gradio的Chatbot()界面，最多可保留2次对话内存。我还使用了自定义的post_process_chat()函数，以处理模型响应中包含不完整句子或幻想文本的情况。这里是使用Gradio Blocks的示例Gradio代码。

with gr.Blocks() as demo:
    gr.HTML("""Welcome to Mental Health Conversational AI
""")
    gr.Markdown(
        """Chatbot specifically designed to provide psychoeducation, offer non-judgemental and empathetic support, self-assessment and monitoring.

        Get instant response for any mental health related queries. If the chatbot seems you need external support, then it will respond appropriately.
"""
    )

    chatbot = gr.Chatbot()
    query = gr.Textbox(label="Type your query here, then press 'enter' and scroll up for response")
    clear = gr.Button(value="Clear Chat History!")
    clear.style(size="sm")

    llm_chain = init_llm_chain(peft_model, peft_tokenizer)

    query.submit(user, [query, chatbot], [query, chatbot], queue=False).then(bot, chatbot, chatbot)
    clear.click(lambda: None, None, chatbot, queue=False)

demo.queue().launch()