Aligning a LLM with Human Preferences#
In order to better align the responses instruction-tuned LLMs generate to what humans would prefer, we can train LLMs against a reward model or a dataset of human preferences in a process known as RLHF (Reinforcement Learning with Human Feedback).
DataDreamer makes this process extremely simple and straightforward to accomplish. We demonstrate it below using LoRA to only train a fraction of the weights with DPO (a more stable, and efficient alignment method than traditional RLHF).
from datadreamer import DataDreamer
from datadreamer.steps import HFHubDataSource
from datadreamer.trainers import TrainHFDPO
from peft import LoraConfig
with DataDreamer("./output"):
# Get the DPO dataset
dpo_dataset = HFHubDataSource(
"Get DPO Dataset", "Intel/orca_dpo_pairs", split="train"
)
# Keep only 1000 examples as a quick demo
dpo_dataset = dpo_dataset.take(1000)
# Create training data splits
splits = dpo_dataset.splits(train_size=0.90, validation_size=0.10)
# Align the TinyLlama chat model with human preferences
trainer = TrainHFDPO(
"Align TinyLlama-Chat",
model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
peft_config=LoraConfig(),
device=["cuda:0", "cuda:1"],
dtype="bfloat16",
)
trainer.train(
train_prompts=splits["train"].output["question"],
train_chosen=splits["train"].output["chosen"],
train_rejected=splits["train"].output["rejected"],
validation_prompts=splits["validation"].output["question"],
validation_chosen=splits["validation"].output["chosen"],
validation_rejected=splits["validation"].output["rejected"],
epochs=3,
batch_size=1,
gradient_accumulation_steps=32,
)