DataDreamer#
Prompt Β· Generate Synthetic Data Β· Train & Align Models
DataDreamer is a powerful open-source Python library for prompting, synthetic data generation, and training workflows. It is designed to be simple, extremely efficient, and research-grade.
Installation:
pip3 install datadreamer.dev
|
|
demo.py |
Result of demo.py |
---|---|
from datadreamer import DataDreamer
from datadreamer.llms import OpenAI
from datadreamer.steps import DataFromPrompt, ProcessWithPrompt
from datadreamer.trainers import TrainHFFineTune
from peft import LoraConfig
with DataDreamer("./output"):
# Load GPT-4
gpt_4 = OpenAI(model_name="gpt-4")
# Generate synthetic arXiv-style research paper abstracts with GPT-4
arxiv_dataset = DataFromPrompt(
"Generate Research Paper Abstracts",
args={
"llm": gpt_4,
"n": 1000,
"temperature": 1.2,
"instruction": (
"Generate an arXiv abstract of an NLP research paper."
" Return just the abstract, no titles."
),
},
outputs={"generations": "abstracts"},
)
# Ask GPT-4 to convert the abstracts to tweets
abstracts_and_tweets = ProcessWithPrompt(
"Generate Tweets from Abstracts",
inputs={"inputs": arxiv_dataset.output["abstracts"]},
args={
"llm": gpt_4,
"instruction": (
"Given the abstract, write a tweet to summarize the work."
),
"top_p": 1.0,
},
outputs={"inputs": "abstracts", "generations": "tweets"},
)
# Create training data splits
splits = abstracts_and_tweets.splits(train_size=0.90, validation_size=0.10)
# Train a model to convert research paper abstracts to tweets
# with the synthetic dataset
trainer = TrainHFFineTune(
"Train an Abstract => Tweet Model",
model_name="google/t5-v1_1-base",
peft_config=LoraConfig(),
)
trainer.train(
train_input=splits["train"].output["abstracts"],
train_output=splits["train"].output["tweets"],
validation_input=splits["validation"].output["abstracts"],
validation_output=splits["validation"].output["tweets"],
epochs=30,
batch_size=8,
)
# Publish and share the synthetic dataset
abstracts_and_tweets.publish_to_hf_hub(
"datadreamer-dev/abstracts_and_tweets",
train_size=0.90,
validation_size=0.10,
)
# Publish and share the trained model
trainer.publish_to_hf_hub("datadreamer-dev/abstracts_to_tweet_model")
|
|
π For more demonstrations and recipes see the Quick Tour page. |
With DataDreamer you can:
π¬ Create Prompting Workflows: Create and run multi-step, complex, prompting workflows easily with major open source or API-based LLMs.
π Generate Synthetic Datasets: Generate synthetic datasets for novel tasks or augment existing datasets with LLMs.
βοΈ Train Models: Align models. Fine-tune models. Instruction-tune models. Distill models. Train on existing data or synthetic data.
β¦ learn more about whatβs possible in the Overview Guide.
DataDreamer is:
𧩠Simple: Simple and approachable to use with sensible defaults, yet powerful with support for bleeding edge techniques.
π¬ Research-Grade: Built for researchers, by researchers, but accessible to all. A focus on correctness, best practices, and reproducibility.
ποΈ Efficient: Aggressive caching and resumability built-in. Support for techniques like quantization, parameter-efficient training (LoRA), and more.
π Reproducible: Workflows built with DataDreamer are easily shareable, reproducible, and extendable.
π€ Makes Sharing Easy: Publishing datasets and models is simple. Automatically generate data cards and model cards with metadata. Generate a list of any citations required.
β¦ learn more about the motivation and design principles behind DataDreamer.
Citation#
Please cite the DataDreamer paper:
@misc{patel2024datadreamer,
title={DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows},
author={Ajay Patel and Colin Raffel and Chris Callison-Burch},
year={2024},
eprint={2402.10379},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Contact#
Please reach out to us via email (ajayp@upenn.edu) or on Discord if you have any questions, comments, or feedback.