DataDreamer#

Prompt · Generate Synthetic Data · Train & Align Models

DataDreamer is a powerful open-source Python library for prompting, synthetic data generation, and training workflows. It is designed to be simple, extremely efficient, and research-grade.

Installation: pip3 install datadreamer.dev
demo.py	Result of demo.py
Train a model to generate a tweet summarizing a research paper abstract using synthetic data.# from datadreamer import DataDreamer from datadreamer.llms import OpenAI from datadreamer.steps import DataFromPrompt, ProcessWithPrompt from datadreamer.trainers import TrainHFFineTune from peft import LoraConfig with DataDreamer("./output"): # Load GPT-4 gpt_4 = OpenAI(model_name="gpt-4") # Generate synthetic arXiv-style research paper abstracts with GPT-4 arxiv_dataset = DataFromPrompt( "Generate Research Paper Abstracts", args={ "llm": gpt_4, "n": 1000, "temperature": 1.2, "instruction": ( "Generate an arXiv abstract of an NLP research paper." " Return just the abstract, no titles." ), }, outputs={"generations": "abstracts"}, ) # Ask GPT-4 to convert the abstracts to tweets abstracts_and_tweets = ProcessWithPrompt( "Generate Tweets from Abstracts", inputs={"inputs": arxiv_dataset.output["abstracts"]}, args={ "llm": gpt_4, "instruction": ( "Given the abstract, write a tweet to summarize the work." ), "top_p": 1.0, }, outputs={"inputs": "abstracts", "generations": "tweets"}, ) # Create training data splits splits = abstracts_and_tweets.splits(train_size=0.90, validation_size=0.10) # Train a model to convert research paper abstracts to tweets # with the synthetic dataset trainer = TrainHFFineTune( "Train an Abstract => Tweet Model", model_name="google/t5-v1_1-base", peft_config=LoraConfig(), ) trainer.train( train_input=splits["train"].output["abstracts"], train_output=splits["train"].output["tweets"], validation_input=splits["validation"].output["abstracts"], validation_output=splits["validation"].output["tweets"], epochs=30, batch_size=8, ) # Publish and share the synthetic dataset abstracts_and_tweets.publish_to_hf_hub( "datadreamer-dev/abstracts_and_tweets", train_size=0.90, validation_size=0.10, ) # Publish and share the trained model trainer.publish_to_hf_hub("datadreamer-dev/abstracts_to_tweet_model")	See the synthetic dataset and the trained model
🚀 For more demonstrations and recipes see the Quick Tour page.

Installation:

pip3 install datadreamer.dev

demo.py

Result of

demo.py

Train a model to generate a tweet summarizing a research paper abstract using synthetic data.#

from datadreamer import DataDreamer
from datadreamer.llms import OpenAI
from datadreamer.steps import DataFromPrompt, ProcessWithPrompt
from datadreamer.trainers import TrainHFFineTune
from peft import LoraConfig

with DataDreamer("./output"):
   # Load GPT-4
   gpt_4 = OpenAI(model_name="gpt-4")

   # Generate synthetic arXiv-style research paper abstracts with GPT-4
   arxiv_dataset = DataFromPrompt(
      "Generate Research Paper Abstracts",
      args={
         "llm": gpt_4,
         "n": 1000,
         "temperature": 1.2,
         "instruction": (
            "Generate an arXiv abstract of an NLP research paper."
            " Return just the abstract, no titles."
         ),
      },
      outputs={"generations": "abstracts"},
   )

   # Ask GPT-4 to convert the abstracts to tweets
   abstracts_and_tweets = ProcessWithPrompt(
      "Generate Tweets from Abstracts",
      inputs={"inputs": arxiv_dataset.output["abstracts"]},
      args={
         "llm": gpt_4,
         "instruction": (
            "Given the abstract, write a tweet to summarize the work."
         ),
         "top_p": 1.0,
      },
      outputs={"inputs": "abstracts", "generations": "tweets"},
   )

   # Create training data splits
   splits = abstracts_and_tweets.splits(train_size=0.90, validation_size=0.10)

   # Train a model to convert research paper abstracts to tweets
   # with the synthetic dataset
   trainer = TrainHFFineTune(
      "Train an Abstract => Tweet Model",
      model_name="google/t5-v1_1-base",
      peft_config=LoraConfig(),
   )
   trainer.train(
      train_input=splits["train"].output["abstracts"],
      train_output=splits["train"].output["tweets"],
      validation_input=splits["validation"].output["abstracts"],
      validation_output=splits["validation"].output["tweets"],
      epochs=30,
      batch_size=8,
   )

   # Publish and share the synthetic dataset
   abstracts_and_tweets.publish_to_hf_hub(
      "datadreamer-dev/abstracts_and_tweets",
      train_size=0.90,
      validation_size=0.10,
   )

   # Publish and share the trained model
   trainer.publish_to_hf_hub("datadreamer-dev/abstracts_to_tweet_model")

See the synthetic dataset and the trained model

🚀 For more demonstrations and recipes see the Quick Tour page.

With DataDreamer you can:

💬 Create Prompting Workflows: Create and run multi-step, complex, prompting workflows easily with major open source or API-based LLMs.
📊 Generate Synthetic Datasets: Generate synthetic datasets for novel tasks or augment existing datasets with LLMs.
⚙️ Train Models: Align models. Fine-tune models. Instruction-tune models. Distill models. Train on existing data or synthetic data.
… learn more about what’s possible in the Overview Guide.

DataDreamer is:

🧩 Simple: Simple and approachable to use with sensible defaults, yet powerful with support for bleeding edge techniques.
🔬 Research-Grade: Built for researchers, by researchers, but accessible to all. A focus on correctness, best practices, and reproducibility.
🏎️ Efficient: Aggressive caching and resumability built-in. Support for techniques like quantization, parameter-efficient training (LoRA), and more.
🔄 Reproducible: Workflows built with DataDreamer are easily shareable, reproducible, and extendable.
🤝 Makes Sharing Easy: Publishing datasets and models is simple. Automatically generate data cards and model cards with metadata. Generate a list of any citations required.
… learn more about the motivation and design principles behind DataDreamer.

Citation#

Please cite the DataDreamer paper:

@misc{patel2024datadreamer,
   title={DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows},
   author={Ajay Patel and Colin Raffel and Chris Callison-Burch},
   year={2024},
   eprint={2402.10379},
   archivePrefix={arXiv},
   primaryClass={cs.CL}
}

Contact#

Please reach out to us via email (ajayp@upenn.edu) or on Discord if you have any questions, comments, or feedback.