DataDreamer#

Prompt   Β·   Generate Synthetic Data   Β·   Train & Align Models

Tests & Release Ruff

DataDreamer is a powerful open-source Python library for prompting, synthetic data generation, and training workflows. It is designed to be simple, extremely efficient, and research-grade.

Installation:
pip3 install datadreamer.dev
demo.py
Result of
demo.py
Train a model to generate a tweet summarizing a research paper abstract using synthetic data.#
from datadreamer import DataDreamer
from datadreamer.llms import OpenAI
from datadreamer.steps import DataFromPrompt, ProcessWithPrompt
from datadreamer.trainers import TrainHFFineTune
from peft import LoraConfig

with DataDreamer("./output"):
   # Load GPT-4
   gpt_4 = OpenAI(model_name="gpt-4")

   # Generate synthetic arXiv-style research paper abstracts with GPT-4
   arxiv_dataset = DataFromPrompt(
      "Generate Research Paper Abstracts",
      args={
         "llm": gpt_4,
         "n": 1000,
         "temperature": 1.2,
         "instruction": (
            "Generate an arXiv abstract of an NLP research paper."
            " Return just the abstract, no titles."
         ),
      },
      outputs={"generations": "abstracts"},
   )

   # Ask GPT-4 to convert the abstracts to tweets
   abstracts_and_tweets = ProcessWithPrompt(
      "Generate Tweets from Abstracts",
      inputs={"inputs": arxiv_dataset.output["abstracts"]},
      args={
         "llm": gpt_4,
         "instruction": (
            "Given the abstract, write a tweet to summarize the work."
         ),
         "top_p": 1.0,
      },
      outputs={"inputs": "abstracts", "generations": "tweets"},
   )

   # Create training data splits
   splits = abstracts_and_tweets.splits(train_size=0.90, validation_size=0.10)

   # Train a model to convert research paper abstracts to tweets
   # with the synthetic dataset
   trainer = TrainHFFineTune(
      "Train an Abstract => Tweet Model",
      model_name="google/t5-v1_1-base",
      peft_config=LoraConfig(),
   )
   trainer.train(
      train_input=splits["train"].output["abstracts"],
      train_output=splits["train"].output["tweets"],
      validation_input=splits["validation"].output["abstracts"],
      validation_output=splits["validation"].output["tweets"],
      epochs=30,
      batch_size=8,
   )

   # Publish and share the synthetic dataset
   abstracts_and_tweets.publish_to_hf_hub(
      "datadreamer-dev/abstracts_and_tweets",
      train_size=0.90,
      validation_size=0.10,
   )

   # Publish and share the trained model
   trainer.publish_to_hf_hub("datadreamer-dev/abstracts_to_tweet_model")

Demo

See the synthetic dataset and the trained model

πŸš€ For more demonstrations and recipes see the Quick Tour page.


With DataDreamer you can:

  • πŸ’¬ Create Prompting Workflows: Create and run multi-step, complex, prompting workflows easily with major open source or API-based LLMs.

  • πŸ“Š Generate Synthetic Datasets: Generate synthetic datasets for novel tasks or augment existing datasets with LLMs.

  • βš™οΈ Train Models: Align models. Fine-tune models. Instruction-tune models. Distill models. Train on existing data or synthetic data.

  • … learn more about what’s possible in the Overview Guide.

DataDreamer is:

  • 🧩 Simple: Simple and approachable to use with sensible defaults, yet powerful with support for bleeding edge techniques.

  • πŸ”¬ Research-Grade: Built for researchers, by researchers, but accessible to all. A focus on correctness, best practices, and reproducibility.

  • 🏎️ Efficient: Aggressive caching and resumability built-in. Support for techniques like quantization, parameter-efficient training (LoRA), and more.

  • πŸ”„ Reproducible: Workflows built with DataDreamer are easily shareable, reproducible, and extendable.

  • 🀝 Makes Sharing Easy: Publishing datasets and models is simple. Automatically generate data cards and model cards with metadata. Generate a list of any citations required.

  • … learn more about the motivation and design principles behind DataDreamer.

Citation#

Please cite the DataDreamer paper:

@misc{patel2024datadreamer,
   title={DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows},
   author={Ajay Patel and Colin Raffel and Chris Callison-Burch},
   year={2024},
   eprint={2402.10379},
   archivePrefix={arXiv},
   primaryClass={cs.CL}
}

Contact#

Please reach out to us via email (ajayp@upenn.edu) or on Discord if you have any questions, comments, or feedback.