Overview Guide#

The following is a high-level conceptual overview of the DataDreamer library to help you quickly understand how different components work together to create prompting, synthetic data generation, and training workflows.

DataDreamer Sessions#

Any code you write with the DataDreamer library, you will place within a DataDreamer session like so:

from datadreamer import DataDreamer

with DataDreamer('./output/'):
    # ... run steps or trainers here ...

Within a session, you can run any steps or trainers you want. DataDreamer will automatically organize, cache, and save the results of each step or trainer run within a session to the output folder. This makes the session easily resumable if interrupted and reproducible when the code is shared along with session the output folder.

Steps#

A step in DataDreamer transforms some input data to some output data. Steps are the core operators in a DataDreamer session and are useful for generating data from LLMs, synthetic data augmentation to existing datasets, or any other data processing task. The output of one step can be used as the input to another step, allowing you to chain together multiple steps to create complex workflows.

For example, the HFHubDataSource step lets you load in an existing dataset from the Hugging Face Hub. Steps like Prompt, FewShotPrompt, and FewShotPromptWithRetrieval help you produce generations from from LLMs. You can see all of the available built-in steps here. Although not required to use DataDreamer, you may be interested in creating your own steps to encapsulate a custom technique or routine.

Trainers#

A trainer in DataDreamer can train on a dataset, usually the output of a step, and produces a model. Trainers are useful for alignment, fine-tuning, instruction-tuning, training classifiers, and training a model from scratch.

Many types of training schemes are supported including for example TrainHFFineTune, TrainSentenceTransformer, TrainHFDPO. You can see the full list of available trainers here. Trainers also support training on multiple GPUs and training with quantization and parameter-efficient techniques like LoRA.

Models#

You can instantiate models like a LLM or Embedder. A step may require a model as an argument to run, for example, Prompt takes a LLM as an argument. Models make it easy to load and run the latest open source models as well as models served by API (OpenAI, Anthropic, Together AI, Mistral AI, etc.). DataDreamer also makes it simple to swap models for experimentation. You can see the full list of available LLMs here.

DataDreamer provides a variety of utilities around efficient generation including loading them with quantization, running them on multiple GPUs, caching generations, and more.

Publishing Datasets and Models#

DataDreamer makes it extremely simple and easy to export and publish both datasets (the outputs of steps) and models (the outputs of trainers) to the Hugging Face Hub. This makes it easier to openly share datasets and models you create with others along with reproducibility information.

DataDreamer will also automatically generate and publish data cards and model cards that contain useful metadata like software license information, citations, and reproducibility information alongside your dataset or model.

Reproducibility#

DataDreamer has a strong focus on reproducibility and DataDreamer sessions are easily reproducible when the code and the session output folder is shared alongside any published datasets or models created by DataDreamer.

Important

By sharing the session output folder, it allows others to reproduce or extend your workflow by easily resuming the DataDreamer session and modifying it, while taking advantage of cached intermediate outputs and work to avoid expensive or slow re-computation where possible.

Datasets and models published with DataDreamer also have automatically generated data cards and model cards for reproducibility.

Advanced Usage#

Now that you have a basic understanding of the DataDreamer library, you may be interested topics covered in the Advanced Usage section.