Caching and Saved Outputs#
DataDreamer aggressively caches and saves its work at multiple levels to avoid re-computing when possible to be as time- and cost-efficient as possible.
Step Outputs: DataDreamer caches the results of each step run within a session to the output folder. If a session is interrupted and re-run, DataDreamer will automatically load the results of previously completed steps from disk and resume where it left off.
Model Generations and Outputs: DataDreamer caches the results computed by a
LLM
,Embedder
model, etc.Training Checkpoints: DataDreamer will automatically save and resume from checkpoints when training a model with
Trainer
.
Output Folder File Structure#
DataDreamer
sessions write to an output folder where all outputs and caches are saved. Below is a brief description of the output folder structure.
Step Folders: Each
Step
will produce a named folder within the output folder. The name of the folder is the name of the step, and the folder contains the output dataset of the step within a_dataset
folder.step.json
contains metadata about the step. If a step is run within another step, its folder will be nested under the parent step’s folder.Trainer Folders: Each
Trainer
will produce a named folder within the output folder. The name of the folder is the name of the trainer, and the folder contains saved checkpoints during training to a_checkpoints
folder and the final trained model to a_model
folder. Various JSON files inside the_model
folder liketraining_args.json
contain metadata about the training configuration.Cache Folder: The
.cache
folder in the output folder holds the SQLite databases that are used to cache the generations and outputs produced by models likeLLM
orEmbedder
.Backups Folder: The
_backups
folder in the output folder holds backups of step or trainer folders that have since been invalidated by a newer configuration of that step or trainer. They are kept in case a user reverts to a previous configuration of the step or trainer.