Caching and Saved Outputs#
DataDreamer aggressively caches and saves its work at multiple levels to avoid re-computing when possible to be as time- and cost-efficient as possible.
Step Outputs: DataDreamer caches the results of each step run within a session to the output folder. If a session is interrupted and re-run, DataDreamer will automatically load the results of previously completed steps from disk and resume where it left off.
Model Generations and Outputs: DataDreamer caches the results computed by a
LLM,Embeddermodel, etc.Training Checkpoints: DataDreamer will automatically save and resume from checkpoints when training a model with
Trainer.
Output Folder File Structure#
DataDreamer sessions write to an output folder where all outputs and caches are saved. Below is a brief description of the output folder structure.
Step Folders: Each
Stepwill produce a named folder within the output folder. The name of the folder is the name of the step, and the folder contains the output dataset of the step within a_datasetfolder.step.jsoncontains metadata about the step. If a step is run within another step, its folder will be nested under the parent step’s folder.Trainer Folders: Each
Trainerwill produce a named folder within the output folder. The name of the folder is the name of the trainer, and the folder contains saved checkpoints during training to a_checkpointsfolder and the final trained model to a_modelfolder. Various JSON files inside the_modelfolder liketraining_args.jsoncontain metadata about the training configuration.Cache Folder: The
.cachefolder in the output folder holds the SQLite databases that are used to cache the generations and outputs produced by models likeLLMorEmbedder.Backups Folder: The
_backupsfolder in the output folder holds backups of step or trainer folders that have since been invalidated by a newer configuration of that step or trainer. They are kept in case a user reverts to a previous configuration of the step or trainer.