steps#

Step objects perform core operations within a DataDreamer session transforming input data into output data. They are run within a DataDreamer session and each step takes in data as a set of inputs and configuration options as a set of args and then returns output data under the output attribute. A help message that describes what inputs and args a particular step takes in and what output it returns can be found by accessing the help string. All steps derive from the Step base class.

You can create your own steps to implement custom logic or a new technique.

Constructing a `Step` object#

Step `inputs`#

Step inputs are a set of named inputs that are passed to a step via a key-value dictionary. The input data a step will operate on are given to a step via inputs and each input may be of type OutputDatasetColumn or OutputIterableDatasetColumn. The input columns passed to a step are typically output columns created by previous steps within a DataDreamer session.

Step `args`#

Step args are a set of named arguments that are passed to a step via a key-value dictionary. Arguments are non-input data parameters typically used to configure a step’s behavior. Arguments are typically of type bool, int, str, or other Python types. Some steps like Prompt or Embed take LLM objects Embedder objects as args.

Step `outputs`#

Steps produce output data in the form of an output dataset object like OutputDataset or OutputIterableDataset accessible by the output attribute. To access a column on these output dataset objects you can use the __getitem__ operator like so: step.output['column_name']. This will return a OutputDatasetColumn or OutputIterableDatasetColumn column object.

Renaming output columns#

If you want to rename the columns in the output dataset, at Step construction you can pass a outputs argument that is a dictionary mapping from the original output column names to the new column names.

Caching#

Steps are cached and their outputs are saved to disk after running. Upon resuming a DataDreamer session, previously completed steps are loaded from disk.

Types of Steps#

DataDreamer comes with a set of built-in steps that can be used to create a workflow to process data, create synthetic datasets, augment existing datasets, and train models. We catalogue diferent types of steps below.

Lazy Steps#

Some steps may run lazily. This means that the step’s output will be lazily computed as it is consumed by another step. This is useful for steps that work with large datasets or require expensive computation. Lazy steps will return a OutputIterableDataset under their output attribute. At any point, you can use the progress attribute to get how much of a lazy step’s output has been computed so far.

Lazy steps do not save their output to disk. If you want to explictly save a lazy step’s ouput to disk, you can call save() on the lazy step. If you have a series of lazy steps chained together, you can call save() on the last lazy step to save the final transformed data to disk after all transformations have been computed, skipping saving intermediate transformations to disk.

Convenience Methods#

Various convenience methods are available on Step objects to help with frequent tasks.

Previewing a Step’s `output`#

If you want to preview a step’s output, you can call the head() method to quickly retrieve a few rows from the step’s output dataset in the form of a pandas.DataFrame object.

Common Operations#

A variety of common data operations and data transformations are available as methods on Step objects:

Basic Operations	Column Operations	Sequence Operations	Save/Copy Operations	Training Dataset Operations
`select()` `take()` `skip()` `add_item()`	`select_columns()` `rename_column()` `rename_columns()` `remove_columns()`	`shuffle()` `sort()` `map()` `filter()` `reverse()`	`save()` `copy()` `shard()`	`splits()`

In addition, there are higher-order functions that operate on multiple Step objects that can be used to concatenate the output datasets of multiple steps together or to combine the output dataset columns of multiple steps together (similar to zip() in Python):

concat()
zipped()

Running Steps in Parallel#

See the Running Steps in Parallel page.

Data Card Generation#

An automatically generated data card can be viewed by calling data_card(). The data card can be helpful for reproducibility and for sharing your work with others when published alongside your code. When publishing datasets, the data card will be published alongside the dataset.

The data card traces what steps were run to produce the step’s output dataset, what models were used, what paper citations and software licenses may apply, among other useful information (see DataCardType). Reproducibility information such as the versions of packages used and a fingerprint hash (a signature of all steps chained together to produce the final step’s output dataset) is also included.

Exporting and Publishing Datasets#

You can export the output dataset produced by a step to disk by calling one of the various export methods available:

export_to_dict()
export_to_list()
export_to_json()
export_to_csv()
export_to_hf_dataset()

You can publish the output dataset produced by a step to the Hugging Face Hub by calling publish_to_hf_hub().

class datadreamer.steps.Step(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#

Bases: object

Base class for all steps.

Parameters:

name (str) – The name of the step.
inputs (Optional[dict[str, OutputDatasetColumn | OutputIterableDatasetColumn]], default: None) – The inputs to the step.
args (Optional[dict[str, Any]], default: None) – The args to the step.
outputs (Optional[dict[str, str]], default: None) – The name mapping to rename outputs produced by the step.
progress_interval (Optional[int], default: 60) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
verbose (Optional[bool], default: None) – Whether or not to print verbose logs.
log_level (Optional[int], default: None) – The logging level to use (DEBUG, INFO, etc.).
save_num_proc (Optional[int], default: None) – The number of processes to use if saving to disk.
save_num_shards (Optional[int], default: None) – The number of shards on disk to save the dataset into.
background (bool, default: False) – Whether to run the operation in the background.

register_input(input_column_name, required=True, help=None)[source]#

Parameters:

input_column_name (str) – The name of the input column.
required (bool, default: True) – Whether the input is required.
help (Optional[str], default: None) – The help string for the input.

register_arg(arg_name, required=True, default=None, help=None, default_help=None)[source]#

Parameters:

arg_name (str) – The name of the argument.
required (bool, default: True) – Whether the argument is required.
default (Optional[Any], default: None) – The default value of the argument.
help (Optional[str], default: None) – The help string for the argument.
default_help (Optional[str], default: None) – The help string for the default value of the argument.

register_output(output_column_name, help=None)[source]#

Parameters:

output_column_name (str) – The name of the output column.
help (Optional[str], default: None) – The help string for the output.

register_data_card(data_card_type, data_card)[source]#

Parameters:

data_card_type (str) – The type of the data card.
data_card (Any) – The data card.

property args: dict[str, Any][source]#: The args of the step.

property inputs: dict[str, OutputDatasetColumn | OutputIterableDatasetColumn][source]#: The inputs of the step.

get_run_output_folder_path()[source]#

Get the run output folder path that can be used by the step for writing persistent data.

Return type:: str

pickle(value, *args, **kwargs)[source]#

Pickle a value so it can be stored in a row produced by this step. See create your own steps for more details.

Parameters:

value (Any) – The value to pickle.
*args (Any) – The args to pass to dumps().
**kwargs (Any) – The kwargs to pass to dumps().

Return type:

bytes

unpickle(value)[source]#

Unpickle a value that was stored in a row produced by this step with pickle(). See create your own steps for more details.

Parameters:: value (bytes) – The value to unpickle.
Return type:: Any

property progress: None | float[source]#: The progress of the step.

property output: OutputDataset | OutputIterableDataset[source]#: The output dataset of the step.

property dataset_path: str[source]#: The path to the step’s output dataset on disk in HuggingFace Dataset format if the step has been saved to disk.

head(n=5, shuffle=False, seed=None, buffer_size=1000)[source]#

Return the first n rows of the step’s output as a pandas DataFrame for easy viewing.

Parameters:

n (default: 5) – The number of rows to return.
shuffle (default: False) – Whether to shuffle the rows before taking the first n.
seed (default: None) – The seed to use if shuffling.
buffer_size (default: 1000) – The buffer size to use if shuffling and the step’s output is an iterable dataset.

Return type:

DataFrame

data_card()[source]#

Print the data card for the step.

Return type:: None

select(indices, name=None, lazy=True, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#

Select rows from the step’s output by their indices. See select() for more details.

Parameters:

indices (Iterable) – The indices of the rows to select.
name (Optional[str], default: None) – The name of the operation.
lazy (bool, default: True) – Whether to run the operation lazily.
progress_interval (UnionType[None, int, Default], default: DEFAULT) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
writer_batch_size (Optional[int], default: 1000) – The batch size to use if saving to disk.
save_num_proc (UnionType[None, int, Default], default: DEFAULT) – The number of processes to use if saving to disk.
save_num_shards (UnionType[None, int, Default], default: DEFAULT) – The number of shards on disk to save the dataset into.
background (bool, default: False) – Whether to run the operation in the background.

Return type:

Step

Returns:

A new step with the operation applied.

select_columns(column_names, name=None, lazy=True, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#

Select columns from the step’s output. See select_columns() for more details.

Parameters:

column_names (str | list[str]) – The names of the columns to select.
name (Optional[str], default: None) – The name of the operation.
lazy (bool, default: True) – Whether to run the operation lazily.
progress_interval (UnionType[None, int, Default], default: DEFAULT) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
writer_batch_size (Optional[int], default: 1000) – The batch size to use if saving to disk.
save_num_proc (UnionType[None, int, Default], default: DEFAULT) – The number of processes to use if saving to disk.
save_num_shards (UnionType[None, int, Default], default: DEFAULT) – The number of shards on disk to save the dataset into.
background (bool, default: False) – Whether to run the operation in the background.

Return type:

Step

Returns:

A new step with the operation applied.

take(n, name=None, lazy=True, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#

Take the first n rows from the step’s output. See take() for more details.

Parameters:

n (int) – The number of rows to take.
name (Optional[str], default: None) – The name of the operation.
lazy (bool, default: True) – Whether to run the operation lazily.
progress_interval (UnionType[None, int, Default], default: DEFAULT) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
writer_batch_size (Optional[int], default: 1000) – The batch size to use if saving to disk.
save_num_proc (UnionType[None, int, Default], default: DEFAULT) – The number of processes to use if saving to disk.
save_num_shards (UnionType[None, int, Default], default: DEFAULT) – The number of shards on disk to save the dataset into.
background (bool, default: False) – Whether to run the operation in the background.

Return type:

Step

Returns:

A new step with the operation applied.

skip(n, name=None, lazy=True, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#

Skip the first n rows from the step’s output. See skip() for more details.

Parameters:

n (int) – The number of rows to skip.
name (Optional[str], default: None) – The name of the operation.
lazy (bool, default: True) – Whether to run the operation lazily.
progress_interval (UnionType[None, int, Default], default: DEFAULT) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
writer_batch_size (Optional[int], default: 1000) – The batch size to use if saving to disk.
save_num_proc (UnionType[None, int, Default], default: DEFAULT) – The number of processes to use if saving to disk.
save_num_shards (UnionType[None, int, Default], default: DEFAULT) – The number of shards on disk to save the dataset into.
background (bool, default: False) – Whether to run the operation in the background.

Return type:

Step

Returns:

A new step with the operation applied.

shuffle(seed=None, buffer_size=1000, name=None, lazy=True, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#

Shuffle the rows of the step’s output. See shuffle() for more details.

Parameters:

seed (Optional[int], default: None) – The random seed to use for shuffling the step’s output.
buffer_size (int, default: 1000) – The buffer size to use for shuffling the dataset, if the step’s output is an OutputIterableDataset.
name (Optional[str], default: None) – The name of the operation.
lazy (bool, default: True) – Whether to run the operation lazily.
progress_interval (UnionType[None, int, Default], default: DEFAULT) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
writer_batch_size (Optional[int], default: 1000) – The batch size to use if saving to disk.
save_num_proc (UnionType[None, int, Default], default: DEFAULT) – The number of processes to use if saving to disk.
save_num_shards (UnionType[None, int, Default], default: DEFAULT) – The number of shards on disk to save the dataset into.
background (bool, default: False) – Whether to run the operation in the background.

Return type:

Step

Returns:

A new step with the operation applied.

sort(column_names, reverse=False, null_placement='at_end', name=None, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#

Sort the rows of the step’s output. See sort() for more details.

Parameters:

column_names (Union[str, Sequence[str]]) – The names of the columns to sort by.
reverse (Union[bool, Sequence[bool]], default: False) – Whether to sort in reverse order.
null_placement (str, default: 'at_end') – Where to place null values in the sorted dataset.
name (Optional[str], default: None) – The name of the operation.
progress_interval (UnionType[None, int, Default], default: DEFAULT) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
writer_batch_size (Optional[int], default: 1000) – The batch size to use if saving to disk.
save_num_proc (UnionType[None, int, Default], default: DEFAULT) – The number of processes to use if saving to disk.
save_num_shards (UnionType[None, int, Default], default: DEFAULT) – The number of shards on disk to save the dataset into.
background (bool, default: False) – Whether to run the operation in the background.

Return type:

Step

Returns:

A new step with the operation applied.

add_item(item, name=None, lazy=True, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#

Add a row to the step’s output. See add_item() for more details.

Parameters:

item (dict) – The item to add to the step’s output.
name (Optional[str], default: None) – The name of the operation.
lazy (bool, default: True) – Whether to run the operation lazily.
progress_interval (UnionType[None, int, Default], default: DEFAULT) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
writer_batch_size (Optional[int], default: 1000) – The batch size to use if saving to disk.
save_num_proc (UnionType[None, int, Default], default: DEFAULT) – The number of processes to use if saving to disk.
save_num_shards (UnionType[None, int, Default], default: DEFAULT) – The number of shards on disk to save the dataset into.
background (bool, default: False) – Whether to run the operation in the background.

Return type:

Step

Returns:

A new step with the operation applied.

map(function, with_indices=False, input_columns=None, batched=False, batch_size=1000, remove_columns=None, total_num_rows=None, auto_progress=True, name=None, lazy=True, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#

Apply a function to the step’s output. See map() for more details.

Parameters:

function (Callable) – The function to apply to rows of the step’s output.
with_indices (bool, default: False) – Whether to pass the indices of the rows to the function.
input_columns (Union[None, str, list[str]], default: None) – The names of the columns to pass to the function.
batched (bool, default: False) – Whether to apply the function in batches.
batch_size (int, default: 1000) – The batch size to use if applying the function in batches.
remove_columns (Union[None, str, list[str]], default: None) – The names of the columns to remove from the output.
total_num_rows (Optional[int], default: None) – The total number of rows being processed (helps with displaying progress).
auto_progress (bool, default: True) – Whether to automatically update the progress % for this step.
name (Optional[str], default: None) – The name of the operation.
lazy (bool, default: True) – Whether to run the operation lazily.
progress_interval (UnionType[None, int, Default], default: DEFAULT) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
writer_batch_size (Optional[int], default: 1000) – The batch size to use if saving to disk.
save_num_proc (UnionType[None, int, Default], default: DEFAULT) – The number of processes to use if saving to disk.
save_num_shards (UnionType[None, int, Default], default: DEFAULT) – The number of shards on disk to save the dataset into.
background (bool, default: False) – Whether to run the operation in the background.

Return type:

Step

Returns:

A new step with the operation applied.

filter(function, with_indices=False, input_columns=None, batched=False, batch_size=1000, total_num_rows=None, auto_progress=True, name=None, lazy=True, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#

Filter rows from the step’s output. See filter() for more details.

Parameters:

function (Callable) – The function to use for filtering rows of the step’s output.
with_indices (bool, default: False) – Whether to pass the indices of the rows to the function.
input_columns (Union[None, str, list[str]], default: None) – The names of the columns to pass to the function.
batched (bool, default: False) – Whether to apply the function in batches.
batch_size (int, default: 1000) – The batch size to use if applying the function in batches.
total_num_rows (Optional[int], default: None) – The total number of rows being processed (helps with displaying progress).
auto_progress (bool, default: True) – Whether to automatically update the progress % for this step.
name (Optional[str], default: None) – The name of the operation.
lazy (bool, default: True) – Whether to run the operation lazily.
progress_interval (UnionType[None, int, Default], default: DEFAULT) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
writer_batch_size (Optional[int], default: 1000) – The batch size to use if saving to disk.
save_num_proc (UnionType[None, int, Default], default: DEFAULT) – The number of processes to use if saving to disk.
save_num_shards (UnionType[None, int, Default], default: DEFAULT) – The number of shards on disk to save the dataset into.
background (bool, default: False) – Whether to run the operation in the background.

Return type:

Step

Returns:

A new step with the operation applied.

rename_column(original_column_name, new_column_name, name=None, lazy=True, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#

Rename a column in the step’s output. See rename_column() for more details.

Parameters:

original_column_name (str) – The original name of the column.
new_column_name (str) – The new name of the column.
name (Optional[str], default: None) – The name of the operation.
lazy (bool, default: True) – Whether to run the operation lazily.
progress_interval (UnionType[None, int, Default], default: DEFAULT) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
writer_batch_size (Optional[int], default: 1000) – The batch size to use if saving to disk.
save_num_proc (UnionType[None, int, Default], default: DEFAULT) – The number of processes to use if saving to disk.
save_num_shards (UnionType[None, int, Default], default: DEFAULT) – The number of shards on disk to save the dataset into.
background (bool, default: False) – Whether to run the operation in the background.

Return type:

Step

Returns:

A new step with the operation applied.

rename_columns(column_mapping, name=None, lazy=True, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#

Rename columns in the step’s output. See rename_columns() for more details.

Parameters:

column_mapping (dict[str, str]) – The mapping of original column names to new column names.
name (Optional[str], default: None) – The name of the operation.
lazy (bool, default: True) – Whether to run the operation lazily.
progress_interval (UnionType[None, int, Default], default: DEFAULT) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
writer_batch_size (Optional[int], default: 1000) – The batch size to use if saving to disk.
save_num_proc (UnionType[None, int, Default], default: DEFAULT) – The number of processes to use if saving to disk.
save_num_shards (UnionType[None, int, Default], default: DEFAULT) – The number of shards on disk to save the dataset into.
background (bool, default: False) – Whether to run the operation in the background.

Return type:

Step

Returns:

A new step with the operation applied.

remove_columns(column_names, name=None, lazy=True, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#

Remove columns from the step’s output. See remove_columns() for more details.

Parameters:

column_names (str | list[str]) – The names of the columns to remove.
name (Optional[str], default: None) – The name of the operation.
lazy (bool, default: True) – Whether to run the operation lazily.
progress_interval (UnionType[None, int, Default], default: DEFAULT) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
writer_batch_size (Optional[int], default: 1000) – The batch size to use if saving to disk.
save_num_proc (UnionType[None, int, Default], default: DEFAULT) – The number of processes to use if saving to disk.
save_num_shards (UnionType[None, int, Default], default: DEFAULT) – The number of shards on disk to save the dataset into.
background (bool, default: False) – Whether to run the operation in the background.

Return type:

Step

Returns:

A new step with the operation applied.

splits(train_size=None, validation_size=None, test_size=None, stratify_by_column=None, name=None, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#

Split the step’s output into multiple splits for training, validation, and testing. If train_size or validation_size or test_size is not specified, the corresponding split will not be created.

Parameters:

train_size (Union[None, float, int], default: None) – The size of the training split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the training split. If an int, should be the number of rows to include in the training split.
validation_size (Union[None, float, int], default: None) – The size of the validation split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the validation split. If an int, should be the number of rows to include in the validation split.
test_size (Union[None, float, int], default: None) – The size of the test split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If an int, should be the number of rows to include in the test split.
stratify_by_column (Optional[str], default: None) – The name of the column to use to stratify equally between splits.
name (Optional[str], default: None) – The name of the operation.
progress_interval (UnionType[None, int, Default], default: DEFAULT) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
writer_batch_size (Optional[int], default: 1000) – The batch size to use if saving to disk.
save_num_proc (UnionType[None, int, Default], default: DEFAULT) – The number of processes to use if saving to disk.
save_num_shards (UnionType[None, int, Default], default: DEFAULT) – The number of shards on disk to save the dataset into.
background (bool, default: False) – Whether to run the operation in the background.

Return type:

dict[str, Step]

Returns:

A dictionary where the keys are the names of the splits and the values are new steps with the split applied.

shard(num_shards, index, contiguous=False, name=None, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#

Shard the step’s output into multiple shards. See shard() for more details.

Parameters:

num_shards (int) – The number of shards to split the dataset into.
index (int) – The index of the shard to select.
contiguous (bool, default: False) – Whether to select contiguous blocks of indicies for shards.
name (Optional[str], default: None) – The name of the operation.
progress_interval (UnionType[None, int, Default], default: DEFAULT) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
writer_batch_size (Optional[int], default: 1000) – The batch size to use if saving to disk.
save_num_proc (UnionType[None, int, Default], default: DEFAULT) – The number of processes to use if saving to disk.
save_num_shards (UnionType[None, int, Default], default: DEFAULT) – The number of shards on disk to save the dataset into.
background (bool, default: False) – Whether to run the operation in the background.

Return type:

Step

Returns:

A new step with the operation applied.

reverse(name=None, lazy=True, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#

Reverse the rows of the step’s output.

Parameters:

name (Optional[str], default: None) – The name of the operation.
lazy (bool, default: True) – Whether to run the operation lazily.
progress_interval (UnionType[None, int, Default], default: DEFAULT) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
writer_batch_size (Optional[int], default: 1000) – The batch size to use if saving to disk.
save_num_proc (UnionType[None, int, Default], default: DEFAULT) – The number of processes to use if saving to disk.
save_num_shards (UnionType[None, int, Default], default: DEFAULT) – The number of shards on disk to save the dataset into.
background (bool, default: False) – Whether to run the operation in the background.

Return type:

Step

Returns:

A new step with the operation applied.

save(name=None, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#

Save the step’s output to disk.

Parameters:

name (Optional[str], default: None) – The name of the operation.
progress_interval (UnionType[None, int, Default], default: DEFAULT) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
writer_batch_size (Optional[int], default: 1000) – The batch size to use if saving to disk.
save_num_proc (UnionType[None, int, Default], default: DEFAULT) – The number of processes to use if saving to disk.
save_num_shards (UnionType[None, int, Default], default: DEFAULT) – The number of shards on disk to save the dataset into.
background (bool, default: False) – Whether to run the operation in the background.

Return type:

Step

Returns:

A new step with the operation applied.

copy(name=None, lazy=None, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#

Create a copy of the step’s output.

Parameters:

name (Optional[str], default: None) – The name of the operation.
lazy (Optional[bool], default: None) – Whether to run the operation lazily.
progress_interval (UnionType[None, int, Default], default: DEFAULT) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
writer_batch_size (Optional[int], default: 1000) – The batch size to use if saving to disk.
save_num_proc (UnionType[None, int, Default], default: DEFAULT) – The number of processes to use if saving to disk.
save_num_shards (UnionType[None, int, Default], default: DEFAULT) – The number of shards on disk to save the dataset into.
background (bool, default: False) – Whether to run the operation in the background.

Return type:

Step

Returns:

A new step with the operation applied.

export_to_dict(train_size=None, validation_size=None, test_size=None, stratify_by_column=None, writer_batch_size=1000, save_num_proc=None, save_num_shards=None)[source]#

Export the step’s output to a dictionary and optionally create splits. See splits() for more details on splits behavior.

Parameters:

train_size (Union[None, float, int], default: None) – The size of the training split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the training split. If an int, should be the number of rows to include in the training split.
validation_size (Union[None, float, int], default: None) – The size of the validation split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the validation split. If an int, should be the number of rows to include in the validation split.
test_size (Union[None, float, int], default: None) – The size of the test split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If an int, should be the number of rows to include in the test split.
stratify_by_column (Optional[str], default: None) – The name of the column to use to stratify equally between splits.
writer_batch_size (Optional[int], default: 1000) – The batch size to use if saving to disk.
save_num_proc (Optional[int], default: None) – The number of processes to use if saving to disk.
save_num_shards (Optional[int], default: None) – The number of shards on disk to save the dataset into.

Return type:

dict

Returns:

The step’s output as a dictionary.

export_to_list(train_size=None, validation_size=None, test_size=None, stratify_by_column=None, writer_batch_size=1000, save_num_proc=None, save_num_shards=None)[source]#

Export the step’s output to a list and optionally create splits. See splits() for more details on splits behavior.

Parameters:

train_size (Union[None, float, int], default: None) – The size of the training split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the training split. If an int, should be the number of rows to include in the training split.
validation_size (Union[None, float, int], default: None) – The size of the validation split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the validation split. If an int, should be the number of rows to include in the validation split.
test_size (Union[None, float, int], default: None) – The size of the test split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If an int, should be the number of rows to include in the test split.
stratify_by_column (Optional[str], default: None) – The name of the column to use to stratify equally between splits.
writer_batch_size (Optional[int], default: 1000) – The batch size to use if saving to disk.
save_num_proc (Optional[int], default: None) – The number of processes to use if saving to disk.
save_num_shards (Optional[int], default: None) – The number of shards on disk to save the dataset into.

Return type:

list | dict

export_to_json(path, train_size=None, validation_size=None, test_size=None, stratify_by_column=None, writer_batch_size=1000, save_num_proc=None, save_num_shards=None, **to_json_kwargs)[source]#

Export the step’s output to a JSON file and optionally create splits. See splits() for more details on splits behavior.

Parameters:

path (str) – The path to save the JSON file to.
train_size (Union[None, float, int], default: None) – The size of the training split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the training split. If an int, should be the number of rows to include in the training split.
validation_size (Union[None, float, int], default: None) – The size of the validation split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the validation split. If an int, should be the number of rows to include in the validation split.
test_size (Union[None, float, int], default: None) – The size of the test split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If an int, should be the number of rows to include in the test split.
stratify_by_column (Optional[str], default: None) – The name of the column to use to stratify equally between splits.
writer_batch_size (Optional[int], default: 1000) – The batch size to use if saving to disk.
save_num_proc (Optional[int], default: None) – The number of processes to use if saving to disk.
save_num_shards (Optional[int], default: None) – The number of shards on disk to save the dataset into.
to_json_kwargs – Additional keyword arguments to pass to to_json().

Return type:

str | dict

Returns:

The path to the JSON file or a dictionary of paths if creating splits.

export_to_csv(path, sep=',', train_size=None, validation_size=None, test_size=None, stratify_by_column=None, writer_batch_size=1000, save_num_proc=None, save_num_shards=None, **to_csv_kwargs)[source]#

Export the step’s output to a CSV file and optionally create splits. See splits() for more details on splits behavior.

Parameters:

path (str) – The path to save the CSV file to.
sep (default: ',') – The delimiter to use for the CSV file.
train_size (Union[None, float, int], default: None) – The size of the training split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the training split. If an int, should be the number of rows to include in the training split.
validation_size (Union[None, float, int], default: None) – The size of the validation split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the validation split. If an int, should be the number of rows to include in the validation split.
test_size (Union[None, float, int], default: None) – The size of the test split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If an int, should be the number of rows to include in the test split.
stratify_by_column (Optional[str], default: None) – The name of the column to use to stratify equally between splits.
writer_batch_size (Optional[int], default: 1000) – The batch size to use if saving to disk.
save_num_proc (Optional[int], default: None) – The number of processes to use if saving to disk.
save_num_shards (Optional[int], default: None) – The number of shards on disk to save the dataset into.
to_csv_kwargs – Additional keyword arguments to pass to to_csv().

Return type:

str | dict

Returns:

The path to the CSV file or a dictionary of paths if creating splits.

export_to_hf_dataset(path, train_size=None, validation_size=None, test_size=None, stratify_by_column=None, writer_batch_size=1000, save_num_proc=None, save_num_shards=None)[source]#

Export the step’s output to a Hugging Face Dataset and optionally create splits. See splits() for more details on splits behavior.

Parameters:

path (str) – The path to save the Hugging Face Dataset folder to.
train_size (Union[None, float, int], default: None) – The size of the training split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the training split. If an int, should be the number of rows to include in the training split.
validation_size (Union[None, float, int], default: None) – The size of the validation split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the validation split. If an int, should be the number of rows to include in the validation split.
test_size (Union[None, float, int], default: None) – The size of the test split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If an int, should be the number of rows to include in the test split.
stratify_by_column (Optional[str], default: None) – The name of the column to use to stratify equally between splits.
writer_batch_size (Optional[int], default: 1000) – The batch size to use if saving to disk.
save_num_proc (Optional[int], default: None) – The number of processes to use if saving to disk.
save_num_shards (Optional[int], default: None) – The number of shards on disk to save the dataset into.

Return type:

Dataset | DatasetDict

Returns:

The step’s output as a Hugging Face Dataset or DatasetDict if creating splits.

publish_to_hf_hub(repo_id, branch=None, private=False, token=None, train_size=None, validation_size=None, test_size=None, stratify_by_column=None, writer_batch_size=1000, save_num_proc=None, save_num_shards=None, is_synthetic=True, **kwargs)[source]#

Publish the step’s output to the Hugging Face Hub as a dataset and optionally create splits. See splits() for more details on splits behavior. See push_to_hub() for more details on publishing.

Parameters:

repo_id (str) – The repository ID to publish the dataset to.
branch (Optional[str], default: None) – The branch to push the dataset to.
private (bool, default: False) – Whether to make the dataset private.
token (Optional[str], default: None) – The Hugging Face API token to use for authentication.
train_size (Union[None, float, int], default: None) – The size of the training split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the training split. If an int, should be the number of rows to include in the training split.
validation_size (Union[None, float, int], default: None) – The size of the validation split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the validation split. If an int, should be the number of rows to include in the validation split.
test_size (Union[None, float, int], default: None) – The size of the test split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If an int, should be the number of rows to include in the test split.
stratify_by_column (Optional[str], default: None) – The name of the column to use to stratify equally between splits.
writer_batch_size (Optional[int], default: 1000) – The batch size to use if saving to disk.
save_num_proc (Optional[int], default: None) – The number of processes to use if saving to disk.
save_num_shards (Optional[int], default: None) – The number of shards on disk to save the dataset into.
is_synthetic (bool, default: True) – Whether the dataset is synthetic (applies certain metadata when publishing).
**kwargs – Additional keyword arguments to pass to push_to_hub().

Return type:

str

Returns:

The URL to the published dataset.

class datadreamer.steps.DataSource(name, data, total_num_rows=None, auto_progress=True, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False, **kwargs)[source]#

Bases: Step

Loads a dataset from a in-memory Python object.

Parameters:

name (str) – The name of the step.
data (Any) – The data to load as a dataset. See the valid return formats for more details on acceptable data formats.
total_num_rows (Optional[int], default: None) – The total number of rows being processed (helps with displaying progress, if the data is being lazily loaded).
auto_progress (bool, default: True) – Whether to automatically update the progress % for this step.
progress_interval (Optional[int], default: 60) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
verbose (Optional[bool], default: None) – Whether or not to print verbose logs.
log_level (Optional[int], default: None) – The logging level to use (DEBUG, INFO, etc.).
save_num_proc (Optional[int], default: None) – The number of processes to use if saving to disk.
save_num_shards (Optional[int], default: None) – The number of shards on disk to save the dataset into.
background (bool, default: False) – Whether to run the operation in the background.
**config_kwargs – Additional keyword arguments to pass to datasets.load_dataset().

property output: OutputDataset | OutputIterableDataset[source]#: The output dataset of the step.

class datadreamer.steps.HFHubDataSource(name, path, config_name=None, split=None, revision=None, trust_remote_code=False, streaming=False, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False, **config_kwargs)[source]#

Bases: DataSource

Loads a dataset from the Hugging Face Hub. See datasets.load_dataset() for more details.

Parameters:

name (str) – The name of the step.
path (str) – The path to the dataset on the Hugging Face Hub.
config_name (Optional[str], default: None) – The name of the dataset configuration to load.
split (Union[None, str, Split], default: None) – The split to load.
revision (Union[None, str, Version], default: None) – The version (commit hash) of the dataset to load.
trust_remote_code (bool, default: False) – Whether to trust the remote code.
streaming (bool, default: False) – Whether to load the dataset in streaming mode.
progress_interval (Optional[int], default: 60) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
verbose (Optional[bool], default: None) – Whether or not to print verbose logs.
log_level (Optional[int], default: None) – The logging level to use (DEBUG, INFO, etc.).
save_num_proc (Optional[int], default: None) – The number of processes to use if saving to disk.
save_num_shards (Optional[int], default: None) – The number of shards on disk to save the dataset into.
background (bool, default: False) – Whether to run the operation in the background.

property output: OutputDataset | OutputIterableDataset[source]#: The output dataset of the step.

class datadreamer.steps.HFDatasetDataSource(name, dataset_path, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#

Bases: DataSource

Loads a Hugging Face Dataset from a local path. See datasets.load_from_disk() for more details.

Parameters:

name (str) – The name of the step.
dataset_path (str) – The path to the datasets.Dataset folder.
progress_interval (Optional[int], default: 60) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
verbose (Optional[bool], default: None) – Whether or not to print verbose logs.
log_level (Optional[int], default: None) – The logging level to use (DEBUG, INFO, etc.).
save_num_proc (Optional[int], default: None) – The number of processes to use if saving to disk.
save_num_shards (Optional[int], default: None) – The number of shards on disk to save the dataset into.
background (bool, default: False) – Whether to run the operation in the background.

Loads a dataset from a in-memory Python object.

Parameters:

name (str) – The name of the step.
data (Any) –
The data to load as a dataset. See the valid return formats for more details on acceptable data formats.
total_num_rows – The total number of rows being processed (helps with displaying progress, if the data is being lazily loaded).
auto_progress – Whether to automatically update the progress % for this step.
progress_interval (Optional[int], default: 60) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
verbose (Optional[bool], default: None) – Whether or not to print verbose logs.
log_level (Optional[int], default: None) – The logging level to use (DEBUG, INFO, etc.).
save_num_proc (Optional[int], default: None) – The number of processes to use if saving to disk.
save_num_shards (Optional[int], default: None) – The number of shards on disk to save the dataset into.
background (bool, default: False) – Whether to run the operation in the background.
**config_kwargs – Additional keyword arguments to pass to datasets.load_dataset().

property output: OutputDataset | OutputIterableDataset[source]#: The output dataset of the step.

class datadreamer.steps.JSONDataSource(name, data_folder=None, data_files=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False, **config_kwargs)[source]#

Bases: DataSource

Loads a JSON dataset from a local path. See datasets.load_dataset() for more details.

Parameters:

name (str) – The name of the step.
data_folder (Optional[str], default: None) – The path to the dataset folder.
data_files (Union[None, str, Sequence[str]], default: None) – The name of files from the folder to load.
progress_interval (Optional[int], default: 60) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
verbose (Optional[bool], default: None) – Whether or not to print verbose logs.
log_level (Optional[int], default: None) – The logging level to use (DEBUG, INFO, etc.).
save_num_proc (Optional[int], default: None) – The number of processes to use if saving to disk.
save_num_shards (Optional[int], default: None) – The number of shards on disk to save the dataset into.
background (bool, default: False) – Whether to run the operation in the background.
**config_kwargs – Additional keyword arguments to pass to datasets.load_dataset().

Loads a dataset from a in-memory Python object.

Parameters:

name (str) – The name of the step.
data (Any) –
The data to load as a dataset. See the valid return formats for more details on acceptable data formats.
total_num_rows – The total number of rows being processed (helps with displaying progress, if the data is being lazily loaded).
auto_progress – Whether to automatically update the progress % for this step.
progress_interval (Optional[int], default: 60) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
verbose (Optional[bool], default: None) – Whether or not to print verbose logs.
log_level (Optional[int], default: None) – The logging level to use (DEBUG, INFO, etc.).
save_num_proc (Optional[int], default: None) – The number of processes to use if saving to disk.
save_num_shards (Optional[int], default: None) – The number of shards on disk to save the dataset into.
background (bool, default: False) – Whether to run the operation in the background.
**config_kwargs – Additional keyword arguments to pass to datasets.load_dataset().

property output: OutputDataset | OutputIterableDataset[source]#: The output dataset of the step.

class datadreamer.steps.CSVDataSource(name, data_folder=None, data_files=None, sep=',', progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False, **config_kwargs)[source]#

Bases: DataSource

Loads a CSV dataset from a local path. See datasets.load_dataset() for more details.

Parameters:

name (str) – The name of the step.
data_folder (Optional[str], default: None) – The path to the dataset folder.
data_files (Union[None, str, Sequence[str]], default: None) – The name of files from the folder to load.
progress_interval (Optional[int], default: 60) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
verbose (Optional[bool], default: None) – Whether or not to print verbose logs.
log_level (Optional[int], default: None) – The logging level to use (DEBUG, INFO, etc.).
save_num_proc (Optional[int], default: None) – The number of processes to use if saving to disk.
save_num_shards (Optional[int], default: None) – The number of shards on disk to save the dataset into.
background (bool, default: False) – Whether to run the operation in the background.
**config_kwargs – Additional keyword arguments to pass to datasets.load_dataset().

Loads a dataset from a in-memory Python object.

Parameters:

name (str) – The name of the step.
data (Any) –
The data to load as a dataset. See the valid return formats for more details on acceptable data formats.
total_num_rows – The total number of rows being processed (helps with displaying progress, if the data is being lazily loaded).
auto_progress – Whether to automatically update the progress % for this step.
progress_interval (Optional[int], default: 60) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
verbose (Optional[bool], default: None) – Whether or not to print verbose logs.
log_level (Optional[int], default: None) – The logging level to use (DEBUG, INFO, etc.).
save_num_proc (Optional[int], default: None) – The number of processes to use if saving to disk.
save_num_shards (Optional[int], default: None) – The number of shards on disk to save the dataset into.
background (bool, default: False) – Whether to run the operation in the background.
**config_kwargs – Additional keyword arguments to pass to datasets.load_dataset().

property output: OutputDataset | OutputIterableDataset[source]#: The output dataset of the step.

class datadreamer.steps.TextDataSource(name, data_folder=None, data_files=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False, **config_kwargs)[source]#

Bases: DataSource

Loads a text dataset from a local path. See datasets.load_dataset() for more details.

Parameters:

name (str) – The name of the step.
data_folder (Optional[str], default: None) – The path to the dataset folder.
data_files (Union[None, str, Sequence[str]], default: None) – The name of files from the folder to load.
progress_interval (Optional[int], default: 60) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
verbose (Optional[bool], default: None) – Whether or not to print verbose logs.
log_level (Optional[int], default: None) – The logging level to use (DEBUG, INFO, etc.).
save_num_proc (Optional[int], default: None) – The number of processes to use if saving to disk.
save_num_shards (Optional[int], default: None) – The number of shards on disk to save the dataset into.
background (bool, default: False) – Whether to run the operation in the background.
**config_kwargs – Additional keyword arguments to pass to datasets.load_dataset().

Loads a dataset from a in-memory Python object.

Parameters:

name (str) – The name of the step.
data (Any) –
The data to load as a dataset. See the valid return formats for more details on acceptable data formats.
total_num_rows – The total number of rows being processed (helps with displaying progress, if the data is being lazily loaded).
auto_progress – Whether to automatically update the progress % for this step.
progress_interval (Optional[int], default: 60) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
verbose (Optional[bool], default: None) – Whether or not to print verbose logs.
log_level (Optional[int], default: None) – The logging level to use (DEBUG, INFO, etc.).
save_num_proc (Optional[int], default: None) – The number of processes to use if saving to disk.
save_num_shards (Optional[int], default: None) – The number of shards on disk to save the dataset into.
background (bool, default: False) – Whether to run the operation in the background.
**config_kwargs – Additional keyword arguments to pass to datasets.load_dataset().

property output: OutputDataset | OutputIterableDataset[source]#: The output dataset of the step.

class datadreamer.steps.Prompt(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#

Bases: _PromptBase

Runs a set of prompts against a LLM.

Prompt.help#

Prompt(
	name = 'The name of the step.',
	inputs = {
		'prompts': 'The prompts to process with the LLM.'
	},
	args = {
		'llm': 'The LLM to use.',
		'post_process': 'A function to post-process the generations. (optional)',
		'lazy': 'Whether to run lazily or not. (optional, defaults to False)',
		'**kwargs': 'Any other arguments you want to pass to the .run() method of the LLM. (optional)'
	},
	outputs = {
		'prompts': 'The prompts processed with the LLM.',
		'generations': 'The generations by the LLM.'
	},
)

property output: OutputDataset | OutputIterableDataset[source]#: The output dataset of the step.

class datadreamer.steps.RAGPrompt(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#

Bases: _PromptBase, SuperStep

Processes a set of prompts using in-context texts with a LLM. The k most relevant texts are retrieved for each prompt using a Retriever.

RAGPrompt.help#

RAGPrompt(
	name = 'The name of the step.',
	inputs = {
		'prompts': 'The prompts to process with the LLM.'
	},
	args = {
		'llm': 'The LLM to use.',
		'retriever': 'The retriever to use to retrieve in-context texts.',
		'k': 'The number of texts to retrieve.',
		'retrieved_text_label': "The label to use for retrieved texts. (optional, defaults to 'Document:')",
		'prompt_label': "The label to use for the prompt. (optional, defaults to 'Question:')",
		'max_new_tokens': 'The maximum number of tokens to generate. (optional)',
		'sep': "The separator to use between in-context retrieved texts and each prompt. (optional, defaults to '\\n')",
		'min_in_context_retrieved_texts': 'The minimum number of in-context retrieved texts to include. (optional)',
		'max_in_context_retrieved_texts': 'The maximum number of in-context retrieved texts to include. (optional)',
		'post_process': 'A function to post-process the generations. (optional)',
		'lazy': 'Whether to run lazily or not. (optional, defaults to False)',
		'**kwargs': 'Any other arguments you want to pass to the .run() method of the LLM. (optional)'
	},
	outputs = {
		'prompts': 'The prompts processed with the LLM.',
		'generations': 'The generations by the LLM.'
	},
)

property output: OutputDataset | OutputIterableDataset[source]#: The output dataset of the step.

class datadreamer.steps.ProcessWithPrompt(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#

Bases: _PromptBase

Processes a set of inputs using an instruction with a LLM.

ProcessWithPrompt.help#

ProcessWithPrompt(
	name = 'The name of the step.',
	inputs = {
		'inputs': 'The inputs to process with the LLM.'
	},
	args = {
		'llm': 'The LLM to use.',
		'instruction': 'An instruction that describes how to process the input.',
		'input_label': "The label to use for inputs. (optional, defaults to 'Input:')",
		'instruction_label': "The label to use for the instruction. (optional, defaults to 'Instruction:')",
		'max_new_tokens': 'The maximum number of tokens to generate. (optional)',
		'sep': "The separator to use between instructions and the input. (optional, defaults to '\\n\\n')",
		'post_process': 'A function to post-process the generations. (optional)',
		'lazy': 'Whether to run lazily or not. (optional, defaults to False)',
		'**kwargs': 'Any other arguments you want to pass to the .run() method of the LLM. (optional)'
	},
	outputs = {
		'inputs': 'The inputs processed with the LLM.',
		'prompts': 'The prompts processed with the LLM.',
		'generations': 'The generations by the LLM.'
	},
)

property output: OutputDataset | OutputIterableDataset[source]#: The output dataset of the step.

class datadreamer.steps.FewShotPrompt(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#

Bases: _PromptBase

Processes a set of inputs using in-context examples and an instruction with a LLM.

FewShotPrompt.help#

FewShotPrompt(
	name = 'The name of the step.',
	inputs = {
		'input_examples': 'The in-context example inputs to include in the prompt.',
		'output_examples': 'The in-context example outputs to include in the prompt.',
		'inputs': 'The inputs to process with the LLM.'
	},
	args = {
		'llm': 'The LLM to use.',
		'input_label': "The label to use for inputs. (optional, defaults to 'Input:')",
		'output_label': "The label to use for outputs. (optional, defaults to 'Output:')",
		'max_new_tokens': 'The maximum number of tokens to generate. (optional)',
		'instruction': 'An instruction to include in the prompt. (optional)',
		'sep': "The separator to use between instructions and in-context examples. (optional, defaults to '\\n')",
		'min_in_context_examples': 'The minimum number of in-context examples to include. (optional)',
		'max_in_context_examples': 'The maximum number of in-context examples to include. (optional)',
		'post_process': 'A function to post-process the generations. (optional)',
		'lazy': 'Whether to run lazily or not. (optional, defaults to False)',
		'**kwargs': 'Any other arguments you want to pass to the .run() method of the LLM. (optional)'
	},
	outputs = {
		'inputs': 'The inputs processed with the LLM.',
		'prompts': 'The prompts processed with the LLM.',
		'generations': 'The generations by the LLM.'
	},
)

property output: OutputDataset | OutputIterableDataset[source]#: The output dataset of the step.

class datadreamer.steps.FewShotPromptWithRetrieval(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#

Bases: FewShotPrompt, SuperStep

Processes a set of inputs using in-context examples and an instruction with a LLM. The k most relevant in-context examples are retrieved for each input using an Embedder.

FewShotPromptWithRetrieval.help#

FewShotPromptWithRetrieval(
	name = 'The name of the step.',
	inputs = {
		'input_examples': 'The in-context example inputs to retrieve from to include in the prompt.',
		'output_examples': 'The in-context example outputs to retrieve from to include in the prompt.',
		'inputs': 'The inputs to process with the LLM.'
	},
	args = {
		'llm': 'The LLM to use.',
		'embedder': 'The embedder to use to retrieve in-context examples.',
		'k': 'The number of in-context examples to retrieve.',
		'input_label': "The label to use for inputs. (optional, defaults to 'Input:')",
		'output_label': "The label to use for outputs. (optional, defaults to 'Output:')",
		'max_new_tokens': 'The maximum number of tokens to generate. (optional)',
		'instruction': 'An instruction to include in the prompt. (optional)',
		'sep': "The separator to use between instructions and in-context examples. (optional, defaults to '\\n')",
		'min_in_context_examples': 'The minimum number of in-context examples to include. (optional)',
		'max_in_context_examples': 'The maximum number of in-context examples to include. (optional)',
		'post_process': 'A function to post-process the generations. (optional)',
		'lazy': 'Whether to run lazily or not. (optional, defaults to False)',
		'**kwargs': 'Any other arguments you want to pass to the .run() method of the LLM. (optional)'
	},
	outputs = {
		'inputs': 'The inputs processed with the LLM.',
		'prompts': 'The prompts processed with the LLM.',
		'generations': 'The generations by the LLM.'
	},
)

property output: OutputDataset | OutputIterableDataset[source]#: The output dataset of the step.

class datadreamer.steps.DataFromPrompt(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#

Bases: _PromptBase

Generates n rows of data using an instruction with a LLM.

DataFromPrompt.help#

DataFromPrompt(
	name = 'The name of the step.',
	args = {
		'llm': 'The LLM to use.',
		'instruction': 'The instruction to use to generate data.',
		'n': 'The number of rows to generate from the prompt.',
		'temperature': 'The temperature to use when generating data. (optional, defaults to 1.0)',
		'top_p': 'The top_p to use when generating data. (optional, defaults to 1.0)',
		'post_process': 'A function to post-process the generations. (optional)',
		'lazy': 'Whether to run lazily or not. (optional, defaults to False)',
		'**kwargs': 'Any other arguments you want to pass to the .run() method of the LLM. (optional)'
	},
	outputs = {
		'prompts': 'The prompts processed with the LLM.',
		'generations': 'The generations by the LLM.'
	},
)

property output: OutputDataset | OutputIterableDataset[source]#: The output dataset of the step.

class datadreamer.steps.DataFromAttributedPrompt(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#

Bases: _PromptBase

Generates n rows of data using an attributed instruction with a LLM. See the AttrPrompt paper for more information.

DataFromAttributedPrompt.help#

DataFromAttributedPrompt(
	name = 'The name of the step.',
	args = {
		'llm': 'The LLM to use.',
		'instruction': 'The attributed instruction to use to generate data.',
		'attributes': 'The attributes to use in the instruction.',
		'n': 'The number of rows to generate from the prompt.',
		'temperature': 'The temperature to use when generating data. (optional, defaults to 1.0)',
		'top_p': 'The top_p to use when generating data. (optional, defaults to 1.0)',
		'post_process': 'A function to post-process the generations. (optional)',
		'lazy': 'Whether to run lazily or not. (optional, defaults to False)',
		'**kwargs': 'Any other arguments you want to pass to the .run() method of the LLM. (optional)'
	},
	outputs = {
		'attributes': 'The attributes used to generate the data.',
		'prompts': 'The prompts processed with the LLM.',
		'generations': 'The generations by the LLM.'
	},
)

property output: OutputDataset | OutputIterableDataset[source]#: The output dataset of the step.

class datadreamer.steps.FilterWithPrompt(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#

Bases: _PromptBase, SuperStep

Filters a set of inputs using an instruction with a LLM and filter_func which takes in a generation produced by the instruction and returns a bool on whether to keep the row or not.

FilterWithPrompt.help#

FilterWithPrompt(
	name = 'The name of the step.',
	inputs = {
		'inputs': 'The inputs to process with the LLM.'
	},
	args = {
		'llm': 'The LLM to use.',
		'instruction': 'An instruction that describes how to process the input.',
		'filter_func': "A function to filter the generations with. (optional, defaults to filtering by parsing 'Yes' / 'No' from the generation)",
		'input_label': "The label to use for inputs. (optional, defaults to 'Input:')",
		'instruction_label': "The label to use for the instruction. (optional, defaults to 'Instruction:')",
		'max_new_tokens': 'The maximum number of tokens to generate. (optional)',
		'sep': "The separator to use between instructions and the input. (optional, defaults to '\\n\\n')",
		'post_process': 'A function to post-process the generations. (optional)',
		'lazy': 'Whether to run lazily or not. (optional, defaults to False)',
		'**kwargs': 'Any other arguments you want to pass to the .run() method of the LLM. (optional)'
	},
	outputs = {
		'*columns': "All of the columns of the step producing the 'inputs'."
	},
)

property output: OutputDataset | OutputIterableDataset[source]#: The output dataset of the step.

class datadreamer.steps.RankWithPrompt(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#

Bases: _PromptBase, SuperStep

Produces scores for a set of inputs using an instruction with a LLM. Scores are parsed from the generations produced by the instruction. Optionally, the input rows are sorted and filtered using the scores.

RankWithPrompt.help#

RankWithPrompt(
	name = 'The name of the step.',
	inputs = {
		'inputs': 'The inputs to process with the LLM.'
	},
	args = {
		'llm': 'The LLM to use.',
		'instruction': 'An instruction that describes how to process the input.',
		'sort': 'Whether or not to sort the inputs by the score produced by the instruction. (optional, defaults to True)',
		'reverse': 'Whether or not to reverse the sort direction. (optional, defaults to True)',
		'score_threshold': 'A score threshold. If specified, filter out input rows that scored below the threshold. (optional)',
		'input_label': "The label to use for inputs. (optional, defaults to 'Input:')",
		'instruction_label': "The label to use for the instruction. (optional, defaults to 'Instruction:')",
		'max_new_tokens': 'The maximum number of tokens to generate. (optional)',
		'sep': "The separator to use between instructions and the input. (optional, defaults to '\\n\\n')",
		'post_process': 'A function to post-process the generations. (optional)',
		'lazy': 'Whether to run lazily or not. (optional, defaults to False)',
		'**kwargs': 'Any other arguments you want to pass to the .run() method of the LLM. (optional)'
	},
	outputs = {
		'*columns': "All of the columns of the step producing the 'inputs'.",
		'scores': 'The scores produced by the instruction on the inputs.'
	},
)

property output: OutputDataset | OutputIterableDataset[source]#: The output dataset of the step.

class datadreamer.steps.JudgePairsWithPrompt(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#

Bases: _PromptBase

Judges between a of pair of inputs (a, b) using an instruction with a LLM.

JudgePairsWithPrompt.help#

JudgePairsWithPrompt(
	name = 'The name of the step.',
	inputs = {
		'a': 'A set of inputs to judge between.',
		'b': 'A set of inputs to judge between.'
	},
	args = {
		'llm': 'The LLM to use.',
		'instruction': 'An instruction that describes how to judge the input. (optional, defaults to "Which input is better? Respond with \'Input A\' or \'Input B\'.")',
		'judgement_func': "A function to get the judgement from the generation. (optional, defaults to judging by parsing 'a_label' / 'b_label' ('Input A' / 'Input B') from the generation)",
		'randomize_order': "Whether to randomly swap the order of 'a' and 'b' in the prompt to mitigate position bias. (optional, defaults to True)",
		'a_label': "The label to use for inputs 'a'. (optional, defaults to 'Input A:')",
		'b_label': "The label to use for inputs 'b'. (optional, defaults to 'Input B:')",
		'instruction_label': "The label to use for the instruction. (optional, defaults to 'Instruction:')",
		'max_new_tokens': 'The maximum number of tokens to generate. (optional)',
		'sep': "The separator to use between instructions and the input. (optional, defaults to '\\n')",
		'post_process': 'A function to post-process the generations. (optional)',
		'lazy': 'Whether to run lazily or not. (optional, defaults to False)',
		'**kwargs': 'Any other arguments you want to pass to the .run() method of the LLM. (optional)'
	},
	outputs = {
		'a': 'A set of inputs judged with the LLM.',
		'b': 'A set of inputs judged with the LLM.',
		'judge_prompts': 'The prompts processed with the LLM to judge inputs.',
		'judge_generations': 'The judgement generations by the LLM.',
		'judgements': 'The judgements by the LLM.'
	},
)

property output: OutputDataset | OutputIterableDataset[source]#: The output dataset of the step.

class datadreamer.steps.JudgeGenerationPairsWithPrompt(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#

Bases: _PromptBase

Judges between a of pair of generations (a, b) that were produced in response to prompts using an instruction with a LLM.

JudgeGenerationPairsWithPrompt.help#

JudgeGenerationPairsWithPrompt(
	name = 'The name of the step.',
	inputs = {
		'prompts': "A set of prompts used to generate 'a' and 'b'.",
		'a': 'A set of generations to judge between.',
		'b': 'A set of generations to judge between.'
	},
	args = {
		'llm': 'The LLM to use.',
		'instruction': 'An instruction that describes how to judge the input. (optional, defaults to "Which response is better? Respond with \'Response A\' or \'Response B\'.")',
		'judgement_func': "A function to get the judgement from the generation. (optional, defaults to judging by parsing 'a_label' / 'b_label' ('Response A' / 'Response B') from the generation)",
		'randomize_order': "Whether to randomly swap the order of 'a' and 'b' in the prompt to mitigate position bias. (optional, defaults to True)",
		'prompt_label': "The label to use for prompts. (optional, defaults to 'Prompt:')",
		'a_label': "The label to use for inputs 'a'. (optional, defaults to 'Response A:')",
		'b_label': "The label to use for inputs 'b'. (optional, defaults to 'Response B:')",
		'instruction_label': "The label to use for the instruction. (optional, defaults to 'Instruction:')",
		'max_new_tokens': 'The maximum number of tokens to generate. (optional)',
		'sep': "The separator to use between instructions and the input. (optional, defaults to '\\n')",
		'post_process': 'A function to post-process the generations. (optional)',
		'lazy': 'Whether to run lazily or not. (optional, defaults to False)',
		'**kwargs': 'Any other arguments you want to pass to the .run() method of the LLM. (optional)'
	},
	outputs = {
		'prompts': "The set of prompts used to generate 'a' and 'b'.",
		'a': 'A set of inputs judged with the LLM.',
		'b': 'A set of inputs judged with the LLM.',
		'judge_prompts': 'The prompts processed with the LLM to judge inputs.',
		'judge_generations': 'The judgement generations by the LLM.',
		'judgements': 'The judgements by the LLM.'
	},
)

property output: OutputDataset | OutputIterableDataset[source]#: The output dataset of the step.

class datadreamer.steps.RunTaskModel(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#

Bases: Step

Runs a set of texts against a TaskModel.

RunTaskModel.help#

RunTaskModel(
	name = 'The name of the step.',
	inputs = {
		'texts': 'The texts to process with the TaskModel.'
	},
	args = {
		'model': 'The TaskModel to use.',
		'truncate': 'Whether or not to truncate inputs. (optional, defaults to False)',
		'lazy': 'Whether to run lazily or not. (optional, defaults to False)',
		'**kwargs': 'Any other arguments you want to pass to the .run() method of the TaskModel. (optional)'
	},
	outputs = {
		'texts': 'The texts processed with the TaskModel.',
		'results': 'The results from the TaskModel.'
	},
)

property output: OutputDataset | OutputIterableDataset[source]#: The output dataset of the step.

class datadreamer.steps.Embed(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#

Bases: Step

Embeds a set of texts with an Embedder.

Embed.help#

Embed(
	name = 'The name of the step.',
	inputs = {
		'texts': 'The texts to embed.'
	},
	args = {
		'embedder': 'The Embedder to use.',
		'truncate': 'Whether or not to truncate inputs. (optional, defaults to False)',
		'instruction': 'The instruction to prefix inputs to the embedding model with. (optional)',
		'lazy': 'Whether to run lazily or not. (optional, defaults to False)',
		'**kwargs': 'Any other arguments you want to pass to the .run() method of the Embedder. (optional)'
	},
	outputs = {
		'texts': 'The texts that were embedded.',
		'embeddings': 'The embeddings by the Embedder.'
	},
)

property output: OutputDataset | OutputIterableDataset[source]#: The output dataset of the step.

class datadreamer.steps.CosineSimilarity(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#

Bases: SuperStep

Computes the cosine similarity between two sets of embeddings (a and b). If a and b are a set of texts, then they will be embedded using the provided Embedder.

CosineSimilarity.help#

CosineSimilarity(
	name = 'The name of the step.',
	inputs = {
		'a': 'The embeddings or texts to compute the cosine similarity of.',
		'b': 'The embeddings or texts to compute the cosine similarity of.'
	},
	args = {
		'similarity_batch_size': 'How many cosine similarities to compute at once. (optional, defaults to 100)',
		'device': 'The device or list of devices to compute the cosine similarities with. (optional)',
		'embedder': "The Embedder to use to embed 'a' and 'b' if they are a set of texts. (optional)",
		'lazy': 'Whether to run lazily or not. (optional, defaults to False)',
		'**kwargs': 'Any other arguments you want to pass to the .run() method of the Embedder if it is used. (optional)'
	},
	outputs = {
		'a': 'The embeddings or texts that the cosine similarities were computed for.',
		'b': 'The embeddings or texts that the cosine similarities were computed for.',
		'similarities': 'The similarities computed by the step.'
	},
)

property output: OutputDataset | OutputIterableDataset[source]#: The output dataset of the step.

class datadreamer.steps.Retrieve(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#

Bases: Step

Retrieves results for a set of queries with a Retriever.

Retrieve.help#

Retrieve(
	name = 'The name of the step.',
	inputs = {
		'queries': 'The queries to retrieve results for.'
	},
	args = {
		'retriever': 'The Retriever to use.',
		'k': 'How many results to retrieve. (optional, defaults to 5)',
		'lazy': 'Whether to run lazily or not. (optional, defaults to False)',
		'**kwargs': 'Any other arguments you want to pass to the .run() method of the Retriever. (optional)'
	},
	outputs = {
		'queries': 'The queries used to retrieve results.',
		'results': 'The results from the Retriever.'
	},
)

property output: OutputDataset | OutputIterableDataset[source]#: The output dataset of the step.

datadreamer.steps.concat(*steps, name=None, lazy=True, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#

Concatenate the rows of the outputs of the input steps.

Parameters:

steps (Step) – The steps to concatenate.
name (Optional[str], default: None) – The name of the operation.
lazy (bool, default: True) – Whether to run the operation lazily.
progress_interval (UnionType[None, int, Default], default: DEFAULT) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the operation (ignore saved results).
writer_batch_size (Optional[int], default: 1000) – The batch size to use if saving to disk.
save_num_proc (UnionType[None, int, Default], default: DEFAULT) – The number of processes to use if saving to disk.
save_num_shards (UnionType[None, int, Default], default: DEFAULT) – The number of shards on disk to save the dataset into.
background (bool, default: False) – Whether to run the operation in the background.

Return type:

Step

Returns:

A new step with the operation applied.

datadreamer.steps.zipped(*steps, name=None, lazy=True, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#

Zip the outputs of the input steps. This is similar to the built-in Python zip() function, essentially concatenating the columns of the outputs of the input steps.

Parameters:

steps (Step) – The steps to zip.
name (Optional[str], default: None) – The name of the operation.
lazy (bool, default: True) – Whether to run the operation lazily.
progress_interval (UnionType[None, int, Default], default: DEFAULT) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the operation (ignore saved results).
writer_batch_size (Optional[int], default: 1000) – The batch size to use if saving to disk.
save_num_proc (UnionType[None, int, Default], default: DEFAULT) – The number of processes to use if saving to disk.
save_num_shards (UnionType[None, int, Default], default: DEFAULT) – The number of shards on disk to save the dataset into.
background (bool, default: False) – Whether to run the operation in the background.

Return type:

Step

Returns:

A new step with the operation applied.

datadreamer.steps.wait(*steps, poll_interval=1.0)[source]#

Wait for all steps to complete if they are running in the background.

Parameters:: poll_interval (default: 1.0) – How often to poll in seconds.

datadreamer.steps.concurrent(*funcs)[source]#

Run a set of functions (which run steps) concurrently.

Parameters:: *funcs (Callable) – The functions to run concurrently.

class datadreamer.steps.DataCardType[source]#

Bases: object

The types of data card entries.

DATETIME = 'Date & Time'#

MODEL_NAME = 'Model Name'#

DATASET_NAME = 'Dataset Name'#

LICENSE = 'License Information'#

CITATION = 'Citation Information'#

DATASET_CARD = 'Dataset Card'#

MODEL_CARD = 'Model Card'#

URL = 'URL'#

class datadreamer.steps.LazyRows(value, total_num_rows=None, auto_progress=True, save=False, save_writer_batch_size=None)[source]#

Bases: object

A wrapper for lazy rows output from a step. See create your own steps for more details.

Parameters:

value (lazy output) – The lazy rows output.
total_num_rows (Optional[int], default: None) – The total number of rows being processed (helps with displaying progress).
auto_progress (bool, default: True) – Whether to automatically update the progress % for this step.
writer_batch_size – The batch size to use if saving to disk.

class datadreamer.steps.LazyRowBatches(value, total_num_rows=None, auto_progress=True, save=False, save_writer_batch_size=None)[source]#

Bases: object

A wrapper for lazy row batches output from a step. See create your own steps for more details.

Parameters:

value (lazy output) – The lazy row batches output.
total_num_rows (Optional[int], default: None) – The total number of rows being processed (helps with displaying progress).
auto_progress (bool, default: True) – Whether to automatically update the progress % for this step.
writer_batch_size – The batch size to use if saving to disk.

class datadreamer.steps.SuperStep(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#

Bases: Step

The class to subclass if you want to create a step that runs other steps. See create your own steps for more details.

property output: OutputDataset | OutputIterableDataset[source]#: The output dataset of the step.

steps#

Constructing a Step object#

Step inputs#

Step args#

Step outputs#

Renaming output columns#

Caching#

Types of Steps#

Standard Prompting

Data Generation Using Prompting

Validation & Judging with Prompting

Lazy Steps#

Convenience Methods#

Previewing a Step’s output#

Common Operations#

Running Steps in Parallel#

Data Card Generation#

Exporting and Publishing Datasets#

Constructing a `Step` object#

Step `inputs`#

Step `args`#

Step `outputs`#

Previewing a Step’s `output`#