steps#
Step
objects perform core operations within a DataDreamer session
transforming input data into output data. They are run within a DataDreamer session and
each step takes in data as a set of inputs
and configuration options as a set of
args
and then returns output data under the output
attribute. A
help message that describes what inputs
and args
a particular step takes in and
what output
it returns can be found by accessing the
help
string. All steps derive from the Step
base class.
You can create your own steps to implement custom logic or a new technique.
Constructing a Step
object#
Step inputs
#
Step inputs
are a set of named inputs that are passed to a step via a key-value
dictionary. The input data a step will operate on are given to a step via inputs
and
each input may be of type OutputDatasetColumn
or
OutputIterableDatasetColumn
. The input columns
passed to a step are typically output columns created by previous steps within a
DataDreamer session.
Step args
#
Step args
are a set of named arguments that are passed to a step via a key-value
dictionary. Arguments are non-input data parameters typically used to configure a stepβs
behavior. Arguments are typically of type bool
, int
, str
, or other Python
types. Some steps like Prompt
or Embed
take
LLM
objects Embedder
objects as args
.
Step outputs
#
Steps produce output data in the form of an output dataset object like
OutputDataset
or
OutputIterableDataset
accessible by the
output
attribute. To access a column on these output dataset objects
you can use the __getitem__
operator like so: step.output['column_name']
.
This will return a OutputDatasetColumn
or OutputIterableDatasetColumn
column object.
Renaming output columns#
If you want to rename the columns in the output dataset, at Step
construction you can pass a outputs
argument that is a dictionary mapping from
the original output column names to the new column names.
Caching#
Steps are cached and their outputs are saved to disk after running. Upon resuming a DataDreamer session, previously completed steps are loaded from disk.
Types of Steps#
DataDreamer comes with a set of built-in steps that can be used to create a workflow to process data, create synthetic datasets, augment existing datasets, and train models. We catalogue diferent types of steps below.
Data Source
Most steps take input data and produce output data. Data Source steps are special
in that they are an initial source of data that can be consumed as input by other
steps. This could be loading data from an online source, a file, or simply from an
in-memory list
or dict
.
Step Classes:
Prompting
There a few different types of prompting steps available. These steps all typically
take in a LLM
object as an argument in args
.
Standard Prompting
Standard prompting steps help you run prompts against an
LLM
.
Step Classes:
Data Generation Using Prompting
Some prompting steps are used to help generate completely synthetic data from prompts using an instruction.
Step Classes:
Validation & Judging with Prompting
Some prompting steps are used to help perform validation and judging tasks on a
set of inputs using an instruction with a LLM
.
Step Classes:
Other NLP Tasks
Training
See the trainers page for more information on training within a DataDreamer session.
Lazy Steps#
Some steps may run lazily. This means that the stepβs output will be lazily
computed as it is consumed by another step. This is useful for steps that work with
large datasets or require expensive computation. Lazy steps will return a
OutputIterableDataset
under their
output
attribute. At any point, you can use the
progress
attribute to get how much of a lazy stepβs output has been
computed so far.
Lazy steps do not save their output to disk. If you want to explictly save a lazy stepβs
ouput to disk, you can call save()
on the lazy step. If you have a series
of lazy steps chained together, you can call save()
on the last lazy
step to save the final transformed data to disk after all transformations have been
computed, skipping saving intermediate transformations to disk.
Convenience Methods#
Various convenience methods are available on Step
objects to help with
frequent tasks.
Previewing a Stepβs output
#
If you want to preview a stepβs output
, you can call the
head()
method to quickly retrieve a few rows from the stepβs output
dataset in the form of a pandas.DataFrame
object.
Common Operations#
A variety of common data operations and data transformations are available as methods on
Step
objects:
Basic Operations |
Column Operations |
Sequence Operations |
Save/Copy Operations |
Training Dataset Operations |
---|---|---|---|---|
In addition, there are higher-order functions that operate on multiple Step
objects that can be used to concatenate the output datasets of multiple steps together
or to combine the output dataset columns of multiple steps together (similar to
zip()
in Python):
Running Steps in Parallel#
See the Running Steps in Parallel page.
Data Card Generation#
An automatically generated data card can be viewed by calling
data_card()
. The data card can be helpful for reproducibility and for
sharing your work with others when published alongside your code. When
publishing datasets, the data card will be
published alongside the dataset.
The data card traces what steps were run to produce the stepβs output dataset, what
models were used, what paper citations and software licenses may apply, among other
useful information (see DataCardType
). Reproducibility
information such as the versions of packages used and a fingerprint hash (a signature of
all steps chained together to produce the final stepβs output dataset) is also included.
Exporting and Publishing Datasets#
You can export the output dataset produced by a step to disk by calling one of the various export methods available:
You can publish the output dataset produced by a step to the
Hugging Face Hub by calling
publish_to_hf_hub()
.
- class datadreamer.steps.Step(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#
Bases:
object
Base class for all steps.
- Parameters:
name (
str
) β The name of the step.inputs (
Optional
[dict
[str
,OutputDatasetColumn
|OutputIterableDatasetColumn
]], default:None
) β The inputs to the step.args (
Optional
[dict
[str
,Any
]], default:None
) β The args to the step.outputs (
Optional
[dict
[str
,str
]], default:None
) β The name mapping to rename outputs produced by the step.progress_interval (
Optional
[int
], default:60
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).verbose (
Optional
[bool
], default:None
) β Whether or not to print verbose logs.log_level (
Optional
[int
], default:None
) β The logging level to use (DEBUG
,INFO
, etc.).save_num_proc (
Optional
[int
], default:None
) β The number of processes to use if saving to disk.save_num_shards (
Optional
[int
], default:None
) β The number of shards on disk to save the dataset into.background (
bool
, default:False
) β Whether to run the operation in the background.
- register_input(input_column_name, required=True, help=None)[source]#
Register an input for the step. See create your own steps for more details.
- register_arg(arg_name, required=True, default=None, help=None, default_help=None)[source]#
Register an argument for the step. See create your own steps for more details.
- Parameters:
arg_name (
str
) β The name of the argument.required (
bool
, default:True
) β Whether the argument is required.default (
Optional
[Any
], default:None
) β The default value of the argument.help (
Optional
[str
], default:None
) β The help string for the argument.default_help (
Optional
[str
], default:None
) β The help string for the default value of the argument.
- register_output(output_column_name, help=None)[source]#
Register an output for the step. See create your own steps for more details.
- register_data_card(data_card_type, data_card)[source]#
Register a data card for the step. See create your own steps for more details.
- property inputs: dict[str, OutputDatasetColumn | OutputIterableDatasetColumn][source]#
The inputs of the step.
- get_run_output_folder_path()[source]#
Get the run output folder path that can be used by the step for writing persistent data.
- Return type:
- pickle(value, *args, **kwargs)[source]#
Pickle a value so it can be stored in a row produced by this step. See create your own steps for more details.
- unpickle(value)[source]#
Unpickle a value that was stored in a row produced by this step with
pickle()
. See create your own steps for more details.
- property output: OutputDataset | OutputIterableDataset[source]#
The output dataset of the step.
- property dataset_path: str[source]#
The path to the stepβs output dataset on disk in HuggingFace
Dataset
format if the step has been saved to disk.
- head(n=5, shuffle=False, seed=None, buffer_size=1000)[source]#
Return the first
n
rows of the stepβs output as a pandasDataFrame
for easy viewing.- Parameters:
n (default:
5
) β The number of rows to return.shuffle (default:
False
) β Whether to shuffle the rows before taking the firstn
.seed (default:
None
) β The seed to use if shuffling.buffer_size (default:
1000
) β The buffer size to use if shuffling and the stepβs output is an iterable dataset.
- Return type:
- select(indices, name=None, lazy=True, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#
Select rows from the stepβs output by their indices. See
select()
for more details.- Parameters:
indices (
Iterable
) β The indices of the rows to select.name (
Optional
[str
], default:None
) β The name of the operation.lazy (
bool
, default:True
) β Whether to run the operation lazily.progress_interval (
UnionType
[None
,int
,Default
], default:DEFAULT
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).writer_batch_size (
Optional
[int
], default:1000
) β The batch size to use if saving to disk.save_num_proc (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of processes to use if saving to disk.save_num_shards (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of shards on disk to save the dataset into.background (
bool
, default:False
) β Whether to run the operation in the background.
- Return type:
- Returns:
A new step with the operation applied.
- select_columns(column_names, name=None, lazy=True, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#
Select columns from the stepβs output. See
select_columns()
for more details.- Parameters:
column_names (
str
|list
[str
]) β The names of the columns to select.name (
Optional
[str
], default:None
) β The name of the operation.lazy (
bool
, default:True
) β Whether to run the operation lazily.progress_interval (
UnionType
[None
,int
,Default
], default:DEFAULT
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).writer_batch_size (
Optional
[int
], default:1000
) β The batch size to use if saving to disk.save_num_proc (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of processes to use if saving to disk.save_num_shards (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of shards on disk to save the dataset into.background (
bool
, default:False
) β Whether to run the operation in the background.
- Return type:
- Returns:
A new step with the operation applied.
- take(n, name=None, lazy=True, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#
Take the first
n
rows from the stepβs output. Seetake()
for more details.- Parameters:
n (
int
) β The number of rows to take.name (
Optional
[str
], default:None
) β The name of the operation.lazy (
bool
, default:True
) β Whether to run the operation lazily.progress_interval (
UnionType
[None
,int
,Default
], default:DEFAULT
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).writer_batch_size (
Optional
[int
], default:1000
) β The batch size to use if saving to disk.save_num_proc (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of processes to use if saving to disk.save_num_shards (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of shards on disk to save the dataset into.background (
bool
, default:False
) β Whether to run the operation in the background.
- Return type:
- Returns:
A new step with the operation applied.
- skip(n, name=None, lazy=True, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#
Skip the first
n
rows from the stepβs output. Seeskip()
for more details.- Parameters:
n (
int
) β The number of rows to skip.name (
Optional
[str
], default:None
) β The name of the operation.lazy (
bool
, default:True
) β Whether to run the operation lazily.progress_interval (
UnionType
[None
,int
,Default
], default:DEFAULT
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).writer_batch_size (
Optional
[int
], default:1000
) β The batch size to use if saving to disk.save_num_proc (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of processes to use if saving to disk.save_num_shards (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of shards on disk to save the dataset into.background (
bool
, default:False
) β Whether to run the operation in the background.
- Return type:
- Returns:
A new step with the operation applied.
- shuffle(seed=None, buffer_size=1000, name=None, lazy=True, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#
Shuffle the rows of the stepβs output. See
shuffle()
for more details.- Parameters:
seed (
Optional
[int
], default:None
) β The random seed to use for shuffling the stepβs output.buffer_size (
int
, default:1000
) β The buffer size to use for shuffling the dataset, if the stepβs output is anOutputIterableDataset
.name (
Optional
[str
], default:None
) β The name of the operation.lazy (
bool
, default:True
) β Whether to run the operation lazily.progress_interval (
UnionType
[None
,int
,Default
], default:DEFAULT
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).writer_batch_size (
Optional
[int
], default:1000
) β The batch size to use if saving to disk.save_num_proc (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of processes to use if saving to disk.save_num_shards (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of shards on disk to save the dataset into.background (
bool
, default:False
) β Whether to run the operation in the background.
- Return type:
- Returns:
A new step with the operation applied.
- sort(column_names, reverse=False, null_placement='at_end', name=None, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#
Sort the rows of the stepβs output. See
sort()
for more details.- Parameters:
column_names (
Union
[str
,Sequence
[str
]]) β The names of the columns to sort by.reverse (
Union
[bool
,Sequence
[bool
]], default:False
) β Whether to sort in reverse order.null_placement (
str
, default:'at_end'
) β Where to place null values in the sorted dataset.name (
Optional
[str
], default:None
) β The name of the operation.progress_interval (
UnionType
[None
,int
,Default
], default:DEFAULT
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).writer_batch_size (
Optional
[int
], default:1000
) β The batch size to use if saving to disk.save_num_proc (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of processes to use if saving to disk.save_num_shards (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of shards on disk to save the dataset into.background (
bool
, default:False
) β Whether to run the operation in the background.
- Return type:
- Returns:
A new step with the operation applied.
- add_item(item, name=None, lazy=True, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#
Add a row to the stepβs output. See
add_item()
for more details.- Parameters:
item (
dict
) β The item to add to the stepβs output.name (
Optional
[str
], default:None
) β The name of the operation.lazy (
bool
, default:True
) β Whether to run the operation lazily.progress_interval (
UnionType
[None
,int
,Default
], default:DEFAULT
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).writer_batch_size (
Optional
[int
], default:1000
) β The batch size to use if saving to disk.save_num_proc (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of processes to use if saving to disk.save_num_shards (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of shards on disk to save the dataset into.background (
bool
, default:False
) β Whether to run the operation in the background.
- Return type:
- Returns:
A new step with the operation applied.
- map(function, with_indices=False, input_columns=None, batched=False, batch_size=1000, remove_columns=None, total_num_rows=None, auto_progress=True, name=None, lazy=True, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#
Apply a function to the stepβs output. See
map()
for more details.- Parameters:
function (
Callable
) β The function to apply to rows of the stepβs output.with_indices (
bool
, default:False
) β Whether to pass the indices of the rows to the function.input_columns (
Union
[None
,str
,list
[str
]], default:None
) β The names of the columns to pass to the function.batched (
bool
, default:False
) β Whether to apply the function in batches.batch_size (
int
, default:1000
) β The batch size to use if applying the function in batches.remove_columns (
Union
[None
,str
,list
[str
]], default:None
) β The names of the columns to remove from the output.total_num_rows (
Optional
[int
], default:None
) β The total number of rows being processed (helps with displaying progress).auto_progress (
bool
, default:True
) β Whether to automatically update the progress % for this step.name (
Optional
[str
], default:None
) β The name of the operation.lazy (
bool
, default:True
) β Whether to run the operation lazily.progress_interval (
UnionType
[None
,int
,Default
], default:DEFAULT
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).writer_batch_size (
Optional
[int
], default:1000
) β The batch size to use if saving to disk.save_num_proc (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of processes to use if saving to disk.save_num_shards (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of shards on disk to save the dataset into.background (
bool
, default:False
) β Whether to run the operation in the background.
- Return type:
- Returns:
A new step with the operation applied.
- filter(function, with_indices=False, input_columns=None, batched=False, batch_size=1000, total_num_rows=None, auto_progress=True, name=None, lazy=True, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#
Filter rows from the stepβs output. See
filter()
for more details.- Parameters:
function (
Callable
) β The function to use for filtering rows of the stepβs output.with_indices (
bool
, default:False
) β Whether to pass the indices of the rows to the function.input_columns (
Union
[None
,str
,list
[str
]], default:None
) β The names of the columns to pass to the function.batched (
bool
, default:False
) β Whether to apply the function in batches.batch_size (
int
, default:1000
) β The batch size to use if applying the function in batches.total_num_rows (
Optional
[int
], default:None
) β The total number of rows being processed (helps with displaying progress).auto_progress (
bool
, default:True
) β Whether to automatically update the progress % for this step.name (
Optional
[str
], default:None
) β The name of the operation.lazy (
bool
, default:True
) β Whether to run the operation lazily.progress_interval (
UnionType
[None
,int
,Default
], default:DEFAULT
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).writer_batch_size (
Optional
[int
], default:1000
) β The batch size to use if saving to disk.save_num_proc (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of processes to use if saving to disk.save_num_shards (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of shards on disk to save the dataset into.background (
bool
, default:False
) β Whether to run the operation in the background.
- Return type:
- Returns:
A new step with the operation applied.
- rename_column(original_column_name, new_column_name, name=None, lazy=True, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#
Rename a column in the stepβs output. See
rename_column()
for more details.- Parameters:
original_column_name (
str
) β The original name of the column.new_column_name (
str
) β The new name of the column.name (
Optional
[str
], default:None
) β The name of the operation.lazy (
bool
, default:True
) β Whether to run the operation lazily.progress_interval (
UnionType
[None
,int
,Default
], default:DEFAULT
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).writer_batch_size (
Optional
[int
], default:1000
) β The batch size to use if saving to disk.save_num_proc (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of processes to use if saving to disk.save_num_shards (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of shards on disk to save the dataset into.background (
bool
, default:False
) β Whether to run the operation in the background.
- Return type:
- Returns:
A new step with the operation applied.
- rename_columns(column_mapping, name=None, lazy=True, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#
Rename columns in the stepβs output. See
rename_columns()
for more details.- Parameters:
column_mapping (
dict
[str
,str
]) β The mapping of original column names to new column names.name (
Optional
[str
], default:None
) β The name of the operation.lazy (
bool
, default:True
) β Whether to run the operation lazily.progress_interval (
UnionType
[None
,int
,Default
], default:DEFAULT
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).writer_batch_size (
Optional
[int
], default:1000
) β The batch size to use if saving to disk.save_num_proc (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of processes to use if saving to disk.save_num_shards (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of shards on disk to save the dataset into.background (
bool
, default:False
) β Whether to run the operation in the background.
- Return type:
- Returns:
A new step with the operation applied.
- remove_columns(column_names, name=None, lazy=True, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#
Remove columns from the stepβs output. See
remove_columns()
for more details.- Parameters:
column_names (
str
|list
[str
]) β The names of the columns to remove.name (
Optional
[str
], default:None
) β The name of the operation.lazy (
bool
, default:True
) β Whether to run the operation lazily.progress_interval (
UnionType
[None
,int
,Default
], default:DEFAULT
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).writer_batch_size (
Optional
[int
], default:1000
) β The batch size to use if saving to disk.save_num_proc (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of processes to use if saving to disk.save_num_shards (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of shards on disk to save the dataset into.background (
bool
, default:False
) β Whether to run the operation in the background.
- Return type:
- Returns:
A new step with the operation applied.
- splits(train_size=None, validation_size=None, test_size=None, stratify_by_column=None, name=None, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#
Split the stepβs output into multiple splits for training, validation, and testing. If
train_size
orvalidation_size
ortest_size
is not specified, the corresponding split will not be created.- Parameters:
train_size (
Union
[None
,float
,int
], default:None
) β The size of the training split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the training split. If an int, should be the number of rows to include in the training split.validation_size (
Union
[None
,float
,int
], default:None
) β The size of the validation split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the validation split. If an int, should be the number of rows to include in the validation split.test_size (
Union
[None
,float
,int
], default:None
) β The size of the test split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If an int, should be the number of rows to include in the test split.stratify_by_column (
Optional
[str
], default:None
) β The name of the column to use to stratify equally between splits.name (
Optional
[str
], default:None
) β The name of the operation.progress_interval (
UnionType
[None
,int
,Default
], default:DEFAULT
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).writer_batch_size (
Optional
[int
], default:1000
) β The batch size to use if saving to disk.save_num_proc (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of processes to use if saving to disk.save_num_shards (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of shards on disk to save the dataset into.background (
bool
, default:False
) β Whether to run the operation in the background.
- Return type:
- Returns:
A dictionary where the keys are the names of the splits and the values are new steps with the split applied.
- shard(num_shards, index, contiguous=False, name=None, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#
Shard the stepβs output into multiple shards. See
shard()
for more details.- Parameters:
num_shards (
int
) β The number of shards to split the dataset into.index (
int
) β The index of the shard to select.contiguous (
bool
, default:False
) β Whether to select contiguous blocks of indicies for shards.name (
Optional
[str
], default:None
) β The name of the operation.progress_interval (
UnionType
[None
,int
,Default
], default:DEFAULT
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).writer_batch_size (
Optional
[int
], default:1000
) β The batch size to use if saving to disk.save_num_proc (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of processes to use if saving to disk.save_num_shards (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of shards on disk to save the dataset into.background (
bool
, default:False
) β Whether to run the operation in the background.
- Return type:
- Returns:
A new step with the operation applied.
- reverse(name=None, lazy=True, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#
Reverse the rows of the stepβs output.
- Parameters:
name (
Optional
[str
], default:None
) β The name of the operation.lazy (
bool
, default:True
) β Whether to run the operation lazily.progress_interval (
UnionType
[None
,int
,Default
], default:DEFAULT
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).writer_batch_size (
Optional
[int
], default:1000
) β The batch size to use if saving to disk.save_num_proc (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of processes to use if saving to disk.save_num_shards (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of shards on disk to save the dataset into.background (
bool
, default:False
) β Whether to run the operation in the background.
- Return type:
- Returns:
A new step with the operation applied.
- save(name=None, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#
Save the stepβs output to disk.
- Parameters:
name (
Optional
[str
], default:None
) β The name of the operation.progress_interval (
UnionType
[None
,int
,Default
], default:DEFAULT
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).writer_batch_size (
Optional
[int
], default:1000
) β The batch size to use if saving to disk.save_num_proc (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of processes to use if saving to disk.save_num_shards (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of shards on disk to save the dataset into.background (
bool
, default:False
) β Whether to run the operation in the background.
- Return type:
- Returns:
A new step with the operation applied.
- copy(name=None, lazy=None, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#
Create a copy of the stepβs output.
- Parameters:
name (
Optional
[str
], default:None
) β The name of the operation.lazy (
Optional
[bool
], default:None
) β Whether to run the operation lazily.progress_interval (
UnionType
[None
,int
,Default
], default:DEFAULT
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).writer_batch_size (
Optional
[int
], default:1000
) β The batch size to use if saving to disk.save_num_proc (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of processes to use if saving to disk.save_num_shards (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of shards on disk to save the dataset into.background (
bool
, default:False
) β Whether to run the operation in the background.
- Return type:
- Returns:
A new step with the operation applied.
- export_to_dict(train_size=None, validation_size=None, test_size=None, stratify_by_column=None, writer_batch_size=1000, save_num_proc=None, save_num_shards=None)[source]#
Export the stepβs output to a dictionary and optionally create splits. See
splits()
for more details on splits behavior.- Parameters:
train_size (
Union
[None
,float
,int
], default:None
) β The size of the training split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the training split. If an int, should be the number of rows to include in the training split.validation_size (
Union
[None
,float
,int
], default:None
) β The size of the validation split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the validation split. If an int, should be the number of rows to include in the validation split.test_size (
Union
[None
,float
,int
], default:None
) β The size of the test split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If an int, should be the number of rows to include in the test split.stratify_by_column (
Optional
[str
], default:None
) β The name of the column to use to stratify equally between splits.writer_batch_size (
Optional
[int
], default:1000
) β The batch size to use if saving to disk.save_num_proc (
Optional
[int
], default:None
) β The number of processes to use if saving to disk.save_num_shards (
Optional
[int
], default:None
) β The number of shards on disk to save the dataset into.
- Return type:
- Returns:
The stepβs output as a dictionary.
- export_to_list(train_size=None, validation_size=None, test_size=None, stratify_by_column=None, writer_batch_size=1000, save_num_proc=None, save_num_shards=None)[source]#
Export the stepβs output to a list and optionally create splits. See
splits()
for more details on splits behavior.- Parameters:
train_size (
Union
[None
,float
,int
], default:None
) β The size of the training split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the training split. If an int, should be the number of rows to include in the training split.validation_size (
Union
[None
,float
,int
], default:None
) β The size of the validation split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the validation split. If an int, should be the number of rows to include in the validation split.test_size (
Union
[None
,float
,int
], default:None
) β The size of the test split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If an int, should be the number of rows to include in the test split.stratify_by_column (
Optional
[str
], default:None
) β The name of the column to use to stratify equally between splits.writer_batch_size (
Optional
[int
], default:1000
) β The batch size to use if saving to disk.save_num_proc (
Optional
[int
], default:None
) β The number of processes to use if saving to disk.save_num_shards (
Optional
[int
], default:None
) β The number of shards on disk to save the dataset into.
- Return type:
- export_to_json(path, train_size=None, validation_size=None, test_size=None, stratify_by_column=None, writer_batch_size=1000, save_num_proc=None, save_num_shards=None, **to_json_kwargs)[source]#
Export the stepβs output to a JSON file and optionally create splits. See
splits()
for more details on splits behavior.- Parameters:
path (
str
) β The path to save the JSON file to.train_size (
Union
[None
,float
,int
], default:None
) β The size of the training split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the training split. If an int, should be the number of rows to include in the training split.validation_size (
Union
[None
,float
,int
], default:None
) β The size of the validation split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the validation split. If an int, should be the number of rows to include in the validation split.test_size (
Union
[None
,float
,int
], default:None
) β The size of the test split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If an int, should be the number of rows to include in the test split.stratify_by_column (
Optional
[str
], default:None
) β The name of the column to use to stratify equally between splits.writer_batch_size (
Optional
[int
], default:1000
) β The batch size to use if saving to disk.save_num_proc (
Optional
[int
], default:None
) β The number of processes to use if saving to disk.save_num_shards (
Optional
[int
], default:None
) β The number of shards on disk to save the dataset into.to_json_kwargs β Additional keyword arguments to pass to
to_json()
.
- Return type:
- Returns:
The path to the JSON file or a dictionary of paths if creating splits.
- export_to_csv(path, sep=',', train_size=None, validation_size=None, test_size=None, stratify_by_column=None, writer_batch_size=1000, save_num_proc=None, save_num_shards=None, **to_csv_kwargs)[source]#
Export the stepβs output to a CSV file and optionally create splits. See
splits()
for more details on splits behavior.- Parameters:
path (
str
) β The path to save the CSV file to.sep (default:
','
) β The delimiter to use for the CSV file.train_size (
Union
[None
,float
,int
], default:None
) β The size of the training split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the training split. If an int, should be the number of rows to include in the training split.validation_size (
Union
[None
,float
,int
], default:None
) β The size of the validation split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the validation split. If an int, should be the number of rows to include in the validation split.test_size (
Union
[None
,float
,int
], default:None
) β The size of the test split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If an int, should be the number of rows to include in the test split.stratify_by_column (
Optional
[str
], default:None
) β The name of the column to use to stratify equally between splits.writer_batch_size (
Optional
[int
], default:1000
) β The batch size to use if saving to disk.save_num_proc (
Optional
[int
], default:None
) β The number of processes to use if saving to disk.save_num_shards (
Optional
[int
], default:None
) β The number of shards on disk to save the dataset into.to_csv_kwargs β Additional keyword arguments to pass to
to_csv()
.
- Return type:
- Returns:
The path to the CSV file or a dictionary of paths if creating splits.
- export_to_hf_dataset(path, train_size=None, validation_size=None, test_size=None, stratify_by_column=None, writer_batch_size=1000, save_num_proc=None, save_num_shards=None)[source]#
Export the stepβs output to a Hugging Face
Dataset
and optionally create splits. Seesplits()
for more details on splits behavior.- Parameters:
path (
str
) β The path to save the Hugging FaceDataset
folder to.train_size (
Union
[None
,float
,int
], default:None
) β The size of the training split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the training split. If an int, should be the number of rows to include in the training split.validation_size (
Union
[None
,float
,int
], default:None
) β The size of the validation split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the validation split. If an int, should be the number of rows to include in the validation split.test_size (
Union
[None
,float
,int
], default:None
) β The size of the test split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If an int, should be the number of rows to include in the test split.stratify_by_column (
Optional
[str
], default:None
) β The name of the column to use to stratify equally between splits.writer_batch_size (
Optional
[int
], default:1000
) β The batch size to use if saving to disk.save_num_proc (
Optional
[int
], default:None
) β The number of processes to use if saving to disk.save_num_shards (
Optional
[int
], default:None
) β The number of shards on disk to save the dataset into.
- Return type:
Dataset
|DatasetDict
- Returns:
The stepβs output as a Hugging Face
Dataset
orDatasetDict
if creating splits.
- publish_to_hf_hub(repo_id, branch=None, private=False, token=None, train_size=None, validation_size=None, test_size=None, stratify_by_column=None, writer_batch_size=1000, save_num_proc=None, save_num_shards=None, is_synthetic=True, **kwargs)[source]#
Publish the stepβs output to the Hugging Face Hub as a dataset and optionally create splits. See
splits()
for more details on splits behavior. Seepush_to_hub()
for more details on publishing.- Parameters:
repo_id (
str
) β The repository ID to publish the dataset to.branch (
Optional
[str
], default:None
) β The branch to push the dataset to.private (
bool
, default:False
) β Whether to make the dataset private.token (
Optional
[str
], default:None
) β The Hugging Face API token to use for authentication.train_size (
Union
[None
,float
,int
], default:None
) β The size of the training split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the training split. If an int, should be the number of rows to include in the training split.validation_size (
Union
[None
,float
,int
], default:None
) β The size of the validation split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the validation split. If an int, should be the number of rows to include in the validation split.test_size (
Union
[None
,float
,int
], default:None
) β The size of the test split. If a float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If an int, should be the number of rows to include in the test split.stratify_by_column (
Optional
[str
], default:None
) β The name of the column to use to stratify equally between splits.writer_batch_size (
Optional
[int
], default:1000
) β The batch size to use if saving to disk.save_num_proc (
Optional
[int
], default:None
) β The number of processes to use if saving to disk.save_num_shards (
Optional
[int
], default:None
) β The number of shards on disk to save the dataset into.is_synthetic (
bool
, default:True
) β Whether the dataset is synthetic (applies certain metadata when publishing).**kwargs β Additional keyword arguments to pass to
push_to_hub()
.
- Return type:
- Returns:
The URL to the published dataset.
- class datadreamer.steps.DataSource(name, data, total_num_rows=None, auto_progress=True, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False, **kwargs)[source]#
Bases:
Step
Loads a dataset from a in-memory Python object.
- Parameters:
name (
str
) β The name of the step.data (Any) β The data to load as a dataset. See the valid return formats for more details on acceptable data formats.
total_num_rows (
Optional
[int
], default:None
) β The total number of rows being processed (helps with displaying progress, if the data is being lazily loaded).auto_progress (
bool
, default:True
) β Whether to automatically update the progress % for this step.progress_interval (
Optional
[int
], default:60
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).verbose (
Optional
[bool
], default:None
) β Whether or not to print verbose logs.log_level (
Optional
[int
], default:None
) β The logging level to use (DEBUG
,INFO
, etc.).save_num_proc (
Optional
[int
], default:None
) β The number of processes to use if saving to disk.save_num_shards (
Optional
[int
], default:None
) β The number of shards on disk to save the dataset into.background (
bool
, default:False
) β Whether to run the operation in the background.**config_kwargs β Additional keyword arguments to pass to
datasets.load_dataset()
.
- property output: OutputDataset | OutputIterableDataset[source]#
The output dataset of the step.
- class datadreamer.steps.HFHubDataSource(name, path, config_name=None, split=None, revision=None, trust_remote_code=False, streaming=False, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False, **config_kwargs)[source]#
Bases:
DataSource
Loads a dataset from the Hugging Face Hub. See
datasets.load_dataset()
for more details.- Parameters:
name (
str
) β The name of the step.path (
str
) β The path to the dataset on the Hugging Face Hub.config_name (
Optional
[str
], default:None
) β The name of the dataset configuration to load.split (
Union
[None
,str
,Split
], default:None
) β The split to load.revision (
Union
[None
,str
,Version
], default:None
) β The version (commit hash) of the dataset to load.trust_remote_code (
bool
, default:False
) β Whether to trust the remote code.streaming (
bool
, default:False
) β Whether to load the dataset in streaming mode.progress_interval (
Optional
[int
], default:60
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).verbose (
Optional
[bool
], default:None
) β Whether or not to print verbose logs.log_level (
Optional
[int
], default:None
) β The logging level to use (DEBUG
,INFO
, etc.).save_num_proc (
Optional
[int
], default:None
) β The number of processes to use if saving to disk.save_num_shards (
Optional
[int
], default:None
) β The number of shards on disk to save the dataset into.background (
bool
, default:False
) β Whether to run the operation in the background.
- property output: OutputDataset | OutputIterableDataset[source]#
The output dataset of the step.
- class datadreamer.steps.HFDatasetDataSource(name, dataset_path, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#
Bases:
DataSource
Loads a Hugging Face
Dataset
from a local path. Seedatasets.load_from_disk()
for more details.- Parameters:
name (
str
) β The name of the step.dataset_path (
str
) β The path to thedatasets.Dataset
folder.progress_interval (
Optional
[int
], default:60
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).verbose (
Optional
[bool
], default:None
) β Whether or not to print verbose logs.log_level (
Optional
[int
], default:None
) β The logging level to use (DEBUG
,INFO
, etc.).save_num_proc (
Optional
[int
], default:None
) β The number of processes to use if saving to disk.save_num_shards (
Optional
[int
], default:None
) β The number of shards on disk to save the dataset into.background (
bool
, default:False
) β Whether to run the operation in the background.
Loads a dataset from a in-memory Python object.
- Parameters:
name (
str
) β The name of the step.data (Any) β
The data to load as a dataset. See the valid return formats for more details on acceptable data formats.
total_num_rows β The total number of rows being processed (helps with displaying progress, if the data is being lazily loaded).
auto_progress β Whether to automatically update the progress % for this step.
progress_interval (
Optional
[int
], default:60
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).verbose (
Optional
[bool
], default:None
) β Whether or not to print verbose logs.log_level (
Optional
[int
], default:None
) β The logging level to use (DEBUG
,INFO
, etc.).save_num_proc (
Optional
[int
], default:None
) β The number of processes to use if saving to disk.save_num_shards (
Optional
[int
], default:None
) β The number of shards on disk to save the dataset into.background (
bool
, default:False
) β Whether to run the operation in the background.**config_kwargs β Additional keyword arguments to pass to
datasets.load_dataset()
.
- property output: OutputDataset | OutputIterableDataset[source]#
The output dataset of the step.
- class datadreamer.steps.JSONDataSource(name, data_folder=None, data_files=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False, **config_kwargs)[source]#
Bases:
DataSource
Loads a JSON dataset from a local path. See
datasets.load_dataset()
for more details.- Parameters:
name (
str
) β The name of the step.data_folder (
Optional
[str
], default:None
) β The path to the dataset folder.data_files (
Union
[None
,str
,Sequence
[str
]], default:None
) β The name of files from the folder to load.progress_interval (
Optional
[int
], default:60
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).verbose (
Optional
[bool
], default:None
) β Whether or not to print verbose logs.log_level (
Optional
[int
], default:None
) β The logging level to use (DEBUG
,INFO
, etc.).save_num_proc (
Optional
[int
], default:None
) β The number of processes to use if saving to disk.save_num_shards (
Optional
[int
], default:None
) β The number of shards on disk to save the dataset into.background (
bool
, default:False
) β Whether to run the operation in the background.**config_kwargs β Additional keyword arguments to pass to
datasets.load_dataset()
.
Loads a dataset from a in-memory Python object.
- Parameters:
name (
str
) β The name of the step.data (Any) β
The data to load as a dataset. See the valid return formats for more details on acceptable data formats.
total_num_rows β The total number of rows being processed (helps with displaying progress, if the data is being lazily loaded).
auto_progress β Whether to automatically update the progress % for this step.
progress_interval (
Optional
[int
], default:60
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).verbose (
Optional
[bool
], default:None
) β Whether or not to print verbose logs.log_level (
Optional
[int
], default:None
) β The logging level to use (DEBUG
,INFO
, etc.).save_num_proc (
Optional
[int
], default:None
) β The number of processes to use if saving to disk.save_num_shards (
Optional
[int
], default:None
) β The number of shards on disk to save the dataset into.background (
bool
, default:False
) β Whether to run the operation in the background.**config_kwargs β Additional keyword arguments to pass to
datasets.load_dataset()
.
- property output: OutputDataset | OutputIterableDataset[source]#
The output dataset of the step.
- class datadreamer.steps.CSVDataSource(name, data_folder=None, data_files=None, sep=',', progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False, **config_kwargs)[source]#
Bases:
DataSource
Loads a CSV dataset from a local path. See
datasets.load_dataset()
for more details.- Parameters:
name (
str
) β The name of the step.data_folder (
Optional
[str
], default:None
) β The path to the dataset folder.data_files (
Union
[None
,str
,Sequence
[str
]], default:None
) β The name of files from the folder to load.progress_interval (
Optional
[int
], default:60
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).verbose (
Optional
[bool
], default:None
) β Whether or not to print verbose logs.log_level (
Optional
[int
], default:None
) β The logging level to use (DEBUG
,INFO
, etc.).save_num_proc (
Optional
[int
], default:None
) β The number of processes to use if saving to disk.save_num_shards (
Optional
[int
], default:None
) β The number of shards on disk to save the dataset into.background (
bool
, default:False
) β Whether to run the operation in the background.**config_kwargs β Additional keyword arguments to pass to
datasets.load_dataset()
.
Loads a dataset from a in-memory Python object.
- Parameters:
name (
str
) β The name of the step.data (Any) β
The data to load as a dataset. See the valid return formats for more details on acceptable data formats.
total_num_rows β The total number of rows being processed (helps with displaying progress, if the data is being lazily loaded).
auto_progress β Whether to automatically update the progress % for this step.
progress_interval (
Optional
[int
], default:60
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).verbose (
Optional
[bool
], default:None
) β Whether or not to print verbose logs.log_level (
Optional
[int
], default:None
) β The logging level to use (DEBUG
,INFO
, etc.).save_num_proc (
Optional
[int
], default:None
) β The number of processes to use if saving to disk.save_num_shards (
Optional
[int
], default:None
) β The number of shards on disk to save the dataset into.background (
bool
, default:False
) β Whether to run the operation in the background.**config_kwargs β Additional keyword arguments to pass to
datasets.load_dataset()
.
- property output: OutputDataset | OutputIterableDataset[source]#
The output dataset of the step.
- class datadreamer.steps.TextDataSource(name, data_folder=None, data_files=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False, **config_kwargs)[source]#
Bases:
DataSource
Loads a text dataset from a local path. See
datasets.load_dataset()
for more details.- Parameters:
name (
str
) β The name of the step.data_folder (
Optional
[str
], default:None
) β The path to the dataset folder.data_files (
Union
[None
,str
,Sequence
[str
]], default:None
) β The name of files from the folder to load.progress_interval (
Optional
[int
], default:60
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).verbose (
Optional
[bool
], default:None
) β Whether or not to print verbose logs.log_level (
Optional
[int
], default:None
) β The logging level to use (DEBUG
,INFO
, etc.).save_num_proc (
Optional
[int
], default:None
) β The number of processes to use if saving to disk.save_num_shards (
Optional
[int
], default:None
) β The number of shards on disk to save the dataset into.background (
bool
, default:False
) β Whether to run the operation in the background.**config_kwargs β Additional keyword arguments to pass to
datasets.load_dataset()
.
Loads a dataset from a in-memory Python object.
- Parameters:
name (
str
) β The name of the step.data (Any) β
The data to load as a dataset. See the valid return formats for more details on acceptable data formats.
total_num_rows β The total number of rows being processed (helps with displaying progress, if the data is being lazily loaded).
auto_progress β Whether to automatically update the progress % for this step.
progress_interval (
Optional
[int
], default:60
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).verbose (
Optional
[bool
], default:None
) β Whether or not to print verbose logs.log_level (
Optional
[int
], default:None
) β The logging level to use (DEBUG
,INFO
, etc.).save_num_proc (
Optional
[int
], default:None
) β The number of processes to use if saving to disk.save_num_shards (
Optional
[int
], default:None
) β The number of shards on disk to save the dataset into.background (
bool
, default:False
) β Whether to run the operation in the background.**config_kwargs β Additional keyword arguments to pass to
datasets.load_dataset()
.
- property output: OutputDataset | OutputIterableDataset[source]#
The output dataset of the step.
- class datadreamer.steps.Prompt(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#
Bases:
_PromptBase
Runs a set of prompts against a
LLM
.Prompt( name = 'The name of the step.', inputs = { 'prompts': 'The prompts to process with the LLM.' }, args = { 'llm': 'The LLM to use.', 'post_process': 'A function to post-process the generations. (optional)', 'lazy': 'Whether to run lazily or not. (optional, defaults to False)', '**kwargs': 'Any other arguments you want to pass to the .run() method of the LLM. (optional)' }, outputs = { 'prompts': 'The prompts processed with the LLM.', 'generations': 'The generations by the LLM.' }, )
- property output: OutputDataset | OutputIterableDataset[source]#
The output dataset of the step.
- class datadreamer.steps.RAGPrompt(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#
Bases:
_PromptBase
,SuperStep
Processes a set of prompts using in-context texts with a
LLM
. Thek
most relevant texts are retrieved for each prompt using aRetriever
.RAGPrompt( name = 'The name of the step.', inputs = { 'prompts': 'The prompts to process with the LLM.' }, args = { 'llm': 'The LLM to use.', 'retriever': 'The retriever to use to retrieve in-context texts.', 'k': 'The number of texts to retrieve.', 'retrieved_text_label': "The label to use for retrieved texts. (optional, defaults to 'Document:')", 'prompt_label': "The label to use for the prompt. (optional, defaults to 'Question:')", 'max_new_tokens': 'The maximum number of tokens to generate. (optional)', 'sep': "The separator to use between in-context retrieved texts and each prompt. (optional, defaults to '\\n')", 'min_in_context_retrieved_texts': 'The minimum number of in-context retrieved texts to include. (optional)', 'max_in_context_retrieved_texts': 'The maximum number of in-context retrieved texts to include. (optional)', 'post_process': 'A function to post-process the generations. (optional)', 'lazy': 'Whether to run lazily or not. (optional, defaults to False)', '**kwargs': 'Any other arguments you want to pass to the .run() method of the LLM. (optional)' }, outputs = { 'prompts': 'The prompts processed with the LLM.', 'generations': 'The generations by the LLM.' }, )
- property output: OutputDataset | OutputIterableDataset[source]#
The output dataset of the step.
- class datadreamer.steps.ProcessWithPrompt(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#
Bases:
_PromptBase
Processes a set of inputs using an instruction with a
LLM
.ProcessWithPrompt( name = 'The name of the step.', inputs = { 'inputs': 'The inputs to process with the LLM.' }, args = { 'llm': 'The LLM to use.', 'instruction': 'An instruction that describes how to process the input.', 'input_label': "The label to use for inputs. (optional, defaults to 'Input:')", 'instruction_label': "The label to use for the instruction. (optional, defaults to 'Instruction:')", 'max_new_tokens': 'The maximum number of tokens to generate. (optional)', 'sep': "The separator to use between instructions and the input. (optional, defaults to '\\n\\n')", 'post_process': 'A function to post-process the generations. (optional)', 'lazy': 'Whether to run lazily or not. (optional, defaults to False)', '**kwargs': 'Any other arguments you want to pass to the .run() method of the LLM. (optional)' }, outputs = { 'inputs': 'The inputs processed with the LLM.', 'prompts': 'The prompts processed with the LLM.', 'generations': 'The generations by the LLM.' }, )
- property output: OutputDataset | OutputIterableDataset[source]#
The output dataset of the step.
- class datadreamer.steps.FewShotPrompt(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#
Bases:
_PromptBase
Processes a set of inputs using in-context examples and an instruction with a
LLM
.FewShotPrompt( name = 'The name of the step.', inputs = { 'input_examples': 'The in-context example inputs to include in the prompt.', 'output_examples': 'The in-context example outputs to include in the prompt.', 'inputs': 'The inputs to process with the LLM.' }, args = { 'llm': 'The LLM to use.', 'input_label': "The label to use for inputs. (optional, defaults to 'Input:')", 'output_label': "The label to use for outputs. (optional, defaults to 'Output:')", 'max_new_tokens': 'The maximum number of tokens to generate. (optional)', 'instruction': 'An instruction to include in the prompt. (optional)', 'sep': "The separator to use between instructions and in-context examples. (optional, defaults to '\\n')", 'min_in_context_examples': 'The minimum number of in-context examples to include. (optional)', 'max_in_context_examples': 'The maximum number of in-context examples to include. (optional)', 'post_process': 'A function to post-process the generations. (optional)', 'lazy': 'Whether to run lazily or not. (optional, defaults to False)', '**kwargs': 'Any other arguments you want to pass to the .run() method of the LLM. (optional)' }, outputs = { 'inputs': 'The inputs processed with the LLM.', 'prompts': 'The prompts processed with the LLM.', 'generations': 'The generations by the LLM.' }, )
- property output: OutputDataset | OutputIterableDataset[source]#
The output dataset of the step.
- class datadreamer.steps.FewShotPromptWithRetrieval(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#
Bases:
FewShotPrompt
,SuperStep
Processes a set of inputs using in-context examples and an instruction with a
LLM
. Thek
most relevant in-context examples are retrieved for each input using anEmbedder
.FewShotPromptWithRetrieval( name = 'The name of the step.', inputs = { 'input_examples': 'The in-context example inputs to retrieve from to include in the prompt.', 'output_examples': 'The in-context example outputs to retrieve from to include in the prompt.', 'inputs': 'The inputs to process with the LLM.' }, args = { 'llm': 'The LLM to use.', 'embedder': 'The embedder to use to retrieve in-context examples.', 'k': 'The number of in-context examples to retrieve.', 'input_label': "The label to use for inputs. (optional, defaults to 'Input:')", 'output_label': "The label to use for outputs. (optional, defaults to 'Output:')", 'max_new_tokens': 'The maximum number of tokens to generate. (optional)', 'instruction': 'An instruction to include in the prompt. (optional)', 'sep': "The separator to use between instructions and in-context examples. (optional, defaults to '\\n')", 'min_in_context_examples': 'The minimum number of in-context examples to include. (optional)', 'max_in_context_examples': 'The maximum number of in-context examples to include. (optional)', 'post_process': 'A function to post-process the generations. (optional)', 'lazy': 'Whether to run lazily or not. (optional, defaults to False)', '**kwargs': 'Any other arguments you want to pass to the .run() method of the LLM. (optional)' }, outputs = { 'inputs': 'The inputs processed with the LLM.', 'prompts': 'The prompts processed with the LLM.', 'generations': 'The generations by the LLM.' }, )
- property output: OutputDataset | OutputIterableDataset[source]#
The output dataset of the step.
- class datadreamer.steps.DataFromPrompt(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#
Bases:
_PromptBase
Generates
n
rows of data using an instruction with aLLM
.DataFromPrompt( name = 'The name of the step.', args = { 'llm': 'The LLM to use.', 'instruction': 'The instruction to use to generate data.', 'n': 'The number of rows to generate from the prompt.', 'temperature': 'The temperature to use when generating data. (optional, defaults to 1.0)', 'top_p': 'The top_p to use when generating data. (optional, defaults to 1.0)', 'post_process': 'A function to post-process the generations. (optional)', 'lazy': 'Whether to run lazily or not. (optional, defaults to False)', '**kwargs': 'Any other arguments you want to pass to the .run() method of the LLM. (optional)' }, outputs = { 'prompts': 'The prompts processed with the LLM.', 'generations': 'The generations by the LLM.' }, )
- property output: OutputDataset | OutputIterableDataset[source]#
The output dataset of the step.
- class datadreamer.steps.DataFromAttributedPrompt(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#
Bases:
_PromptBase
Generates
n
rows of data using an attributed instruction with aLLM
. See the AttrPrompt paper for more information.Format of the
instruction
andattributes
argumentsThe
instruction
argument is a string with templated variables representing attributes, for example:instruction = "Generate a {adjective} sentence that is {length}."
Then, the
attributes
argument is a dictionary of lists, where the keys are the attribute names and the values are the possible values for the attribute, for example:attributes = { "adjective": ["serious", "funny"], "length": ["short", "long"], }
So all combinations of attributes will be used to generate data, by replacing the templated variables in the instruction with the attribute values to create 4 distinct attributed prompts to use:
βGenerate a serious sentence that is short.β
βGenerate a serious sentence that is long.β
βGenerate a funny sentence that is short.β
βGenerate a funny sentence that is long.β
If you want to directly specify the combinations of attributes, without automatically using all possible combinations, you can pass in a list of dictionaries to the
attributes
argument instead. Then you can directly specify which combinations should be used to create the attributed prompts, for example:attributes = [ {"adjective": "serious", "length": "short"}, {"adjective": "funny", "length": "short"}, {"adjective": "funny", "length": "long"}, ]
With this specification of
attributes
, only 3 attributed prompts will be used instead of 4.DataFromAttributedPrompt( name = 'The name of the step.', args = { 'llm': 'The LLM to use.', 'instruction': 'The attributed instruction to use to generate data.', 'attributes': 'The attributes to use in the instruction.', 'n': 'The number of rows to generate from the prompt.', 'temperature': 'The temperature to use when generating data. (optional, defaults to 1.0)', 'top_p': 'The top_p to use when generating data. (optional, defaults to 1.0)', 'post_process': 'A function to post-process the generations. (optional)', 'lazy': 'Whether to run lazily or not. (optional, defaults to False)', '**kwargs': 'Any other arguments you want to pass to the .run() method of the LLM. (optional)' }, outputs = { 'attributes': 'The attributes used to generate the data.', 'prompts': 'The prompts processed with the LLM.', 'generations': 'The generations by the LLM.' }, )
- property output: OutputDataset | OutputIterableDataset[source]#
The output dataset of the step.
- class datadreamer.steps.FilterWithPrompt(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#
Bases:
_PromptBase
,SuperStep
Filters a set of inputs using an instruction with a
LLM
andfilter_func
which takes in a generation produced by the instruction and returns abool
on whether to keep the row or not.FilterWithPrompt( name = 'The name of the step.', inputs = { 'inputs': 'The inputs to process with the LLM.' }, args = { 'llm': 'The LLM to use.', 'instruction': 'An instruction that describes how to process the input.', 'filter_func': "A function to filter the generations with. (optional, defaults to filtering by parsing 'Yes' / 'No' from the generation)", 'input_label': "The label to use for inputs. (optional, defaults to 'Input:')", 'instruction_label': "The label to use for the instruction. (optional, defaults to 'Instruction:')", 'max_new_tokens': 'The maximum number of tokens to generate. (optional)', 'sep': "The separator to use between instructions and the input. (optional, defaults to '\\n\\n')", 'post_process': 'A function to post-process the generations. (optional)', 'lazy': 'Whether to run lazily or not. (optional, defaults to False)', '**kwargs': 'Any other arguments you want to pass to the .run() method of the LLM. (optional)' }, outputs = { '*columns': "All of the columns of the step producing the 'inputs'." }, )
- property output: OutputDataset | OutputIterableDataset[source]#
The output dataset of the step.
- class datadreamer.steps.RankWithPrompt(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#
Bases:
_PromptBase
,SuperStep
Produces scores for a set of inputs using an instruction with a
LLM
. Scores are parsed from the generations produced by the instruction. Optionally, the input rows are sorted and filtered using the scores.RankWithPrompt( name = 'The name of the step.', inputs = { 'inputs': 'The inputs to process with the LLM.' }, args = { 'llm': 'The LLM to use.', 'instruction': 'An instruction that describes how to process the input.', 'sort': 'Whether or not to sort the inputs by the score produced by the instruction. (optional, defaults to True)', 'reverse': 'Whether or not to reverse the sort direction. (optional, defaults to True)', 'score_threshold': 'A score threshold. If specified, filter out input rows that scored below the threshold. (optional)', 'input_label': "The label to use for inputs. (optional, defaults to 'Input:')", 'instruction_label': "The label to use for the instruction. (optional, defaults to 'Instruction:')", 'max_new_tokens': 'The maximum number of tokens to generate. (optional)', 'sep': "The separator to use between instructions and the input. (optional, defaults to '\\n\\n')", 'post_process': 'A function to post-process the generations. (optional)', 'lazy': 'Whether to run lazily or not. (optional, defaults to False)', '**kwargs': 'Any other arguments you want to pass to the .run() method of the LLM. (optional)' }, outputs = { '*columns': "All of the columns of the step producing the 'inputs'.", 'scores': 'The scores produced by the instruction on the inputs.' }, )
- property output: OutputDataset | OutputIterableDataset[source]#
The output dataset of the step.
- class datadreamer.steps.JudgePairsWithPrompt(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#
Bases:
_PromptBase
Judges between a of pair of inputs (
a
,b
) using an instruction with aLLM
.JudgePairsWithPrompt( name = 'The name of the step.', inputs = { 'a': 'A set of inputs to judge between.', 'b': 'A set of inputs to judge between.' }, args = { 'llm': 'The LLM to use.', 'instruction': 'An instruction that describes how to judge the input. (optional, defaults to "Which input is better? Respond with \'Input A\' or \'Input B\'.")', 'judgement_func': "A function to get the judgement from the generation. (optional, defaults to judging by parsing 'a_label' / 'b_label' ('Input A' / 'Input B') from the generation)", 'randomize_order': "Whether to randomly swap the order of 'a' and 'b' in the prompt to mitigate position bias. (optional, defaults to True)", 'a_label': "The label to use for inputs 'a'. (optional, defaults to 'Input A:')", 'b_label': "The label to use for inputs 'b'. (optional, defaults to 'Input B:')", 'instruction_label': "The label to use for the instruction. (optional, defaults to 'Instruction:')", 'max_new_tokens': 'The maximum number of tokens to generate. (optional)', 'sep': "The separator to use between instructions and the input. (optional, defaults to '\\n')", 'post_process': 'A function to post-process the generations. (optional)', 'lazy': 'Whether to run lazily or not. (optional, defaults to False)', '**kwargs': 'Any other arguments you want to pass to the .run() method of the LLM. (optional)' }, outputs = { 'a': 'A set of inputs judged with the LLM.', 'b': 'A set of inputs judged with the LLM.', 'judge_prompts': 'The prompts processed with the LLM to judge inputs.', 'judge_generations': 'The judgement generations by the LLM.', 'judgements': 'The judgements by the LLM.' }, )
- property output: OutputDataset | OutputIterableDataset[source]#
The output dataset of the step.
- class datadreamer.steps.JudgeGenerationPairsWithPrompt(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#
Bases:
_PromptBase
Judges between a of pair of generations (
a
,b
) that were produced in response toprompts
using an instruction with aLLM
.JudgeGenerationPairsWithPrompt( name = 'The name of the step.', inputs = { 'prompts': "A set of prompts used to generate 'a' and 'b'.", 'a': 'A set of generations to judge between.', 'b': 'A set of generations to judge between.' }, args = { 'llm': 'The LLM to use.', 'instruction': 'An instruction that describes how to judge the input. (optional, defaults to "Which response is better? Respond with \'Response A\' or \'Response B\'.")', 'judgement_func': "A function to get the judgement from the generation. (optional, defaults to judging by parsing 'a_label' / 'b_label' ('Response A' / 'Response B') from the generation)", 'randomize_order': "Whether to randomly swap the order of 'a' and 'b' in the prompt to mitigate position bias. (optional, defaults to True)", 'prompt_label': "The label to use for prompts. (optional, defaults to 'Prompt:')", 'a_label': "The label to use for inputs 'a'. (optional, defaults to 'Response A:')", 'b_label': "The label to use for inputs 'b'. (optional, defaults to 'Response B:')", 'instruction_label': "The label to use for the instruction. (optional, defaults to 'Instruction:')", 'max_new_tokens': 'The maximum number of tokens to generate. (optional)', 'sep': "The separator to use between instructions and the input. (optional, defaults to '\\n')", 'post_process': 'A function to post-process the generations. (optional)', 'lazy': 'Whether to run lazily or not. (optional, defaults to False)', '**kwargs': 'Any other arguments you want to pass to the .run() method of the LLM. (optional)' }, outputs = { 'prompts': "The set of prompts used to generate 'a' and 'b'.", 'a': 'A set of inputs judged with the LLM.', 'b': 'A set of inputs judged with the LLM.', 'judge_prompts': 'The prompts processed with the LLM to judge inputs.', 'judge_generations': 'The judgement generations by the LLM.', 'judgements': 'The judgements by the LLM.' }, )
- property output: OutputDataset | OutputIterableDataset[source]#
The output dataset of the step.
- class datadreamer.steps.RunTaskModel(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#
Bases:
Step
Runs a set of texts against a
TaskModel
.RunTaskModel( name = 'The name of the step.', inputs = { 'texts': 'The texts to process with the TaskModel.' }, args = { 'model': 'The TaskModel to use.', 'truncate': 'Whether or not to truncate inputs. (optional, defaults to False)', 'lazy': 'Whether to run lazily or not. (optional, defaults to False)', '**kwargs': 'Any other arguments you want to pass to the .run() method of the TaskModel. (optional)' }, outputs = { 'texts': 'The texts processed with the TaskModel.', 'results': 'The results from the TaskModel.' }, )
- property output: OutputDataset | OutputIterableDataset[source]#
The output dataset of the step.
- class datadreamer.steps.Embed(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#
Bases:
Step
Embeds a set of texts with an
Embedder
.Embed( name = 'The name of the step.', inputs = { 'texts': 'The texts to embed.' }, args = { 'embedder': 'The Embedder to use.', 'truncate': 'Whether or not to truncate inputs. (optional, defaults to False)', 'instruction': 'The instruction to prefix inputs to the embedding model with. (optional)', 'lazy': 'Whether to run lazily or not. (optional, defaults to False)', '**kwargs': 'Any other arguments you want to pass to the .run() method of the Embedder. (optional)' }, outputs = { 'texts': 'The texts that were embedded.', 'embeddings': 'The embeddings by the Embedder.' }, )
- property output: OutputDataset | OutputIterableDataset[source]#
The output dataset of the step.
- class datadreamer.steps.CosineSimilarity(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#
Bases:
SuperStep
Computes the cosine similarity between two sets of embeddings (
a
andb
). Ifa
andb
are a set of texts, then they will be embedded using the providedEmbedder
.CosineSimilarity( name = 'The name of the step.', inputs = { 'a': 'The embeddings or texts to compute the cosine similarity of.', 'b': 'The embeddings or texts to compute the cosine similarity of.' }, args = { 'similarity_batch_size': 'How many cosine similarities to compute at once. (optional, defaults to 100)', 'device': 'The device or list of devices to compute the cosine similarities with. (optional)', 'embedder': "The Embedder to use to embed 'a' and 'b' if they are a set of texts. (optional)", 'lazy': 'Whether to run lazily or not. (optional, defaults to False)', '**kwargs': 'Any other arguments you want to pass to the .run() method of the Embedder if it is used. (optional)' }, outputs = { 'a': 'The embeddings or texts that the cosine similarities were computed for.', 'b': 'The embeddings or texts that the cosine similarities were computed for.', 'similarities': 'The similarities computed by the step.' }, )
- property output: OutputDataset | OutputIterableDataset[source]#
The output dataset of the step.
- class datadreamer.steps.Retrieve(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#
Bases:
Step
Retrieves results for a set of queries with a
Retriever
.Retrieve( name = 'The name of the step.', inputs = { 'queries': 'The queries to retrieve results for.' }, args = { 'retriever': 'The Retriever to use.', 'k': 'How many results to retrieve. (optional, defaults to 5)', 'lazy': 'Whether to run lazily or not. (optional, defaults to False)', '**kwargs': 'Any other arguments you want to pass to the .run() method of the Retriever. (optional)' }, outputs = { 'queries': 'The queries used to retrieve results.', 'results': 'The results from the Retriever.' }, )
- property output: OutputDataset | OutputIterableDataset[source]#
The output dataset of the step.
- datadreamer.steps.concat(*steps, name=None, lazy=True, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#
Concatenate the rows of the outputs of the input steps.
- Parameters:
steps (
Step
) β The steps to concatenate.name (
Optional
[str
], default:None
) β The name of the operation.lazy (
bool
, default:True
) β Whether to run the operation lazily.progress_interval (
UnionType
[None
,int
,Default
], default:DEFAULT
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the operation (ignore saved results).writer_batch_size (
Optional
[int
], default:1000
) β The batch size to use if saving to disk.save_num_proc (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of processes to use if saving to disk.save_num_shards (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of shards on disk to save the dataset into.background (
bool
, default:False
) β Whether to run the operation in the background.
- Return type:
- Returns:
A new step with the operation applied.
- datadreamer.steps.zipped(*steps, name=None, lazy=True, progress_interval=DEFAULT, force=False, writer_batch_size=1000, save_num_proc=DEFAULT, save_num_shards=DEFAULT, background=False)[source]#
Zip the outputs of the input steps. This is similar to the built-in Python
zip()
function, essentially concatenating the columns of the outputs of the input steps.- Parameters:
steps (
Step
) β The steps to zip.name (
Optional
[str
], default:None
) β The name of the operation.lazy (
bool
, default:True
) β Whether to run the operation lazily.progress_interval (
UnionType
[None
,int
,Default
], default:DEFAULT
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the operation (ignore saved results).writer_batch_size (
Optional
[int
], default:1000
) β The batch size to use if saving to disk.save_num_proc (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of processes to use if saving to disk.save_num_shards (
UnionType
[None
,int
,Default
], default:DEFAULT
) β The number of shards on disk to save the dataset into.background (
bool
, default:False
) β Whether to run the operation in the background.
- Return type:
- Returns:
A new step with the operation applied.
- datadreamer.steps.wait(*steps, poll_interval=1.0)[source]#
Wait for all steps to complete if they are running in the background.
- Parameters:
poll_interval (default:
1.0
) β How often to poll in seconds.
- datadreamer.steps.concurrent(*funcs)[source]#
Run a set of functions (which run steps) concurrently.
- Parameters:
*funcs (
Callable
) β The functions to run concurrently.
- class datadreamer.steps.DataCardType[source]#
Bases:
object
The types of data card entries.
- DATETIME = 'Date & Time'#
- MODEL_NAME = 'Model Name'#
- DATASET_NAME = 'Dataset Name'#
- LICENSE = 'License Information'#
- CITATION = 'Citation Information'#
- DATASET_CARD = 'Dataset Card'#
- MODEL_CARD = 'Model Card'#
- URL = 'URL'#
- class datadreamer.steps.LazyRows(value, total_num_rows=None, auto_progress=True, save=False, save_writer_batch_size=None)[source]#
Bases:
object
A wrapper for lazy rows output from a step. See create your own steps for more details.
- Parameters:
value (lazy output) β The lazy rows output.
total_num_rows (
Optional
[int
], default:None
) β The total number of rows being processed (helps with displaying progress).auto_progress (
bool
, default:True
) β Whether to automatically update the progress % for this step.writer_batch_size β The batch size to use if saving to disk.
- class datadreamer.steps.LazyRowBatches(value, total_num_rows=None, auto_progress=True, save=False, save_writer_batch_size=None)[source]#
Bases:
object
A wrapper for lazy row batches output from a step. See create your own steps for more details.
- Parameters:
value (lazy output) β The lazy row batches output.
total_num_rows (
Optional
[int
], default:None
) β The total number of rows being processed (helps with displaying progress).auto_progress (
bool
, default:True
) β Whether to automatically update the progress % for this step.writer_batch_size β The batch size to use if saving to disk.
- class datadreamer.steps.SuperStep(name, inputs=None, args=None, outputs=None, progress_interval=60, force=False, verbose=None, log_level=None, save_num_proc=None, save_num_shards=None, background=False)[source]#
Bases:
Step
The class to subclass if you want to create a step that runs other steps. See create your own steps for more details.
- property output: OutputDataset | OutputIterableDataset[source]#
The output dataset of the step.