Creating a new Step#

Creating a new step in DataDreamer is useful for creating custom logic to transform or process input data into some new output data. It is something you will commonly want to do in a workflow to implement a new or custom technique.

When you create and run a new step in a DataDreamer session, you immediately get the benefits of the DataDreamer library such as reusability, caching of outputs, enhanced logging, parallelizability, easy interoperability with the other steps and trainers available in DataDreamer, and automatic data card generation.

You can create a new step by subclassing the Step class and implementing the setup() and run() methods:

from datadreamer.steps import Step

class MyNewStep(Step):
    def setup(self):
        # Register inputs, arguments, outputs, and data card information here

    def run(self):
        # Implement your custom data processing / transformation logic here

Implementing setup()#

The setup() method registers what inputs and arguments the step will accept, and what outputs it will return. It also allows you to register data card information for the step, that ultimately is used by DataDreamer to automatically generate data cards.

Registering inputs, arguments, and outputs#

You can use self.register_input() to register the name of each input that the step will accept. Inputs will be provided as OutputDatasetColumn or OutputIterableDatasetColumn objects.

You can use self.register_arg() to register the name of each argument that the step will accept. Arguments can be provided of any type.

You can use self.register_output() to register the name of each output that the step will produce. These outputs must be returned by your run() method implementation.

Registering data card information#

You can use self.register_data_card() to register various data card information where data_card_type can be a one of the DataCardType types, and data_card_value can be then information you wish to add to the data card for that DataCardType.

Implementing run()#

The run() method is where you implement your custom data processing / transformation logic using the input data and arguments requested in setup(). Your implementation of run() must also return outputs that correspond to the outputs registered in setup().

Accessing inputs and arguments#

You can access the inputs and arguments provided to the step by accessing the self.inputs and self.args dictionaries, respectively.

Storing persistent data#

If you need a folder to store persistent data during run(), you can use the self.get_run_output_folder_path() method.

Returning outputs#

You can return a dataset of outputs from run() corresponding with the output column names registered in setup(). DataDreamer will automatically convert the return value to an OutputDataset object and make it available on the output attribute of the step.

Valid Return Formats

DataDreamer is very flexible in what you can return as outputs, and you can return an output dataset in any of the following ways:

  • You can return a dictionary of lists, where each key is the name of an output column, and each value is a list of values for that output column.

  • You can return a list of dictionaries, where each list item is a row of data, and each dictionary key is the name of an output column, and each dictionary value is the value for that output column.

  • You can return a list of tuples, where each list item is a row of data, and each tuple item is the value for each output column in the order they were registered.

  • You can return a Hugging Face Dataset object or IterableDataset object.

  • … other data structures are also supported, DataDreamer will try to understand what you are returning and convert it to an appropriate dataset of outputs.

Note

If any of your output columns contain a value that is not a primitive Python type (bool, str, float, int, list, etc.) you may get a type error stating that the value is not valid since it cannot be serialized. If this happens, you can pickle the values of the column by using self.pickle(). This will allow you to return arbitrary Python types. DataDreamer will automatically unpickle the values when the output dataset is accessed. You can also use self.unpickle() to manually unpickle values if needed.

Returning outputs lazily#

If you want to return outputs lazily to make your step run as a lazy step, you can return a generator function that will yield a single row of data at a time instead and wrap the function with LazyRows before returning it. If you want your generator function to yield a batch of rows at a time, you can wrap the function with LazyRowBatches instead.

Updating the progress indicator#

DataDreamer can keep the user updated on the progress of your step if you periodically update the progress by setting the self.progress attribute to a value between 0 and 1. If you are returning outputs lazily, DataDreamer will automatically update the progress based on the number of rows yielded so far.

Running steps within steps#

If you want to run other steps inside run(), then you must subclass the SuperStep class instead of the Step class.

Contributing#

You may want to contribute the new step class you created to DataDreamer for others to use, especially if it is a reusable technique. See the Contributing page for how to contribute your step that others may benefit from using. If applicable, please ensure your implementation includes data card metadata, such as a link to the model/data cards used, any license information, and any citation information.