Creating a new Step#

Creating a new step in DataDreamer is useful for creating custom logic to transform or process input data into some new output data. It is something you will commonly want to do in a workflow to implement a new or custom technique.

When you create and run a new step in a DataDreamer session, you immediately get the benefits of the DataDreamer library such as reusability, caching of outputs, enhanced logging, parallelizability, easy interoperability with the other steps and trainers available in DataDreamer, and automatic data card generation.

You can create a new step by subclassing the Step class and implementing the setup() and run() methods:

from datadreamer.steps import Step

class MyNewStep(Step):
    def setup(self):
        # Register inputs, arguments, outputs, and data card information here

    def run(self):
        # Implement your custom data processing / transformation logic here

Implementing `setup()`#

The setup() method registers what inputs and arguments the step will accept, and what outputs it will return. It also allows you to register data card information for the step, that ultimately is used by DataDreamer to automatically generate data cards.

Registering inputs, arguments, and outputs#

You can use self.register_input() to register the name of each input that the step will accept. Inputs will be provided as OutputDatasetColumn or OutputIterableDatasetColumn objects.

You can use self.register_arg() to register the name of each argument that the step will accept. Arguments can be provided of any type.

You can use self.register_output() to register the name of each output that the step will produce. These outputs must be returned by your run() method implementation.

Registering data card information#

You can use self.register_data_card() to register various data card information where data_card_type can be a one of the DataCardType types, and data_card_value can be then information you wish to add to the data card for that DataCardType.

Implementing `run()`#

The run() method is where you implement your custom data processing / transformation logic using the input data and arguments requested in setup(). Your implementation of run() must also return outputs that correspond to the outputs registered in setup().

Accessing inputs and arguments#

You can access the inputs and arguments provided to the step by accessing the self.inputs and self.args dictionaries, respectively.

Storing persistent data#

If you need a folder to store persistent data during run(), you can use the self.get_run_output_folder_path() method.

Returning outputs#

You can return a dataset of outputs from run() corresponding with the output column names registered in setup(). DataDreamer will automatically convert the return value to an OutputDataset object and make it available on the output attribute of the step.

Returning outputs lazily#

If you want to return outputs lazily to make your step run as a lazy step, you can return a generator function that will yield a single row of data at a time instead and wrap the function with LazyRows before returning it. If you want your generator function to yield a batch of rows at a time, you can wrap the function with LazyRowBatches instead.

Updating the progress indicator#

DataDreamer can keep the user updated on the progress of your step if you periodically update the progress by setting the self.progress attribute to a value between 0 and 1. If you are returning outputs lazily, DataDreamer will automatically update the progress based on the number of rows yielded so far.

Running steps within steps#

If you want to run other steps inside run(), then you must subclass the SuperStep class instead of the Step class.

Contributing#

You may want to contribute the new step class you created to DataDreamer for others to use, especially if it is a reusable technique. See the Contributing page for how to contribute your step that others may benefit from using. If applicable, please ensure your implementation includes data card metadata, such as a link to the model/data cards used, any license information, and any citation information.