Training Models on Multiple GPUs#

DataDreamer makes training on multiple GPUs with Trainer objects extremely simple and straightforward. All you need to do is pass in a list of devices to the device parameter of Trainer at construction instead of a single device. That’s it.

Distributed Training Modes#

There are two distributed training models supported by DataDreamer:

FSDP (default)
DDP

Monitoring GPU Memory Usage#

You can easily monitor the GPU memory usage of your final multi-GPU training setup by passing the verbose parameter to Trainer at construction. This will log device memory usage at the beginning and end of training and after each epoch.

Multi-Node Multi-GPU Training#

If you want to train on multiple nodes (servers or machines) each with multiple GPUs, you can do so by running the same DataDreamer training script on each node and passing a dictionary to the distributed_config parameter of Trainer. It should look like:

{
    "master_addr": "<IP address of master node>",
    "master_port": "<free and accessible port on master node>",
    "nnodes": total_number_of_nodes_in_cluster,
    "node_rank": the_rank_of_this_node_in_the_cluster,
}

Tip

If you are training on a dataset that is the output of steps in your DataDreamer session, all of those steps must execute on each node. If you want to only compute these steps once, you can compute them on the master node and then copy the output folder of the DataDreamer session to the other nodes before training. This will ensure the other nodes don’t recompute these steps and instead load the cached outputs from disk.