Training Models on Multiple GPUs#
DataDreamer makes training on multiple GPUs with Trainer
objects extremely
simple and straightforward. All you need to do is pass in a list of devices to the device
parameter of
Trainer
at construction instead of a single device. That’s it.
Distributed Training Modes#
There are two distributed training models supported by DataDreamer:
FSDP (default)
DDP
FSDP (Fully-Sharded Data Parallel)
By default, when training on multiple GPUs, DataDreamer uses PyTorch’s new FSDP (Fully-Sharded Data Parallel) implementation. It is useful for training large models that don’t fit on a single GPU. FSDP shards the model parameters across all GPUs and only loads a partial slice of the model on each GPU. This allows you to train models that are larger than the memory of a single GPU.
Your effective batch size is the batch_size
you supply multiplied by the number of GPUs as each GPU will process a batch
independently and perform a synchronized weight update at the end.
DDP (Distributed Data Parallel)
In Distributed Data Parallel, your model is not sharded across multiple GPUs. Instead, each GPU has a full copy of the model. This means that
your model must fit on a single GPU. DDP is primarily useful for scaling up the effective batch size during training and can help train models
faster, but is not useful for training models that don’t fit on a single GPU. DDP is not used by default by DataDreamer. If you want to use DDP instead, you can pass
fsdp=False
to Trainer
at construction. This will disable FSDP and switch to DDP.
Your effective batch size is the batch_size
you supply multiplied by the number of GPUs as each GPU will process a batch
independently and perform a synchronized weight update at the end.
Monitoring GPU Memory Usage#
You can easily monitor the GPU memory usage of your final multi-GPU training setup by passing the verbose
parameter
to Trainer
at construction. This will log device memory usage at
the beginning and end of training and after each epoch.
Multi-Node Multi-GPU Training#
If you want to train on multiple nodes (servers or machines) each with multiple GPUs, you can do so by running the same DataDreamer
training script on each node and passing a dictionary to the distributed_config
parameter of
Trainer
. It should look like:
{
"master_addr": "<IP address of master node>",
"master_port": "<free and accessible port on master node>",
"nnodes": total_number_of_nodes_in_cluster,
"node_rank": the_rank_of_this_node_in_the_cluster,
}
Setting up distributed_config
and device
on each node
Important
You should run the same DataDreamer training script on each node. The only difference should be the node_rank
in the
distributed_config
parameter and device
parameter passed to Trainer
at
construction.
The master_addr
should be the IP address of the master node amongst all of your nodes. This is also the only node that should log and
save the model to disk and continue execution after training is complete, the rest of the nodes will exit after training is complete.
The master_port
should be a free port on the master node that will be used for communication between the nodes and the master node.
The nnodes
should be the total number of nodes in the cluster. On each node, the node_rank
in the distributed_config
config should be different
(to specify which # node it is out of the nnodes
total nodes).
For each node, you should pass a list of GPUs to the device
parameter of Trainer
at
construction that contains all of the GPUs on the current node that you wish to utilize.
Tip
If you are training on a dataset that is the output of steps in your DataDreamer session, all of those steps must execute on each node.
If you want to only compute these steps once, you can compute them on the master node and then copy the output folder of the
DataDreamer
session to the other nodes before training. This will ensure the other nodes don’t recompute
these steps and instead load the cached outputs from disk.