在Linux下使用PyTorch進行分布式訓練,主要涉及以下幾個步驟:
環境準備:
初始化進程組:
torch.distributed.init_process_group()函數來初始化分布式環境。這個函數需要幾個參數,包括后端(如nccl、gloo等)、初始化方法(如tcp://)、IP地址和端口號。數據并行:
torch.nn.parallel.DistributedDataParallel(DDP)來包裝你的模型。DDP會自動處理數據的分片和梯度的聚合。數據加載:
torch.utils.data.distributed.DistributedSampler來確保每個進程處理數據集的不同部分。訓練循環:
保存模型:
下面是一個簡單的示例代碼,展示了如何在Linux下使用PyTorch進行分布式訓練:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler
import torchvision.datasets as datasets
import torchvision.transforms as transforms
# 初始化分布式環境
world_size = 4 # 假設有4個GPU
rank = 0 # 當前進程的rank
master_ip = '192.168.1.1' # 主節點的IP地址
master_port = '12345' # 主節點的端口號
torch.distributed.init_process_group(
backend='nccl',
init_method=f'tcp://{master_ip}:{master_port}',
world_size=world_size,
rank=rank
)
# 定義模型
class SimpleModel(nn.Module):
def __init__(self):
super(SimpleModel, self).__init__()
self.fc = nn.Linear(10, 10)
def forward(self, x):
return self.fc(x)
model = SimpleModel().to(rank)
ddp_model = DDP(model, device_ids=[rank])
# 數據加載
transform = transforms.Compose([transforms.ToTensor()])
dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
sampler = DistributedSampler(dataset)
dataloader = DataLoader(dataset, batch_size=64, sampler=sampler)
# 優化器
optimizer = optim.SGD(ddp_model.parameters(), lr=0.01)
# 訓練循環
for epoch in range(5):
sampler.set_epoch(epoch)
running_loss = 0.0
for data, target in dataloader:
data, target = data.to(rank), target.to(rank)
optimizer.zero_grad()
output = ddp_model(data)
loss = nn.functional.cross_entropy(output, target)
loss.backward()
optimizer.step()
running_loss += loss.item()
print(f'Epoch {epoch+1}, Loss: {running_loss/len(dataloader)}')
# 保存模型(只在主進程中執行)
if rank == 0:
torch.save(ddp_model.state_dict(), 'model.pth')
# 清理分布式環境
torch.distributed.destroy_process_group()
通過以上步驟,你可以在Linux環境下使用PyTorch實現分布式訓練。