Live Engine
Select Topic
easyTraining Infrastructure
A team trains a ResNet-50 image classifier on a single GPU with a batch size of 32. Training takes 8 hours for one epoch on 10 million images. They want to reduce epoch time to under 1 hour. They switch to data parallelism with 8 GPUs. What is the expected communication bottleneck that prevents perfect 8x speedup, and what does the actual speedup look like in practice?