Training Infrastructure | ML System Design

Live Engine

Select Topic

easyTraining Infrastructure

A team trains a ResNet-50 image classifier on a single GPU with a batch size of 32. Training takes 8 hours for one epoch on 10 million images. They want to reduce epoch time to under 1 hour. They switch to data parallelism with 8 GPUs. What is the expected communication bottleneck that prevents perfect 8x speedup, and what does the actual speedup look like in practice?

Live Engine

Select Topic

easyTraining Infrastructure

A team trains a ResNet-50 image classifier on a single GPU with a batch size of 32. Training takes 8 hours for one epoch on 10 million images. They want to reduce epoch time to under 1 hour. They switch to data parallelism with 8 GPUs. What is the expected communication bottleneck that prevents perfect 8x speedup, and what does the actual speedup look like in practice?