EfficientNet uses compound scaling: depth d = α^φ, width w = β^φ, resolution r = γ^φ where α·β²·γ² ≈ 2. The paper fixes φ=1 (EfficientNet-B1). If you double computational budget (φ=2), how do the three dimensions scale, and why is compound scaling preferred over scaling only width or only depth?
A Compound scaling just adds more layers; width and resolution are fixed B With φ=2 and (α=1.2, β=1.1, γ=1.15) (EfficientNet's found constants): depth = 1.2², width = 1.1², resolution = 1.15². At φ=2: depth factor = 1.44, width factor = 1.21, resolution factor = 1.32. Compound scaling preferred because: (1) depth alone hits diminishing returns (vanishing gradients for very deep networks); (2) width alone loses long-range dependencies (wide but shallow networks miss hierarchical features); (3) resolution alone increases compute quadratically without depth to process the extra spatial information. Balancing all three maintains the effective receptive field growth, the model's capacity to learn hierarchical features, and the resolution of input detail. C Compound scaling is identical to neural architecture search; the scaling rule finds the best architecture D Compound scaling always requires α=β=γ; using different values causes performance degradation