Characterizing Inter-Communication Patterns for Parallelism Strategies
Distributed training frameworks hide communication behind high-level abstractions (e.g., DDP/FSDP, pipeline schedules, tensor parallelism), but real performance is often dominated by network behavior: bursty traffic, synchronization points, and sensitivity to bandwidth/latency. This project uses controlled CloudLab experiments to measure and compare communication patterns across parallelism strategies.
Motivation
Different parallelism strategies partition parameters, activations, and gradients differently, leading to qualitatively different network stress patterns. Total communicated bytes alone is insufficient: training steps often contain tight synchronization and micro-bursts that interact poorly with oversubscription, bandwidth caps, and heterogeneous links.
Goals
- Compare communication volume, frequency, and temporal distribution within a training step across strategies.
- Quantify synchronization structure (collective vs. point-to-point) and identify bursty vs. streaming behavior.
- Measure sensitivity to network constraints by varying bandwidth limits and node counts on CloudLab.
Methodology
- Implement representative workloads in PyTorch Distributed: Data Parallel (DDP), Tensor/Model Parallel (TP/MP), Pipeline Parallel (PP), and selected hybrid configs.
- Instrument comms at two levels:
- Framework-level: collectives/P2P op timing and sizes.
- System-level: per-interface throughput over time (e.g., ss, ifstat, tc stats; optionally eBPF).
- Run controlled sweeps on CloudLab: nodes (2–8), bandwidth caps (tc), and model/sequence sizes to expose different regimes.
Key Metrics
Expected Outputs
- A reproducible benchmark harness (configs + scripts) for running DP/TP/PP and collecting comm traces.
- Comparative plots: bandwidth-over-time, op timelines, and message-size distributions per strategy.
- A short report summarizing observed patterns and implications for network-aware strategy selection and scheduling.
Resource Request (CloudLab)
Multi-node cluster (2–8 nodes). Preference for GPU if available, but CPU-only runs are acceptable for pattern characterization. Need root access for network shaping (tc) and system-level instrumentation.