SystemsDistributed MLNetwork MeasurementCloudLab

Characterizing Inter-Communication Patterns for Parallelism Strategies

Distributed training frameworks hide communication behind high-level abstractions (e.g., DDP/FSDP, pipeline schedules, tensor parallelism), but real performance is often dominated by network behavior: bursty traffic, synchronization points, and sensitivity to bandwidth/latency. This project uses controlled CloudLab experiments to measure and compare communication patterns across parallelism strategies.

Status
Experiment design
Stack
PyTorch Distributed · tc · iperf · eBPF/ss
Focus
DP · TP/MP · PP · Hybrid

Motivation

Different parallelism strategies partition parameters, activations, and gradients differently, leading to qualitatively different network stress patterns. Total communicated bytes alone is insufficient: training steps often contain tight synchronization and micro-bursts that interact poorly with oversubscription, bandwidth caps, and heterogeneous links.

Goals

  • Compare communication volume, frequency, and temporal distribution within a training step across strategies.
  • Quantify synchronization structure (collective vs. point-to-point) and identify bursty vs. streaming behavior.
  • Measure sensitivity to network constraints by varying bandwidth limits and node counts on CloudLab.

Methodology

  1. Implement representative workloads in PyTorch Distributed: Data Parallel (DDP), Tensor/Model Parallel (TP/MP), Pipeline Parallel (PP), and selected hybrid configs.
  2. Instrument comms at two levels:
    • Framework-level: collectives/P2P op timing and sizes.
    • System-level: per-interface throughput over time (e.g., ss, ifstat, tc stats; optionally eBPF).
  3. Run controlled sweeps on CloudLab: nodes (2–8), bandwidth caps (tc), and model/sequence sizes to expose different regimes.

Key Metrics

Traffic shape
burstiness, peak/avg ratio, step-level time series
Synchronization
#collectives, barriers, critical path
Message stats
size distribution, count, cadence
Sensitivity
slowdown vs. bandwidth/latency caps

Expected Outputs

  • A reproducible benchmark harness (configs + scripts) for running DP/TP/PP and collecting comm traces.
  • Comparative plots: bandwidth-over-time, op timelines, and message-size distributions per strategy.
  • A short report summarizing observed patterns and implications for network-aware strategy selection and scheduling.

Resource Request (CloudLab)

Multi-node cluster (2–8 nodes). Preference for GPU if available, but CPU-only runs are acceptable for pattern characterization. Need root access for network shaping (tc) and system-level instrumentation.

Want to collaborate or have pointers on CloudLab profiles/topologies?
Contact me