SystemsDistributed MLNetwork MeasurementCloudLab

Characterizing Inter-Communication Patterns for Parallelism Strategies

Distributed training frameworks hide communication behind high-level abstractions (e.g., DDP/FSDP, pipeline schedules, tensor parallelism), but real performance is often dominated by network behavior: bursty traffic, synchronization points, and sensitivity to bandwidth/latency. This project uses controlled CloudLab experiments to measure and compare communication patterns across parallelism strategies.

Status

Experiment design

Stack

PyTorch Distributed · tc · iperf · eBPF/ss

Focus

DP · TP/MP · PP · Hybrid

Motivation

Different parallelism strategies partition parameters, activations, and gradients differently, leading to qualitatively different network stress patterns. Total communicated bytes alone is insufficient: training steps often contain tight synchronization and micro-bursts that interact poorly with oversubscription, bandwidth caps, and heterogeneous links.

Goals

Compare communication volume, frequency, and temporal distribution within a training step across strategies.
Quantify synchronization structure (collective vs. point-to-point) and identify bursty vs. streaming behavior.
Measure sensitivity to network constraints by varying bandwidth limits and node counts on CloudLab.

Methodology

Implement representative workloads in PyTorch Distributed: Data Parallel (DDP), Tensor/Model Parallel (TP/MP), Pipeline Parallel (PP), and selected hybrid configs.
Instrument comms at two levels:
- Framework-level: collectives/P2P op timing and sizes.
- System-level: per-interface throughput over time (e.g., ss, ifstat, tc stats; optionally eBPF).
Run controlled sweeps on CloudLab: nodes (2–8), bandwidth caps (tc), and model/sequence sizes to expose different regimes.

Key Metrics

Traffic shape

burstiness, peak/avg ratio, step-level time series

Synchronization

#collectives, barriers, critical path

Message stats

size distribution, count, cadence

Sensitivity

slowdown vs. bandwidth/latency caps

Expected Outputs

A reproducible benchmark harness (configs + scripts) for running DP/TP/PP and collecting comm traces.
Comparative plots: bandwidth-over-time, op timelines, and message-size distributions per strategy.
A short report summarizing observed patterns and implications for network-aware strategy selection and scheduling.

Resource Request (CloudLab)

Multi-node cluster (2–8 nodes). Preference for GPU if available, but CPU-only runs are acceptable for pattern characterization. Need root access for network shaping (tc) and system-level instrumentation.