Glossary
A
Activation CheckpointingMemory
Drop and recompute intermediate activations during backward pass to save memory at the cost of extra compute.
recomputetrade-offall_gather_objectCollectives
Gather Python objects from all ranks to all ranks. Complements tensor-based all_gather for small control data.
objectsgatherserializationAll-GatherCommunication
Collective that gathers shards from all processes so each process ends up with the concatenated full tensor.
tensor parallelparametersAll-ReduceCommunication
Collective that reduces values (e.g., sum of gradients) across all processes and distributes the result back to all.
gradientssynchronization
B
Backend (torch.distributed)(nccl, gloo, mpi)PyTorch Distributed
Communication backend for process groups. 'nccl' (GPU-only, fastest on NVIDIA), 'gloo' (CPU/multi-arch), 'mpi' (when compiled with MPI). Choose NCCL for CUDA tensors; Gloo for CPU tensors.
NCCLGlooMPIbarrier (torch.distributed)Collectives
Collective that blocks until all ranks enter, ensuring a synchronization point in the program.
synchronizationrendezvousBF16Precision
bfloat16 format with 8-bit exponent and 7-bit mantissa; similar dynamic range to FP32, often more stable than FP16.
bfloat16AMPBroadcastCommunication
Collective that sends a tensor from one process (the root) to all others.
parametersinitbroadcast_object_listCollectives
Broadcast Python objects by serializing them across ranks. Useful for small configs/metadata that are not tensors.
objectsbroadcastserialization
C
CheckpointReliability
Saved model, optimizer, and training state enabling resume or evaluation without retraining from scratch.
stateresumeCheckpoint ShardingMemory
Saving model state across multiple files/processes so no single device needs to materialize the full state at once.
FSDPDeepSpeedCollective CommunicationCommunication
Operations involving multiple processes such as all-reduce, reduce-scatter, all-gather, and broadcast used to synchronize training state.
NCCLsynchronizationCUDA GraphsPerformance
Capture and replay sequences of GPU operations to reduce launch overhead and improve performance.
launch overheadPyTorch 2CUDA StreamsPerformance
Queues for asynchronous GPU work submission; used to overlap kernels and communication with compute.
asynchronyoverlapCUDA_VISIBLE_DEVICESEnvironment
Comma‑separated list of logical GPU IDs exposed to the process. Controls device enumeration and mapping for LOCAL_RANK→CUDA device selection.
CUDAdevice mapping
D
Data Parallelism(DP)Distributed Training
Replicate the full model on each worker and split the batch across workers; gradients are synchronized (e.g., via all-reduce) to keep weights in sync.
DDPscalingreplicasDeepSpeedFrameworks
Microsoft's library for scalable training featuring ZeRO, pipeline parallelism, offloading, and many optimizations.
ZeROoffloadDeterminismReproducibility
Producing bitwise or numerically stable results across runs by controlling seeds, algorithms, and parallelism sources.
seedcudnnalgorithmsDeviceMeshPyTorch Distributed
An n‑dimensional logical arrangement of devices used by DTensor and collective libraries to express partitioning and communication groups across multiple axes (e.g., data, tensor, pipeline).
DTensorparallelismgroupsDirect Preference Optimization (DPO)(DPO)Fine-tuning
Preference optimization method that learns from chosen vs. rejected outputs without explicit reward modeling.
preferencealignmentDistributed Data Parallel (DDP)(DDP)Frameworks
PyTorch's process-based data parallel training where each process holds a full model replica and synchronizes gradients using collective communications.
PyTorchall-reduceNCCLDistributed SamplerInput Pipeline
Data loader component that partitions a dataset across processes to avoid sample duplication and maintain shuffling guarantees.
samplersharding
E
Elastic TrainingDistributed Systems
Allow workers to join or leave during training (e.g., due to preemption), with automatic re-rendezvous and state recovery.
fault tolerancepreemption
F
FileStorePyTorch Distributed
Filesystem-backed Store that uses a shared directory to coordinate ranks. Suitable when nodes share a filesystem (e.g., NFS).
StorefilesystemFP16Precision
IEEE half-precision floating point with 5-bit exponent and 10-bit mantissa; higher throughput but narrower dynamic range than BF16.
half precisionFully Sharded Data Parallel (FSDP)(FSDP)Frameworks
PyTorch sharded training that partitions parameters, gradients, and optimizer states across workers, optionally with activation checkpointing, to reduce memory.
PyTorchshardingmemoryFused KernelsPerformance
Kernels combining multiple ops into one to reduce memory reads/writes and launch overhead, improving throughput.
throughputoptimization
G
Global Batch SizeOptimization
Total number of samples processed per optimizer step across all workers and micro-batches.
batch sizescalingGLOO_SOCKET_IFNAMEEnvironment
Network interface name for the Gloo backend to bind/listen on. Mirrors NCCL_SOCKET_IFNAME for Gloo.
GloonetworkinterfaceGPipeAlgorithms
Pipeline parallel training method that splits mini-batches into micro-batches to keep pipeline stages utilized.
pipelinemicro-batchesGradient AccumulationOptimization
Accumulate gradients over multiple micro-batches before an optimizer step to simulate larger effective batch sizes.
memorybatch sizeGradient BucketingOptimization
Grouping gradients into buckets to reduce overhead and enable overlapping communication with computation.
overlapcommunicationGradient ClippingOptimization
Limit the norm or value of gradients to prevent exploding gradients and improve training stability at large batch sizes.
stabilitynorm
H
HostfileInfrastructure
File listing nodes and slots (GPUs) for multi-node training; used by launchers to allocate resources.
multi-nodelauncherHybrid ParallelismDistributed Training
Combine multiple forms of parallelism (data, tensor, pipeline) to scale very large models across many devices.
3D parallelismscaling
I
InfiniBandHardware
Low-latency, high-throughput network commonly used for multi-node training, often with RDMA support.
networkRDMA
K
KubernetesInfrastructure
Container orchestration system; commonly used to run distributed training jobs with operators or custom controllers.
orchestrationcloud
L
LatencyPerformance
Time it takes to complete a single training step; can increase with synchronization or stragglers.
step timesynchronizationLearning Rate ScheduleOptimization
Rule for changing the learning rate over time (e.g., cosine decay, step decay) to improve convergence.
convergencescheduleLocal Rank(LOCAL_RANK, LOCAL_WORLD_SIZE)Distributed Systems
Rank of a process relative to its node; used to select the local GPU device. Provided by launchers through LOCAL_RANK (and LOCAL_WORLD_SIZE).
GPUdeviceLoRAFine-tuning
Low-Rank Adaptation; inject low-rank trainable matrices into existing weights to adapt models with small memory overhead.
PEFTadaptersLoss ScalingPrecision
Multiply the loss to shift small gradients into representable range when using FP16, then unscale before the optimizer step.
AMPstability
M
Master Address and Port(MASTER_ADDR, MASTER_PORT)Distributed Systems
Network coordinates used by workers to rendezvous and form process groups. In PyTorch, set via environment variables MASTER_ADDR and MASTER_PORT.
rendezvousenv varsMegatron-LMFrameworks
Framework for large-scale Transformer training with tensor and pipeline parallelism and fused kernels.
tensor parallelpipelineMicro-batchOptimization
A small batch that fits on device memory; multiple micro-batches can be accumulated into a larger global batch.
pipelineaccumulationMixed Precision TrainingPrecision
Use reduced precision (e.g., FP16 or BF16) for most ops to reduce memory and increase throughput, with care to maintain numerical stability.
AMPbf16fp16Model ParallelismDistributed Training
Split a single model's parameters across multiple devices so each device holds only a shard of the model.
shardingscalingmonitored_barrierCollectives
Barrier variant that detects ranks that fail to reach the barrier by a deadline and logs or raises for easier deadlock diagnosis.
debuggingdeadlock
N
NCCLCommunication
NVIDIA Collective Communications Library optimized for high-performance multi-GPU and multi-node communication.
collectivesGPUNCCL_ALGOEnvironment
Algorithm preference hint for NCCL collectives (e.g., 'Tree', 'Ring'). Can influence latency vs. bandwidth trade‑offs.
NCCLalgorithmNCCL_DEBUGEnvironment
Controls NCCL logging verbosity (e.g., 'WARN', 'INFO', 'TRACE'). Useful for diagnosing topology, algorithm selection, and transport issues.
NCCLloggingNCCL_IB_GID_INDEXEnvironment
InfiniBand GID index used by NCCL to select RoCE/IB addressing (common values: 0 for IB, 3 for RoCEv2).
NCCLGIDRoCENCCL_IB_HCAEnvironment
InfiniBand device allowlist for NCCL (e.g., 'mlx5_0'). Helps constrain which HCAs are used on multi‑HCA nodes.
NCCLInfiniBandHCANCCL_P2P_DISABLEEnvironment
Disable peer‑to‑peer (P2P) direct GPU communication in NCCL when set to '1'; forces traffic through other transports.
NCCLP2PNCCL_PROTOEnvironment
Protocol hint for NCCL ('LL', 'LL128', 'Simple') controlling chunk sizes/latency behavior.
NCCLprotocolNCCL_SHM_DISABLEEnvironment
Disable NCCL shared‑memory transport on a node (set to '1') to work around SHM limitations or container constraints.
NCCLSHMNCCL_SOCKET_IFNAMEEnvironment
Comma‑separated list of network interfaces NCCL is allowed to use (e.g., 'ib0,eno1'). Set to choose between InfiniBand/Ethernet or to avoid docker/lo interfaces.
NCCLnetworkinterfaceNVLinkHardware
High-bandwidth interconnect between NVIDIA GPUs that provides faster device-to-device communication than PCIe.
interconnectGPU
O
OffloadingMemory
Moving tensors (parameters, gradients, optimizer states, activations) to CPU or NVMe to fit larger models than GPU memory allows.
CPUNVMethroughputOptimizer State ShardingMemory
Partitioning optimizer states (e.g., Adam moments) across workers to reduce memory footprint.
ZeROFSDPOverlap Compute and CommunicationOptimization
Scheduling communication (e.g., gradient reductions) concurrently with compute to hide latency and improve throughput.
performancelatency
P
Parameter ShardingMemory
Partitioning model parameters across devices so each process stores only a shard, gathering on-the-fly when needed.
ZeROFSDPParameter-Efficient Fine-Tuning (PEFT)(PEFT)Fine-tuning
Fine-tune a small subset of parameters or add modules (e.g., LoRA) to adapt large models efficiently.
LoRAadaptersPCIeHardware
Peripheral Component Interconnect Express; general-purpose high-speed bus used for GPU-host and GPU-GPU communication.
interconnectPipeDreamAlgorithms
Pipeline parallel approach with asynchronous weight updates and schedule optimizations to reduce bubbles.
pipelineschedulingPipeline ParallelismDistributed Training
Split the model by layers into stages placed on different devices; process micro-batches through stages like an assembly line to keep devices busy.
GPipePipeDreamstagesProcess GroupDistributed Systems
A set of distributed processes that can communicate via collectives; used to scope data, tensor, or pipeline parallel communication.
PyTorchNCCLgroupsProximal Policy Optimization (PPO)(PPO)Fine-tuning
Policy-gradient RL algorithm commonly used in RLHF to optimize language models from scalar rewards.
RLHFpolicy gradient
Q
QLoRAFine-tuning
PEFT method combining 4-bit quantization of base weights with LoRA adapters to reduce memory during fine-tuning.
quantizationLoRA
R
Rank(RANK)Distributed Systems
Unique identifier of a process within a distributed job, in the range [0, WORLD_SIZE-1]. Often provided via the environment variable RANK. Rank 0 is typically used for logging/checkpoint coordination.
DDPprocessRayFrameworks
Distributed execution framework that provides high-level APIs for scaling Python and ML workloads, including training.
distributedpythonRDMAHardware
Remote Direct Memory Access; allows one machine to access another's memory without involving the CPU, reducing latency.
networklatencyreduce_scatter_tensorCollectives
Reduce a list of input tensors across ranks and scatter equal‑sized shards of the reduced result to each rank, enabling overlapped communication with sharded optimizers.
shardingoverlapReduce-ScatterCommunication
Collective that reduces values across processes and scatters disjoint shards of the result to each process; useful for sharded training.
shardingcollectiveReinforcement Learning from Human Feedback (RLHF)(RLHF)Fine-tuning
Pipeline that trains a reward model from human preferences and optimizes the policy (e.g., via PPO) to maximize the reward.
preferencealignmentRendezvousDistributed Systems
The mechanism by which distributed processes discover each other and form process groups (e.g., via master address/port or an elastic agent).
initelasticRing All-ReduceCommunication
All-reduce algorithm arranging processes in a ring, passing and reducing chunks to optimize bandwidth utilization.
algorithmbandwidthRLAIFFine-tuning
Reinforcement Learning from AI Feedback; uses AI-generated preferences to scale preference optimization.
preferencescaling
S
SeedReproducibility
Initial value for random number generators; setting consistent seeds across processes aids reproducibility.
randomnessRNGSequence PackingInput Pipeline
Concatenate multiple sequences into a fixed-length window to reduce padding and increase token throughput.
packingtokensSLURMInfrastructure
Cluster workload manager used to schedule and launch distributed training jobs on HPC systems.
schedulerHPCStore (torch.distributed)PyTorch Distributed
A key–value service used by processes to share small pieces of state during rendezvous and beyond. Implementations include TCPStore, FileStore, and HashStore. Supports operations such as set(), get(), wait(), and timeouts.
rendezvousstateStoreStragglerDistributed Systems
A slow worker that delays synchronous steps; mitigation includes better placement, pipelining, or asynchronous techniques.
performancesynchronizationSupervised Fine-Tuning (SFT)(SFT)Fine-tuning
Train a model on input-output pairs to follow instructions before preference optimization or RLHF.
instruction tuningpreference
T
TCPStorePyTorch Distributed
Networked Store backed by a TCP server (typically on rank 0). Other ranks connect via MASTER_ADDR/MASTER_PORT. Useful for multi-node rendezvous and sharing runtime state.
StoreTCPTensor ParallelismDistributed Training
Type of model parallelism that partitions individual tensors (e.g., attention or MLP weights) across devices to parallelize intra-layer compute.
Megatron-LMintra-layershardingThroughputPerformance
Number of samples or tokens processed per unit time; key metric for distributed training efficiency.
tokens/ssamples/sTopology-aware PlacementInfrastructure
Scheduling processes with awareness of hardware links (NVLink, PCIe, IB) to maximize bandwidth and minimize latency.
placementbandwidthTORCH_DISTRIBUTED_DEBUGEnvironment
Debug level for torch.distributed ('OFF', 'INFO', 'DETAIL'). Increases runtime checks and logging to help diagnose collectives and rendezvous issues.
debugloggingtorch.distributed.init_process_group(init_process_group)PyTorch Distributed
Initialize the default process group for collectives. Must be called once per process before using torch.distributed APIs. Key args: backend ('nccl', 'gloo', 'mpi'), rank, world_size, and init_method/store. Call destroy_process_group() at shutdown.
initializationprocess groupbackendtorchrunLaunchers
Recommended PyTorch launcher. Spawns one process per GPU, sets RANK, WORLD_SIZE, LOCAL_RANK/LOCAL_WORLD_SIZE, and passes rendezvous information (MASTER_ADDR/MASTER_PORT). Supports elastic/etcd rendezvous.
elasticspawnmultiprocess
W
WarmupOptimization
Initial phase increasing the learning rate gradually to stabilize optimization, especially with large batch sizes.
lr schedulestabilityWorld Size(WORLD_SIZE)Distributed Systems
Total number of distributed processes participating in training. Commonly exported as the environment variable WORLD_SIZE and consumed by launchers (e.g., torchrun) and frameworks to size process groups.
DDPFSDPDeepSpeed
Z
ZeROFrameworks
DeepSpeed's Zero Redundancy Optimizer that removes memory redundancy by sharding optimizer states, gradients, and parameters across data-parallel workers.
DeepSpeedshardingmemoryZeRO Stage 1Frameworks
Shard optimizer states across data-parallel workers to reduce memory without changing gradients or parameters.
DeepSpeedoptimizer stateZeRO Stage 2Frameworks
Shard optimizer states and gradients across data-parallel workers for further memory savings.
DeepSpeedgradientsZeRO Stage 3Frameworks
Shard optimizer states, gradients, and parameters across workers, achieving maximal memory savings with on-the-fly parameter gathering.
DeepSpeedparameters