Glossary

A

  • Activation CheckpointingMemory

    Drop and recompute intermediate activations during backward pass to save memory at the cost of extra compute.

    recomputetrade-off
  • all_gather_objectCollectives

    Gather Python objects from all ranks to all ranks. Complements tensor-based all_gather for small control data.

    objectsgatherserialization
  • All-GatherCommunication

    Collective that gathers shards from all processes so each process ends up with the concatenated full tensor.

    tensor parallelparameters
  • All-ReduceCommunication

    Collective that reduces values (e.g., sum of gradients) across all processes and distributes the result back to all.

    gradientssynchronization

B

  • Backend (torch.distributed)(nccl, gloo, mpi)PyTorch Distributed

    Communication backend for process groups. 'nccl' (GPU-only, fastest on NVIDIA), 'gloo' (CPU/multi-arch), 'mpi' (when compiled with MPI). Choose NCCL for CUDA tensors; Gloo for CPU tensors.

    NCCLGlooMPI
  • barrier (torch.distributed)Collectives

    Collective that blocks until all ranks enter, ensuring a synchronization point in the program.

    synchronizationrendezvous
  • BF16Precision

    bfloat16 format with 8-bit exponent and 7-bit mantissa; similar dynamic range to FP32, often more stable than FP16.

    bfloat16AMP
  • BroadcastCommunication

    Collective that sends a tensor from one process (the root) to all others.

    parametersinit
  • broadcast_object_listCollectives

    Broadcast Python objects by serializing them across ranks. Useful for small configs/metadata that are not tensors.

    objectsbroadcastserialization

C

  • CheckpointReliability

    Saved model, optimizer, and training state enabling resume or evaluation without retraining from scratch.

    stateresume
  • Checkpoint ShardingMemory

    Saving model state across multiple files/processes so no single device needs to materialize the full state at once.

    FSDPDeepSpeed
  • Collective CommunicationCommunication

    Operations involving multiple processes such as all-reduce, reduce-scatter, all-gather, and broadcast used to synchronize training state.

    NCCLsynchronization
  • CUDA GraphsPerformance

    Capture and replay sequences of GPU operations to reduce launch overhead and improve performance.

    launch overheadPyTorch 2
  • CUDA StreamsPerformance

    Queues for asynchronous GPU work submission; used to overlap kernels and communication with compute.

    asynchronyoverlap
  • CUDA_VISIBLE_DEVICESEnvironment

    Comma‑separated list of logical GPU IDs exposed to the process. Controls device enumeration and mapping for LOCAL_RANK→CUDA device selection.

    CUDAdevice mapping

D

  • Data Parallelism(DP)Distributed Training

    Replicate the full model on each worker and split the batch across workers; gradients are synchronized (e.g., via all-reduce) to keep weights in sync.

    DDPscalingreplicas
  • DeepSpeedFrameworks

    Microsoft's library for scalable training featuring ZeRO, pipeline parallelism, offloading, and many optimizations.

    ZeROoffload
  • DeterminismReproducibility

    Producing bitwise or numerically stable results across runs by controlling seeds, algorithms, and parallelism sources.

    seedcudnnalgorithms
  • DeviceMeshPyTorch Distributed

    An n‑dimensional logical arrangement of devices used by DTensor and collective libraries to express partitioning and communication groups across multiple axes (e.g., data, tensor, pipeline).

    DTensorparallelismgroups
  • Direct Preference Optimization (DPO)(DPO)Fine-tuning

    Preference optimization method that learns from chosen vs. rejected outputs without explicit reward modeling.

    preferencealignment
  • Distributed Data Parallel (DDP)(DDP)Frameworks

    PyTorch's process-based data parallel training where each process holds a full model replica and synchronizes gradients using collective communications.

    PyTorchall-reduceNCCL
  • Distributed SamplerInput Pipeline

    Data loader component that partitions a dataset across processes to avoid sample duplication and maintain shuffling guarantees.

    samplersharding

E

  • Elastic TrainingDistributed Systems

    Allow workers to join or leave during training (e.g., due to preemption), with automatic re-rendezvous and state recovery.

    fault tolerancepreemption

F

  • FileStorePyTorch Distributed

    Filesystem-backed Store that uses a shared directory to coordinate ranks. Suitable when nodes share a filesystem (e.g., NFS).

    Storefilesystem
  • FP16Precision

    IEEE half-precision floating point with 5-bit exponent and 10-bit mantissa; higher throughput but narrower dynamic range than BF16.

    half precision
  • Fully Sharded Data Parallel (FSDP)(FSDP)Frameworks

    PyTorch sharded training that partitions parameters, gradients, and optimizer states across workers, optionally with activation checkpointing, to reduce memory.

    PyTorchshardingmemory
  • Fused KernelsPerformance

    Kernels combining multiple ops into one to reduce memory reads/writes and launch overhead, improving throughput.

    throughputoptimization

G

  • Global Batch SizeOptimization

    Total number of samples processed per optimizer step across all workers and micro-batches.

    batch sizescaling
  • GLOO_SOCKET_IFNAMEEnvironment

    Network interface name for the Gloo backend to bind/listen on. Mirrors NCCL_SOCKET_IFNAME for Gloo.

    Gloonetworkinterface
  • GPipeAlgorithms

    Pipeline parallel training method that splits mini-batches into micro-batches to keep pipeline stages utilized.

    pipelinemicro-batches
  • Gradient AccumulationOptimization

    Accumulate gradients over multiple micro-batches before an optimizer step to simulate larger effective batch sizes.

    memorybatch size
  • Gradient BucketingOptimization

    Grouping gradients into buckets to reduce overhead and enable overlapping communication with computation.

    overlapcommunication
  • Gradient ClippingOptimization

    Limit the norm or value of gradients to prevent exploding gradients and improve training stability at large batch sizes.

    stabilitynorm

H

  • HostfileInfrastructure

    File listing nodes and slots (GPUs) for multi-node training; used by launchers to allocate resources.

    multi-nodelauncher
  • Hybrid ParallelismDistributed Training

    Combine multiple forms of parallelism (data, tensor, pipeline) to scale very large models across many devices.

    3D parallelismscaling

I

  • InfiniBandHardware

    Low-latency, high-throughput network commonly used for multi-node training, often with RDMA support.

    networkRDMA

K

  • KubernetesInfrastructure

    Container orchestration system; commonly used to run distributed training jobs with operators or custom controllers.

    orchestrationcloud

L

  • LatencyPerformance

    Time it takes to complete a single training step; can increase with synchronization or stragglers.

    step timesynchronization
  • Learning Rate ScheduleOptimization

    Rule for changing the learning rate over time (e.g., cosine decay, step decay) to improve convergence.

    convergenceschedule
  • Local Rank(LOCAL_RANK, LOCAL_WORLD_SIZE)Distributed Systems

    Rank of a process relative to its node; used to select the local GPU device. Provided by launchers through LOCAL_RANK (and LOCAL_WORLD_SIZE).

    GPUdevice
  • LoRAFine-tuning

    Low-Rank Adaptation; inject low-rank trainable matrices into existing weights to adapt models with small memory overhead.

    PEFTadapters
  • Loss ScalingPrecision

    Multiply the loss to shift small gradients into representable range when using FP16, then unscale before the optimizer step.

    AMPstability

M

  • Master Address and Port(MASTER_ADDR, MASTER_PORT)Distributed Systems

    Network coordinates used by workers to rendezvous and form process groups. In PyTorch, set via environment variables MASTER_ADDR and MASTER_PORT.

    rendezvousenv vars
  • Megatron-LMFrameworks

    Framework for large-scale Transformer training with tensor and pipeline parallelism and fused kernels.

    tensor parallelpipeline
  • Micro-batchOptimization

    A small batch that fits on device memory; multiple micro-batches can be accumulated into a larger global batch.

    pipelineaccumulation
  • Mixed Precision TrainingPrecision

    Use reduced precision (e.g., FP16 or BF16) for most ops to reduce memory and increase throughput, with care to maintain numerical stability.

    AMPbf16fp16
  • Model ParallelismDistributed Training

    Split a single model's parameters across multiple devices so each device holds only a shard of the model.

    shardingscaling
  • monitored_barrierCollectives

    Barrier variant that detects ranks that fail to reach the barrier by a deadline and logs or raises for easier deadlock diagnosis.

    debuggingdeadlock

N

  • NCCLCommunication

    NVIDIA Collective Communications Library optimized for high-performance multi-GPU and multi-node communication.

    collectivesGPU
  • NCCL_ALGOEnvironment

    Algorithm preference hint for NCCL collectives (e.g., 'Tree', 'Ring'). Can influence latency vs. bandwidth trade‑offs.

    NCCLalgorithm
  • NCCL_DEBUGEnvironment

    Controls NCCL logging verbosity (e.g., 'WARN', 'INFO', 'TRACE'). Useful for diagnosing topology, algorithm selection, and transport issues.

    NCCLlogging
  • NCCL_IB_GID_INDEXEnvironment

    InfiniBand GID index used by NCCL to select RoCE/IB addressing (common values: 0 for IB, 3 for RoCEv2).

    NCCLGIDRoCE
  • NCCL_IB_HCAEnvironment

    InfiniBand device allowlist for NCCL (e.g., 'mlx5_0'). Helps constrain which HCAs are used on multi‑HCA nodes.

    NCCLInfiniBandHCA
  • NCCL_P2P_DISABLEEnvironment

    Disable peer‑to‑peer (P2P) direct GPU communication in NCCL when set to '1'; forces traffic through other transports.

    NCCLP2P
  • NCCL_PROTOEnvironment

    Protocol hint for NCCL ('LL', 'LL128', 'Simple') controlling chunk sizes/latency behavior.

    NCCLprotocol
  • NCCL_SHM_DISABLEEnvironment

    Disable NCCL shared‑memory transport on a node (set to '1') to work around SHM limitations or container constraints.

    NCCLSHM
  • NCCL_SOCKET_IFNAMEEnvironment

    Comma‑separated list of network interfaces NCCL is allowed to use (e.g., 'ib0,eno1'). Set to choose between InfiniBand/Ethernet or to avoid docker/lo interfaces.

    NCCLnetworkinterface

O

  • OffloadingMemory

    Moving tensors (parameters, gradients, optimizer states, activations) to CPU or NVMe to fit larger models than GPU memory allows.

    CPUNVMethroughput
  • Optimizer State ShardingMemory

    Partitioning optimizer states (e.g., Adam moments) across workers to reduce memory footprint.

    ZeROFSDP
  • Overlap Compute and CommunicationOptimization

    Scheduling communication (e.g., gradient reductions) concurrently with compute to hide latency and improve throughput.

    performancelatency

P

  • Parameter ShardingMemory

    Partitioning model parameters across devices so each process stores only a shard, gathering on-the-fly when needed.

    ZeROFSDP
  • Parameter-Efficient Fine-Tuning (PEFT)(PEFT)Fine-tuning

    Fine-tune a small subset of parameters or add modules (e.g., LoRA) to adapt large models efficiently.

    LoRAadapters
  • PCIeHardware

    Peripheral Component Interconnect Express; general-purpose high-speed bus used for GPU-host and GPU-GPU communication.

    interconnect
  • PipeDreamAlgorithms

    Pipeline parallel approach with asynchronous weight updates and schedule optimizations to reduce bubbles.

    pipelinescheduling
  • Pipeline ParallelismDistributed Training

    Split the model by layers into stages placed on different devices; process micro-batches through stages like an assembly line to keep devices busy.

    GPipePipeDreamstages
  • Process GroupDistributed Systems

    A set of distributed processes that can communicate via collectives; used to scope data, tensor, or pipeline parallel communication.

    PyTorchNCCLgroups
  • Proximal Policy Optimization (PPO)(PPO)Fine-tuning

    Policy-gradient RL algorithm commonly used in RLHF to optimize language models from scalar rewards.

    RLHFpolicy gradient

Q

  • QLoRAFine-tuning

    PEFT method combining 4-bit quantization of base weights with LoRA adapters to reduce memory during fine-tuning.

    quantizationLoRA

R

  • Rank(RANK)Distributed Systems

    Unique identifier of a process within a distributed job, in the range [0, WORLD_SIZE-1]. Often provided via the environment variable RANK. Rank 0 is typically used for logging/checkpoint coordination.

    DDPprocess
  • RayFrameworks

    Distributed execution framework that provides high-level APIs for scaling Python and ML workloads, including training.

    distributedpython
  • RDMAHardware

    Remote Direct Memory Access; allows one machine to access another's memory without involving the CPU, reducing latency.

    networklatency
  • reduce_scatter_tensorCollectives

    Reduce a list of input tensors across ranks and scatter equal‑sized shards of the reduced result to each rank, enabling overlapped communication with sharded optimizers.

    shardingoverlap
  • Reduce-ScatterCommunication

    Collective that reduces values across processes and scatters disjoint shards of the result to each process; useful for sharded training.

    shardingcollective
  • Reinforcement Learning from Human Feedback (RLHF)(RLHF)Fine-tuning

    Pipeline that trains a reward model from human preferences and optimizes the policy (e.g., via PPO) to maximize the reward.

    preferencealignment
  • RendezvousDistributed Systems

    The mechanism by which distributed processes discover each other and form process groups (e.g., via master address/port or an elastic agent).

    initelastic
  • Ring All-ReduceCommunication

    All-reduce algorithm arranging processes in a ring, passing and reducing chunks to optimize bandwidth utilization.

    algorithmbandwidth
  • RLAIFFine-tuning

    Reinforcement Learning from AI Feedback; uses AI-generated preferences to scale preference optimization.

    preferencescaling

S

  • SeedReproducibility

    Initial value for random number generators; setting consistent seeds across processes aids reproducibility.

    randomnessRNG
  • Sequence PackingInput Pipeline

    Concatenate multiple sequences into a fixed-length window to reduce padding and increase token throughput.

    packingtokens
  • SLURMInfrastructure

    Cluster workload manager used to schedule and launch distributed training jobs on HPC systems.

    schedulerHPC
  • Store (torch.distributed)PyTorch Distributed

    A key–value service used by processes to share small pieces of state during rendezvous and beyond. Implementations include TCPStore, FileStore, and HashStore. Supports operations such as set(), get(), wait(), and timeouts.

    rendezvousstateStore
  • StragglerDistributed Systems

    A slow worker that delays synchronous steps; mitigation includes better placement, pipelining, or asynchronous techniques.

    performancesynchronization
  • Supervised Fine-Tuning (SFT)(SFT)Fine-tuning

    Train a model on input-output pairs to follow instructions before preference optimization or RLHF.

    instruction tuningpreference

T

  • TCPStorePyTorch Distributed

    Networked Store backed by a TCP server (typically on rank 0). Other ranks connect via MASTER_ADDR/MASTER_PORT. Useful for multi-node rendezvous and sharing runtime state.

    StoreTCP
  • Tensor ParallelismDistributed Training

    Type of model parallelism that partitions individual tensors (e.g., attention or MLP weights) across devices to parallelize intra-layer compute.

    Megatron-LMintra-layersharding
  • ThroughputPerformance

    Number of samples or tokens processed per unit time; key metric for distributed training efficiency.

    tokens/ssamples/s
  • Topology-aware PlacementInfrastructure

    Scheduling processes with awareness of hardware links (NVLink, PCIe, IB) to maximize bandwidth and minimize latency.

    placementbandwidth
  • TORCH_DISTRIBUTED_DEBUGEnvironment

    Debug level for torch.distributed ('OFF', 'INFO', 'DETAIL'). Increases runtime checks and logging to help diagnose collectives and rendezvous issues.

    debuglogging
  • torch.distributed.init_process_group(init_process_group)PyTorch Distributed

    Initialize the default process group for collectives. Must be called once per process before using torch.distributed APIs. Key args: backend ('nccl', 'gloo', 'mpi'), rank, world_size, and init_method/store. Call destroy_process_group() at shutdown.

    initializationprocess groupbackend
  • torchrunLaunchers

    Recommended PyTorch launcher. Spawns one process per GPU, sets RANK, WORLD_SIZE, LOCAL_RANK/LOCAL_WORLD_SIZE, and passes rendezvous information (MASTER_ADDR/MASTER_PORT). Supports elastic/etcd rendezvous.

    elasticspawnmultiprocess

W

  • WarmupOptimization

    Initial phase increasing the learning rate gradually to stabilize optimization, especially with large batch sizes.

    lr schedulestability
  • World Size(WORLD_SIZE)Distributed Systems

    Total number of distributed processes participating in training. Commonly exported as the environment variable WORLD_SIZE and consumed by launchers (e.g., torchrun) and frameworks to size process groups.

    DDPFSDPDeepSpeed

Z

  • ZeROFrameworks

    DeepSpeed's Zero Redundancy Optimizer that removes memory redundancy by sharding optimizer states, gradients, and parameters across data-parallel workers.

    DeepSpeedshardingmemory
  • ZeRO Stage 1Frameworks

    Shard optimizer states across data-parallel workers to reduce memory without changing gradients or parameters.

    DeepSpeedoptimizer state
  • ZeRO Stage 2Frameworks

    Shard optimizer states and gradients across data-parallel workers for further memory savings.

    DeepSpeedgradients
  • ZeRO Stage 3Frameworks

    Shard optimizer states, gradients, and parameters across workers, achieving maximal memory savings with on-the-fly parameter gathering.

    DeepSpeedparameters