Thomas Kalnik's Blog

Glossary

A

Activation CheckpointingMemory
Drop and recompute intermediate activations during backward pass to save memory at the cost of extra compute.
recomputetrade-off
all_gather_objectCollectives
Gather Python objects from all ranks to all ranks. Complements tensor-based all_gather for small control data.
objectsgatherserialization
All-GatherCommunication
Collective that gathers shards from all processes so each process ends up with the concatenated full tensor.
tensor parallelparameters
All-ReduceCommunication
Collective that reduces values (e.g., sum of gradients) across all processes and distributes the result back to all.
gradientssynchronization

B

Backend (torch.distributed)(nccl, gloo, mpi)PyTorch Distributed
Communication backend for process groups. 'nccl' (GPU-only, fastest on NVIDIA), 'gloo' (CPU/multi-arch), 'mpi' (when compiled with MPI). Choose NCCL for CUDA tensors; Gloo for CPU tensors.
NCCLGlooMPI
barrier (torch.distributed)Collectives
Collective that blocks until all ranks enter, ensuring a synchronization point in the program.
synchronizationrendezvous
BF16Precision
bfloat16 format with 8-bit exponent and 7-bit mantissa; similar dynamic range to FP32, often more stable than FP16.
bfloat16AMP
BroadcastCommunication
Collective that sends a tensor from one process (the root) to all others.
parametersinit
broadcast_object_listCollectives
Broadcast Python objects by serializing them across ranks. Useful for small configs/metadata that are not tensors.
objectsbroadcastserialization

C

CheckpointReliability
Saved model, optimizer, and training state enabling resume or evaluation without retraining from scratch.
stateresume
Checkpoint ShardingMemory
Saving model state across multiple files/processes so no single device needs to materialize the full state at once.
FSDPDeepSpeed
Collective CommunicationCommunication
Operations involving multiple processes such as all-reduce, reduce-scatter, all-gather, and broadcast used to synchronize training state.
NCCLsynchronization
CUDA GraphsPerformance
Capture and replay sequences of GPU operations to reduce launch overhead and improve performance.
launch overheadPyTorch 2
CUDA StreamsPerformance
Queues for asynchronous GPU work submission; used to overlap kernels and communication with compute.
asynchronyoverlap
CUDA_VISIBLE_DEVICESEnvironment
Comma‑separated list of logical GPU IDs exposed to the process. Controls device enumeration and mapping for LOCAL_RANK→CUDA device selection.
CUDAdevice mapping

D

Data Parallelism(DP)Distributed Training
Replicate the full model on each worker and split the batch across workers; gradients are synchronized (e.g., via all-reduce) to keep weights in sync.
DDPscalingreplicas
DeepSpeedFrameworks
Microsoft's library for scalable training featuring ZeRO, pipeline parallelism, offloading, and many optimizations.
ZeROoffload
DeterminismReproducibility
Producing bitwise or numerically stable results across runs by controlling seeds, algorithms, and parallelism sources.
seedcudnnalgorithms
DeviceMeshPyTorch Distributed
An n‑dimensional logical arrangement of devices used by DTensor and collective libraries to express partitioning and communication groups across multiple axes (e.g., data, tensor, pipeline).
DTensorparallelismgroups
Direct Preference Optimization (DPO)(DPO)Fine-tuning
Preference optimization method that learns from chosen vs. rejected outputs without explicit reward modeling.
preferencealignment
Distributed Data Parallel (DDP)(DDP)Frameworks
PyTorch's process-based data parallel training where each process holds a full model replica and synchronizes gradients using collective communications.
PyTorchall-reduceNCCL
Distributed SamplerInput Pipeline
Data loader component that partitions a dataset across processes to avoid sample duplication and maintain shuffling guarantees.
samplersharding

E

Elastic TrainingDistributed Systems
Allow workers to join or leave during training (e.g., due to preemption), with automatic re-rendezvous and state recovery.
fault tolerancepreemption

F

FileStorePyTorch Distributed
Filesystem-backed Store that uses a shared directory to coordinate ranks. Suitable when nodes share a filesystem (e.g., NFS).
Storefilesystem
FP16Precision
IEEE half-precision floating point with 5-bit exponent and 10-bit mantissa; higher throughput but narrower dynamic range than BF16.
half precision
Fully Sharded Data Parallel (FSDP)(FSDP)Frameworks
PyTorch sharded training that partitions parameters, gradients, and optimizer states across workers, optionally with activation checkpointing, to reduce memory.
PyTorchshardingmemory
Fused KernelsPerformance
Kernels combining multiple ops into one to reduce memory reads/writes and launch overhead, improving throughput.
throughputoptimization

G

Global Batch SizeOptimization
Total number of samples processed per optimizer step across all workers and micro-batches.
batch sizescaling
GLOO_SOCKET_IFNAMEEnvironment
Network interface name for the Gloo backend to bind/listen on. Mirrors NCCL_SOCKET_IFNAME for Gloo.
Gloonetworkinterface
GPipeAlgorithms
Pipeline parallel training method that splits mini-batches into micro-batches to keep pipeline stages utilized.
pipelinemicro-batches
Gradient AccumulationOptimization
Accumulate gradients over multiple micro-batches before an optimizer step to simulate larger effective batch sizes.
memorybatch size
Gradient BucketingOptimization
Grouping gradients into buckets to reduce overhead and enable overlapping communication with computation.
overlapcommunication
Gradient ClippingOptimization
Limit the norm or value of gradients to prevent exploding gradients and improve training stability at large batch sizes.
stabilitynorm

H

HostfileInfrastructure
File listing nodes and slots (GPUs) for multi-node training; used by launchers to allocate resources.
multi-nodelauncher
Hybrid ParallelismDistributed Training
Combine multiple forms of parallelism (data, tensor, pipeline) to scale very large models across many devices.
3D parallelismscaling

I

InfiniBandHardware
Low-latency, high-throughput network commonly used for multi-node training, often with RDMA support.
networkRDMA

K

KubernetesInfrastructure
Container orchestration system; commonly used to run distributed training jobs with operators or custom controllers.
orchestrationcloud

L

LatencyPerformance
Time it takes to complete a single training step; can increase with synchronization or stragglers.
step timesynchronization
Learning Rate ScheduleOptimization
Rule for changing the learning rate over time (e.g., cosine decay, step decay) to improve convergence.
convergenceschedule
Local Rank(LOCAL_RANK, LOCAL_WORLD_SIZE)Distributed Systems
Rank of a process relative to its node; used to select the local GPU device. Provided by launchers through LOCAL_RANK (and LOCAL_WORLD_SIZE).
GPUdevice
LoRAFine-tuning
Low-Rank Adaptation; inject low-rank trainable matrices into existing weights to adapt models with small memory overhead.
PEFTadapters
Loss ScalingPrecision
Multiply the loss to shift small gradients into representable range when using FP16, then unscale before the optimizer step.
AMPstability

M

Master Address and Port(MASTER_ADDR, MASTER_PORT)Distributed Systems
Network coordinates used by workers to rendezvous and form process groups. In PyTorch, set via environment variables MASTER_ADDR and MASTER_PORT.
rendezvousenv vars
Megatron-LMFrameworks
Framework for large-scale Transformer training with tensor and pipeline parallelism and fused kernels.
tensor parallelpipeline
Micro-batchOptimization
A small batch that fits on device memory; multiple micro-batches can be accumulated into a larger global batch.
pipelineaccumulation
Mixed Precision TrainingPrecision
Use reduced precision (e.g., FP16 or BF16) for most ops to reduce memory and increase throughput, with care to maintain numerical stability.
AMPbf16fp16
Model ParallelismDistributed Training
Split a single model's parameters across multiple devices so each device holds only a shard of the model.
shardingscaling
monitored_barrierCollectives
Barrier variant that detects ranks that fail to reach the barrier by a deadline and logs or raises for easier deadlock diagnosis.
debuggingdeadlock

N

NCCLCommunication
NVIDIA Collective Communications Library optimized for high-performance multi-GPU and multi-node communication.
collectivesGPU
NCCL_ALGOEnvironment
Algorithm preference hint for NCCL collectives (e.g., 'Tree', 'Ring'). Can influence latency vs. bandwidth trade‑offs.
NCCLalgorithm
NCCL_DEBUGEnvironment
Controls NCCL logging verbosity (e.g., 'WARN', 'INFO', 'TRACE'). Useful for diagnosing topology, algorithm selection, and transport issues.
NCCLlogging
NCCL_IB_GID_INDEXEnvironment
InfiniBand GID index used by NCCL to select RoCE/IB addressing (common values: 0 for IB, 3 for RoCEv2).
NCCLGIDRoCE
NCCL_IB_HCAEnvironment
InfiniBand device allowlist for NCCL (e.g., 'mlx5_0'). Helps constrain which HCAs are used on multi‑HCA nodes.
NCCLInfiniBandHCA
NCCL_P2P_DISABLEEnvironment
Disable peer‑to‑peer (P2P) direct GPU communication in NCCL when set to '1'; forces traffic through other transports.
NCCLP2P
NCCL_PROTOEnvironment
Protocol hint for NCCL ('LL', 'LL128', 'Simple') controlling chunk sizes/latency behavior.
NCCLprotocol
NCCL_SHM_DISABLEEnvironment
Disable NCCL shared‑memory transport on a node (set to '1') to work around SHM limitations or container constraints.
NCCLSHM
NCCL_SOCKET_IFNAMEEnvironment
Comma‑separated list of network interfaces NCCL is allowed to use (e.g., 'ib0,eno1'). Set to choose between InfiniBand/Ethernet or to avoid docker/lo interfaces.
NCCLnetworkinterface
NVLinkHardware
High-bandwidth interconnect between NVIDIA GPUs that provides faster device-to-device communication than PCIe.
interconnectGPU

O

OffloadingMemory
Moving tensors (parameters, gradients, optimizer states, activations) to CPU or NVMe to fit larger models than GPU memory allows.
CPUNVMethroughput
Optimizer State ShardingMemory
Partitioning optimizer states (e.g., Adam moments) across workers to reduce memory footprint.
ZeROFSDP
Overlap Compute and CommunicationOptimization
Scheduling communication (e.g., gradient reductions) concurrently with compute to hide latency and improve throughput.
performancelatency

P

Parameter ShardingMemory
Partitioning model parameters across devices so each process stores only a shard, gathering on-the-fly when needed.
ZeROFSDP
Parameter-Efficient Fine-Tuning (PEFT)(PEFT)Fine-tuning
Fine-tune a small subset of parameters or add modules (e.g., LoRA) to adapt large models efficiently.
LoRAadapters
PCIeHardware
Peripheral Component Interconnect Express; general-purpose high-speed bus used for GPU-host and GPU-GPU communication.
interconnect
PipeDreamAlgorithms
Pipeline parallel approach with asynchronous weight updates and schedule optimizations to reduce bubbles.
pipelinescheduling
Pipeline ParallelismDistributed Training
Split the model by layers into stages placed on different devices; process micro-batches through stages like an assembly line to keep devices busy.
GPipePipeDreamstages
Process GroupDistributed Systems
A set of distributed processes that can communicate via collectives; used to scope data, tensor, or pipeline parallel communication.
PyTorchNCCLgroups
Proximal Policy Optimization (PPO)(PPO)Fine-tuning
Policy-gradient RL algorithm commonly used in RLHF to optimize language models from scalar rewards.
RLHFpolicy gradient

Q

QLoRAFine-tuning
PEFT method combining 4-bit quantization of base weights with LoRA adapters to reduce memory during fine-tuning.
quantizationLoRA

R

Rank(RANK)Distributed Systems
Unique identifier of a process within a distributed job, in the range [0, WORLD_SIZE-1]. Often provided via the environment variable RANK. Rank 0 is typically used for logging/checkpoint coordination.
DDPprocess
RayFrameworks
Distributed execution framework that provides high-level APIs for scaling Python and ML workloads, including training.
distributedpython
RDMAHardware
Remote Direct Memory Access; allows one machine to access another's memory without involving the CPU, reducing latency.
networklatency
reduce_scatter_tensorCollectives
Reduce a list of input tensors across ranks and scatter equal‑sized shards of the reduced result to each rank, enabling overlapped communication with sharded optimizers.
shardingoverlap
Reduce-ScatterCommunication
Collective that reduces values across processes and scatters disjoint shards of the result to each process; useful for sharded training.
shardingcollective
Reinforcement Learning from Human Feedback (RLHF)(RLHF)Fine-tuning
Pipeline that trains a reward model from human preferences and optimizes the policy (e.g., via PPO) to maximize the reward.
preferencealignment
RendezvousDistributed Systems
The mechanism by which distributed processes discover each other and form process groups (e.g., via master address/port or an elastic agent).
initelastic
Ring All-ReduceCommunication
All-reduce algorithm arranging processes in a ring, passing and reducing chunks to optimize bandwidth utilization.
algorithmbandwidth
RLAIFFine-tuning
Reinforcement Learning from AI Feedback; uses AI-generated preferences to scale preference optimization.
preferencescaling

S

SeedReproducibility
Initial value for random number generators; setting consistent seeds across processes aids reproducibility.
randomnessRNG
Sequence PackingInput Pipeline
Concatenate multiple sequences into a fixed-length window to reduce padding and increase token throughput.
packingtokens
SLURMInfrastructure
Cluster workload manager used to schedule and launch distributed training jobs on HPC systems.
schedulerHPC
Store (torch.distributed)PyTorch Distributed
A key–value service used by processes to share small pieces of state during rendezvous and beyond. Implementations include TCPStore, FileStore, and HashStore. Supports operations such as set(), get(), wait(), and timeouts.
rendezvousstateStore
StragglerDistributed Systems
A slow worker that delays synchronous steps; mitigation includes better placement, pipelining, or asynchronous techniques.
performancesynchronization
Supervised Fine-Tuning (SFT)(SFT)Fine-tuning
Train a model on input-output pairs to follow instructions before preference optimization or RLHF.
instruction tuningpreference

T

TCPStorePyTorch Distributed
Networked Store backed by a TCP server (typically on rank 0). Other ranks connect via MASTER_ADDR/MASTER_PORT. Useful for multi-node rendezvous and sharing runtime state.
StoreTCP
Tensor ParallelismDistributed Training
Type of model parallelism that partitions individual tensors (e.g., attention or MLP weights) across devices to parallelize intra-layer compute.
Megatron-LMintra-layersharding
ThroughputPerformance
Number of samples or tokens processed per unit time; key metric for distributed training efficiency.
tokens/ssamples/s
Topology-aware PlacementInfrastructure
Scheduling processes with awareness of hardware links (NVLink, PCIe, IB) to maximize bandwidth and minimize latency.
placementbandwidth
TORCH_DISTRIBUTED_DEBUGEnvironment
Debug level for torch.distributed ('OFF', 'INFO', 'DETAIL'). Increases runtime checks and logging to help diagnose collectives and rendezvous issues.
debuglogging
torch.distributed.init_process_group(init_process_group)PyTorch Distributed
Initialize the default process group for collectives. Must be called once per process before using torch.distributed APIs. Key args: backend ('nccl', 'gloo', 'mpi'), rank, world_size, and init_method/store. Call destroy_process_group() at shutdown.
initializationprocess groupbackend
torchrunLaunchers
Recommended PyTorch launcher. Spawns one process per GPU, sets RANK, WORLD_SIZE, LOCAL_RANK/LOCAL_WORLD_SIZE, and passes rendezvous information (MASTER_ADDR/MASTER_PORT). Supports elastic/etcd rendezvous.
elasticspawnmultiprocess

W

WarmupOptimization
Initial phase increasing the learning rate gradually to stabilize optimization, especially with large batch sizes.
lr schedulestability
World Size(WORLD_SIZE)Distributed Systems
Total number of distributed processes participating in training. Commonly exported as the environment variable WORLD_SIZE and consumed by launchers (e.g., torchrun) and frameworks to size process groups.
DDPFSDPDeepSpeed

Z

ZeROFrameworks
DeepSpeed's Zero Redundancy Optimizer that removes memory redundancy by sharding optimizer states, gradients, and parameters across data-parallel workers.
DeepSpeedshardingmemory
ZeRO Stage 1Frameworks
Shard optimizer states across data-parallel workers to reduce memory without changing gradients or parameters.
DeepSpeedoptimizer state
ZeRO Stage 2Frameworks
Shard optimizer states and gradients across data-parallel workers for further memory savings.
DeepSpeedgradients
ZeRO Stage 3Frameworks
Shard optimizer states, gradients, and parameters across workers, achieving maximal memory savings with on-the-fly parameter gathering.
DeepSpeedparameters