Misc LLM training posts

Debugging CUDA Driver Chaos: A Deep Dive into Kernel Module Version Mismatches

A detailed walkthrough of debugging CUDA driver and kernel module version mismatches across multiple nodes in a distributed DeepSeek V3 training cluster

Fine-Tuning Qwen 2.5-72B for YouTube Content Generation: A Two-Step Training Approach

Large Language Models (LLMs) have revolutionized content generation, but getting them to perform well on domain-specific tasks often requires fine-tuning. Over the past few months, I've been experimenting with fine-tuning Qwen 2.5-72B for YouTube content generation, specifically focusing on two different training approaches: a sophisticated two-step process combining Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), and a simpler single-step SFT approach.

Scaling the Summit: Distributed Inference with Meta-Llama-3.1-405B using vLLM

This post details the technical approach, configuration, and key insights from deploying one of the largest language models currently available using distributed inference techniques.

Fine-Tuning Llama 3.1 8B with Direct Preference Optimization: A Distributed Training Approach

As part of our deep learning research initiatives, I recently conducted a distributed Direct Preference Optimization (DPO) fine-tuning of the Meta Llama 3.1 8B model.