Thomas Kalnik's Blog

As a machine learning engineer specializing in generative AI, I recently undertook a project to fine-tune Stability AI's SDXL model for creating custom thumbnails in a specific visual style. This blog post details the technical approach, methodologies, and lessons learned throughout this process.

Project Overview

The goal was clear: develop a specialized thumbnail generation model that could produce high-quality, black and white storyboard-style illustrations in a 16:9 aspect ratio (specifically 1365×768 pixels). The fine-tuned model needed to understand and accurately represent specific visual concepts from input prompts while maintaining a consistent artistic style.

Dataset Preparation

The foundation of any successful fine-tuning project is high-quality training data. I assembled a carefully curated dataset of low-resolution storyboard frames, each maintaining the target 16:9 aspect ratio. This dataset was crucial for teaching the model the specific visual language I wanted it to learn.

Some key considerations in dataset preparation:

Consistent artistic style across all images
Black and white color scheme only (explicitly avoiding color)
Careful curation to ensure quality and style consistency
Proper aspect ratio maintenance (16:9)

The dataset was structured as an image folder compatible with Hugging Face's dataset loading utilities, enabling seamless integration with the training pipeline.

Fine-tuning Approach

I employed Replicate's infrastructure for the fine-tuning process, which significantly simplified GPU resource management. The approach involved:

Base Model Selection: SDXL was chosen as the foundation due to its exceptional generation capabilities and ability to understand complex prompts.
Training Parameters:
- Learning rate: 1e-5
- Mixed precision: fp16
- Resolution: 1024px
- Training steps: 15,000
- Batch size: 1 with gradient accumulation steps of 4
LoRA Fine-tuning: Rather than training the entire model, I utilized Low-Rank Adaptation (LoRA) with a scale of 0.4 during inference, allowing for efficient model specialization while maintaining much of the base model's knowledge.

Generation Pipeline

The generation pipeline incorporated several key components:

Prompt Engineering: I developed a system prompt template that enforced the black and white storyboard style: (((black and white))) in the style of storyboard and cartoonish
Negative Prompting: Extensive negative prompts were used to avoid common diffusion model artifacts and to explicitly prevent color in the outputs:
```
(((colors))), (((color))), (((colorful))), (((colourful))), mature content, double body, double face, 
double features, incorrect posture...
```
Generation Parameters:
- Seed: 42 (for reproducibility during testing)
- Inference steps: 25
- Guidance scale: 12
- Width/height: 1365×768 (16:9 aspect ratio)

Evaluation Framework

Evaluating generative models is notoriously challenging, requiring both quantitative metrics and qualitative assessment. I built a comprehensive evaluation suite that included:

CLIP Score: Measuring semantic similarity between input prompts and generated images using OpenAI's CLIP model.
Object Detection: Using YOLOv8 to verify the presence of objects mentioned in the prompts.
Multi-modal Evaluation:
- BLIP caption generation to create descriptions of generated images
- CLIP interrogator to understand what the model "sees" in its own outputs
- Semantic similarity scoring between prompts and generated captions
LLM-based Evaluation: Leveraging GPT-4 to perform nuanced semantic evaluation of image-prompt alignment, using a 1-10 scoring system.
Facial Analysis: Using DeepFace to evaluate emotional consistency and facial quality when human subjects were present.

Results and Insights

The fine-tuned model demonstrated several key strengths:

Style Consistency: Maintained the black and white storyboard aesthetic across diverse prompts.
Prompt Adherence: The model showed strong semantic alignment with input prompts, verified by CLIP scores averaging above 0.75.
Object Accuracy: Properly generated requested objects in appropriate contexts.
Compositional Understanding: Successfully maintained scene composition appropriate for thumbnail imagery.

However, challenges remained, particularly with:

Certain complex action sequences
Very specific spatial relationships
Text rendering within images
Consistent facial expressions across different prompts

Technical Optimizations

Several technical optimizations proved beneficial:

Gradient Accumulation: Using gradient accumulation steps of 4 effectively increased the batch size while managing memory constraints.
Resolution Management: Training at 1024px while generating at slightly higher resolutions (1365×768) provided a good balance of quality and efficiency.
LoRA Parameter Tuning: Finding the optimal LoRA scale (0.4) that preserved base model capabilities while incorporating new stylistic elements.

Future Directions

This project opens several paths for further exploration:

Style-specific Negative Embedding: Developing dedicated textual inversion embeddings for more precise style control.
Controlled Generation: Implementing ControlNet adaptations for more precise compositional control.
Prompt Optimization: Developing an automated prompt optimization system to maximize model performance for specific use cases.
Additional Fine-tuning: Further specialization on narrow subdomains for improved results in specific scenarios.

Conclusion

Fine-tuning SDXL for specialized thumbnail generation demonstrated both the power and limitations of current diffusion model customization approaches. The technical workflows established during this project can be adapted for other domain-specific image generation tasks, providing a foundation for specialized visual content creation pipelines.

While challenges remain, the ability to create consistent, high-quality thumbnail imagery from text prompts represents a significant productivity enhancement for content creators, marketers, and designers.