GraphRAG Assisted Ideation with a YouTube Knowledge Graph

I've been building a system that helps creators generate better video ideas using a combination of Retrieval-Augmented Generation (RAG) with PostgreSQL's pg_vector and a Neo4j knowledge graph.

GraphRAG Assisted Ideation with a YouTube Knowledge Graph

As a developer working on content ideation tools, I've been building a system that helps creators generate better video ideas using a combination of Retrieval-Augmented Generation (RAG) with PostgreSQL's pg_vector and a Neo4j knowledge graph. In this post, I'll walk through the architecture, implementation challenges, and insights gained from building this agentic ideation system.

The Core Challenge

Content creators constantly need fresh, engaging ideas. But manually browsing through trending topics, analyzing past successful videos, and keeping track of audience interests is time-consuming. I wanted to build a system that could:

  1. Automatically process YouTube content to understand what works
  2. Create a knowledge graph of content relationships
  3. Generate personalized content ideas based on a creator's unique style and audience

System Architecture Overview

The system consists of several interconnected components:

  1. YouTube Data Collection Layer - For retrieving video metadata and transcripts
  2. Transcript Processing Pipeline - For converting raw transcripts into semantic data
  3. Vector Database (PostgreSQL/pg_vector) - For efficient semantic search
  4. Knowledge Graph (Neo4j) - For content relationship mapping
  5. Ideation API - For generating personalized content ideas

Let me walk through how each piece works together.

Processing YouTube Transcripts

The first challenge was obtaining and processing YouTube transcripts at scale. I implemented a pipeline that:

  1. Downloads transcripts using a Python library that facilitates fetching available transcripts for given video IDs.
  2. Formats the raw transcript data, which often comes as a list of text segments with timestamps, into a human-readable format. Both raw and processed formats are preserved for future use.
  3. Summarizes them using various Large Language Models (LLMs). After experimenting with multiple models, I found certain models were particularly effective for generating detailed summaries, bullet-point lists, loglines, concise labels, and content categorization.
  4. Generates embeddings using appropriate embedding models for semantic search functionality.
  5. Stores all data (raw transcripts, generated summaries, and embeddings) in a properly structured PostgreSQL database with pg_vector extension.

The pipeline includes robust error handling, rate limiting mechanisms, and optional asynchronous processing to manage API rate limits efficiently while processing large volumes of content.

Vector Storage with PostgreSQL

For efficient semantic search, I leveraged PostgreSQL with the pg_vector extension. This setup allows for blazingly fast similarity searches with proper indexing. The database schema includes:

  • A main table for storing video transcripts and their embeddings
  • A table for storing smaller segments (chunks) of the transcripts and their corresponding embeddings, crucial for more granular RAG operations
  • A table for various summary types generated by different LLMs and their embeddings

This multi-table approach provides flexibility in querying based on different content granularities, summary types, or embedding models. The system creates embeddings for full transcripts, summaries, and transcript chunks, which provides versatility in search operations:

  1. Full transcript embeddings allow for detailed content matching
  2. Summary embeddings focus on the core themes and concepts derived from different LLMs
  3. Chunk embeddings enable more focused RAG by retrieving specific relevant parts of a long transcript

Neo4j Knowledge Graph Implementation

What makes this system particularly powerful is the Neo4j knowledge graph that connects related content. Videos are clustered based on semantic similarity of their summaries or transcripts, creating a network of relationships. The process involves:

  1. Fetching processed content from the PostgreSQL database
  2. Grouping content by embedding similarity using clustering algorithms
  3. Creating relationship structures in Neo4j based on content, creator, theme, and other metadata
  4. Establishing weighted relationships between clusters based on semantic similarity metrics

The knowledge graph structure enables multi-hop reasoning, which traditional vector databases can't do effectively on their own. For instance, the system can identify:

  • Content clusters popular with specific creators
  • Relationships between content themes across different creators
  • "Bridge clusters" that connect different creators or topics
  • Content gaps and opportunities for new content

The graph database allows for complex queries that can reveal nuanced relationships and potential collaboration opportunities or thematic overlaps that wouldn't be obvious from simple similarity searches.

Idea Synthesis Process

The real magic happens in the ideation components that combine:

  1. RAG from PostgreSQL vector search
  2. Multi-hop queries from the Neo4j knowledge graph
  3. LLM generation using appropriate prompting techniques

Here's the general approach to idea synthesis:

  1. Retrieve a creator's recent content and style patterns from the PostgreSQL database
  2. Identify relevant content clusters from the Neo4j knowledge graph
  3. Find related clusters through graph traversal, potentially discovering non-obvious connections
  4. Prepare comprehensive context that includes creator style, audience preferences, content trends, and related successful content
  5. Use carefully engineered prompts with appropriate LLMs to generate contextually relevant ideas

The system can generate multiple types of content suggestions:

  1. Concept ideas - Detailed video concepts with hooks, target audience information, and format suggestions
  2. Title ideas - Engaging, high-CTR title suggestions based on proven patterns
  3. Trend opportunities - Emerging topics within a creator's niche
  4. Cross-audience expansion ideas - Content that might help reach adjacent audiences

Performance Optimization with Vector Indexing

One challenge was ensuring fast query performance as the database grew, especially for the RAG system that relies on quick retrieval of relevant transcript chunks. Vector indexing in PostgreSQL was implemented for various embedding types using IVFFlat indexes with carefully tuned parameters.

These indices significantly speed up nearest-neighbor searches by reducing the search space, making the RAG system work efficiently even with thousands of transcripts and their chunks. The indexing strategy required careful tuning based on the dataset characteristics, query patterns, and performance requirements.

Challenges and Learnings

Building this system taught me several valuable lessons:

  1. LLM Quality Matters - Different models produced dramatically different summary quality. Some models offered an excellent balance of detailed output and cost efficiency for the summarization tasks. Prompt engineering was crucial for extracting specific types of information consistently.

  2. Chunking Strategies - Finding the right chunking strategy for RAG was essential. While fixed-size chunking worked initially, semantic chunking often yields better contextual relevance. Using more sophisticated LLMs to analyze transcripts and break them into semantically meaningful sections proved more effective than relying solely on fixed token counts.

  3. Rate Limiting and Asynchronous Processing - Managing API rate limits for both the YouTube Data API and various LLM services required careful implementation. Asynchronous processing with dynamic rate limiting based on token usage patterns proved effective for maximizing throughput while staying within API constraints.

  4. Graph Schema Evolution - As the project evolved, the Neo4j schema needed updates to capture new relationships and properties. Designing a more explicit and versioned schema management approach from the start could have streamlined this process.

Results and Future Work

The system now helps creators generate ideas that blend proven formats with novel angles. Future enhancements include:

  1. Audience awareness - By analyzing audience engagement metrics alongside content clusters
  2. Trend intelligence - By tracking temporal patterns in content performance
  3. Multi-platform insights - Expanding beyond YouTube to other platforms
  4. A/B testing integration - To validate idea performance predictions

Conclusion

Combining RAG with a knowledge graph creates a powerful ideation system that goes beyond simple semantic search. The multi-hop reasoning enabled by Neo4j, coupled with the efficient vector search of pg_vector, allows for nuanced content recommendations that respect both the creator's unique style and audience preferences.

This technology stack demonstrates that AI-assisted content creation doesn't mean standardization—it can instead help amplify a creator's unique voice while ensuring their content resonates with their audience.