$ man how-to/build-sqlite-content-index

CLI Toolsadvanced

How to Build a SQLite Content Index for Your Repo

Turn your file system into a queryable database with zero external dependencies


Why a SQLite Index for Content

A repo with 100+ content files across 6 platforms becomes opaque. The file system organizes files by path but cannot answer questions like: how many posts went final this week, which content has cross-platform siblings, or what is the total word count for February. A SQLite database sitting alongside the repo gives you SQL queries over your content without changing the source of truth. The index is derived data — rebuilt from git-tracked files on every run. Delete the database, run the script, get the same result. Zero external dependencies. Python stdlib only: json, sqlite3, pathlib, re.
CODE

Schema Design — Nine Tables

The schema covers every content type in the repo. The content table holds every draft and final across all platforms with fields for platform, stage, title, slug, date, pillar, arc, series, and word count. daily_logs tracks performance metrics. sessions is append-only and survives index rebuilds — historical records, not derived data. skills indexes the Claude and Cursor skill registry. content_links stores the relationship graph with two link types: series_sibling (auto-detected) and cross_platform_note (parsed from content). assets catalogs visual assets with structured metadata parsed from filenames. videos and thumbnails track the video pipeline. The schema design principle: every table maps to a content type, every row is derived from a file, every rebuild is idempotent.
PATTERN

Metadata Parsing — Two Formats

Content files use two metadata formats. Most platforms use blockquote syntax: > **Key**: Value at the top of the file. Website posts use YAML frontmatter between --- delimiters. The parser detects the format automatically and extracts structured fields. Title is extracted from metadata or falls back to the first # heading. Word count strips frontmatter before counting. The dual-format parser means you do not need to standardize your entire content repo to one format. Meet the content where it is.
PATTERN

Cross-Platform Link Detection

The index discovers relationships between content files automatically. Implicit sibling detection matches files with identical (date, slug) across platforms and creates series_sibling links. If you have linkedin/final/2026-02-17_build-your-own-os.md and substack/final/2026-02-17_build-your-own-os.md, the index links them without manual annotation. Explicit cross-reference detection parses Cross-Platform Notes sections, looks for platform keywords with aliases, and matches to existing content by date and platform. This creates a queryable content graph showing how pieces relate across your publishing pipeline.
CODE

Asset and Video Inventory

Filename patterns encode structured metadata. The asset parser extracts type, tier, class, variant, and size from filenames like tier-3-idle-256.gif or class-alchemist-static.png. The video parser extracts brand, aspect ratio, and format from filenames like contentos-landscape or gtmos-linkedin-4x5. Brand aliases handle variations — lead-magnet maps to shawnos, for example. The result is a queryable inventory of every visual asset in the repo without manual cataloging. Run a query to find all tier-3 idle animations for shawnos. Run another to find deployed videos missing from a specific site.
PRO TIP

Dead Page Detection and Content Gaps

Query the content table for files with zero inbound links from content_links — those are orphans, content that exists but nothing points to. Query for files with zero outbound links — those are dead ends that do not connect forward. The most powerful use is gap detection: query for expected topics that have zero coverage. This is how the index revealed its own gap. Three major systems shipped with no blog coverage. The tool that finds content gaps found content gaps about the tool. Use the index as a content audit instrument, not just a catalog. Run it weekly. Let the queries tell you what to write next.

related guides
How to Build Your Own Content Engineering SystemHow to Build a React Video Rendering System with RemotionHow to Design a Content Cluster Strategy Across Multiple Sites
← how-to wikiknowledge guide →
ShawnOS.ai|theGTMOS.ai|theContentOS.ai
built with Next.js · Tailwind · Claude · Remotion