Querying Your Repo Like a Database
tl;dr: 100+ content files across 6 platforms were impossible to query from the file system. So I built a SQLite index that makes the entire repo queryable from the command line. One script, 9 tables, zero external dependencies. It even found its own content gap.
what problem does a content index solve?
the repo had over 100 content files. LinkedIn drafts, X threads, Substack newsletters, Reddit posts, website blog entries, TikTok scripts. spread across 6 platforms, split between draft and final stages. the directory structure was clean, but the data was opaque. I couldn't answer basic questions without manually scanning files.
how many LinkedIn posts went final this week? which content has cross-platform siblings? what's the total word count for February? which assets are deployed and which are sitting in source?
the file system is great for organizing. it's terrible for querying.
how is the SQLite index built?
scripts/build_index.py walks the repo, parses every content file, and loads the results into a 9-table SQLite database at data/index.db. zero external dependencies. stdlib only. json, sqlite3, pathlib, re. that's it.
the index is derived data. it's rebuilt from git-tracked files every time you run it. delete the database, run the script, get the same result. the source of truth is always the repo. the database is just a query layer on top.
$ python3 scripts/build_index.py
Building index: data/index.db
Content: 80 files indexed
Daily logs: 11 days indexed
Skills: 54 indexed
Assets: 522 visual assets indexed
Videos: 3 video files indexed
Content links: 75 series-sibling pairs detected
how is the SQLite schema structured?
nine tables. each one indexes a different content type.
content: the core table. every draft and final across all platforms. fields include platform, stage, title, slug, date, pillar, arc, series, word count. metadata is parsed from two formats: blockquote syntax (> **Key**: Value) for most platforms, YAML frontmatter for website posts.
daily_logs: performance metrics from the daily tracker. output score, letter grade, word count, shipped count, agent cost, ROI multiplier, commits.
sessions: context handoff history. this table is append-only and survives index rebuilds. every other table gets dropped and recreated. sessions persist because they're historical records, not derived data.
skills: the Claude and Cursor skill registry. 54 skills indexed with name, description, file path, category.
content_links: the relationship graph. two link types: series_sibling (same date and slug across platforms, detected automatically) and cross_platform_note (explicit references parsed from Cross-Platform Notes sections).
assets: 522 visual assets across the progression system. tier avatars, class badges, tool icons, Nio variants, sprite sheets. filename patterns get parsed into structured data: tier-3-idle-256.gif becomes asset_type=tier, tier=3, variant=idle, size_px=256.
videos: video file catalog with brand, aspect ratio, format, deployment status.
thumbnails: thumbnail inventory by brand and variant.
how do you query the index?
scripts/query_index.py is the read-only interface. eight subcommands with filtering.
$ python3 scripts/query_index.py content --platform linkedin --since 2026-02-15
$ python3 scripts/query_index.py stats --latest 3
$ python3 scripts/query_index.py skills --category claude
$ python3 scripts/query_index.py links --date 2026-02-17
$ python3 scripts/query_index.py assets --site shawnos --type tier --tier 3
$ python3 scripts/query_index.py videos --brand gtmos --source-only
output modes: table (default), JSON (--json), row count (--count). the table output is human-readable. the JSON output pipes into other scripts.
how does cross-platform link detection work?
this is where it gets interesting. the index doesn't just catalog files. it discovers relationships between them.
implicit sibling detection: files with identical (date, slug) across platforms get linked as series_sibling. if you have linkedin/final/2026-02-17_build-your-own-os.md and substack/final/2026-02-17_build-your-own-os.md, the index knows they're related without you telling it.
explicit cross-reference detection: the script parses ## Cross-Platform Notes sections, looks for platform keywords with aliases (LinkedIn, X/Twitter, Reddit), and matches them to existing content by date and platform.
75 sibling pairs and 4 explicit cross-links in the current index. the content graph is real and queryable.
what is the dead page detector?
this is why this post exists. the index revealed its own gap.
query the content table for files with zero inbound links from content_links. those are orphans. content that exists but nothing points to. query for files with zero outbound links. those are dead ends. content that doesn't connect to anything else.
I ran those queries and found that three major systems had shipped with zero blog coverage: the Remotion video system, this SQLite index itself, and the content cluster topology. the tool that finds content gaps revealed content gaps about the tool.
how does the system document itself?
this post is a direct result of the system it describes. the SQLite index made the repo queryable. a query revealed that no content covered the index. so I wrote this post. when I rebuild the index, this post appears in it. the system documents itself.
that's not a gimmick. that's the content engineering thesis. the system produces the content that describes the system. every new capability becomes a new piece of content. every new piece of content strengthens the knowledge graph. the loop compounds.
frequently asked questions
why use SQLite for a content index? SQLite is zero-dependency, file-based, and fast enough for thousands of content files. it ships with Python's standard library so you don't need any external packages. the database is a single file you can delete and rebuild from git-tracked source files at any time.
can you query a blog like a database? yes. once your content is indexed, you can run SQL queries against it. filter by platform, date range, word count, stage, pillar, series. the query CLI supports table output for humans and JSON output for piping into other scripts.
what is a dead page detector? it's a query pattern that finds orphaned content. pages with zero inbound links (nothing points to them) and pages with zero outbound links (they don't connect to anything else). these are the posts that exist but don't participate in your content graph.
keep reading
- content cluster breadcrumbs: the topology system
- recursive drift: thinking with AI without losing your voice
$ python3 scripts/query_index.py content --platform website --stage final