Generative Engine Optimization: Auditing Websites for AI-Readiness

Search is changing faster than most developers realize. ChatGPT can now browse the web. Perplexity indexes and summarizes pages in real time. Google's AI Overviews answer questions directly, sometimes without the user ever clicking a result. The implicit contract between websites and search engines, the one that SEO has been optimizing for thirty years, is being renegotiated by AI systems that care about very different things than Googlebot did.

This is the problem space that Generative Engine Optimization (GEO) addresses. I've been thinking about it seriously enough to build a toolchain around it: the @glincker/geokit ecosystem for auditing and improving AI-readiness. Here's what I've learned.

What GEO Actually Is

SEO was about signals: keywords, backlinks, page speed, Core Web Vitals. The audience was a crawler that indexed text and ranked pages by relevance to queries.

GEO is about comprehension. The audience is now a language model that reads your page, extracts information, and synthesizes an answer. The model doesn't care about your keyword density. It cares whether it can accurately understand what your page is about, who wrote it, what claims it makes, and whether those claims are trustworthy enough to cite.

A site that scores well for GEO is one that an AI model can read, understand, and confidently reference. Structured data tells the model what kind of content this is. Semantic HTML gives it document structure. An llms.txt file tells it how you want your content used. Clean prose without ambiguity gives it accurate facts to pull from.

The overlap with good SEO is real, but the specifics differ. A site can have strong SEO signals and be nearly useless to an AI model, and a site can be almost invisible to traditional search while being highly legible to AI systems.

Why This Matters Right Now

The numbers are hard to ignore. Perplexity's monthly active users roughly doubled through 2025. ChatGPT search launched as a default for Plus subscribers and immediately started pulling traffic from traditional search. Google's AI Overviews are now shown on a significant percentage of informational queries.

What this means practically: users are getting answers without visiting pages. If your content is AI-readable and gets cited, you get a mention in the AI response and possibly a click. If your content is AI-unreadable, you get nothing. The page might rank fine in traditional search and still be invisible to the AI layer sitting on top of it.

The developers and teams that figure this out early have a genuine advantage. GEO is roughly where SEO was in 2005: the concepts are real, the tools are immature, and the practitioners who take it seriously first will set the patterns everyone else follows.

The Shift from SEO to GEO

Traditional SEO checklist: title tags, meta descriptions, canonical URLs, sitemap.xml, robots.txt, page speed, backlinks, keyword targeting.

GEO checklist: all of that, plus structured data with schema.org vocabularies, semantic HTML (not just div soup), an llms.txt file, citation-friendly content structure, factual precision, author attribution, date accuracy, and clear provenance signals.

The structural differences are worth examining:

Structured data moves from nice-to-have to essential. JSON-LD with schema.org types lets AI models know they're looking at an Article vs. a Product vs. a FAQPage. Without it, the model has to guess from context and it often guesses wrong or hedges its answer.

Semantic HTML matters again. Years of div-heavy React apps trained developers to think about HTML as just a rendering target. AI models reading your page for content treat <article>, <section>, <h1>-<h6>, and <nav> as real signals about document structure. A page with one <div> inside another <div> inside another <div> gives the model nothing to anchor to.

llms.txt is new and increasingly important. It's a plain text file at the root of your domain that tells AI systems how your content should be used. Think of it as a robots.txt but for language models instead of crawlers. Which content is freely usable for context? Which is behind a paywall and should not be cited as freely accessible? What's the preferred format for referencing your site?

Content structure changes. AI models extract information by proximity and heading hierarchy. A claim buried in paragraph seven with no heading context is easy to misattribute or miss entirely. Short paragraphs, descriptive headings, and explicit topic sentences make content more extractable.

How I Built the GeoKit Ecosystem

The problem I kept running into: I wanted to know whether a given site would show up accurately in AI responses, and there was no tooling for it. Traditional SEO audit tools check the wrong things. So I built @glincker/geokit, a suite of three packages with a clear pipeline: audit -> generate -> convert.

`@glincker/geo-audit`

This is the scorer. It fetches a URL, analyzes the page, and returns a 0-100 AI-readiness score with itemized results for each check.

import { audit } from '@glincker/geo-audit';
 
const result = await audit('https://example.com');
 
console.log(result.score); // 62
console.log(result.checks);
// [
//   { id: 'structured-data', score: 15, max: 20, status: 'partial', detail: 'JSON-LD present but missing author field' },
//   { id: 'llms-txt', score: 0, max: 15, status: 'missing', detail: 'No llms.txt found at /.well-known/llms.txt or /llms.txt' },
//   { id: 'semantic-html', score: 12, max: 15, status: 'pass', detail: 'Heading hierarchy valid, article element present' },
//   { id: 'meta-tags', score: 8, max: 10, status: 'pass', detail: 'Title and description present, OG tags complete' },
//   { id: 'schema-types', score: 10, max: 15, status: 'partial', detail: 'Article schema found, missing dateModified and author' },
//   { id: 'content-structure', score: 10, max: 15, status: 'partial', detail: 'Average paragraph length high (>120 words)' },
//   { id: 'robots-txt', score: 7, max: 10, status: 'pass', detail: 'Sitemap declared, no aggressive blocking' },
// ]

The scoring system has seven primary checks, each weighted by how much that factor affects AI comprehension in practice. Structured data carries the most weight because it's the most direct signal. llms.txt absence is a significant deduction because it's a simple fix with real impact.

`@glincker/geo-seo`

This is the generator. Given site metadata and content description, it produces the files you need: llms.txt, JSON-LD structured data blocks, and an updated robots.txt that's AI-aware.

import { generateLlmsTxt, generateJsonLd, generateRobotsTxt } from '@glincker/geo-seo';
 
const llmsTxt = generateLlmsTxt({
  siteName: 'Example Blog',
  siteUrl: 'https://example.com',
  description: 'Technical articles on TypeScript, distributed systems, and developer tooling.',
  allowedUsage: ['training', 'retrieval', 'citation'],
  contentTypes: [{ path: '/blog', type: 'article', license: 'CC BY 4.0' }],
  preferredCitation: 'https://example.com/blog/{slug}',
});
 
const articleSchema = generateJsonLd({
  type: 'Article',
  headline: 'Building Type-Safe APIs with Hono',
  author: { name: 'G', url: 'https://thegdsks.com' },
  datePublished: '2026-01-15',
  dateModified: '2026-01-20',
  description: 'A deep dive into building fully type-safe HTTP APIs with Hono and TypeScript.',
});

The llms.txt format isn't fully standardized yet, but there's an emerging convention that I've tried to align with. The key sections are: site description, allowed uses, content index with paths and types, and attribution preferences.

`@glincker/geomark`

This is the converter. It takes a URL and returns clean markdown stripped of navigation, ads, cookie banners, and other noise. The output is what you'd want to paste into an AI context window for research or summarization.

import { toMarkdown } from '@glincker/geomark';
 
const md = await toMarkdown('https://example.com/blog/some-article', {
  includeImages: false,
  includeTables: true,
  cleanNavigation: true,
  preserveCodeBlocks: true,
});
 
// Returns clean article text in markdown, ready for LLM context

This one is useful in two directions. Developers use it to build retrieval pipelines. Site owners use it to check what an AI model would actually extract from their pages, which is often different from what they expect.

What the Audit Catches in Practice

I've run the audit against a sample of popular developer-facing sites. The pattern is consistent: technical content sites score well on meta tags and HTML structure, but fall down on structured data and llms.txt. Marketing sites often have the inverse problem: polished structured data from their CMS but div-heavy content that's hard to extract.

Common failures I see:

JSON-LD without author attribution. An Article schema without an author field means AI models have to guess or report the source as unknown. For content you want cited, this matters.

Missing dateModified. Models use modification dates to judge freshness. A tutorial written in 2018 and never updated should probably declare that. A page that's regularly updated and doesn't say so loses freshness signals.

Heading hierarchies that skip levels. Going from h1 to h3 with no h2 in between confuses document parsing. It's a minor issue in a browser but a meaningful signal problem for a model building a content outline.

Robots.txt blocking AI crawlers without intent. Some sites accidentally block Perplexity or other AI user agents through overly broad Disallow rules. If you want AI visibility, your robots.txt needs to explicitly allow the crawlers you care about, or use a permissive wildcard.

llms.txt simply not existing. The majority of sites I've audited don't have one. It's a fifteen-minute addition that immediately improves how AI systems understand your content preferences.

A Practical GEO Checklist

What I'd tell a developer who wants to improve their site's AI-readiness:

Add JSON-LD with schema.org Article type to every blog post. Include author, datePublished, dateModified, and description. Make it machine-readable, not just human-readable.
Create /llms.txt at your domain root. Describe what the site is, what AI systems are allowed to do with the content, and how you want it cited.
Audit your semantic HTML. Every article should live in an <article> element. Headings should form a valid hierarchy. Navigation and footer should be in <nav> and <footer> respectively, not generic divs.
Keep paragraphs under 100 words on average. Long paragraphs are harder to extract cleanly. Short, precise paragraphs with one idea each are friendlier to summarization models.
Run geo-audit against your own pages. The itemized results show exactly what to fix and roughly how much each fix improves the score. Prioritize by weight, not by ease.
Add <link rel="canonical"> to every page if you haven't already. Duplicate content is as confusing for AI models as for traditional search.
Check your robots.txt for unintentional AI crawler blocking. Add explicit Allow rules for crawlers you want indexed.
Update stale content with dateModified. Even if the content is still accurate, an old datePublished and no dateModified sends a staleness signal that hurts citation likelihood.

What's Next for GeoKit

The audit has seven checks now. The next round will add: reading level scoring (content should match the stated audience), citation density analysis (does the page reference external sources that an AI could verify?), and broken structured data detection beyond just presence or absence.

A browser extension is in early design. The idea is a one-click audit panel that runs the geo-audit checks against whatever page you're viewing and shows the score inline. Useful for quick competitive research and for developers checking their own pages during development.

CI integration is the longer-term goal. A GitHub Action that runs geo-audit on changed pages in a pull request and comments the delta score. The same way Lighthouse CI gates on performance regressions, GEO CI would gate on AI-readiness regressions.

The llms.txt space is moving fast and the format will probably stabilize into something more structured in the next year. GeoKit's generator will track whatever becomes the consensus.

Why This Is Worth Your Time Now

Most sites are not taking GEO seriously yet. That means the gap between AI-readable and AI-opaque sites is large and the cost of closing it is still low. Adding structured data, an llms.txt, and cleaning up semantic HTML is hours of work, not weeks. The sites that do it now will have a meaningful advantage as AI-mediated search continues to grow.

The underlying principle is the same one that made well-structured HTML good for accessibility: when you describe your content clearly and precisely, the machines that process it do a better job of representing it accurately. For accessibility, that machine is a screen reader. For GEO, it's a language model deciding whether to cite you. The investment is the same; the audience is just different.