How We Built Glimpse's Audit Engine
Building a site audit tool that actually catches real problems is harder than it looks. Early versions of Glimpse would give sites near-perfect scores while professional tools flagged hundreds of issues. We had to completely rethink our approach.
Here's how we built an audit engine that catches what matters.
The Problem with Simple Audits
Our first implementation was naive: fetch pages, check for missing meta tags, calculate a score. A site could have:
- 45 redirect chains
- 39 canonicals pointing to redirects
- Orphan pages with no internal links
- Broken redirects
...and still score 90+. That's useless.
Real SEO problems aren't just missing <title> tags. They're structural issues in how pages link together, how redirects chain, and whether search engines can actually index your content.
The Architecture
Glimpse now uses a multi-phase audit pipeline:
Discovery → Crawl → Link Analysis → Issue Detection → Scoring
Each phase builds on the previous one, and the final score reflects actual site health.
Phase 1: Discovery
Before crawling, we map out the site:
robots.txt Analysis
GET https://example.com/robots.txt
We extract:
- Sitemap locations
- Disallow rules (what we shouldn't crawl)
- Crawl-delay directives
Sitemap Parsing
Sitemaps can be nested. A sitemap index points to other sitemaps, which might point to more sitemaps. We follow the entire tree:
async function discoverUrls(domain: string): Promise<string[]> {
const sitemapUrls = await findSitemaps(domain);
const allUrls = new Set<string>();
for (const sitemapUrl of sitemapUrls) {
const urls = await parseSitemap(sitemapUrl);
urls.forEach(url => allUrls.add(url));
}
return filterCrawlableUrls([...allUrls]);
}
Phase 2: Crawling with Redirect Tracking
This is where things get interesting. We don't just fetch pages—we track the entire redirect journey.
Manual Redirect Following
Instead of letting fetch() auto-follow redirects, we handle them manually:
async function crawlWithRedirects(url: string): Promise<CrawlResult> {
const chain: RedirectHop[] = [];
let currentUrl = url;
let response: Response;
while (true) {
response = await fetch(currentUrl, { redirect: 'manual' });
if (response.status >= 300 && response.status < 400) {
const location = response.headers.get('location');
chain.push({
from: currentUrl,
to: location,
status: response.status,
isHttpToHttps: isHttpToHttpsRedirect(currentUrl, location)
});
currentUrl = location;
continue;
}
break;
}
return {
originalUrl: url,
finalUrl: currentUrl,
redirectChain: chain,
finalStatus: response.status,
html: await response.text()
};
}
This captures:
- Redirect chains: Multiple hops (A → B → C)
- Redirect types: 301 vs 302 vs 307
- HTTP→HTTPS redirects: Protocol upgrades
- Broken redirects: Chains that end in 4xx/5xx
What We Extract
For each page, we parse:
SEO Essentials
- Title, description, canonical
- H1/H2 structure
- Robots directives (meta and X-Robots-Tag header)
Open Graph & Twitter Cards
- All og:* and twitter:* tags
- Image URLs for validation
Links
- Every
<a href>on the page - Classified as internal or external
- Used to build the link graph
Page Assets
- HTML size, stylesheet count, script count
- Image and font references
Phase 3: Link Graph Analysis
After crawling all pages, we build a complete picture of how they connect.
Building the Graph
interface LinkGraph {
// URL → URLs that link TO it
incomingLinks: Map<string, Set<string>>;
// URL → URLs it links TO
outgoingLinks: Map<string, Set<string>>;
}
function buildLinkGraph(pages: PageData[]): LinkGraph {
const graph: LinkGraph = {
incomingLinks: new Map(),
outgoingLinks: new Map()
};
for (const page of pages) {
const pageUrl = normalizeUrl(page.url);
for (const link of page.links) {
if (link.isExternal) continue;
const targetUrl = normalizeUrl(link.href);
// Track outgoing
if (!graph.outgoingLinks.has(pageUrl)) {
graph.outgoingLinks.set(pageUrl, new Set());
}
graph.outgoingLinks.get(pageUrl)!.add(targetUrl);
// Track incoming
if (!graph.incomingLinks.has(targetUrl)) {
graph.incomingLinks.set(targetUrl, new Set());
}
graph.incomingLinks.get(targetUrl)!.add(pageUrl);
}
}
return graph;
}
Detecting Link Issues
With the graph built, we identify problems:
Orphan Pages: No incoming internal links
const orphanPages = pages.filter(page => {
const incoming = graph.incomingLinks.get(page.url);
return !incoming || incoming.size === 0;
});
Dead Ends: No outgoing links
const deadEnds = pages.filter(page => {
const outgoing = graph.outgoingLinks.get(page.url);
return !outgoing || outgoing.size === 0;
});
Links to Redirects: Internal links pointing to 3xx pages Links to Errors: Internal links pointing to 4xx/5xx pages
Phase 4: Canonical & Indexability
These checks require additional validation beyond just parsing HTML.
Canonical Validation
We don't trust that a canonical URL is valid—we verify it:
async function validateCanonical(page: PageData): Promise<CanonicalIssue | null> {
if (!page.canonical) {
return { type: 'missing' };
}
// Fetch the canonical URL
const response = await fetch(page.canonical, { redirect: 'manual' });
if (response.status >= 300 && response.status < 400) {
return { type: 'points_to_redirect', canonical: page.canonical };
}
if (response.status >= 400) {
return { type: 'points_to_error', canonical: page.canonical };
}
return null;
}
Indexability Analysis
A page might be blocked from indexing in multiple ways:
- robots.txt: Disallow rule matches the URL
- Meta robots:
<meta name="robots" content="noindex"> - X-Robots-Tag: HTTP header with noindex directive
- Canonical mismatch: Page canonicalizes to a different URL
We check all four and report which mechanism blocks the page.
Phase 5: Weighted Scoring
Here's where we diverged from naive implementations.
The Wrong Way
// DON'T DO THIS
if (!page.title) score -= 10 / totalPages;
This normalizes by page count, which means more pages = higher score. A 100-page site with 10 problems scores better than a 10-page site with the same 10 problems. That's backwards.
The Right Way
We use weighted issue scoring where each problem type has a fixed cost:
const issueWeights = {
// Critical
httpError4xx: 20,
httpError5xx: 25,
brokenRedirect: 15,
redirectChain: 10,
// High
missingTitle: 8,
canonicalPointsToRedirect: 10,
orphanPage: 8,
// Medium
missingDescription: 3,
missingCanonical: 4,
// Low
longTitle: 2,
missingOgImage: 2,
};
function calculateScore(issues: DetectedIssue[]): number {
const totalPenalty = issues.reduce((sum, issue) => {
return sum + (issue.count * issueWeights[issue.type]);
}, 0);
// Scale relative to site size, but with diminishing returns
const maxPenalty = Math.sqrt(totalPages) * 50;
const score = Math.max(0, 100 - (totalPenalty / maxPenalty * 100));
return Math.round(score);
}
Now a site with real problems scores poorly, regardless of size.
Lessons Learned
1. Redirects Are Everywhere
We initially ignored redirects. Big mistake. Most sites have redirect chains from:
- HTTP→HTTPS migrations
- Trailing slash normalization
- Old URL structures
These chain together and cause real problems.
2. The Link Graph Reveals Hidden Issues
Orphan pages are invisible unless you map the entire site. A page might exist in your sitemap but have zero internal links pointing to it. Search engines will struggle to discover it.
3. Scoring Must Hurt
If everything scores 90+, the score is meaningless. Real sites have real problems. Our scoring now produces 30s and 40s for sites with structural issues—because that's accurate.
What We Detect Now
| Category | Issues | |----------|--------| | Internal Pages | 404s, 4xx errors, 5xx errors | | Redirects | Chains, broken redirects, 302s (should be 301), HTTP→HTTPS | | Indexability | Canonical issues, noindex conflicts, robots.txt blocks | | Links | Orphan pages, dead ends, links to redirects/errors | | Content | Missing titles, descriptions, H1s, duplicates | | Social | Missing OG tags, Twitter cards |
Try It
Want to see how your site actually scores? Run an audit at get-glimpse.com. You might be surprised—and that's the point.
Questions about the implementation? Email ashish.so@redon.ai