Back to Blog

How We Built Glimpse's Audit Engine

by Ashish Sontakke

Building a site audit tool that actually catches real problems is harder than it looks. Early versions of Glimpse would give sites near-perfect scores while professional tools flagged hundreds of issues. We had to completely rethink our approach.

Here's how we built an audit engine that catches what matters.

The Problem with Simple Audits

Our first implementation was naive: fetch pages, check for missing meta tags, calculate a score. A site could have:

  • 45 redirect chains
  • 39 canonicals pointing to redirects
  • Orphan pages with no internal links
  • Broken redirects

...and still score 90+. That's useless.

Real SEO problems aren't just missing <title> tags. They're structural issues in how pages link together, how redirects chain, and whether search engines can actually index your content.

The Architecture

Glimpse now uses a multi-phase audit pipeline:

Discovery → Crawl → Link Analysis → Issue Detection → Scoring

Each phase builds on the previous one, and the final score reflects actual site health.

Phase 1: Discovery

Before crawling, we map out the site:

robots.txt Analysis

GET https://example.com/robots.txt

We extract:

  • Sitemap locations
  • Disallow rules (what we shouldn't crawl)
  • Crawl-delay directives

Sitemap Parsing

Sitemaps can be nested. A sitemap index points to other sitemaps, which might point to more sitemaps. We follow the entire tree:

async function discoverUrls(domain: string): Promise<string[]> {
  const sitemapUrls = await findSitemaps(domain);
  const allUrls = new Set<string>();

  for (const sitemapUrl of sitemapUrls) {
    const urls = await parseSitemap(sitemapUrl);
    urls.forEach(url => allUrls.add(url));
  }

  return filterCrawlableUrls([...allUrls]);
}

Phase 2: Crawling with Redirect Tracking

This is where things get interesting. We don't just fetch pages—we track the entire redirect journey.

Manual Redirect Following

Instead of letting fetch() auto-follow redirects, we handle them manually:

async function crawlWithRedirects(url: string): Promise<CrawlResult> {
  const chain: RedirectHop[] = [];
  let currentUrl = url;
  let response: Response;

  while (true) {
    response = await fetch(currentUrl, { redirect: 'manual' });

    if (response.status >= 300 && response.status < 400) {
      const location = response.headers.get('location');
      chain.push({
        from: currentUrl,
        to: location,
        status: response.status,
        isHttpToHttps: isHttpToHttpsRedirect(currentUrl, location)
      });
      currentUrl = location;
      continue;
    }

    break;
  }

  return {
    originalUrl: url,
    finalUrl: currentUrl,
    redirectChain: chain,
    finalStatus: response.status,
    html: await response.text()
  };
}

This captures:

  • Redirect chains: Multiple hops (A → B → C)
  • Redirect types: 301 vs 302 vs 307
  • HTTP→HTTPS redirects: Protocol upgrades
  • Broken redirects: Chains that end in 4xx/5xx

What We Extract

For each page, we parse:

SEO Essentials

  • Title, description, canonical
  • H1/H2 structure
  • Robots directives (meta and X-Robots-Tag header)

Open Graph & Twitter Cards

  • All og:* and twitter:* tags
  • Image URLs for validation

Links

  • Every <a href> on the page
  • Classified as internal or external
  • Used to build the link graph

Page Assets

  • HTML size, stylesheet count, script count
  • Image and font references

Phase 3: Link Graph Analysis

After crawling all pages, we build a complete picture of how they connect.

Building the Graph

interface LinkGraph {
  // URL → URLs that link TO it
  incomingLinks: Map<string, Set<string>>;
  // URL → URLs it links TO
  outgoingLinks: Map<string, Set<string>>;
}

function buildLinkGraph(pages: PageData[]): LinkGraph {
  const graph: LinkGraph = {
    incomingLinks: new Map(),
    outgoingLinks: new Map()
  };

  for (const page of pages) {
    const pageUrl = normalizeUrl(page.url);

    for (const link of page.links) {
      if (link.isExternal) continue;

      const targetUrl = normalizeUrl(link.href);

      // Track outgoing
      if (!graph.outgoingLinks.has(pageUrl)) {
        graph.outgoingLinks.set(pageUrl, new Set());
      }
      graph.outgoingLinks.get(pageUrl)!.add(targetUrl);

      // Track incoming
      if (!graph.incomingLinks.has(targetUrl)) {
        graph.incomingLinks.set(targetUrl, new Set());
      }
      graph.incomingLinks.get(targetUrl)!.add(pageUrl);
    }
  }

  return graph;
}

Detecting Link Issues

With the graph built, we identify problems:

Orphan Pages: No incoming internal links

const orphanPages = pages.filter(page => {
  const incoming = graph.incomingLinks.get(page.url);
  return !incoming || incoming.size === 0;
});

Dead Ends: No outgoing links

const deadEnds = pages.filter(page => {
  const outgoing = graph.outgoingLinks.get(page.url);
  return !outgoing || outgoing.size === 0;
});

Links to Redirects: Internal links pointing to 3xx pages Links to Errors: Internal links pointing to 4xx/5xx pages

Phase 4: Canonical & Indexability

These checks require additional validation beyond just parsing HTML.

Canonical Validation

We don't trust that a canonical URL is valid—we verify it:

async function validateCanonical(page: PageData): Promise<CanonicalIssue | null> {
  if (!page.canonical) {
    return { type: 'missing' };
  }

  // Fetch the canonical URL
  const response = await fetch(page.canonical, { redirect: 'manual' });

  if (response.status >= 300 && response.status < 400) {
    return { type: 'points_to_redirect', canonical: page.canonical };
  }

  if (response.status >= 400) {
    return { type: 'points_to_error', canonical: page.canonical };
  }

  return null;
}

Indexability Analysis

A page might be blocked from indexing in multiple ways:

  1. robots.txt: Disallow rule matches the URL
  2. Meta robots: <meta name="robots" content="noindex">
  3. X-Robots-Tag: HTTP header with noindex directive
  4. Canonical mismatch: Page canonicalizes to a different URL

We check all four and report which mechanism blocks the page.

Phase 5: Weighted Scoring

Here's where we diverged from naive implementations.

The Wrong Way

// DON'T DO THIS
if (!page.title) score -= 10 / totalPages;

This normalizes by page count, which means more pages = higher score. A 100-page site with 10 problems scores better than a 10-page site with the same 10 problems. That's backwards.

The Right Way

We use weighted issue scoring where each problem type has a fixed cost:

const issueWeights = {
  // Critical
  httpError4xx: 20,
  httpError5xx: 25,
  brokenRedirect: 15,
  redirectChain: 10,

  // High
  missingTitle: 8,
  canonicalPointsToRedirect: 10,
  orphanPage: 8,

  // Medium
  missingDescription: 3,
  missingCanonical: 4,

  // Low
  longTitle: 2,
  missingOgImage: 2,
};

function calculateScore(issues: DetectedIssue[]): number {
  const totalPenalty = issues.reduce((sum, issue) => {
    return sum + (issue.count * issueWeights[issue.type]);
  }, 0);

  // Scale relative to site size, but with diminishing returns
  const maxPenalty = Math.sqrt(totalPages) * 50;
  const score = Math.max(0, 100 - (totalPenalty / maxPenalty * 100));

  return Math.round(score);
}

Now a site with real problems scores poorly, regardless of size.

Lessons Learned

1. Redirects Are Everywhere

We initially ignored redirects. Big mistake. Most sites have redirect chains from:

  • HTTP→HTTPS migrations
  • Trailing slash normalization
  • Old URL structures

These chain together and cause real problems.

2. The Link Graph Reveals Hidden Issues

Orphan pages are invisible unless you map the entire site. A page might exist in your sitemap but have zero internal links pointing to it. Search engines will struggle to discover it.

3. Scoring Must Hurt

If everything scores 90+, the score is meaningless. Real sites have real problems. Our scoring now produces 30s and 40s for sites with structural issues—because that's accurate.

What We Detect Now

| Category | Issues | |----------|--------| | Internal Pages | 404s, 4xx errors, 5xx errors | | Redirects | Chains, broken redirects, 302s (should be 301), HTTP→HTTPS | | Indexability | Canonical issues, noindex conflicts, robots.txt blocks | | Links | Orphan pages, dead ends, links to redirects/errors | | Content | Missing titles, descriptions, H1s, duplicates | | Social | Missing OG tags, Twitter cards |

Try It

Want to see how your site actually scores? Run an audit at get-glimpse.com. You might be surprised—and that's the point.


Questions about the implementation? Email ashish.so@redon.ai