Back to Blog

From URL to Insights: Inside Glimpse's Crawling Process

by Ashish Sontakke

You enter a domain and click "Audit." Seconds later, you have a complete analysis with health score, categorized issues, and drill-down to affected pages. Here's what happens in between.

Phase 1: Domain Validation

First, we normalize your input:

yoursite.com        → https://yoursite.com
http://yoursite.com → https://yoursite.com
www.yoursite.com    → https://www.yoursite.com
yoursite.com/page   → https://yoursite.com (domain extracted)

A quick HEAD request confirms the domain is reachable. No point crawling a site that's down.

Phase 2: Discovery

Before crawling pages, we need to know what exists.

robots.txt Fetch

GET https://yoursite.com/robots.txt

We parse:

  • Sitemap directives: Lines like Sitemap: https://yoursite.com/sitemap.xml
  • Disallow rules: Paths our crawler should respect
  • Crawl-delay: Rate limiting hints

Sitemap Discovery

If robots.txt doesn't list sitemaps, we check common locations:

  • /sitemap.xml
  • /sitemap_index.xml
  • /sitemap/sitemap.xml

Sitemaps can be nested. A sitemap index points to other sitemaps:

<sitemapindex>
  <sitemap>
    <loc>https://yoursite.com/sitemap-posts.xml</loc>
  </sitemap>
  <sitemap>
    <loc>https://yoursite.com/sitemap-pages.xml</loc>
  </sitemap>
</sitemapindex>

We follow the entire tree, recursively fetching until we have every URL.

URL Filtering

Not every URL should be audited:

  • Assets skipped: .jpg, .png, .pdf, .css, .js
  • Non-HTML skipped: API endpoints, data files
  • User patterns applied: Your ignore patterns like /admin/*

The result: a clean list of HTML pages to crawl.

Phase 3: The Crawl

This is where most tools stop at "fetch page, parse HTML." We go deeper.

Redirect Tracking

We don't let fetch() auto-follow redirects. Instead, we manually follow each hop:

// Simplified flow
let currentUrl = startUrl;
const chain = [];

while (true) {
  const response = await fetch(currentUrl, { redirect: 'manual' });

  if (isRedirect(response.status)) {
    chain.push({
      from: currentUrl,
      to: response.headers.get('location'),
      status: response.status
    });
    currentUrl = response.headers.get('location');
    continue;
  }

  // Final destination reached
  break;
}

For each page, we now know:

  • Original URL: What's in your sitemap/links
  • Final URL: Where it actually resolves
  • Redirect chain: Every hop in between
  • Redirect types: 301, 302, 307, 308
  • HTTP→HTTPS: Protocol upgrade detection
  • Broken chains: When redirects lead to errors

Rate Limiting

We don't hammer your server:

  • 3-5 concurrent requests maximum
  • 1-second delay between batches
  • Exponential backoff on rate limits (429 responses)

A 50-page site takes 15-30 seconds. A 200-page site might take 1-2 minutes.

HTML Extraction

For each successfully fetched page:

SEO Metadata

  • <title> tag
  • <meta name="description">
  • <link rel="canonical">
  • <meta name="robots"> (noindex/nofollow)

Response Headers

  • X-Robots-Tag (server-side noindex)
  • Status code and response time

Content Structure

  • H1 and H2 headings
  • Word count
  • All internal and external links

Social Tags

  • Open Graph: og:title, og:description, og:image, etc.
  • Twitter Cards: twitter:card, twitter:title, etc.

Page Assets

  • HTML size in bytes
  • Stylesheet count
  • Script count
  • Image count

Phase 4: Link Graph Analysis

After all pages are crawled, we build the complete link graph.

Building the Graph

Every internal link creates two relationships:

  • Outgoing: Page A links to Page B
  • Incoming: Page B receives a link from Page A
Homepage
├─ links to → /about
├─ links to → /products
└─ links to → /blog

/about
├─ links to → /contact
└─ links to → Homepage

/products
├─ links to → /products/item-1
└─ links to → /products/item-2

/hidden-page
└─ (no incoming links - ORPHAN)

Detecting Issues

Orphan Pages: Pages with zero incoming internal links

  • They exist (in sitemap or were crawled)
  • But nothing on your site links to them
  • Search engines struggle to discover them

Dead Ends: Pages with zero outgoing links

  • They don't link anywhere
  • Users hit a wall
  • Link equity doesn't flow onward

Links to Redirects: Internal links pointing to URLs that redirect

  • Creates unnecessary redirect hops
  • Update links to point to final destinations

Links to Errors: Internal links pointing to 4xx/5xx pages

  • Broken internal navigation
  • Wastes crawl budget

Phase 5: Canonical Validation

We don't trust that canonical URLs are valid—we verify them.

The Check

For each page with a canonical tag:

  1. Fetch the canonical URL
  2. Check the response:
    • Does it redirect? → Issue: "canonical points to redirect"
    • Does it return 4xx/5xx? → Issue: "canonical points to error"
    • Does it match the page's own URL? → Self-referencing check

Why This Matters

A common post-migration problem:

<!-- On https://yoursite.com/new-page -->
<link rel="canonical" href="https://yoursite.com/old-page">

If /old-page now redirects to /new-page, you have a circular canonical → redirect → canonical loop. Search engines get confused.

Phase 6: Indexability Analysis

A page might be blocked from indexing in multiple ways:

What We Check

  1. robots.txt: Does a Disallow rule match this URL?
  2. Meta robots: <meta name="robots" content="noindex">
  3. X-Robots-Tag header: HTTP header with noindex
  4. Canonical mismatch: Page canonicalizes elsewhere

Conflict Detection

Sometimes pages have conflicting signals:

  • Canonical says "index me at this URL"
  • Meta robots says "noindex"

We flag these conflicts so you can resolve the intent.

Phase 7: Issue Aggregation

All detected problems are categorized:

Internal Pages
├── 404 page (1) [Error]
└── 4XX page (1) [Error]

Indexability
├── Canonical points to redirect (39) [Error]
└── Indexable became non-indexable (40) [Warning]

Links
├── Orphan pages (5) [Warning]
├── Pages with no outgoing links (2) [Warning]
└── Links to redirect (15) [Info]

Redirects
├── Broken redirect (1) [Error]
├── Redirect chain (12) [Warning]
├── 302 redirect (2) [Warning]
└── HTTP to HTTPS redirect (8) [Info]

Each issue links to affected pages—click to see exactly which URLs have the problem.

Phase 8: Scoring

With all issues detected, we calculate the health score using weighted penalties. Critical issues like broken redirects cost more than minor issues like long titles.

The final score reflects actual site health, not just meta tag presence.

What You See

Overview Dashboard

  • Health score with category breakdown
  • Issue counts by severity
  • Quick stats (pages scanned, errors, warnings)

Issue Drill-Down

  • Click any issue to see affected pages
  • Navigate directly to page details
  • Filter pages by issue type

Page Details

  • Complete metadata dump
  • Social preview mockups (how links appear when shared)
  • Link analysis (incoming and outgoing)
  • All detected issues for that page

Timing

For a typical 50-page site:

| Phase | Time | |-------|------| | Validation | ~100ms | | Discovery | 1-3 seconds | | Crawling | 15-30 seconds | | Link Analysis | ~500ms | | Issue Detection | ~500ms | | Scoring | ~100ms | | Total | 20-35 seconds |

Larger sites scale linearly with page count.

Try It

Understanding the process is one thing. Seeing your own results is another.

Run an audit at get-glimpse.com. You'll see exactly what's happening at each step, and the issues might surprise you—even for sites you thought were healthy.


Questions about the crawling process? Email ashish.so@redon.ai