From URL to Insights: Inside Glimpse's Crawling Process

You enter a domain and click "Audit." Seconds later, you have a complete analysis with health score, categorized issues, and drill-down to affected pages. Here's what happens in between.

Phase 1: Domain Validation

First, we normalize your input:

yoursite.com        → https://yoursite.com
http://yoursite.com → https://yoursite.com
www.yoursite.com    → https://www.yoursite.com
yoursite.com/page   → https://yoursite.com (domain extracted)

A quick HEAD request confirms the domain is reachable. No point crawling a site that's down.

Phase 2: Discovery

Before crawling pages, we need to know what exists.

robots.txt Fetch

GET https://yoursite.com/robots.txt

We parse:

Sitemap directives: Lines like Sitemap: https://yoursite.com/sitemap.xml
Disallow rules: Paths our crawler should respect
Crawl-delay: Rate limiting hints

Sitemap Discovery

If robots.txt doesn't list sitemaps, we check common locations:

/sitemap.xml
/sitemap_index.xml
/sitemap/sitemap.xml

Sitemaps can be nested. A sitemap index points to other sitemaps:

<sitemapindex>
  <sitemap>
    <loc>https://yoursite.com/sitemap-posts.xml</loc>
  </sitemap>
  <sitemap>
    <loc>https://yoursite.com/sitemap-pages.xml</loc>
  </sitemap>
</sitemapindex>

We follow the entire tree, recursively fetching until we have every URL.

URL Filtering

Not every URL should be audited:

Assets skipped: .jpg, .png, .pdf, .css, .js
Non-HTML skipped: API endpoints, data files
User patterns applied: Your ignore patterns like /admin/*

The result: a clean list of HTML pages to crawl.

Phase 3: The Crawl

This is where most tools stop at "fetch page, parse HTML." We go deeper.

Redirect Tracking

We don't let fetch() auto-follow redirects. Instead, we manually follow each hop:

// Simplified flow
let currentUrl = startUrl;
const chain = [];

while (true) {
  const response = await fetch(currentUrl, { redirect: 'manual' });

  if (isRedirect(response.status)) {
    chain.push({
      from: currentUrl,
      to: response.headers.get('location'),
      status: response.status
    });
    currentUrl = response.headers.get('location');
    continue;
  }

  // Final destination reached
  break;
}

For each page, we now know:

Original URL: What's in your sitemap/links
Final URL: Where it actually resolves
Redirect chain: Every hop in between
Redirect types: 301, 302, 307, 308
HTTP→HTTPS: Protocol upgrade detection
Broken chains: When redirects lead to errors

Rate Limiting

We don't hammer your server:

3-5 concurrent requests maximum
1-second delay between batches
Exponential backoff on rate limits (429 responses)

A 50-page site takes 15-30 seconds. A 200-page site might take 1-2 minutes.

HTML Extraction

For each successfully fetched page:

SEO Metadata

<title> tag
<meta name="description">
<link rel="canonical">
<meta name="robots"> (noindex/nofollow)

Response Headers

X-Robots-Tag (server-side noindex)
Status code and response time

Content Structure

H1 and H2 headings
Word count
All internal and external links

Social Tags

Open Graph: og:title, og:description, og:image, etc.
Twitter Cards: twitter:card, twitter:title, etc.

Page Assets

HTML size in bytes
Stylesheet count
Script count
Image count

Phase 4: Link Graph Analysis

After all pages are crawled, we build the complete link graph.

Building the Graph

Every internal link creates two relationships:

Outgoing: Page A links to Page B
Incoming: Page B receives a link from Page A

Homepage
├─ links to → /about
├─ links to → /products
└─ links to → /blog

/about
├─ links to → /contact
└─ links to → Homepage

/products
├─ links to → /products/item-1
└─ links to → /products/item-2

/hidden-page
└─ (no incoming links - ORPHAN)

Detecting Issues

Orphan Pages: Pages with zero incoming internal links

They exist (in sitemap or were crawled)
But nothing on your site links to them
Search engines struggle to discover them

Dead Ends: Pages with zero outgoing links

They don't link anywhere
Users hit a wall
Link equity doesn't flow onward

Links to Redirects: Internal links pointing to URLs that redirect

Creates unnecessary redirect hops
Update links to point to final destinations

Links to Errors: Internal links pointing to 4xx/5xx pages

Broken internal navigation
Wastes crawl budget

Phase 5: Canonical Validation

We don't trust that canonical URLs are valid—we verify them.

The Check

For each page with a canonical tag:

Fetch the canonical URL
Check the response:
- Does it redirect? → Issue: "canonical points to redirect"
- Does it return 4xx/5xx? → Issue: "canonical points to error"
- Does it match the page's own URL? → Self-referencing check

Why This Matters

A common post-migration problem:

<!-- On https://yoursite.com/new-page -->
<link rel="canonical" href="https://yoursite.com/old-page">

If /old-page now redirects to /new-page, you have a circular canonical → redirect → canonical loop. Search engines get confused.

Phase 6: Indexability Analysis

A page might be blocked from indexing in multiple ways:

What We Check

robots.txt: Does a Disallow rule match this URL?
Meta robots: <meta name="robots" content="noindex">
X-Robots-Tag header: HTTP header with noindex
Canonical mismatch: Page canonicalizes elsewhere

Conflict Detection

Sometimes pages have conflicting signals:

Canonical says "index me at this URL"
Meta robots says "noindex"

We flag these conflicts so you can resolve the intent.

Phase 7: Issue Aggregation

All detected problems are categorized:

Internal Pages
├── 404 page (1) [Error]
└── 4XX page (1) [Error]

Indexability
├── Canonical points to redirect (39) [Error]
└── Indexable became non-indexable (40) [Warning]

Links
├── Orphan pages (5) [Warning]
├── Pages with no outgoing links (2) [Warning]
└── Links to redirect (15) [Info]

Redirects
├── Broken redirect (1) [Error]
├── Redirect chain (12) [Warning]
├── 302 redirect (2) [Warning]
└── HTTP to HTTPS redirect (8) [Info]

Each issue links to affected pages—click to see exactly which URLs have the problem.

Phase 8: Scoring

With all issues detected, we calculate the health score using weighted penalties. Critical issues like broken redirects cost more than minor issues like long titles.

The final score reflects actual site health, not just meta tag presence.

What You See

Overview Dashboard

Health score with category breakdown
Issue counts by severity
Quick stats (pages scanned, errors, warnings)

Issue Drill-Down

Click any issue to see affected pages
Navigate directly to page details
Filter pages by issue type

Page Details

Complete metadata dump
Social preview mockups (how links appear when shared)
Link analysis (incoming and outgoing)
All detected issues for that page

Timing

For a typical 50-page site:

Phase	Time
Validation	~100ms
Discovery	1-3 seconds
Crawling	15-30 seconds
Link Analysis	~500ms
Issue Detection	~500ms
Scoring	~100ms
Total	20-35 seconds

Larger sites scale linearly with page count.

Try It

Understanding the process is one thing. Seeing your own results is another.

Run an audit at get-glimpse.com. You'll see exactly what's happening at each step, and the issues might surprise you—even for sites you thought were healthy.

Questions about the crawling process? Email ashish.so@redon.ai