From URL to Insights: Inside Glimpse's Crawling Process
You enter a domain and click "Audit." Seconds later, you have a complete analysis with health score, categorized issues, and drill-down to affected pages. Here's what happens in between.
Phase 1: Domain Validation
First, we normalize your input:
yoursite.com → https://yoursite.com
http://yoursite.com → https://yoursite.com
www.yoursite.com → https://www.yoursite.com
yoursite.com/page → https://yoursite.com (domain extracted)
A quick HEAD request confirms the domain is reachable. No point crawling a site that's down.
Phase 2: Discovery
Before crawling pages, we need to know what exists.
robots.txt Fetch
GET https://yoursite.com/robots.txt
We parse:
- Sitemap directives: Lines like
Sitemap: https://yoursite.com/sitemap.xml - Disallow rules: Paths our crawler should respect
- Crawl-delay: Rate limiting hints
Sitemap Discovery
If robots.txt doesn't list sitemaps, we check common locations:
/sitemap.xml/sitemap_index.xml/sitemap/sitemap.xml
Sitemaps can be nested. A sitemap index points to other sitemaps:
<sitemapindex>
<sitemap>
<loc>https://yoursite.com/sitemap-posts.xml</loc>
</sitemap>
<sitemap>
<loc>https://yoursite.com/sitemap-pages.xml</loc>
</sitemap>
</sitemapindex>
We follow the entire tree, recursively fetching until we have every URL.
URL Filtering
Not every URL should be audited:
- Assets skipped:
.jpg,.png,.pdf,.css,.js - Non-HTML skipped: API endpoints, data files
- User patterns applied: Your ignore patterns like
/admin/*
The result: a clean list of HTML pages to crawl.
Phase 3: The Crawl
This is where most tools stop at "fetch page, parse HTML." We go deeper.
Redirect Tracking
We don't let fetch() auto-follow redirects. Instead, we manually follow each hop:
// Simplified flow
let currentUrl = startUrl;
const chain = [];
while (true) {
const response = await fetch(currentUrl, { redirect: 'manual' });
if (isRedirect(response.status)) {
chain.push({
from: currentUrl,
to: response.headers.get('location'),
status: response.status
});
currentUrl = response.headers.get('location');
continue;
}
// Final destination reached
break;
}
For each page, we now know:
- Original URL: What's in your sitemap/links
- Final URL: Where it actually resolves
- Redirect chain: Every hop in between
- Redirect types: 301, 302, 307, 308
- HTTP→HTTPS: Protocol upgrade detection
- Broken chains: When redirects lead to errors
Rate Limiting
We don't hammer your server:
- 3-5 concurrent requests maximum
- 1-second delay between batches
- Exponential backoff on rate limits (429 responses)
A 50-page site takes 15-30 seconds. A 200-page site might take 1-2 minutes.
HTML Extraction
For each successfully fetched page:
SEO Metadata
<title>tag<meta name="description"><link rel="canonical"><meta name="robots">(noindex/nofollow)
Response Headers
X-Robots-Tag(server-side noindex)- Status code and response time
Content Structure
- H1 and H2 headings
- Word count
- All internal and external links
Social Tags
- Open Graph:
og:title,og:description,og:image, etc. - Twitter Cards:
twitter:card,twitter:title, etc.
Page Assets
- HTML size in bytes
- Stylesheet count
- Script count
- Image count
Phase 4: Link Graph Analysis
After all pages are crawled, we build the complete link graph.
Building the Graph
Every internal link creates two relationships:
- Outgoing: Page A links to Page B
- Incoming: Page B receives a link from Page A
Homepage
├─ links to → /about
├─ links to → /products
└─ links to → /blog
/about
├─ links to → /contact
└─ links to → Homepage
/products
├─ links to → /products/item-1
└─ links to → /products/item-2
/hidden-page
└─ (no incoming links - ORPHAN)
Detecting Issues
Orphan Pages: Pages with zero incoming internal links
- They exist (in sitemap or were crawled)
- But nothing on your site links to them
- Search engines struggle to discover them
Dead Ends: Pages with zero outgoing links
- They don't link anywhere
- Users hit a wall
- Link equity doesn't flow onward
Links to Redirects: Internal links pointing to URLs that redirect
- Creates unnecessary redirect hops
- Update links to point to final destinations
Links to Errors: Internal links pointing to 4xx/5xx pages
- Broken internal navigation
- Wastes crawl budget
Phase 5: Canonical Validation
We don't trust that canonical URLs are valid—we verify them.
The Check
For each page with a canonical tag:
- Fetch the canonical URL
- Check the response:
- Does it redirect? → Issue: "canonical points to redirect"
- Does it return 4xx/5xx? → Issue: "canonical points to error"
- Does it match the page's own URL? → Self-referencing check
Why This Matters
A common post-migration problem:
<!-- On https://yoursite.com/new-page -->
<link rel="canonical" href="https://yoursite.com/old-page">
If /old-page now redirects to /new-page, you have a circular canonical → redirect → canonical loop. Search engines get confused.
Phase 6: Indexability Analysis
A page might be blocked from indexing in multiple ways:
What We Check
- robots.txt: Does a Disallow rule match this URL?
- Meta robots:
<meta name="robots" content="noindex"> - X-Robots-Tag header: HTTP header with
noindex - Canonical mismatch: Page canonicalizes elsewhere
Conflict Detection
Sometimes pages have conflicting signals:
- Canonical says "index me at this URL"
- Meta robots says "noindex"
We flag these conflicts so you can resolve the intent.
Phase 7: Issue Aggregation
All detected problems are categorized:
Internal Pages
├── 404 page (1) [Error]
└── 4XX page (1) [Error]
Indexability
├── Canonical points to redirect (39) [Error]
└── Indexable became non-indexable (40) [Warning]
Links
├── Orphan pages (5) [Warning]
├── Pages with no outgoing links (2) [Warning]
└── Links to redirect (15) [Info]
Redirects
├── Broken redirect (1) [Error]
├── Redirect chain (12) [Warning]
├── 302 redirect (2) [Warning]
└── HTTP to HTTPS redirect (8) [Info]
Each issue links to affected pages—click to see exactly which URLs have the problem.
Phase 8: Scoring
With all issues detected, we calculate the health score using weighted penalties. Critical issues like broken redirects cost more than minor issues like long titles.
The final score reflects actual site health, not just meta tag presence.
What You See
Overview Dashboard
- Health score with category breakdown
- Issue counts by severity
- Quick stats (pages scanned, errors, warnings)
Issue Drill-Down
- Click any issue to see affected pages
- Navigate directly to page details
- Filter pages by issue type
Page Details
- Complete metadata dump
- Social preview mockups (how links appear when shared)
- Link analysis (incoming and outgoing)
- All detected issues for that page
Timing
For a typical 50-page site:
| Phase | Time | |-------|------| | Validation | ~100ms | | Discovery | 1-3 seconds | | Crawling | 15-30 seconds | | Link Analysis | ~500ms | | Issue Detection | ~500ms | | Scoring | ~100ms | | Total | 20-35 seconds |
Larger sites scale linearly with page count.
Try It
Understanding the process is one thing. Seeing your own results is another.
Run an audit at get-glimpse.com. You'll see exactly what's happening at each step, and the issues might surprise you—even for sites you thought were healthy.
Questions about the crawling process? Email ashish.so@redon.ai