Tested February 2026 using
curlfor raw HTML andmarkdown.newproxy for markdown conversion.
| Site Type | Raw HTML | Markdown | Token Savings | HTML Tags Stripped |
|---|---|---|---|---|
| Stack Overflow (Q&A) | ~287K tokens | ~36K tokens | 87% | 7,723 |
| GitHub Repo Page | ~92K tokens | ~5K tokens | 94% | 2,348 |
| React Docs | ~71K tokens | ~5K tokens | 93% | 2,890 |
| MDN Web Docs | ~47K tokens | ~8K tokens | 82% | 1,188 |
| Wikipedia Article | ~69K tokens | ~38K tokens | 44% | 3,678 |
| Paul Graham Essay | ~20K tokens | ~34K tokens | -68% | 819 |
Token estimates use ~4 characters per token approximation.
| Site | Raw Bytes | Markdown Bytes | Reduction | Signal Ratio |
|---|---|---|---|---|
| Stack Overflow | 1,149,256 | 143,031 | 87% | 43% signal in raw |
| GitHub (facebook/react) | 367,484 | 20,417 | 94% | 14% signal in raw |
| React Docs (Thinking in React) | 283,120 | 19,027 | 93% | 21% signal in raw |
| MDN (Promise reference) | 188,489 | 33,065 | 82% | 56% signal in raw |
| Wikipedia (Claude language model) | 274,359 | 153,608 | 44% | 21% signal in raw |
| Paul Graham (Great Work) | 79,765 | 134,173 | -68% | 86% signal in raw |
"Signal ratio" = percentage of raw HTML that is actual text content (tags stripped). Lower signal ratio = more bloat = bigger savings from markdown.new.
markdown.new uses a headless browser, which means it can fetch content from sites that block plain curl requests. These sites returned bot-blocking pages to raw curl, but markdown.new returned real content:
| Site | Raw curl result | markdown.new result | Why curl failed |
|---|---|---|---|
| Amazon Product Page | 2.3 KB (bot redirect) | 59.7 KB (full content) | Bot detection redirect |
| AllRecipes | 612 B (access denied) | 24.9 KB (full recipe) | IP/bot blocking |
| Medium Blog Post | 7.2 KB (Cloudflare challenge) | 20.4 KB (full article) | Cloudflare JS challenge |
| NPM Package Page | 7.2 KB (Cloudflare challenge) | 11.2 KB (package info) | Cloudflare JS challenge |
Some sites blocked both methods:
| Site | Raw curl | markdown.new | Issue |
|---|---|---|---|
| CNN | 1.07 MB (JS shell) | 119 B (error) | Heavy JS rendering + anti-bot |
| BBC News | 183 KB (JS shell) | 119 B (error) | JS rendering + anti-bot |
| Reuters | 773 B (CAPTCHA) | 119 B (error) | CAPTCHA on both methods |
Paul Graham's blog uses extremely minimal HTML with almost no boilerplate, ads, or JavaScript. The raw HTML is already 86% meaningful content. In this case, markdown.new actually increased the size by 68% due to conversion overhead and metadata.
Takeaway: markdown.new provides the most value on modern, JavaScript-heavy sites with lots of ads and framework boilerplate. On minimal/static HTML sites, raw fetch may be smaller.
Ranked by token savings:
- App-like pages (GitHub, React docs): 93-94% savings — heavy JS bundles, SVGs, navigation
- Ad-heavy pages (Stack Overflow): 87% savings — ads, sidebars, related content widgets
- Reference docs (MDN): 82% savings — good content but wrapped in framework boilerplate
- Content-focused sites (Wikipedia): 44% savings — relatively clean HTML but still has nav/metadata
- Minimal sites (Paul Graham): -68% — already lean, conversion adds overhead
- All fetches performed via
curl -sL --max-time 30 - Token estimates:
byte_count / 4(standard English approximation) - Signal ratio: text-only bytes (HTML tags stripped) / total bytes
- HTML tag count: regex match on
<[a-zA-Z][^>]*>patterns - Tests run from a residential IP, no VPN or proxy