Skip to content

Latest commit

 

History

History
74 lines (53 loc) · 3.72 KB

File metadata and controls

74 lines (53 loc) · 3.72 KB

Benchmark: Raw HTML vs markdown.new

Tested February 2026 using curl for raw HTML and markdown.new proxy for markdown conversion.

Results Summary

Site Type Raw HTML Markdown Token Savings HTML Tags Stripped
Stack Overflow (Q&A) ~287K tokens ~36K tokens 87% 7,723
GitHub Repo Page ~92K tokens ~5K tokens 94% 2,348
React Docs ~71K tokens ~5K tokens 93% 2,890
MDN Web Docs ~47K tokens ~8K tokens 82% 1,188
Wikipedia Article ~69K tokens ~38K tokens 44% 3,678
Paul Graham Essay ~20K tokens ~34K tokens -68% 819

Token estimates use ~4 characters per token approximation.

Detailed Size Comparison

Site Raw Bytes Markdown Bytes Reduction Signal Ratio
Stack Overflow 1,149,256 143,031 87% 43% signal in raw
GitHub (facebook/react) 367,484 20,417 94% 14% signal in raw
React Docs (Thinking in React) 283,120 19,027 93% 21% signal in raw
MDN (Promise reference) 188,489 33,065 82% 56% signal in raw
Wikipedia (Claude language model) 274,359 153,608 44% 21% signal in raw
Paul Graham (Great Work) 79,765 134,173 -68% 86% signal in raw

"Signal ratio" = percentage of raw HTML that is actual text content (tags stripped). Lower signal ratio = more bloat = bigger savings from markdown.new.

Bot Protection Bypass

markdown.new uses a headless browser, which means it can fetch content from sites that block plain curl requests. These sites returned bot-blocking pages to raw curl, but markdown.new returned real content:

Site Raw curl result markdown.new result Why curl failed
Amazon Product Page 2.3 KB (bot redirect) 59.7 KB (full content) Bot detection redirect
AllRecipes 612 B (access denied) 24.9 KB (full recipe) IP/bot blocking
Medium Blog Post 7.2 KB (Cloudflare challenge) 20.4 KB (full article) Cloudflare JS challenge
NPM Package Page 7.2 KB (Cloudflare challenge) 11.2 KB (package info) Cloudflare JS challenge

Failure Cases

Some sites blocked both methods:

Site Raw curl markdown.new Issue
CNN 1.07 MB (JS shell) 119 B (error) Heavy JS rendering + anti-bot
BBC News 183 KB (JS shell) 119 B (error) JS rendering + anti-bot
Reuters 773 B (CAPTCHA) 119 B (error) CAPTCHA on both methods

Edge Case: Minimal HTML Sites

Paul Graham's blog uses extremely minimal HTML with almost no boilerplate, ads, or JavaScript. The raw HTML is already 86% meaningful content. In this case, markdown.new actually increased the size by 68% due to conversion overhead and metadata.

Takeaway: markdown.new provides the most value on modern, JavaScript-heavy sites with lots of ads and framework boilerplate. On minimal/static HTML sites, raw fetch may be smaller.

When markdown.new Helps Most

Ranked by token savings:

  1. App-like pages (GitHub, React docs): 93-94% savings — heavy JS bundles, SVGs, navigation
  2. Ad-heavy pages (Stack Overflow): 87% savings — ads, sidebars, related content widgets
  3. Reference docs (MDN): 82% savings — good content but wrapped in framework boilerplate
  4. Content-focused sites (Wikipedia): 44% savings — relatively clean HTML but still has nav/metadata
  5. Minimal sites (Paul Graham): -68% — already lean, conversion adds overhead

Methodology

  • All fetches performed via curl -sL --max-time 30
  • Token estimates: byte_count / 4 (standard English approximation)
  • Signal ratio: text-only bytes (HTML tags stripped) / total bytes
  • HTML tag count: regex match on <[a-zA-Z][^>]*> patterns
  • Tests run from a residential IP, no VPN or proxy