Benchmark: Raw HTML vs markdown.new

Tested February 2026 using curl for raw HTML and markdown.new proxy for markdown conversion.

Results Summary

Site Type	Raw HTML	Markdown	Token Savings	HTML Tags Stripped
Stack Overflow (Q&A)	~287K tokens	~36K tokens	87%	7,723
GitHub Repo Page	~92K tokens	~5K tokens	94%	2,348
React Docs	~71K tokens	~5K tokens	93%	2,890
MDN Web Docs	~47K tokens	~8K tokens	82%	1,188
Wikipedia Article	~69K tokens	~38K tokens	44%	3,678
Paul Graham Essay	~20K tokens	~34K tokens	-68%	819

Token estimates use ~4 characters per token approximation.

Detailed Size Comparison

Site	Raw Bytes	Markdown Bytes	Reduction	Signal Ratio
Stack Overflow	1,149,256	143,031	87%	43% signal in raw
GitHub (facebook/react)	367,484	20,417	94%	14% signal in raw
React Docs (Thinking in React)	283,120	19,027	93%	21% signal in raw
MDN (Promise reference)	188,489	33,065	82%	56% signal in raw
Wikipedia (Claude language model)	274,359	153,608	44%	21% signal in raw
Paul Graham (Great Work)	79,765	134,173	-68%	86% signal in raw

"Signal ratio" = percentage of raw HTML that is actual text content (tags stripped). Lower signal ratio = more bloat = bigger savings from markdown.new.

Bot Protection Bypass

markdown.new uses a headless browser, which means it can fetch content from sites that block plain curl requests. These sites returned bot-blocking pages to raw curl, but markdown.new returned real content:

Site	Raw curl result	markdown.new result	Why curl failed
Amazon Product Page	2.3 KB (bot redirect)	59.7 KB (full content)	Bot detection redirect
AllRecipes	612 B (access denied)	24.9 KB (full recipe)	IP/bot blocking
Medium Blog Post	7.2 KB (Cloudflare challenge)	20.4 KB (full article)	Cloudflare JS challenge
NPM Package Page	7.2 KB (Cloudflare challenge)	11.2 KB (package info)	Cloudflare JS challenge

Failure Cases

Some sites blocked both methods:

Site	Raw curl	markdown.new	Issue
CNN	1.07 MB (JS shell)	119 B (error)	Heavy JS rendering + anti-bot
BBC News	183 KB (JS shell)	119 B (error)	JS rendering + anti-bot
Reuters	773 B (CAPTCHA)	119 B (error)	CAPTCHA on both methods

Edge Case: Minimal HTML Sites

Paul Graham's blog uses extremely minimal HTML with almost no boilerplate, ads, or JavaScript. The raw HTML is already 86% meaningful content. In this case, markdown.new actually increased the size by 68% due to conversion overhead and metadata.

Takeaway: markdown.new provides the most value on modern, JavaScript-heavy sites with lots of ads and framework boilerplate. On minimal/static HTML sites, raw fetch may be smaller.

When markdown.new Helps Most

Ranked by token savings:

App-like pages (GitHub, React docs): 93-94% savings — heavy JS bundles, SVGs, navigation
Ad-heavy pages (Stack Overflow): 87% savings — ads, sidebars, related content widgets
Reference docs (MDN): 82% savings — good content but wrapped in framework boilerplate
Content-focused sites (Wikipedia): 44% savings — relatively clean HTML but still has nav/metadata
Minimal sites (Paul Graham): -68% — already lean, conversion adds overhead

Methodology

All fetches performed via curl -sL --max-time 30
Token estimates: byte_count / 4 (standard English approximation)
Signal ratio: text-only bytes (HTML tags stripped) / total bytes
HTML tag count: regex match on <[a-zA-Z][^>]*> patterns
Tests run from a residential IP, no VPN or proxy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark: Raw HTML vs markdown.new

Results Summary

Detailed Size Comparison

Bot Protection Bypass

Failure Cases

Edge Case: Minimal HTML Sites

When markdown.new Helps Most

Methodology

FilesExpand file tree

BENCHMARK.md

Latest commit

History

BENCHMARK.md

File metadata and controls

Benchmark: Raw HTML vs markdown.new

Results Summary

Detailed Size Comparison

Bot Protection Bypass

Failure Cases

Edge Case: Minimal HTML Sites

When markdown.new Helps Most

Methodology