Name: Web Content Extraction Benchmark (WCXB)
Creator: Murrough Foley
Published: 2026
License: https://creativecommons.org/licenses/by/4.0/

This leaderboard is maintained as a hobby side project. Results are updated periodically as new extractors are evaluated and existing systems release updates. If you would like your extractor included, please get in touch. Last updated: April 2026.

What Is Web Content Extraction?

Web content extraction is the task of separating the main content of a web page from its surrounding boilerplate — the navigation menus, cookie banners, ads, sidebar widgets, footer links, and social sharing buttons that make up a modern web page. The main content is whatever the page was actually created to deliver.

Anyone who processes web pages at scale needs content extraction. Search engines use it for indexing. RAG systems use it to feed clean context to large language models. NLP researchers use it to build training corpora. And SEO practitioners use it to approximate what Google actually sees when it evaluates a page.

The difficulty is that there is no standard way to mark the main content of a web page. It is just HTML elements — some containing the article text, most containing chrome around it. A typical page is roughly 80% boilerplate. An extraction system has to figure out which 20% matters, and it has to do it reliably across millions of pages with wildly different structures.

How Web Content Extraction Works

There are broadly three approaches to web content extraction: rule-based heuristics, neural classification, and hybrid systems that combine both.

Rule-based heuristic systems dominate production use. They score DOM elements by text density, paragraph count, link ratio, and CSS class names, then select the highest-scoring element as main content. They are fast — typically under 100ms per page — and need no GPU. The ecosystem here is strong. Trafilatura, created by Adrien Barbaresi, is the most widely used Python extraction library and powers large-scale pipelines like FineWeb and RefinedWeb. Mozilla's Readability is the engine behind Firefox's Reader View. jusText classifies text blocks by stopword frequency, and BoilerPy3 models extraction as sequence labeling over text blocks. Resiliparse from the Webis Group, built in C++/Cython for the ChatNoir search engine, pairs a fast HTML parser with main-content heuristics and is a workhorse for processing Common Crawl at scale.

Neural extraction systems are a more recent development. MinerU-HTML from OpenDataLab fine-tunes a Qwen3-0.6B model to label each HTML element as main content or boilerplate — a neat approach that avoids hallucination since the model classifies rather than generates. ReaderLM-v2 from Jina AI takes a different route, using a 1.5B parameter model to convert raw HTML directly into Markdown. Both are impressive, but they need GPU inference and run 36 to 237 times slower than heuristic systems.

Hybrid systems try to get the best of both worlds. A heuristic extractor handles every page first. An ML confidence predictor flags the ones it is unsure about, and those get routed to a neural model for a second pass. You get heuristic speed on the 90%+ of pages where it works well, and neural quality on the hard cases.

The Problem WCXB Solves

The Web Content Extraction Benchmark solves the problem of invisible extraction failures on non-article page types by providing page-type-stratified evaluation across 7 structurally distinct categories.

Every existing benchmark measures performance almost exclusively on news articles and blog posts. On articles, the top systems all score above F1 = 0.90. That problem is largely solved. But a real web crawler hits far more than articles.

Product pages, discussion forums, documentation, SaaS marketing pages, category listings, content indexes — each has fundamentally different HTML structure, and heuristics tuned for articles break on them in predictable ways:

Forum pages — user posts are wrapped in CSS classes like comment and reply, which article extractors have learned to strip as boilerplate. On a forum, that removes the actual content.
Product pages — descriptions often live in JSON-LD structured data, not in the visible DOM. Extractors that only parse what they can see in the HTML miss them entirely.
Service pages — content is spread across 5 to 15 separate <section> elements. An extractor that picks a single best node captures the hero section and throws away the rest.
Documentation pages — the actual docs sit alongside sidebar nav, version pickers, and table-of-contents panels. Extractors often include the sidebar as if it were content.
Collection pages — product grids are interleaved with filter panels and pagination. Hard to tell which elements are navigational and which are descriptions.
Listing pages — repeated card elements where single-node extraction grabs one card instead of the full list.

These are not tuning problems. They are architectural gaps. An article extractor cannot be tweaked into a good forum extractor — it needs a different strategy. Before WCXB, no benchmark made these gaps visible.

Why WCXB Was Created

WCXB was created to address a gap in how web content extraction systems are evaluated. The motivation came from practical work in SEO analysis.

In SEO, content extraction is used to approximate what search engines see when they evaluate a web page. Google strips away boilerplate to isolate the text that matters for ranking. If you want to understand how Google evaluates content, you need an extraction tool that produces similar output.

When you crawl a single website, removing boilerplate is straightforward. You know the template, you can write targeted selectors, and the HTML structure is consistent across pages. The real difficulty is analysing pages from the SERPs --- competitor pages, ranking pages across an entire keyword set --- where every page comes from a different domain with a different HTML structure, a different template, and a different way of embedding content. At that scale, you need a general-purpose extraction tool that works reliably across thousands of unknown page structures.

The problem I kept running into was that general-purpose tools worked well on articles but fell apart on the page types that matter most for commercial SEO: product pages, category pages, service pages, and documentation. Nobody was measuring performance on these page types. Every benchmark was article-only or mixed-but-unlabelled. Without page-type-specific evaluation, developers had no signal that their tools were failing on a substantial fraction of the web.

WCXB was built to provide that signal.

What WCXB Contains

The Web Content Extraction Benchmark contains 2,008 web pages from 1,613 unique domains. Each page is annotated with ground truth content across one of seven page types: articles (793 pages), service pages (165 pages), products (119 pages), collections (117 pages), forums (113 pages), listings (99 pages), and documentation (91 pages).

WCXB is split into a 1,497-page development set and a 511-page held-out test set with matched page type distributions. The held-out set was constructed from a separate pool of pages, reviewed independently, and never used during any extraction system's development. This split allows researchers to develop against the development set while validating that results generalise on the test set.

Each ground truth annotation includes the full main content as plain text, page title, author name, publication date, and two sets of evaluation snippets: with[] snippets that must appear in a correct extraction (testing completeness) and without[] snippets from boilerplate that must not appear (testing precision). Evaluation uses word-level set-overlap F1, consistent with prior extraction benchmarks.

WCXB is released under CC-BY-4.0 and is available on GitHub, Zenodo (with DOI for citation), and HuggingFace Datasets.

What the Leaderboard Reveals

The overall F1 numbers are interesting. The per-page-type breakdown is where it gets revealing.

On articles, every top system lands between F1 = 0.88 and 0.93. The differences are marginal — pick any of the top five and you will get good results on news articles and blog posts.

On other page types, the picture falls apart. Forums show a 27.4-point spread between the best and worst system. Collections: 29.8 points. Products: 20.6 points. That is the difference between extracting most of the content and extracting almost none of it.

This is the core finding. On articles, everyone converges. On everything else, they diverge. If you only test on articles, every system looks roughly the same. Test across page types, and you see 20 to 30 point gaps that were previously invisible.

Article-only benchmarks have hit their ceiling. The next wave of improvement in web content extraction has to come from the page types that existing benchmarks — and existing extractors — have been ignoring.

Neural Extraction Systems Do Not Solve the Problem

You might expect that throwing a language model at the problem would fix it. It does not. MinerU-HTML trails the best heuristic on 5 of 7 page types (it narrowly leads on forums and listings). ReaderLM-v2, despite being 2.5 times larger, underperforms three heuristic systems overall. On collections, both neural systems score below every top heuristic.

The reason is straightforward: these models were trained mostly on article-like content. They inherited the same article bias, just through a different mechanism. A bigger model trained on the same data distribution does not close a distributional gap. What is needed is type-aware training data or type-aware architectures — and WCXB gives you a way to measure whether either approach is actually working.

Hybrid Extraction Pipelines

The idea is simple: run the heuristic extractor on everything, check how confident it is in its own output, and send the low-confidence pages to a neural model for a second opinion. On WCXB, about 8% of pages get routed this way, and overall F1 improves without adding meaningful latency to the other 92%.

The catch is that routing has to be page-type-aware. MinerU-HTML helps on articles, forums, and service pages, but actually makes things worse on collections and products. A naive "send everything uncertain to the LLM" strategy backfires. The pipeline needs to know which types benefit from neural fallback and which are better left alone.

Held-Out Test Set: Generalisation

WCXB includes a 511-page held-out test set that was never touched during any system's development. Every system scores higher on it than on the dev set (likely because the held-out annotations went through a more refined pipeline), but the ranking stays the same. The best system on dev is the best system on test. That consistency is what makes the benchmark useful — you can trust that improvements measured on the dev set will hold up.

Practical Applications

If you work in SEO, the per-page-type leaderboard tells you which extractor to trust for product pages, which one to use for service pages, and where every tool still falls short. That did not exist before.

If you build data pipelines, the extraction quality predictor lets you run confidence-gated processing at crawl scale. Instead of manually checking outputs, you get a score that flags suspect pages for review or routes them to a heavier extraction method.

#	System	Type	F1	P	R	ms/page
1	rs-trafilatura	Rule+ML	0.859	0.863	0.890	44
2	MinerU-HTML (0.6B)	Neural	0.827	0.845	0.840	1,570
3	Resiliparse	Rule	0.797	0.783	0.863	28
4	Trafilatura	Rule	0.791	0.852	0.793	97
5	dom-smoothie	Rule	0.762	0.806	0.768	26
6	ReaderLM-v2 (1.5B)	Neural	0.741	0.741	0.790	10,410
7	dom-content-extraction	Rule	0.731	0.757	0.789	---
8	Newspaper4k	Rule	0.720	0.838	0.683	1,825
9	magic-html	Rule	0.719	0.813	0.713	---
10	jusText	Rule	0.707	0.771	0.695	---
11	BoilerPy3	Rule	0.687	0.795	0.661	---
12	Readability	Rule	0.674	0.684	0.712	785
13	Goose3	Rule	0.651	0.845	0.593	---

Page Type	N	rs-trafilatura	MinerU-HTML	Resiliparse	Trafilatura	ReaderLM-v2
Article	793	0.932	0.928	0.871	0.924	0.880
Documentation	91	0.931	0.838	0.883	0.888	0.776
Service	165	0.843	0.824	0.815	0.751	0.708
Forum	113	0.792	0.794	0.758	0.575	0.589
Collection	117	0.713	0.506	0.586	0.518	0.415
Listing	99	0.704	0.710	0.620	0.550	0.578
Product	119	0.670	0.619	0.608	0.562	0.464

System	Dev F1	Test F1	Dev Delta	Test Delta	Routed
Hybrid (rs-traf + MinerU)	0.862	0.910	+0.003	+0.007	~8%
rs-trafilatura (alone)	0.859	0.903	baseline	baseline	0%

#	System	F1
1	rs-trafilatura	0.903
2	Trafilatura	0.841
3	Resiliparse	0.817
4	dom-smoothie	0.817
5	Newspaper4k	0.762
6	Readability	0.736
7	Goose3	0.694

Web Content Extraction Benchmark

Web Content Extraction Leaderboard

Development Set Results

What Is Web Content Extraction?

How Web Content Extraction Works

The Problem WCXB Solves

Why WCXB Was Created

What WCXB Contains

What the Leaderboard Reveals

Per-Page-Type Breakdown

Neural Extraction Systems Do Not Solve the Problem

Hybrid Extraction Pipelines

Hybrid Pipeline

Held-Out Test Set: Generalisation

Held-Out Test Set

Practical Applications

Citation

Murrough Foley