Web Content Extraction Benchmark

The Web Content Extraction Benchmark (WCEB) is an open dataset for evaluating how well software extracts the main content from web pages. WCEB contains 2,008 annotated pages (1,497 development + 511 held-out test) and measures extraction quality across 7 structurally distinct page types, revealing performance gaps that article-only benchmarks cannot detect.

2,008 pages
7 page types
1,613 domains
12 extractors baselined

Web Content Extraction Leaderboard

This leaderboard is maintained as a hobby side project. Results are updated periodically as new extractors are evaluated and existing systems release updates. If you would like your extractor included, please get in touch. Last updated: April 2026.

Development Set Results

Word-level F1 on the WCEB development set (1,497 pages across 7 page types). F1 is the harmonic mean of precision (P) and recall (R), measured by word-level overlap between extracted and ground truth content.

# System Type F1 P R ms/page
1 rs-trafilatura Rule+ML 0.859 0.863 0.890 44
2 MinerU-HTML (0.6B) Neural 0.827 0.845 0.840 1,570
3 Trafilatura Rule 0.791 0.852 0.793 94
4 dom-smoothie Rule 0.762 0.806 0.768 27
5 ReaderLM-v2 (1.5B) Neural 0.741 0.741 0.790 10,410
6 dom-content-extraction Rule 0.731 0.757 0.789 ---
7 Newspaper4k Rule 0.720 0.838 0.683 869
8 magic-html Rule 0.719 0.813 0.713 ---
9 jusText Rule 0.707 0.771 0.695 ---
10 BoilerPy3 Rule 0.687 0.795 0.661 ---
11 Readability Rule 0.675 0.685 0.713 751
12 Goose3 Rule 0.651 0.845 0.593 ---

What Is Web Content Extraction?

Web content extraction is the task of separating the main content of a web page from its surrounding boilerplate — the navigation menus, cookie banners, ads, sidebar widgets, footer links, and social sharing buttons that make up a modern web page. The main content is whatever the page was actually created to deliver.

Anyone who processes web pages at scale needs content extraction. Search engines use it for indexing. RAG systems use it to feed clean context to large language models. NLP researchers use it to build training corpora. And SEO practitioners use it to approximate what Google actually sees when it evaluates a page.

The difficulty is that there is no standard way to mark the main content of a web page. It is just HTML elements — some containing the article text, most containing chrome around it. A typical page is roughly 80% boilerplate. An extraction system has to figure out which 20% matters, and it has to do it reliably across millions of pages with wildly different structures.

How Web Content Extraction Works

There are broadly three approaches to web content extraction: rule-based heuristics, neural classification, and hybrid systems that combine both.

Rule-based heuristic systems dominate production use. They score DOM elements by text density, paragraph count, link ratio, and CSS class names, then select the highest-scoring element as main content. They are fast — typically under 100ms per page — and need no GPU. The ecosystem here is strong. Trafilatura, created by Adrien Barbaresi, is the most widely used Python extraction library and powers large-scale pipelines like FineWeb and RefinedWeb. Mozilla's Readability is the engine behind Firefox's Reader View. jusText classifies text blocks by stopword frequency, and BoilerPy3 models extraction as sequence labeling over text blocks.

Neural extraction systems are a more recent development. MinerU-HTML from OpenDataLab fine-tunes a Qwen3-0.6B model to label each HTML element as main content or boilerplate — a neat approach that avoids hallucination since the model classifies rather than generates. ReaderLM-v2 from Jina AI takes a different route, using a 1.5B parameter model to convert raw HTML directly into Markdown. Both are impressive, but they need GPU inference and run 36 to 237 times slower than heuristic systems.

Hybrid systems try to get the best of both worlds. A heuristic extractor handles every page first. An ML confidence predictor flags the ones it is unsure about, and those get routed to a neural model for a second pass. You get heuristic speed on the 90%+ of pages where it works well, and neural quality on the hard cases.

The Problem WCEB Solves

The Web Content Extraction Benchmark solves the problem of invisible extraction failures on non-article page types by providing page-type-stratified evaluation across 7 structurally distinct categories.

Every existing benchmark measures performance almost exclusively on news articles and blog posts. On articles, the top systems all score above F1 = 0.90. That problem is largely solved. But a real web crawler hits far more than articles.

Product pages, discussion forums, documentation, SaaS marketing pages, category listings, content indexes — each has fundamentally different HTML structure, and heuristics tuned for articles break on them in predictable ways:

These are not tuning problems. They are architectural gaps. An article extractor cannot be tweaked into a good forum extractor — it needs a different strategy. Before WCEB, no benchmark made these gaps visible.

Why WCEB Was Created

WCEB was created to address a gap in how web content extraction systems are evaluated. The motivation came from practical work in SEO analysis.

In SEO, content extraction is used to approximate what search engines see when they evaluate a web page. Google strips away boilerplate to isolate the text that matters for ranking. If you want to understand how Google evaluates content, you need an extraction tool that produces similar output.

When you crawl a single website, removing boilerplate is straightforward. You know the template, you can write targeted selectors, and the HTML structure is consistent across pages. The real difficulty is analysing pages from the SERPs --- competitor pages, ranking pages across an entire keyword set --- where every page comes from a different domain with a different HTML structure, a different template, and a different way of embedding content. At that scale, you need a general-purpose extraction tool that works reliably across thousands of unknown page structures.

The problem I kept running into was that general-purpose tools worked well on articles but fell apart on the page types that matter most for commercial SEO: product pages, category pages, service pages, and documentation. Nobody was measuring performance on these page types. Every benchmark was article-only or mixed-but-unlabelled. Without page-type-specific evaluation, developers had no signal that their tools were failing on a substantial fraction of the web.

WCEB was built to provide that signal.

What WCEB Contains

The Web Content Extraction Benchmark contains 2,008 web pages from 1,613 unique domains. Each page is annotated with ground truth content across one of seven page types: articles (793 pages), service pages (165 pages), products (119 pages), collections (117 pages), forums (113 pages), listings (99 pages), and documentation (91 pages).

WCEB is split into a 1,497-page development set and a 511-page held-out test set with matched page type distributions. The held-out set was constructed from a separate pool of pages, reviewed independently, and never used during any extraction system's development. This split allows researchers to develop against the development set while validating that results generalise on the test set.

Each ground truth annotation includes the full main content as plain text, page title, author name, publication date, and two sets of evaluation snippets: with[] snippets that must appear in a correct extraction (testing completeness) and without[] snippets from boilerplate that must not appear (testing precision). Evaluation uses word-level set-overlap F1, consistent with prior extraction benchmarks.

WCEB is released under CC-BY-4.0 and is available on GitHub, Zenodo (with DOI for citation), and HuggingFace Datasets.

What the Leaderboard Reveals

The overall F1 numbers are interesting. The per-page-type breakdown is where it gets revealing.

On articles, every top system lands between F1 = 0.88 and 0.93. The differences are marginal — pick any of the top five and you will get good results on news articles and blog posts.

On other page types, the picture falls apart. Forums show a 26.4-point spread between the best and worst system. Collections: 29.6 points. Products: 20.7 points. That is the difference between extracting most of the content and extracting almost none of it.

Per-Page-Type Breakdown

F1 on the development set by page type. The gap between article and non-article extraction reveals the blind spot in article-only benchmarks.

Page Type N rs-trafilatura MinerU-HTML Trafilatura ReaderLM-v2
Article 793 0.932 0.928 0.926 0.878
Documentation 91 0.931 0.838 0.888 0.776
Service 165 0.843 0.824 0.763 0.703
Forum 113 0.792 0.794 0.585 0.589
Collection 117 0.713 0.506 0.553 0.417
Listing 99 0.704 0.710 0.589 0.559
Product 119 0.670 0.619 0.567 0.463

This is the core finding. On articles, everyone converges. On everything else, they diverge. If you only test on articles, every system looks roughly the same. Test across page types, and you see 20 to 30 point gaps that were previously invisible.

Article-only benchmarks have hit their ceiling. The next wave of improvement in web content extraction has to come from the page types that existing benchmarks — and existing extractors — have been ignoring.

Neural Extraction Systems Do Not Solve the Problem

You might expect that throwing a language model at the problem would fix it. It does not. MinerU-HTML trails the best heuristic on 5 of 7 page types (it narrowly leads on forums and listings). ReaderLM-v2, despite being 2.5 times larger, underperforms three heuristic systems overall. On collections, both neural systems score below every top heuristic.

The reason is straightforward: these models were trained mostly on article-like content. They inherited the same article bias, just through a different mechanism. A bigger model trained on the same data distribution does not close a distributional gap. What is needed is type-aware training data or type-aware architectures — and WCEB gives you a way to measure whether either approach is actually working.

Hybrid Extraction Pipelines

The idea is simple: run the heuristic extractor on everything, check how confident it is in its own output, and send the low-confidence pages to a neural model for a second opinion. On WCEB, about 8% of pages get routed this way, and overall F1 improves without adding meaningful latency to the other 92%.

The catch is that routing has to be page-type-aware. MinerU-HTML helps on articles, forums, and service pages, but actually makes things worse on collections and products. A naive "send everything uncertain to the LLM" strategy backfires. The pipeline needs to know which types benefit from neural fallback and which are better left alone.

Hybrid Pipeline

rs-trafilatura + MinerU-HTML fallback for low-confidence pages (~8% routed to LLM).

System Dev F1 Test F1 Dev Delta Test Delta Routed
Hybrid (rs-traf + MinerU) 0.862 0.910 +0.003 +0.017 ~8%
rs-trafilatura (alone) 0.859 0.893 baseline baseline 0%

Held-Out Test Set: Generalisation

WCEB includes a 511-page held-out test set that was never touched during any system's development. Every system scores higher on it than on the dev set (likely because the held-out annotations went through a more refined pipeline), but the ranking stays the same. The best system on dev is the best system on test. That consistency is what makes the benchmark useful — you can trust that improvements measured on the dev set will hold up.

Held-Out Test Set

Generalization results on the 511-page held-out set (never used during development).

# System F1
1 rs-trafilatura 0.893
2 Trafilatura 0.833
3 dom-smoothie 0.808
4 Newspaper4k 0.753
5 Readability 0.726
6 Goose3 0.687

Practical Applications

If you work in SEO, the per-page-type leaderboard tells you which extractor to trust for product pages, which one to use for service pages, and where every tool still falls short. That did not exist before.

If you build data pipelines, the extraction quality predictor lets you run confidence-gated processing at crawl scale. Instead of manually checking outputs, you get a score that flags suspect pages for review or routes them to a heavier extraction method.

Citation

@article{foley2026wceb,
  title={{WCEB}: A Multi-Type Web Content Extraction Benchmark},
  author={Foley, Murrough},
  year={2026},
  url={https://github.com/Murrough-Foley/web-content-extraction-benchmark},
  doi={10.5281/zenodo.19316874}
}

Paper & Submissions

The WCEB dataset paper has been written and is ready for submission to arXiv. However, arXiv requires a first-time author in the relevant subject area to obtain an endorsement from an existing arXiv author before they can submit.

If you are an arXiv-endorsed author in cs.IR (Information Retrieval), cs.CL (Computation and Language), or a related area and would be willing to provide an endorsement, I would be very grateful. The paper describes the benchmark methodology, page type taxonomy, annotation pipeline, and baseline results presented on this page.

If you have built a web content extraction system and want to add it to the leaderboard, I would also welcome your results. The evaluation scripts and instructions are available in the GitHub repository.

Reach out on LinkedIn

Murrough Foley

Technical SEO consultant and researcher building tools at the intersection of search engine optimization, high-performance web data extraction, and applied machine learning.

I've spent 15+ years in SEO — from affiliate sites and local SEO to enterprise product management and large-scale content operations. These days I focus on technical SEO, programmatic data pipelines, and building the tools I wish existed when I was running audits across thousands of pages.