What Is Web Content Extraction?
Web content extraction is the task of separating the main content of a web page from its surrounding boilerplate — the navigation menus, cookie banners, ads, sidebar widgets, footer links, and social sharing buttons that make up a modern web page. The main content is whatever the page was actually created to deliver.
Anyone who processes web pages at scale needs content extraction. Search engines use it for indexing. RAG systems use it to feed clean context to large language models. NLP researchers use it to build training corpora. And SEO practitioners use it to approximate what Google actually sees when it evaluates a page.
The difficulty is that there is no standard way to mark the main content of a web page. It is just HTML elements — some containing the article text, most containing chrome around it. A typical page is roughly 80% boilerplate. An extraction system has to figure out which 20% matters, and it has to do it reliably across millions of pages with wildly different structures.
How Web Content Extraction Works
There are broadly three approaches to web content extraction: rule-based heuristics, neural classification, and hybrid systems that combine both.
Rule-based heuristic systems dominate production use. They score DOM elements by text density, paragraph count, link ratio, and CSS class names, then select the highest-scoring element as main content. They are fast — typically under 100ms per page — and need no GPU. The ecosystem here is strong. Trafilatura, created by Adrien Barbaresi, is the most widely used Python extraction library and powers large-scale pipelines like FineWeb and RefinedWeb. Mozilla's Readability is the engine behind Firefox's Reader View. jusText classifies text blocks by stopword frequency, and BoilerPy3 models extraction as sequence labeling over text blocks.
Neural extraction systems are a more recent development. MinerU-HTML from OpenDataLab fine-tunes a Qwen3-0.6B model to label each HTML element as main content or boilerplate — a neat approach that avoids hallucination since the model classifies rather than generates. ReaderLM-v2 from Jina AI takes a different route, using a 1.5B parameter model to convert raw HTML directly into Markdown. Both are impressive, but they need GPU inference and run 36 to 237 times slower than heuristic systems.
Hybrid systems try to get the best of both worlds. A heuristic extractor handles every page first. An ML confidence predictor flags the ones it is unsure about, and those get routed to a neural model for a second pass. You get heuristic speed on the 90%+ of pages where it works well, and neural quality on the hard cases.
The Problem WCEB Solves
The Web Content Extraction Benchmark solves the problem of invisible extraction failures on non-article page types by providing page-type-stratified evaluation across 7 structurally distinct categories.
Every existing benchmark measures performance almost exclusively on news articles and blog posts. On articles, the top systems all score above F1 = 0.90. That problem is largely solved. But a real web crawler hits far more than articles.
Product pages, discussion forums, documentation, SaaS marketing pages, category listings, content indexes — each has fundamentally different HTML structure, and heuristics tuned for articles break on them in predictable ways:
- Forum pages — user posts are wrapped in CSS classes like
commentandreply, which article extractors have learned to strip as boilerplate. On a forum, that removes the actual content. - Product pages — descriptions often live in JSON-LD structured data, not in the visible DOM. Extractors that only parse what they can see in the HTML miss them entirely.
- Service pages — content is spread across 5 to 15 separate
<section>elements. An extractor that picks a single best node captures the hero section and throws away the rest. - Documentation pages — the actual docs sit alongside sidebar nav, version pickers, and table-of-contents panels. Extractors often include the sidebar as if it were content.
- Collection pages — product grids are interleaved with filter panels and pagination. Hard to tell which elements are navigational and which are descriptions.
- Listing pages — repeated card elements where single-node extraction grabs one card instead of the full list.
These are not tuning problems. They are architectural gaps. An article extractor cannot be tweaked into a good forum extractor — it needs a different strategy. Before WCEB, no benchmark made these gaps visible.
Why WCEB Was Created
WCEB was created to address a gap in how web content extraction systems are evaluated. The motivation came from practical work in SEO analysis.
In SEO, content extraction is used to approximate what search engines see when they evaluate a web page. Google strips away boilerplate to isolate the text that matters for ranking. If you want to understand how Google evaluates content, you need an extraction tool that produces similar output.
When you crawl a single website, removing boilerplate is straightforward. You know the template, you can write targeted selectors, and the HTML structure is consistent across pages. The real difficulty is analysing pages from the SERPs --- competitor pages, ranking pages across an entire keyword set --- where every page comes from a different domain with a different HTML structure, a different template, and a different way of embedding content. At that scale, you need a general-purpose extraction tool that works reliably across thousands of unknown page structures.
The problem I kept running into was that general-purpose tools worked well on articles but fell apart on the page types that matter most for commercial SEO: product pages, category pages, service pages, and documentation. Nobody was measuring performance on these page types. Every benchmark was article-only or mixed-but-unlabelled. Without page-type-specific evaluation, developers had no signal that their tools were failing on a substantial fraction of the web.
WCEB was built to provide that signal.
What WCEB Contains
The Web Content Extraction Benchmark contains 2,008 web pages from 1,613 unique domains. Each page is annotated with ground truth content across one of seven page types: articles (793 pages), service pages (165 pages), products (119 pages), collections (117 pages), forums (113 pages), listings (99 pages), and documentation (91 pages).
WCEB is split into a 1,497-page development set and a 511-page held-out test set with matched page type distributions. The held-out set was constructed from a separate pool of pages, reviewed independently, and never used during any extraction system's development. This split allows researchers to develop against the development set while validating that results generalise on the test set.
Each ground truth annotation includes the full main content as plain text, page title, author name, publication date, and two sets of evaluation snippets: with[] snippets that must appear in a correct extraction (testing completeness) and without[] snippets from boilerplate that must not appear (testing precision). Evaluation uses word-level set-overlap F1, consistent with prior extraction benchmarks.
WCEB is released under CC-BY-4.0 and is available on GitHub, Zenodo (with DOI for citation), and HuggingFace Datasets.
What the Leaderboard Reveals
The overall F1 numbers are interesting. The per-page-type breakdown is where it gets revealing.
On articles, every top system lands between F1 = 0.88 and 0.93. The differences are marginal — pick any of the top five and you will get good results on news articles and blog posts.
On other page types, the picture falls apart. Forums show a 26.4-point spread between the best and worst system. Collections: 29.6 points. Products: 20.7 points. That is the difference between extracting most of the content and extracting almost none of it.