Why WCEB?
Existing content extraction benchmarks focus almost exclusively on news articles. On articles, the top extractors all converge above F1 = 0.90 --- the problem is largely solved. But the web is not just articles.
A production web crawler encounters product pages, forums, documentation, marketing pages, and category listings. Each has fundamentally different HTML structure, and heuristics tuned for articles fail on them in predictable ways. WCEB measures what article-only benchmarks cannot.
Page Types
Article 793 pages
Blog posts, news articles, editorials. Single content container with sequential paragraphs. The well-studied case.
Forum 113 pages
Discussion threads with multiple user posts. Extractors that treat comment CSS classes as boilerplate lose the primary content.
Service 165 pages
SaaS marketing pages with content distributed across hero, features, testimonials, pricing, and FAQ sections.
Product 119 pages
E-commerce product pages where descriptions are often in JSON-LD structured data rather than visible DOM elements.
Collection 117 pages
Category and collection pages with product grids, filter panels, and pagination mixed with content descriptions.
Listing 99 pages
Content indexes with repeated card elements --- article lists, course catalogues, review roundups.
Documentation 91 pages
Technical docs with code blocks, sidebar navigation, and version selectors. Spans Sphinx, Rustdoc, MDN, and custom platforms.
Development and Test Splits
The Web Content Extraction Benchmark is split into a 1,497-page development set and a 511-page held-out test set. The test set was constructed from a separate pool of pages, reviewed independently, and never used during any extraction system's development or evaluation-driven tuning. Page type distributions are matched within 4 percentage points between splits.
Annotation Methodology
1 LLM-Assisted Drafting
Main content is auto-generated from the HTML source using Claude, extracting title, author, date, and main content text.
2 Automated Verification
Scripts verify the generated annotations against the source HTML --- checking encoding, content bounds, JSON schema validity, and structural consistency.
3 4-Pass Frontier Model Review
Four independent review passes using Claude Opus check content completeness, content boundaries, metadata accuracy, and internal consistency. Each pass compares annotations against the source HTML and applies fixes.
4 Snippet Verification
Automated scripts verify that all with[] snippets appear verbatim in the main content and no without[] snippets are present. A 21-point quality scan checks encoding, boilerplate contamination, metadata plausibility, and cross-reference integrity.
5 Human Review
The human annotator reviews all changes from the preceding stages, adjudicates disagreements, and performs final quality assurance on the complete dataset.
Ground Truth Format
Each ground truth annotation includes the full main content as plain text, page title, author name, publication date, and two sets of evaluation snippets: with[] snippets that must appear in a correct extraction (testing completeness) and without[] snippets from boilerplate that must not appear (testing precision). Content is never truncated. The longest annotation is 49,617 characters.
Evaluation Metric
Extraction quality is measured using word-level set-overlap F1. Extracted text and ground truth are tokenised into word bags (lowercased, splitting on non-alphanumeric characters), and precision, recall, and F1 are computed from word overlap. This metric is insensitive to whitespace and formatting differences while penalising missing or extraneous content. The word-bag approach is consistent with prior extraction benchmarks.
Citation
@article{foley2026wceb,
title={{WCEB}: A Multi-Type Web Content Extraction Benchmark},
author={Foley, Murrough},
year={2026},
url={https://github.com/Murrough-Foley/web-content-extraction-benchmark},
doi={10.5281/zenodo.19316874}
}
Murrough Foley
Technical SEO consultant and researcher building tools at the intersection of search engine optimization, high-performance web data extraction, and applied machine learning.
I've spent 15+ years in SEO — from affiliate sites and local SEO to enterprise product management and large-scale content operations. These days I focus on technical SEO, programmatic data pipelines, and building the tools I wish existed when I was running audits across thousands of pages.
Released under CC-BY-4.0. Free to use for research and commercial purposes with attribution.