Web Content Extraction Benchmark

Why WCXB?

Existing content extraction benchmarks focus almost exclusively on news articles. On articles, the top extractors all converge above F1 = 0.90 --- the problem is largely solved. But the web is not just articles.

A production web crawler encounters product pages, forums, documentation, marketing pages, and category listings. Each has fundamentally different HTML structure, and heuristics tuned for articles fail on them in predictable ways. WCXB measures what article-only benchmarks cannot.

Page Types

Article 793 pages

Blog posts, news articles, editorials. Single content container with sequential paragraphs. The well-studied case.

Forum 113 pages

Discussion threads with multiple user posts. Extractors that treat comment CSS classes as boilerplate lose the primary content.

Service 165 pages

SaaS marketing pages with content distributed across hero, features, testimonials, pricing, and FAQ sections.

Product 119 pages

E-commerce product pages where descriptions are often in JSON-LD structured data rather than visible DOM elements.

Collection 117 pages

Category and collection pages with product grids, filter panels, and pagination mixed with content descriptions.

Listing 99 pages

Content indexes with repeated card elements --- article lists, course catalogues, review roundups.

Documentation 91 pages

Technical docs with code blocks, sidebar navigation, and version selectors. Spans Sphinx, Rustdoc, MDN, and custom platforms.

Development and Test Splits

The Web Content Extraction Benchmark is split into a 1,497-page development set and a 511-page held-out test set. The test set was constructed from a separate pool of pages, reviewed independently, and never used during any extraction system's development or evaluation-driven tuning. Page type distributions are matched within 4 percentage points between splits.

Annotation Methodology

LLM-Assisted Drafting

Main content is auto-generated from the HTML source using Claude, extracting title, author, date, and main content text.

Automated Verification

Scripts verify the generated annotations against the source HTML --- checking encoding, content bounds, JSON schema validity, and structural consistency.

4-Pass Frontier Model Review

Four independent review passes using Claude Opus check content completeness, content boundaries, metadata accuracy, and internal consistency. Each pass compares annotations against the source HTML and applies fixes.

Snippet Verification

Automated scripts verify that all with[] snippets appear verbatim in the main content and no without[] snippets are present. A 21-point quality scan checks encoding, boilerplate contamination, metadata plausibility, and cross-reference integrity.

Human Review

The human annotator reviews all changes from the preceding stages, adjudicates disagreements, and performs final quality assurance on the complete dataset.

Ground Truth Format

Each ground truth annotation includes the full main content as plain text, page title, author name, publication date, and two sets of evaluation snippets: with[] snippets that must appear in a correct extraction (testing completeness) and without[] snippets from boilerplate that must not appear (testing precision). Content is never truncated. The longest annotation is 49,617 characters.

Evaluation Metric

Extraction quality is measured using word-level set-overlap F1. Extracted text and ground truth are tokenised into word bags (lowercased, splitting on non-alphanumeric characters), and precision, recall, and F1 are computed from word overlap. This metric is insensitive to whitespace and formatting differences while penalising missing or extraneous content. The word-bag approach is consistent with prior extraction benchmarks.

@article{foley2026wcxb, title={{WCXB}: A Multi-Type Web Content Extraction Benchmark}, author={Foley, Murrough}, year={2026}, url={https://github.com/Murrough-Foley/web-content-extraction-benchmark}, doi={10.5281/zenodo.19316874} }