This document describes the HTTP Archive reports system, which automatically generates standardized reports from HTTP Archive crawl data.
The reports system generates Dataform operations that:
- Calculate metrics from HTTP Archive crawl data
- Store results in BigQuery tables partitioned by date and clustered by metric/lens/client
- Export data to Cloud Storage as JSON files for consumption by external systems
includes/reports.js- Defines metrics and lensesdefinitions/output/reports/reports_dynamic.js- Generates Dataform operationsincludes/constants.js- Provides shared constants and theDataformTemplateBuilder
The system supports two types of SQL queries:
- Purpose: Distribution analysis with binned data
- Output: Contains
bin,volume,pdf,cdfcolumns - Use case: Page weight distributions, performance metric distributions
- Export path:
reports/{date_folder}/{metric_id}_test.json ⚠️ Do NOT use for: Boolean/binary metrics (present/not present) - only two states don't create meaningful distributions
- Purpose: Trend analysis over time
- Output: Contains percentile data (p10, p25, p50, p75, p90) with timestamps
- Use case: Performance trends, adoption over time, boolean/adoption metrics
- Export path:
reports/{metric_id}_test.json
| Metric Type | Use Timeseries | Use Histogram | Use Both |
|---|---|---|---|
| Boolean/Adoption (present/not present) | ✅ Always | ❌ Never | ❌ |
| Percentage/Rate | ✅ Yes | ❌ Rarely useful | ❌ |
| Continuous values (bytes, time, count) | ✅ For percentiles | ✅ For distribution | ✅ Often |
Key Rule: Always use timeseries for boolean/adoption metrics; histogram only for continuous distributions.
Lenses allow filtering data by different criteria:
all- No filter, all pagestop1k- Top 1,000 ranked sitestop10k- Top 10,000 ranked sitestop100k- Top 100,000 ranked sitestop1m- Top 1,000,000 ranked sitesdrupal- Sites using Drupalmagento- Sites using Magentowordpress- Sites using WordPress
- Configurable start and end dates
- Processes data month by month using
constants.fnPastMonth() - Supports retrospective report generation
Determine which SQL type(s) to use based on your metric:
- Boolean/Adoption metrics (e.g., feature presence, file exists): Use timeseries only
- Continuous metrics (e.g., page weight, load time): Use both histogram and timeseries
- Percentages/Rates: Use timeseries only
Add your metric to the _metrics object in includes/reports.js:
const config = {
_metrics: {
// Existing metrics...
myNewMetric: {
SQL: [
{
type: 'histogram', // or 'timeseries'
query: DataformTemplateBuilder.create((ctx, params) => `
WITH pages AS (
SELECT
date,
client,
-- Your binning logic for histogram
CAST(FLOOR(your_metric_value / bin_size) * bin_size AS INT64) AS bin
FROM ${ctx.ref('crawl', 'pages')}
WHERE
date = '${params.date}'
${params.devRankFilter}
${params.lens.sql}
AND is_root_page
AND your_metric_value > 0
)
-- Your aggregation logic here
SELECT
*,
SUM(pdf) OVER (PARTITION BY client ORDER BY bin) AS cdf
FROM (
-- Calculate probability density function
SELECT
*,
volume / SUM(volume) OVER (PARTITION BY client) AS pdf
FROM (
SELECT
*,
COUNT(0) AS volume
FROM pages
WHERE bin IS NOT NULL
GROUP BY date, client, bin
)
)
ORDER BY bin, client
`)
}
]
}
}
}The metric will be automatically included in the next run of reports_dynamic.js. The system will generate operations for all combinations of:
- Your new metric
- All available lenses
- All SQL types you defined
- The configured date range
Check that the generated operations:
- Create the expected BigQuery tables
- Populate data correctly
- Export to Cloud Storage in the expected format
Your SQL template receives these parameters:
{
date: '2025-07-01', // Current processing date
devRankFilter: 'AND rank <= 10000', // Development filter
lens: {
name: 'top1k', // Lens name
sql: 'AND rank <= 1000' // Lens SQL filter
},
metric: { id: 'myMetric', ... }, // Metric configuration
sql: { type: 'histogram', ... } // SQL type configuration
}date- Processing dateclient- 'desktop' or 'mobile'bin- Numeric bin valuevolume- Count of pages in this binpdf- Probability density function valuecdf- Cumulative distribution function value
date- Processing dateclient- 'desktop' or 'mobile'timestamp- Unix timestamp in millisecondsp10,p25,p50,p75,p90- Percentile values
Every metric query MUST include these patterns:
WHERE
date = '${params.date}' -- Date filter
AND is_root_page -- Root page filter
${params.lens.sql} -- Lens filtering
${params.devRankFilter} -- Dev environment sampling
-- Use:
${ctx.ref('crawl', 'pages')} -- Proper table reference
GROUP BY client
ORDER BY clientROUND(SAFE_DIVIDE(
COUNTIF(condition),
COUNT(0)
) * 100, 2) AS pct_pagesROUND(APPROX_QUANTILES(FLOAT64(metric), 1001)[OFFSET(101)] / 1024, 2) AS p10,
ROUND(APPROX_QUANTILES(FLOAT64(metric), 1001)[OFFSET(251)] / 1024, 2) AS p25,
ROUND(APPROX_QUANTILES(FLOAT64(metric), 1001)[OFFSET(501)] / 1024, 2) AS p50,
ROUND(APPROX_QUANTILES(FLOAT64(metric), 1001)[OFFSET(751)] / 1024, 2) AS p75,
ROUND(APPROX_QUANTILES(FLOAT64(metric), 1001)[OFFSET(901)] / 1024, 2) AS p90
-- Important: Add WHERE condition: AND FLOAT64(metric) > 0 for continuous metrics-- Innermost subquery:
CAST(FLOOR(FLOAT64(metric) / bin_size) * bin_size AS INT64) AS bin,
COUNT(0) AS volume
-- Wrap with: volume / SUM(volume) OVER (PARTITION BY client) AS pdf
-- Wrap with: SUM(pdf) OVER (PARTITION BY client ORDER BY bin) AS cdf- Filter root pages: Always include
AND is_root_pageunless you specifically need all pages - Handle null values: Use appropriate null checks and filtering
- Use consistent binning: For histograms, use logical bin sizes (e.g., 100KB increments for page weight)
- Optimize performance: Use appropriate WHERE clauses and avoid expensive operations
- Test with dev filters: Your queries should work with the development rank filter
- Use safe functions:
SAFE.BOOL()for custom metrics,SAFE_DIVIDE()for percentages
Lenses SQL are a valid BigQuery WHERE clause conditions that can be appended to the main query.
For each combination of date, metric, SQL type, and lens, the system:
- Creates a unique operation name:
{metricId}_{sqlType}_{date}_{lensName} - Generates BigQuery SQL that:
- Deletes existing data for the date/metric/lens combination
- Inserts new calculated data
- Exports results to Cloud Storage
- Tags operations with
crawl_completetags to be triggered on crawl completion.
Reports are stored in BigQuery tables with this structure:
- Partitioned by:
date - Clustered by:
metric,lens,client - Dataset:
reports - Naming:
{metricId}_{sqlType}(e.g.,bytesTotal_histogram)
- Data is calculated and stored in BigQuery
- A
run_export_jobfunction exports filtered data to Cloud Storage - Export paths follow the pattern:
- Histogram:
reports/[{lens}/]{date_underscore}/{metric_id}.json - Timeseries:
reports/[{lens}/]{metric_id}.json
- Histogram:
- Development: Uses
TABLESAMPLEand rank filters for faster processing - Production: Processes full datasets
- Environment detection: Automatic based on
dataform.projectConfig.vars.environment
Modify the DATE_RANGE object in reports_dynamic.js:
const DATE_RANGE = {
startDate: '2025-01-01', // Start processing from this date
endDate: '2025-07-01' // Process up to this date
}Modify the EXPORT_CONFIG object:
const EXPORT_CONFIG = {
bucket: 'your-storage-bucket',
storagePath: 'reports/',
dataset: 'reports',
testSuffix: '.json'
}- Check operation logs in Dataform for SQL errors
- Verify table creation in BigQuery console
- Check export logs in Cloud Run for export errors
- Verify Cloud Storage paths for exported files
- Test SQL templates individually before adding
- Use development environment with smaller datasets for testing
For metrics that track whether a feature/file exists (present or not present), use timeseries only:
llmsTxtAdoption: {
SQL: [
{
type: 'timeseries',
query: DataformTemplateBuilder.create((ctx, params) => `
SELECT
client,
ROUND(SAFE_DIVIDE(
COUNTIF(SAFE.BOOL(custom_metrics.other.llms_txt_validation.valid)),
COUNT(0)
) * 100, 2) AS pct_pages
FROM ${ctx.ref('crawl', 'pages')}
WHERE
date = '${params.date}'
AND is_root_page
${params.lens.sql}
${params.devRankFilter}
GROUP BY client
ORDER BY client
`)
}
]
}Key points:
- Uses
SAFE_DIVIDE()to avoid division by zero - Uses
SAFE.BOOL()for accessing custom_metrics that may not exist - Returns
pct_pagesas the adoption percentage - No histogram - boolean metrics don't have meaningful distributions
For metrics with continuous values (bytes, time, count), use both histogram and timeseries:
jsBytes: {
SQL: [
{
type: 'histogram',
query: DataformTemplateBuilder.create((ctx, params) => `
WITH pages AS (
SELECT
date,
client,
CAST(FLOOR(FLOAT64(summary.bytesJS) / 1024 / 50) * 50 AS INT64) AS bin
FROM ${ctx.ref('crawl', 'pages')}
WHERE
date = '${params.date}'
${params.devRankFilter}
${params.lens.sql}
AND is_root_page
AND INT64(summary.bytesJS) > 0
)
SELECT
*,
SUM(pdf) OVER (PARTITION BY client ORDER BY bin) AS cdf
FROM (
SELECT
*,
volume / SUM(volume) OVER (PARTITION BY client) AS pdf
FROM (
SELECT
*,
COUNT(0) AS volume
FROM pages
WHERE bin IS NOT NULL
GROUP BY date, client, bin
)
)
ORDER BY bin, client
`)
},
{
type: 'timeseries',
query: DataformTemplateBuilder.create((ctx, params) => `
WITH pages AS (
SELECT
date,
client,
FLOAT64(summary.bytesJS) AS bytesJS
FROM ${ctx.ref('crawl', 'pages')}
WHERE
date = '${params.date}'
${params.devRankFilter}
${params.lens.sql}
AND is_root_page
AND INT64(summary.bytesJS) > 0
)
SELECT
date,
client,
UNIX_DATE(date) * 1000 * 60 * 60 * 24 AS timestamp,
ROUND(APPROX_QUANTILES(bytesJS, 1001)[OFFSET(101)] / 1024, 2) AS p10,
ROUND(APPROX_QUANTILES(bytesJS, 1001)[OFFSET(251)] / 1024, 2) AS p25,
ROUND(APPROX_QUANTILES(bytesJS, 1001)[OFFSET(501)] / 1024, 2) AS p50,
ROUND(APPROX_QUANTILES(bytesJS, 1001)[OFFSET(751)] / 1024, 2) AS p75,
ROUND(APPROX_QUANTILES(bytesJS, 1001)[OFFSET(901)] / 1024, 2) AS p90
FROM pages
GROUP BY date, client, timestamp
ORDER BY date, client
`)
}
]
}Key points:
- Histogram shows distribution across bins (50KB increments)
- Timeseries shows percentiles over time (p10, p25, p50, p75, p90)
- Both queries filter out zero values:
AND INT64(summary.bytesJS) > 0 - Uses nested CTEs for clear structure
- Automatically generates reports for JavaScript bundle sizes across all lenses and the configured date range