A/B Testing for Ad Creatives: Traffic Splitting & Statistical Significance

The Difference Between Guessing and Knowing

Weighted rotation, which we built in Part 3, distributes traffic across creatives based on manually assigned weights. A/B testing goes further, it systematically compares creative variants to determine which one performs best, then shifts traffic toward the winner. This is the difference between guessing and knowing.

A proper A/B testing system needs four things: deterministic assignment (the same user always sees the same variant), independent metric tracking per variant, a statistical test to determine when a difference is real versus random noise, and automation to promote winners and retire losers.

This article implements all four. By the end you will have a complete experimentation layer that plugs into the ad resolver from Part 3 and the metrics system from Part 4.

How A/B Testing Works for Ads

The fundamental idea is simple: split users into groups, show each group a different creative, measure which group performs better.

Loading diagram...

The critical property is determinism: user X must always be assigned to the same variant for the duration of the experiment. If they see Variant A on Monday and Variant B on Tuesday, the results are contaminated.

The Experiment Data Model

An experiment is scoped to an ad group rather than a campaign or a creative directly. This is a deliberate design decision. A campaign spans multiple ad groups with different targeting strategies; testing at the campaign level would mix audiences and invalidate the comparison. A creative is the leaf node - too granular to hold experiment state. The ad group is the right level: it has a consistent audience definition and a pool of creatives to compare.

A few field choices are worth explaining before reading the code. isControl lives on ExperimentVariant, not on Experiment itself, because in a multi-variant test any variant can be designated the control - the entity that receives no treatment and provides the performance baseline. minSampleSize is the per-variant minimum, not the total: an experiment with two variants needs at least minSampleSize impressions on each before analysis begins, ensuring both sides have statistically comparable data. confidenceLevel is stored as an integer (95, not 0.95) because PostgreSQL enum columns prefer exact values and integer comparison avoids floating-point representation noise in config fields that rarely change.

// src/lib/types/ads.ts (additions)

export type ExperimentStatus = 'draft' | 'running' | 'paused' | 'completed'

export interface Experiment {
	id: string
	adGroupId: string
	name: string
	status: ExperimentStatus
	/** The metric to optimize: 'ctr' or 'cvr' */
	targetMetric: 'ctr' | 'cvr'
	/** Minimum sample size per variant before analysis */
	minSampleSize: number
	/** Confidence level for statistical test (e.g. 0.95) */
	confidenceLevel: number
	/** The variant that won (set when experiment completes) */
	winnerId: string | null
	variants: ExperimentVariant[]
	createdAt: Date
	updatedAt: Date
}

export interface ExperimentVariant {
	id: string
	experimentId: string
	creativeId: string
	/** Traffic allocation percentage (0–100). All variants must sum to 100. */
	trafficPercent: number
	/** Bucket range start (inclusive, 0–99) */
	bucketStart: number
	/** Bucket range end (inclusive, 0–99) */
	bucketEnd: number
	isControl: boolean
}

Database Schema Addition

// Add to src/lib/server/db/schema.ts

export const experimentStatusEnum = pgEnum('experiment_status', [
	'draft',
	'running',
	'paused',
	'completed'
])

export const experiments = pgTable('experiments', {
	id: text('id').primaryKey(),
	adGroupId: text('ad_group_id')
		.notNull()
		.references(() => adGroups.id, { onDelete: 'cascade' }),
	name: text('name').notNull(),
	status: experimentStatusEnum('status').notNull().default('draft'),
	targetMetric: text('target_metric').notNull().default('ctr'),
	minSampleSize: integer('min_sample_size').notNull().default(1000),
	confidenceLevel: integer('confidence_level').notNull().default(95), // stored as 95 not 0.95
	winnerId: text('winner_id'),
	createdAt: timestamp('created_at').defaultNow().notNull(),
	updatedAt: timestamp('updated_at').defaultNow().notNull()
})

export const experimentVariants = pgTable('experiment_variants', {
	id: text('id').primaryKey(),
	experimentId: text('experiment_id')
		.notNull()
		.references(() => experiments.id, { onDelete: 'cascade' }),
	creativeId: text('creative_id')
		.notNull()
		.references(() => creatives.id),
	trafficPercent: integer('traffic_percent').notNull(),
	bucketStart: integer('bucket_start').notNull(),
	bucketEnd: integer('bucket_end').notNull(),
	isControl: boolean('is_control').notNull().default(false)
})

Deterministic Bucket Assignment

The assignment algorithm must be:

Deterministic - same input always produces the same bucket
Uniform - buckets are evenly distributed
Stable - adding or removing variants does not reassign existing users

We use a simple hash function based on the session ID and experiment ID:

// src/lib/server/ads/experiment-hash.ts

/**
 * Assign a user to a bucket (0–99) based on their session ID
 * and the experiment ID. This is deterministic - the same
 * session + experiment always maps to the same bucket.
 *
 * Uses a simple FNV-1a hash for speed and uniformity.
 */
export function assignBucket(sessionId: string, experimentId: string): number {
	const input = `${sessionId}:${experimentId}`
	let hash = 2166136261 // FNV offset basis

	for (let i = 0; i < input.length; i++) {
		hash ^= input.charCodeAt(i)
		hash = Math.imul(hash, 16777619) // FNV prime
	}

	// Convert to unsigned 32-bit integer, then mod 100
	return Math.abs(hash >>> 0) % 100
}

/**
 * Given a bucket number and a list of variants with bucket ranges,
 * find which variant the bucket falls into.
 */
export function assignVariant(
	bucket: number,
	variants: Array<{ id: string; bucketStart: number; bucketEnd: number }>
): string | null {
	for (const variant of variants) {
		if (bucket >= variant.bucketStart && bucket <= variant.bucketEnd) {
			return variant.id
		}
	}
	return null
}

Why Not Math.random()?

Math.random() is non-deterministic; calling it twice gives different results. If a user refreshes the page, they might see a different variant. A hash-based approach guarantees consistency. The FNV-1a hash is fast (no crypto overhead) and produces a sufficiently uniform distribution for 100 buckets.

Bucket Allocation

Bucket allocation converts a percentage split into concrete integer ranges. The scheme uses exactly 100 buckets (numbered 0–99), which makes the relationship between trafficPercent and bucket count direct: a 50% allocation gets exactly 50 buckets, a 30% allocation gets 30 buckets, and so on. Ranges are contiguous rather than scattered - Variant A owns buckets 0–49, not every other bucket from 0 to 98. Contiguous ranges are simpler to reason about and equally unbiased, since the hash function distributes sessions uniformly across 0–99 regardless of where the boundaries fall. The validation that percentages must sum to exactly 100 ensures every bucket is claimed by exactly one variant; a user cannot fall into an unassigned gap.

// src/lib/server/ads/experiment-setup.ts

import type { ExperimentVariant } from '$lib/types/ads'

interface VariantConfig {
	creativeId: string
	trafficPercent: number
	isControl: boolean
}

/**
 * Allocate bucket ranges for experiment variants.
 * Each variant gets a contiguous range of buckets proportional
 * to its traffic percentage.
 *
 * Example with 3 variants at 50/30/20:
 * - Variant A: buckets 0–49 (50 buckets)
 * - Variant B: buckets 50–79 (30 buckets)
 * - Variant C: buckets 80–99 (20 buckets)
 */
export function allocateBuckets(
	experimentId: string,
	variants: VariantConfig[]
): Omit<ExperimentVariant, 'id'>[] {
	// Validate total traffic is 100%
	const totalTraffic = variants.reduce((sum, v) => sum + v.trafficPercent, 0)
	if (totalTraffic !== 100) {
		throw new Error(`Traffic percentages must sum to 100, got ${totalTraffic}`)
	}

	let currentBucket = 0
	return variants.map((variant) => {
		const bucketCount = variant.trafficPercent
		const bucketStart = currentBucket
		const bucketEnd = currentBucket + bucketCount - 1
		currentBucket += bucketCount

		return {
			experimentId,
			creativeId: variant.creativeId,
			trafficPercent: variant.trafficPercent,
			bucketStart,
			bucketEnd,
			isControl: variant.isControl
		}
	})
}

Integrating with the Ad Resolver

The experiment system plugs into the resolver’s selection stage. When an ad group has a running experiment, the resolver uses experiment assignment instead of weighted random selection:

// src/lib/server/ads/experiment-resolver.ts

import { db } from '$lib/server/db/client'
import { experiments, experimentVariants } from '$lib/server/db/schema'
import { eq } from 'drizzle-orm'
import type { Creative, AdGroup, AdRequestContext } from '$lib/types/ads'
import { assignBucket, assignVariant } from './experiment-hash'

/**
 * Check if an ad group has a running experiment, and if so,
 * select the creative based on experiment assignment.
 *
 * Returns the selected creative, or null if no experiment
 * is running (fall back to normal weighted selection).
 */
export async function resolveExperimentCreative(
	adGroup: AdGroup,
	availableCreatives: Creative[],
	context: AdRequestContext
): Promise<Creative | null> {
	// Find a running experiment for this ad group
	const experiment = await db.query.experiments.findFirst({
		where: eq(experiments.adGroupId, adGroup.id),
		with: {
			variants: true
		}
	})

	if (!experiment || experiment.status !== 'running') {
		return null // No experiment - use normal selection
	}

	// Assign the user to a bucket
	const bucket = assignBucket(context.sessionId, experiment.id)

	// Find the variant for this bucket
	const variantId = assignVariant(bucket, experiment.variants)
	if (!variantId) return null

	// Find the variant's creative
	const variant = experiment.variants.find((v) => v.id === variantId)
	if (!variant) return null

	// Return the creative if it's in the available set
	return availableCreatives.find((c) => c.id === variant.creativeId) ?? null
}

Update the resolver’s Stage 5 to check for experiments first:

// In resolver.ts - update the selection stage:

import { resolveExperimentCreative } from './experiment-resolver'

// ─── Stage 5: Select creative ──────────────────────────
// Check for experiments first
for (const group of topGroups) {
	const experimentCreative = await resolveExperimentCreative(
		group.adGroup,
		group.creatives,
		context
	)
	if (experimentCreative) {
		return {
			creative: experimentCreative,
			campaign: group.campaign,
			adGroup: group.adGroup,
			trackingId: crypto.randomUUID()
		}
	}
}

// No experiment - fall back to weighted selection
const allCreatives = topGroups.flatMap((g) => g.creatives)
const winner = selectCreative(allCreatives)

Per-Variant Metrics

The events table records a creativeId with each impression and click, but it knows nothing about experiments or variants - those are higher-level abstractions that did not exist when the events were recorded. getExperimentMetrics bridges the two systems by fetching the experiment’s variants, then querying events filtered by each variant’s creativeId. This works because each variant maps to a distinct creative: Variant A’s impressions are all events where creativeId = A.creativeId, Variant B’s are all events where creativeId = B.creativeId. The isolation is guaranteed at the resolver level - when an experiment is running, resolveExperimentCreative assigns each session to exactly one creative, so the event counts never overlap.

The implementation runs three queries per variant, the same pattern noted in the metrics article. With two variants that is six database queries per analysis call. Consolidating into a single conditional aggregation query (COUNT(*) FILTER (WHERE type = 'impression')) would reduce this to one, but the sequential version is retained here for readability.

// src/lib/server/ads/experiment-metrics.ts

import { db } from '$lib/server/db/client'
import { adEvents, experimentVariants } from '$lib/server/db/schema'
import { eq, and, sql } from 'drizzle-orm'

export interface VariantMetrics {
	variantId: string
	creativeId: string
	isControl: boolean
	impressions: number
	clicks: number
	conversions: number
	ctr: number
	cvr: number
}

/**
 * Get metrics for each variant in an experiment.
 */
export async function getExperimentMetrics(experimentId: string): Promise<VariantMetrics[]> {
	// Get variants
	const variants = await db.query.experimentVariants.findMany({
		where: eq(experimentVariants.experimentId, experimentId)
	})

	const results: VariantMetrics[] = []

	for (const variant of variants) {
		const impressions = await db
			.select({ count: sql<number>`count(*)` })
			.from(adEvents)
			.where(and(eq(adEvents.creativeId, variant.creativeId), eq(adEvents.type, 'impression')))

		const clicks = await db
			.select({ count: sql<number>`count(*)` })
			.from(adEvents)
			.where(and(eq(adEvents.creativeId, variant.creativeId), eq(adEvents.type, 'click')))

		const conversions = await db
			.select({ count: sql<number>`count(*)` })
			.from(adEvents)
			.where(and(eq(adEvents.creativeId, variant.creativeId), eq(adEvents.type, 'conversion')))

		const impCount = impressions[0]?.count ?? 0
		const clickCount = clicks[0]?.count ?? 0
		const convCount = conversions[0]?.count ?? 0

		results.push({
			variantId: variant.id,
			creativeId: variant.creativeId,
			isControl: variant.isControl,
			impressions: impCount,
			clicks: clickCount,
			conversions: convCount,
			ctr: impCount > 0 ? clickCount / impCount : 0,
			cvr: clickCount > 0 ? convCount / clickCount : 0
		})
	}

	return results
}

Statistical Significance

The most important question in A/B testing: is the difference between variants real, or just random noise? We use the chi-squared test for this.

The Chi-Squared Test

Before reaching for the formula, it helps to hold the core question in mind: if these two variants performed identically in reality, how often would random sampling alone produce a CTR gap at least as large as the one we are seeing? The chi-squared test gives that probability as a p-value. A p-value of 0.03 means there is a 3% chance the observed gap is pure noise - or equivalently, if you ran the same experiment twenty times with identical variants, you would see a gap this large roughly once just by chance. At that point, 5% is the conventional threshold: if the probability of a fluke is below one-in-twenty, most practitioners consider the result trustworthy enough to act on.

That convention is not a law of nature, and a low p-value does not tell you how large the difference is - only that it is unlikely to be zero. The lift figure (percentage improvement over the control) answers the “how much” question. Both numbers matter: a statistically significant lift of 0.01% is almost certainly not worth changing your creative for, even if the p-value is 0.001.

With that framing in place, here is how the test works mechanically:

Loading diagram...

Implementation

The chi-squared statistic measures the total deviation between what was observed and what would be expected if both variants had identical click rates. Each cell in a 2x2 table (clicks and non-clicks for each variant) contributes (observed - expected)^2 / expected. Squaring removes sign so over-performance and under-performance both add to the statistic. Dividing by the expected value normalises for cell size: a deviation of 5 clicks in a cell that expected 10 is far more surprising than the same deviation in a cell that expected 1000. The larger the total statistic, the less likely the data is consistent with the null hypothesis that both variants perform the same.

The test uses 1 degree of freedom because a 2x2 contingency table has exactly one free cell: once you fix the row and column totals and fill in one cell, all others are determined. Degrees of freedom for a contingency table is (rows - 1) x (columns - 1), which for a 2x2 table gives 1 x 1 = 1. The chi-squared CDF converts the statistic into a probability using the regularized incomplete gamma function. At 1 degree of freedom this simplifies to the error function erf - the same function underlying the normal distribution - which is why the code has a fast-path for that case. The p-value is 1 - CDF(statistic): the probability that a chi-squared variable with 1 degree of freedom would exceed the observed statistic by chance alone.

// src/lib/server/ads/statistics.ts

/**
 * Perform a chi-squared test on two variants.
 * Returns the p-value - the probability that the observed
 * difference is due to random chance.
 *
 * Lower p-value = more confident the difference is real.
 * Typically, p < 0.05 (5%) is considered significant.
 */
export function chiSquaredTest(
	impressionsA: number,
	clicksA: number,
	impressionsB: number,
	clicksB: number
): { chiSquared: number; pValue: number; significant: boolean } {
	const totalImpressions = impressionsA + impressionsB
	const totalClicks = clicksA + clicksB
	const totalNonClicks = totalImpressions - totalClicks

	if (totalImpressions === 0 || totalClicks === 0 || totalNonClicks === 0) {
		return { chiSquared: 0, pValue: 1, significant: false }
	}

	// Expected values under the null hypothesis (no difference)
	const expectedClicksA = (impressionsA * totalClicks) / totalImpressions
	const expectedClicksB = (impressionsB * totalClicks) / totalImpressions
	const expectedNonClicksA = (impressionsA * totalNonClicks) / totalImpressions
	const expectedNonClicksB = (impressionsB * totalNonClicks) / totalImpressions

	const nonClicksA = impressionsA - clicksA
	const nonClicksB = impressionsB - clicksB

	// Chi-squared statistic
	const chiSquared =
		Math.pow(clicksA - expectedClicksA, 2) / expectedClicksA +
		Math.pow(clicksB - expectedClicksB, 2) / expectedClicksB +
		Math.pow(nonClicksA - expectedNonClicksA, 2) / expectedNonClicksA +
		Math.pow(nonClicksB - expectedNonClicksB, 2) / expectedNonClicksB

	// Convert to p-value using the chi-squared CDF with 1 degree of freedom
	const pValue = 1 - chiSquaredCDF(chiSquared, 1)

	return {
		chiSquared,
		pValue,
		significant: pValue < 0.05,
	}
}

/**
 * Chi-squared cumulative distribution function.
 * Uses the regularized lower incomplete gamma function.
 * For 1 degree of freedom: CDF(x) = erf(sqrt(x/2))
 */
function chiSquaredCDF(x: number, degreesOfFreedom: number): number {
	if (x <= 0) return 0

	// For 1 degree of freedom, use the error function
	if (degreesOfFreedom === 1) {
		return erf(Math.sqrt(x / 2))
	}

	// General case using the regularized gamma function
	return regularizedGammaP(degreesOfFreedom / 2, x / 2)
}

/**
 * Error function approximation (Abramowitz & Stegun).
 * Accurate to ~1.5e-7.
 */
function erf(x: number): number {
	const sign = x >= 0 ? 1 : -1
	x = Math.abs(x)

	const a1 = 0.254829592
	const a2 = -0.284496736
	const a3 = 1.421413741
	const a4 = -1.453152027
	const a5 = 1.061405429
	const p = 0.3275911

	const t = 1 / (1 + p * x)
	const y = 1 - (((((a5 * t + a4) * t) + a3) * t + a2) * t + a1) * t * Math.exp(-x * x)

	return sign * y
}

/**
 * Regularized lower incomplete gamma function P(a, x).
 * Uses a series expansion.
 */
function regularizedGammaP(a: number, x: number): number {
	if (x === 0) return 0

	let sum = 1 / a
	let term = 1 / a

	for (let n = 1; n < 200; n++) {
		term *= x / (a + n)
		sum += term
		if (Math.abs(term) < 1e-10) break
	}

	return sum * Math.exp(-x + a * Math.log(x) - logGamma(a))
}

/**
 * Natural log of the gamma function (Stirling's approximation).
 */
function logGamma(x: number): number {
	const coefficients = [
		76.18009172947146, -86.50532032941677, 24.01409824083091,
		-1.231739572450155, 0.1208650973866179e-2, -0.5395239384953e-5,
	]

	let y = x
	let tmp = x + 5.5
	tmp -= (x + 0.5) * Math.log(tmp)
	let sum = 1.000000000190015

	for (let j = 0; j < 6; j++) {
		y += 1
		sum += coefficients[j] / y
	}

	return -tmp + Math.log((2.5066282746310005 * sum) / x)
}

// ─── Confidence Interval ───────────────────────────────

// A confidence interval expresses the range within which the true click-through
// rate probably lies, given the observed data. A 95% CI means that if you ran
// the same experiment many times, the interval would contain the true CTR 95%
// of the time. Wider intervals indicate more uncertainty (fewer observations);
// narrower intervals indicate more certainty (more data). When the CI of Variant A
// and Variant B do not overlap, you have strong visual evidence the difference is real.
//
// The Wilson score interval is used instead of the naive normal approximation
// (p ± z * sqrt(p*(1-p)/n)) because the naive formula breaks down near 0% and
// 100% CTR, and for small sample sizes. Wilson's formula includes a correction
// that keeps the interval within [0, 1] and remains accurate even when the
// observed proportion is very small - exactly the conditions you see with low-CTR
// ad creatives early in an experiment.

/**
 * Calculate a confidence interval for a proportion (e.g. CTR).
 * Uses the Wilson score interval, which is more accurate than
 * the normal approximation for small sample sizes.
 */
export function wilsonConfidenceInterval(
	successes: number,
	total: number,
	confidence: number = 0.95
): { lower: number; upper: number; center: number } {
	if (total === 0) return { lower: 0, upper: 0, center: 0 }

	const z = normalQuantile((1 + confidence) / 2)
	const p = successes / total
	const denominator = 1 + z * z / total
	const center = (p + z * z / (2 * total)) / denominator
	const margin = (z * Math.sqrt((p * (1 - p) + z * z / (4 * total)) / total)) / denominator

	return {
		lower: Math.max(0, center - margin),
		upper: Math.min(1, center + margin),
		center,
	}
}

/**
 * Inverse normal CDF (quantile function).
 * Rational approximation accurate to ~1.15e-9.
 */
function normalQuantile(p: number): number {
	if (p <= 0) return -Infinity
	if (p >= 1) return Infinity
	if (p === 0.5) return 0

	const a = [
		-3.969683028665376e1, 2.209460984245205e2,
		-2.759285104469687e2, 1.383577518672690e2,
		-3.066479806614716e1, 2.506628277459239e0,
	]
	const b = [
		-5.447609879822406e1, 1.615858368580409e2,
		-1.556989798598866e2, 6.680131188771972e1,
		-1.328068155288572e1,
	]

	const q = p < 0.5 ? p : 1 - p
	const r = Math.sqrt(-2 * Math.log(q))

	let x = (((((a[0] * r + a[1]) * r + a[2]) * r + a[3]) * r + a[4]) * r + a[5]) /
		((((b[0] * r + b[1]) * r + b[2]) * r + b[3]) * r + b[4]) * r + 1)

	if (p < 0.5) x = -x

	return x
}

Why Implement Stats From Scratch?

We implement the chi-squared test and confidence intervals from scratch for two reasons: (1) it keeps the dependency count at zero - no need for heavy statistics libraries just for one test, and (2) it is educational; understanding these formulas demystifies A/B testing. In production, you could use a library like simple-statistics for more complex analyses.

Experiment Analysis & Auto-Completion

The analyzer is the decision layer that sits above the statistics functions. It handles three distinct states that an experiment can be in, and the right response is different for each. When an experiment has insufficient_data, the statistical test would be meaningless - chi-squared on 15 impressions per variant is noise, not signal - so the analyzer returns early with the raw counts and skips the significance calculation. When it has enough data but the difference is not significant, it returns the full analysis including p-value, inviting the viewer to wait for more data. When significance is reached, it declares the winner.

A design choice worth noting: the analysis compares the control against the single best-performing variant rather than running all pairwise comparisons. This avoids the multiple comparisons problem, where running N tests at 5% significance gives you roughly a 1 - 0.95^N probability of at least one false positive by chance. In a three-variant experiment, three pairwise tests would give you a ~14% false positive rate overall. Comparing only control-vs-best keeps the false positive rate at the chosen confidence level.

After autoCompleteExperiment sets the status to completed, the resolver stops routing traffic through the experiment. resolveExperimentCreative returns null for any experiment that is not running, and the resolver falls back to weighted selection among the ad group’s creatives. The winning creative is recorded on the experiment, but it does not automatically adjust the ad group’s creative weights - that step is left to the campaign manager, who can inspect the result in the dashboard and manually increase the winner’s weight or deactivate the losing creatives.

// src/lib/server/ads/experiment-analyzer.ts

import { db } from '$lib/server/db/client'
import { experiments } from '$lib/server/db/schema'
import { eq } from 'drizzle-orm'
import { getExperimentMetrics } from './experiment-metrics'
import { chiSquaredTest, wilsonConfidenceInterval } from './statistics'

export interface ExperimentAnalysis {
	experimentId: string
	status: 'insufficient_data' | 'no_significance' | 'significant'
	variants: Array<{
		variantId: string
		creativeId: string
		isControl: boolean
		impressions: number
		clicks: number
		ctr: number
		confidenceInterval: { lower: number; upper: number }
	}>
	comparison: {
		chiSquared: number
		pValue: number
		significant: boolean
		lift: number // percentage improvement of best over control
		bestVariantId: string
	} | null
}

/**
 * Analyze an experiment and return the current state of significance.
 */
export async function analyzeExperiment(experimentId: string): Promise<ExperimentAnalysis> {
	const experiment = await db.query.experiments.findFirst({
		where: eq(experiments.id, experimentId)
	})

	if (!experiment) throw new Error('Experiment not found')

	const metrics = await getExperimentMetrics(experimentId)

	// Check minimum sample size
	const hasEnoughData = metrics.every((m) => m.impressions >= experiment.minSampleSize)

	if (!hasEnoughData) {
		return {
			experimentId,
			status: 'insufficient_data',
			variants: metrics.map((m) => ({
				...m,
				confidenceInterval: wilsonConfidenceInterval(
					m.clicks,
					m.impressions,
					experiment.confidenceLevel / 100
				)
			})),
			comparison: null
		}
	}

	// Find control and best-performing variant
	const control = metrics.find((m) => m.isControl) ?? metrics[0]
	const best = metrics.reduce((a, b) => (a.ctr > b.ctr ? a : b))

	// Run chi-squared test between control and best
	const test = chiSquaredTest(control.impressions, control.clicks, best.impressions, best.clicks)

	// Calculate lift (percentage improvement)
	const lift = control.ctr > 0 ? ((best.ctr - control.ctr) / control.ctr) * 100 : 0

	const confidenceThreshold = experiment.confidenceLevel / 100
	const isSignificant = test.pValue < 1 - confidenceThreshold

	return {
		experimentId,
		status: isSignificant ? 'significant' : 'no_significance',
		variants: metrics.map((m) => ({
			...m,
			confidenceInterval: wilsonConfidenceInterval(m.clicks, m.impressions, confidenceThreshold)
		})),
		comparison: {
			chiSquared: test.chiSquared,
			pValue: test.pValue,
			significant: isSignificant,
			lift,
			bestVariantId: best.variantId
		}
	}
}

/**
 * Auto-complete an experiment if it has reached significance.
 * Sets the winning variant and updates the experiment status.
 */
export async function autoCompleteExperiment(experimentId: string): Promise<boolean> {
	const analysis = await analyzeExperiment(experimentId)

	if (analysis.status !== 'significant' || !analysis.comparison) {
		return false
	}

	// Update experiment with winner
	await db
		.update(experiments)
		.set({
			status: 'completed',
			winnerId: analysis.comparison.bestVariantId,
			updatedAt: new Date()
		})
		.where(eq(experiments.id, experimentId))

	return true
}

The Experiment Dashboard Component

The ABTestPanel component renders the current state of an experiment for a campaign manager reviewing their dashboard. It reflects all three analyzer states visually: a muted colour and “Collecting Data” label when sample size is insufficient, a yellow border and “No Significant Difference Yet” when the test is running but inconclusive, and a green highlight with a winner badge when significance is reached.

The confidence interval bars below each variant’s metrics deserve particular attention. Each bar shows the probable range for the variant’s true CTR given the observed data. The shaded region is the interval; the dot marks the observed point estimate. A wide shaded region means the experiment has few impressions and the true CTR could be anywhere in that range. As impressions accumulate, the shaded region narrows and the dot becomes more trustworthy. When the shaded regions of two variants stop overlapping, the difference between them is statistically credible - which is the intuitive equivalent of the formal p-value test.

<!-- src/lib/components/ads/ABTestPanel.svelte -->

<script lang="ts">
	import type { ExperimentAnalysis } from '$lib/server/ads/experiment-analyzer'
	import { formatPercent } from '$lib/utils/format'

	interface Props {
		analysis: ExperimentAnalysis
	}

	let { analysis }: Props = $props()

	let statusLabel = $derived(
		analysis.status === 'insufficient_data'
			? 'Collecting Data'
			: analysis.status === 'no_significance'
				? 'No Significant Difference Yet'
				: 'Winner Found!'
	)

	let statusColor = $derived(
		analysis.status === 'significant'
			? 'var(--accent-green-base)'
			: analysis.status === 'no_significance'
				? 'var(--accent-yellow-base)'
				: 'var(--text-muted)'
	)
</script>

<div class="ab-test-panel">
	<div class="experiment-status" style="border-color: {statusColor}">
		<span class="status-text" style="color: {statusColor}">{statusLabel}</span>
		{#if analysis.comparison}
			<span class="p-value">
				p-value: {analysis.comparison.pValue.toFixed(4)}
			</span>
		{/if}
	</div>

	<!-- Variant Comparison -->
	<div class="variants">
		{#each analysis.variants as variant (variant.variantId)}
			{@const isWinner =
				analysis.comparison?.bestVariantId === variant.variantId &&
				analysis.status === 'significant'}
			<div class="variant-card" class:winner={isWinner} class:control={variant.isControl}>
				<div class="variant-header">
					<span class="variant-label">
						{variant.isControl ? 'Control' : 'Variant'}
					</span>
					{#if isWinner}
						<span class="winner-badge">Winner</span>
					{/if}
				</div>

				<div class="variant-metrics">
					<div class="metric">
						<span class="metric-label">Impressions</span>
						<span class="metric-value">{variant.impressions.toLocaleString()}</span>
					</div>
					<div class="metric">
						<span class="metric-label">Clicks</span>
						<span class="metric-value">{variant.clicks.toLocaleString()}</span>
					</div>
					<div class="metric">
						<span class="metric-label">CTR</span>
						<span class="metric-value">{formatPercent(variant.ctr)}</span>
					</div>
				</div>

				<!-- Confidence Interval Bar -->
				<div class="ci-bar">
					<div class="ci-label">
						{formatPercent(variant.confidenceInterval.lower)} -
						{formatPercent(variant.confidenceInterval.upper)}
					</div>
					<div class="ci-track">
						<div
							class="ci-range"
							style="
								left: {variant.confidenceInterval.lower * 100}%;
								width: {(variant.confidenceInterval.upper - variant.confidenceInterval.lower) * 100}%;
							"
						>
							<div
								class="ci-point"
								style="left: {((variant.ctr - variant.confidenceInterval.lower) /
									(variant.confidenceInterval.upper - variant.confidenceInterval.lower)) *
									100}%"
							></div>
						</div>
					</div>
				</div>
			</div>
		{/each}
	</div>

	<!-- Lift Indicator -->
	{#if analysis.comparison && analysis.comparison.lift !== 0}
		<div class="lift-indicator">
			<span class="lift-value" class:positive={analysis.comparison.lift > 0}>
				{analysis.comparison.lift > 0 ? '+' : ''}{analysis.comparison.lift.toFixed(1)}%
			</span>
			<span class="lift-label">lift over control</span>
		</div>
	{/if}
</div>

<style>
	.ab-test-panel {
		background: var(--surface-1, #fff);
		border: 1px solid var(--border-color, #e5e7eb);
		border-radius: 0.5rem;
		padding: 1.5rem;
	}

	.experiment-status {
		display: flex;
		justify-content: space-between;
		align-items: center;
		padding-bottom: 1rem;
		margin-bottom: 1rem;
		border-bottom: 2px solid;
	}

	.status-text {
		font-weight: 600;
		font-size: 1.1rem;
	}

	.p-value {
		font-family: monospace;
		font-size: 0.85rem;
		color: var(--text-muted);
	}

	.variants {
		display: grid;
		grid-template-columns: repeat(auto-fit, minmax(280px, 1fr));
		gap: 1rem;
		margin-bottom: 1rem;
	}

	.variant-card {
		border: 1px solid var(--border-color, #e5e7eb);
		border-radius: 0.5rem;
		padding: 1rem;
	}

	.variant-card.winner {
		border-color: var(--accent-green-base, #10b981);
		background: rgba(16, 185, 129, 0.05);
	}

	.variant-card.control {
		border-color: var(--accent-blue-base, #3b82f6);
	}

	.variant-header {
		display: flex;
		justify-content: space-between;
		align-items: center;
		margin-bottom: 0.75rem;
	}

	.variant-label {
		font-weight: 600;
		text-transform: uppercase;
		font-size: 0.75rem;
		letter-spacing: 0.05em;
	}

	.winner-badge {
		background: var(--accent-green-base, #10b981);
		color: white;
		padding: 0.15rem 0.5rem;
		border-radius: 0.25rem;
		font-size: 0.75rem;
		font-weight: 600;
	}

	.variant-metrics {
		display: flex;
		gap: 1rem;
		margin-bottom: 1rem;
	}

	.metric {
		flex: 1;
	}

	.metric-label {
		display: block;
		font-size: 0.7rem;
		color: var(--text-muted);
		text-transform: uppercase;
	}

	.metric-value {
		font-size: 1.1rem;
		font-weight: 600;
	}

	.ci-bar {
		margin-top: 0.5rem;
	}

	.ci-label {
		font-size: 0.75rem;
		color: var(--text-muted);
		margin-bottom: 0.25rem;
	}

	.ci-track {
		height: 8px;
		background: var(--surface-2, #f3f4f6);
		border-radius: 4px;
		position: relative;
		overflow: visible;
	}

	.ci-range {
		position: absolute;
		top: 0;
		height: 100%;
		background: var(--accent-blue-base, #3b82f6);
		opacity: 0.3;
		border-radius: 4px;
	}

	.ci-point {
		position: absolute;
		top: -2px;
		width: 12px;
		height: 12px;
		background: var(--accent-blue-base, #3b82f6);
		border-radius: 50%;
		transform: translateX(-50%);
	}

	.lift-indicator {
		text-align: center;
		padding-top: 1rem;
		border-top: 1px solid var(--border-color, #f3f4f6);
	}

	.lift-value {
		font-size: 1.5rem;
		font-weight: 700;
		color: var(--accent-red-base, #dc2626);
	}

	.lift-value.positive {
		color: var(--accent-green-base, #10b981);
	}

	.lift-label {
		display: block;
		font-size: 0.8rem;
		color: var(--text-muted);
	}
</style>

Common A/B Testing Pitfalls

Peeking and early stopping. The most common mistake in A/B testing is checking results while the experiment runs and stopping as soon as significance is reached. Statistical tests assume you collected all data before testing. Checking repeatedly and stopping at the first significant result inflates the false positive rate well above the chosen threshold - you are effectively running many tests at 5% significance and keeping only the one that fires. The minimum sample size guard in analyzeExperiment mitigates this by refusing to declare significance until enough data has accumulated, but it does not eliminate the problem entirely. A more rigorous approach is sequential testing (e.g. sequential probability ratio tests), which is designed for continuous monitoring.

Simultaneous experiments on the same ad group. This system allows at most one running experiment per ad group - resolveExperimentCreative takes the first running experiment it finds. Running multiple experiments simultaneously on the same traffic would contaminate both: a user assigned to Variant A of Experiment 1 might also qualify for Experiment 2, and the two creative changes interact in ways the statistics cannot untangle. Keep one experiment per ad group at a time.

Sample size intuition. The default minSampleSize of 1000 per variant is a conservative starting point, not a universal rule. The required sample size depends on three factors: the baseline CTR, the minimum lift you care about detecting, and the desired confidence. A creative with a 1% baseline CTR needs far more impressions to detect a 0.1% improvement than a creative with a 10% baseline CTR. If you are running on low-traffic placements, consider raising the minimum or accepting a lower confidence level, understanding the trade-off.

What’s Next

The experimentation layer is now complete:

Deterministic bucket assignment using FNV-1a hashing for consistent user experiences
Experiment data model with variants, bucket ranges, and traffic allocation
Resolver integration that prioritizes experiment assignment over weighted selection
Per-variant metric tracking for independent performance measurement
Chi-squared statistical test implemented from scratch with confidence intervals
Auto-completion that detects significance and declares winners
A/B test panel component with visual confidence interval comparison