import numpy as np
from statsmodels.stats.proportion import proportions_ztest, confint_proportions_2indep
control_conversions = 920
control_users = 12_500
treatment_conversions = 1_015
treatment_users = 12_430
z_stat, p_value = proportions_ztest(
count=[treatment_conversions, control_conversions],
nobs=[treatment_users, control_users],
)
ci_low, ci_high = confint_proportions_2indep(
count1=treatment_conversions,
nobs1=treatment_users,
count2=control_conversions,
nobs2=control_users,
method='wald',
)
print({'z_stat': float(z_stat), 'p_value': float(p_value), 'ci': [float(ci_low), float(ci_high)]})
Experiment analysis should not stop at a binary win or lose label. I calculate uplift, confidence intervals, and guardrail metrics like latency or refund rate before recommending rollout. The point of the analysis is decision quality, not statistical theater.