Sequential Probability Ratio Test for AI Products

April 25, 2025

Introduction

It’s critical for AI teams to measure the success of new AI product launches, but it can be especially challenging when the product has very few users. Enterprises are actively rolling out new agentic systems to production, but no one we know is considering running A/B tests. This is because traditional A/B testing requires a certain sample size and a fixed test duration, which may be completely impossible with low user volume in the early days. Of course, you also can’t wait for weeks or months till you have enough data. So how do you know how to iterate on your feature set with limited data and under time constraints?

This is where the Sequential Probability Ratio Test (SPRT) can be extremely useful. SPRT is a sequential hypothesis testing method that lets you analyze results dynamically as data arrives, helping you reach conclusions with sacrificing statistical rigor. We will walk through how SPRT is particularly useful for new AI products with limited distribution.

Basic Intuition

At its core, SPRT continually computes a likelihood ratio comparing how likely the observed data is under 2 competing hypotheses:

Null hypothesis H₀: the new AI feature has no improvement
Alternative hypothesis H₁: the new AI feature has a meaningful improvement

After each new data point (or small chunk of data points), SPRT updates the likelihood ratio (LR) defined as:

\[ \Lambda_n = \frac{P(\text{Data}_{1 \dots n} \mid H_1)}{P(\text{Data}_{1 \dots n} \mid H_0)} \]

We can then compare Λ_n to 2 thresholds – an upper threshold A and a lower threshold B to decide whether to stop:

If Λ_n >= A: we have enough evidence for H₁ (the alternative), so accept H₁
If Λ_n < B: we have enough evidence for H₀, so accept H₀
If B < Λ_n < A: the evidence is inconclusive, so continue and collect more data

This process is repeated sequentially until one of the thresholds is crossed, at which point a decision is made. In practice, A and B are chosen based on the desired error rates, with A > 1 and B < 1. Usually:

\( A = \frac{1 - \beta}{\alpha} \)
\( B = \frac{\beta}{1 - \alpha} \)

Remember that ɑ is the probability of making a Type I error (usually set to 0.05), and β is the probability of making a Type II error (set to 0.2 if you want 80% power).

The key idea is that SPRT allows you to “stop when you’re confident”, as soon as sufficient evidence has accumulated. You don’t have to wait for a fixed sample size!

SPRT vs. Traditional A/B Testing

How does SPRT differ from traditional A/B testing, and why does it matter?

Sample size and efficiency: in a traditional A/B test, you calculate an upfront sample size (say, via power analysis) and run the experiment until that many observations are collected, regardless of whether the difference was obvious earlier or not. This can lead to collecting more data than necessary. SPRT, by contrast, often stops early when results are clear, avoiding unnecessary samples. On average, SPRT uses fewer samples to reach a conclusion than a fixed test with the same error rates.
Decision timing: A fixed test only makes a decision at the end of the experiment (after N samples). Until then, you typically shouldn’t peek at results, or you risk the “peeking problem” of inflating false positives. SPRT, on the other hand, enables continuous monitoring of results and allows stopping as soon as criteria are met. You can look at your experiment dashboard every day and not worry that you're invalidating the test – if it meets the SPRT stop condition, you can trust the result 😊
Type I and II Errors: Fixed-sample tests guarantee error rates exactly at N samples (under the planned effect size). SPRT guarantees error bounds at whatever sample size it stops. In essence, SPRT does not increase false positives even though you’re checking continuously.
Outcome Interpretation: A fixed-sample test yields a p-value and either reject/fail-to-reject conclusion after N samples. SPRT yields a decision possibly at a random sample count. One additional benefit of SPRT is that it can affirmatively accept H₀ (with confidence) if evidence for no difference accumulates (when it falls below B). This can be useful information for AI teams (for instance, knowing a new AI feature is genuinely not outperforming the old one, so you might stop investing in that approach).

Why SPRT is Especially Useful for AI Product Experimentation

For AI products or features in their early stages or limited rollout, data can be a precious commodity. Here’s why SPRT is particularly useful in this context:

Limited Traffic / Users: New AI features (say a beta version of an agent or AI search feature) can receive limited traffic in the early days – perhaps only a few hundred users try it in a week. A traditional A/B test might tell you that you need, for example, 5k users to detect a moderate improvement with statistical significance. Waiting to accumulate that many users could take a long time. SPRT offers a smarter approach: you start the experiment and let it run sequentially. If the new feature truly has a big impact (positive or negative), SPRT may detect it after 500 or 1000 users, thereby shortening the experiment time dramatically 😃 Conversely, if there’s no effect, SPRT might also figure that out sooner and allow you to stop or adjust the feature without wasting time.
Faster Iteration Cycles: Early-stage products undergo rapid iteration. AI teams often want to try a tweak, ship to a small set of users, measure, and iterate. SPRT aligns with iteration speed. As soon as your new AI feature is clearly outperforming the old, you can roll it out wider; if it’s clearly underperforming, you can roll back immediately.
Risk Mitigation (Catch Regressions Early): New AI features can have unexpected behavior like RAG hallucinations (leading to a regression in a KPI). With early users, you want to minimize harm, of course. SPRT will flag a significant drop as soon as enough evidence accumulates. SPRT might detect the significant harm after, say, 200 user interactions and you can roll back immediately, rather than discovering only after reaching 1000 users in a fixed test that you had a serious issue.
Small Effect Detection and Futility: If the improvement from your new AI feature is very small (or non-existent), a fixed sample test might grind out the full sample and still give you a “not statistically significant” result… 😆 With SPRT, you can set a “futility boundary”, effectively the lower threshold B, which will terminate the test early if it’s clear that the results are not trending toward stat sig. For a low-traffic AI product, deciding early that “we’re not seeing a big enough effect” is great because it frees you to pivot to the next idea instead of running a long inconclusive test.

Real World Example: Testing a new feature for a chatbot with limited users

Imagine you’ve developed a new AI chatbot and added a feature where it proactively suggests solutions. You want to test if this new feature improves user satisfaction (measured by a thumbs-up feedback rate) compared to the old version. However, your chatbot is only in beta with ~2000 users total, and only 10% of them might use the new feature in a week – so getting even a few hundred samples could take weeks.

Using a traditional A/B test, you estimate needing ~800 feedback responses in each group to have 80% power to detect, say, a 5% increase in the positive feedback rate at 95% confidence. That might mean running the experiment for 8+ weeks given your traffic. That’s a long time…

Let’s say you use SPRT with ɑ=0.05, β=0.2 (power 0.8) and set H₁ as a 5% increase scenario. You might set up the test with boundaries A≈16 and B≈0.21, based on the equations mentioned earlier. The experiment goes live, and you monitor results daily. After each user interaction with the chatbot, the likelihood ratio is updated. A few scenarios could happen:

By week 3, suppose the new feature has gotten 300 users and the data shows a noticeably higher satisfaction rate. The SPRT likelihood ratio crosses above A – meaning the odds of seeing such data if H₁ (the improvement) is true vs. if H₀ (no improvement) is true are over 16:1. The test stops early and concludes the new feature significantly improves satisfaction. You can confidently roll out the feature to all users a month sooner than planned.
Alternatively, by week 3, maybe the satisfaction rates are about the same or slightly worse with the new feature. The likelihood ratio dips below B, say at 0.15, indicating strong evidence against the new feature delivering the hoped improvement. In fact, this would correspond to statistically detecting a likely regression or no gain. The SPRT would stop and essentially “accept H₀,” meaning you decide the feature isn’t worth pursuing (at least not at the effect size you cared about). This saves you from continuing the test for 5 more weeks only to find a null result; you can go back to the drawing board sooner.
If the results are borderline (neither clearly good nor clearly bad), SPRT will continue to gather more data until it reaches a decision or perhaps until a maximum sample or time limit is hit. In the worst case, you might run the full length similar to a fixed test. But in no case will SPRT perform worse than a fixed test in terms of error rates, and in many cases it will have reached a decision much faster.

This example shows you how SPRT is great for the low-data regime of new AI features: it maximizes the info gained from each user interaction and allows you to reach conclusions as soon as the data is there. 🚀

How to use SPRT, as an AI Engineer 🚀

Define hypotheses and minimum effect size. You need to specify H₀ and H₁. For example, if you expect a 10% lift due to a new AI feature, you need to set H₁ to 10%. If you want a range, you could use a mixture approach.
Choose error rates (ɑ, β). As mentioned earlier, usually ɑ is 0.05, and you could choose β = 0.2 if you want 80% power.
Compute SPRT thresholds. Use the formulas mentioned earlier in the blog post.
Track your data sequentially in the Logs feature on Patronus. Implement code that updates the likelihood ratio after each new data point or batch. Check the likelihood ratio against A and B at each step. Continuous monitoring is key! ‍
Automate experiment stopping rules. This could be done with cron jobs or CI/CD. This is so you stop the experiment once A or B is crossed. You could define a maximum sample size and experiment duration too.

Advanced: Mathematical Background of SPRT

Let’s formalize the setup with some notation and math. Suppose we want to test:

H₀: Θ = Θ₀ (e.g. conversion rate p = p₀ for the control or old model),
H₁: Θ = Θ₁ (e.g. improved conversion rate p = p₁ for the new AI-driven feature),

where Θ is some metric or parameter of interest (it could be a click-through rate, conversion probability, mean user rating, etc.). For concreteness, imagine Θ is a conversion probability. Under H₀, p=p₀; under H₁, p=p₁. Each user interaction (conversion or no conversion) provides evidence. The likelihood ratio after n interactions (with x successes/conversions out of n) would be:

Λ_n = P(data | H₁) / P(data | H₀) = [p₁^x(1 − p₁)^n-x] / [p₀^x(1−p₀)^n-x],

which is the ratio of the two hypotheses’ likelihoods for the observed outcomes. We initialize \( \Lambda_0 = 1 \) (before any data). After each new observation, we update \( \Lambda_n \) multiplicatively by the factor

\[ \frac{P(\text{new data} \mid H_1)}{P(\text{new data} \mid H_0)} \]

(this is equivalent to adding the log-likelihood ratio).

Decision thresholds: We choose thresholds A (upper) and B (lower) based on our tolerances for Type I and Type II errors (false positives and false negatives). Specifically, if we want a significance level (maximal Type I error rate) and power 1-β (so β is the Type II error rate), one convenient choice is:

\( A = \frac{1 - \beta}{\alpha} \)
\( B = \frac{\beta}{1 - \alpha} \)

These thresholds guarantee that \( P(\text{false alarm} \mid H_0) \leq \alpha \) and \( P(\text{miss} \mid H_1) \leq \beta \) for the test. For example, if \( \alpha = 0.05 \) and \( \beta = 0.20 \) (i.e. 95% confidence, 80% power), then:

\( A = \frac{0.95}{0.05} = 19 \quad\quad B = \frac{0.20}{0.95} \approx 0.21 \)

In log terms, we would compare \( \log \Lambda_n \) to \( \log A \) and \( \log B \), which correspond to linear boundaries as data accrues.

How it works: As data comes in, if H₁ is true, the likelihood ratio will tend to grow and eventually exceed A; if H₀ is true, Λ_n will tend to shrink below B. The test stops at whichever boundary is hit first. Notably, SPRT can also end by accepting H₀ (not just rejecting it); if evidence strongly favors no effect, SPRT explicitly makes that call. This differs from fixed-sample tests where one typically either rejects H₀ or “fails to reject” (without a firm conclusion in favor of H₀). The thresholds A, B effectively control the Type I/II error rates of these decisions by design.

Optimality: One of the cool things about SPRT is that it is optimal (in the Wald–Wolfowitz sense) for simple hypothesis testing. And actually, Wald proved that no other test with the same error rates uses fewer samples on average than SPRT. So, SPRT minimizes the expected sample size required to reach a decision, among all sequential tests with given ɑ, β. This optimality means that if an effect truly exists, SPRT will detect it as efficiently as possible, and if there is no effect, SPRT will likely recognize that sooner than a fixed-length test. This makes SPRT highly sample-efficient in many scenarios – this is great when data is scarce or costly.

It’s worth noting that the classic SPRT assumes simple hypotheses (exact H₀ and H₁ parameter values). In practice, if H₁ is composite or not a single specified value (e.g. “the new feature could have some positive impact, but we don’t know how much”), you can use variants like mixture SPRT (mSPRT), which average over a range of possible effects.

‍

If you’d like a free SPRT experimentation setup for your AI product, fill out this form below.

View file

PDF Document

Thank you! Your submission has been received, we'll be in touch soon!

Oops! Something went wrong while submitting the form. Please try again.

Sequential Probability Ratio Test for AI Products

Introduction

Basic Intuition

SPRT vs. Traditional A/B Testing

Why SPRT is Especially Useful for AI Product Experimentation

Real World Example: Testing a new feature for a chatbot with limited users

How to use SPRT, as an AI Engineer 🚀

Advanced: Mathematical Background of SPRT

Other Posts

Day in the Life of an FDE

Percival Integrations

Introducing TRAIL: A Benchmark for Agentic Evaluation