Modeling Statistical Risk in AI Products

April 9, 2025

Introduction

As enterprises prepare to launch new AI experiences into production in 2025, we at Patronus AI frequently hear that leaders are concerned about AI “going off the rails”. They understand that hallucinations and other unexpected behavior can cause serious reputational damage and financial risk to the company. The quintessential example is Air Canada’s chatbot providing incorrect refund information, which resulted in unhappy customers and even legal damage (link here).

In recent months, strategy and analytics teams across enterprises have been tasked with developing ROI analyses for new AI initiatives. A common question we’ve gotten from them is: “How do we quantify the true impact of AI hallucinations? We understand we need guardrails, but is there a specific number I can give my C-Suite on the estimated risk level if we ship without guardrails?”.

We decided to put together a comprehensive guide to help enterprises model statistical risk in new AI products. Let’s dive in!

Setup

This new statistical risk model will quantify how AI errors impact key business metrics like Average Revenue Per User (ARPU). A company can input its own baseline metrics (e.g. current ARPU, user volume) and parameters (e.g. error frequency, churn sensitivity) to simulate outcomes.

For the purpose of this technical guide, we differentiate between 2 different types of AI products:

LLM-based chatbots: single-step evaluation of an output with respect to its conversational context
Autonomous agents: multi-step evaluation where errors can accumulate and surface at any point in the trajectory, due to long context planning and reasoning issues

We will capture one-shot error risk in chatbots, and cumulative error risk in multi-step autonomous agents. We will also use Bayesian inference for uncertain parameters.

Goal: assess potential revenue impact due to AI errors.

Model Parameters and Assumptions

Model Parameters

A (ARPU Baseline): average revenue per user (ARPU) over a given period, in dollars.
U (Total Users): Total users of the product – this could refer to total active users during the period. A * U = total revenue in the period, before considering any AI-related churn.
p_h (hallucination frequency per interaction): this can be estimated through evals. We assume each interaction’s outcome is independent.
- For chatbots: p_c is per-query hallucination probability
- For autonomous agents: p_a is per-step error probability. k is the average number of steps or actions the agent takes to complete a task, and n_a is the number of tasks per user in the period
ɑ (user churn sensitivity to AI errors): among users who experience a significant AI failure, this is the fraction who decide to leave. This parameter encapsulates user tolerance.

Assumptions

We assume the AI error has a noticeable negative impact on the user experience. We also assume that churn due to an AI error would happen quickly, affecting that period’s revenue – if so, we lose the remaining ARPU from them in subsequent periods. We also assume a user who sees multiple errors is more likely to churn.

We will focus on immediate period impacts only, but this statistical model can be extended to lifetime impact as well!

Modeling Statistical Risk in Chatbots

p_c = probability that the chatbot produces a hallucination in a single response.

A user may make multiple queries to the chatbot in a period, so let’s denote the number of queries as n_c. (1-p_c)ⁿ_c is the probability that a user sees 0 errors in all their queries. Then, the probability that the user experiences at least one hallucination is:

$$P_\text{error, user(chatbot)} = 1 - (1 - p_c)^{n_{\text{chat}}}$$

User Churn Probability:

$$P_\text{churn due to AI error} = \alpha \cdot \left(1 - (1 - p_c)^{n_c}\right)$$

This is the incremental churn rate. ɑ is like the average churn likelihood.

With chatbots, errors are isolated events. So even one bad experience can translate to a reasonably high amount of churn, according to this model.

Modeling Statistical Risk in Autonomous Agents

The sequential decision making with agents introduces a compounding risk: an error at any step can derail the entire task. The more steps involved, the higher the chance something goes wrong by the end.

We can extend the chatbot model by accounting for cumulative error probability per task, and then per user across multiple tasks.

p_a = probability of AI error in a single step in the agent flow

Assume a task has k sequential steps, the probability the entire task has no errors is: (1-p_a)^k

Therefore, the probability that at least one error occurs within the agent flow:

$$P_\text{error, task(agent)} = 1 - (1 - p_a)^{k}$$

Note the compounding effect here – a 1% error rate per step can compound to an 63% chance of error by the 100th step in an agent flow.

Now, if the user uses the autonomous agent for multiple tasks in the period:

$$P_\text{error, task(user)} = 1 - (1 - p_u)^{k}$$

$$P_\text{churn due to task error} = \alpha \cdot \left(1 - (1 - p_e)^k\right)$$

So if a user runs many tasks, each with many steps, the likelihood that at least one step goes wrong somewhere is very high…

User Churn Probability:

$$\alpha = 1 - (1 - p_a)^{k \cdot n_a}$$

We could say that an agent failure has higher severity than a chatbot failure, because the agent can continue acting on wrong information, and potentially spend a lot of unnecessary time in the process.

Example: Let’s say an agent has p_a = 0.005 (0.5% error per step), and performs k = 20 steps per task, then P_{error, task} = 1 - (0.995)²⁰ = 9.5% chance the agent fails in a given task. If a user uses the agent for n_a = 5 tasks, the chance the user sees a failure is 1 - (1-0.095)⁵ = 37%. If ɑ = 0.2, then 0.2 * 0.37 = 7.4% of users are churning due to the agent’s issues.

Note: even with a low per-step error rate, the compounding effect across many steps and tasks results in a non-trivial churn risk!

Extension: If the agent is operating continuously or in very long sessions, we could scale up n_a or treat it as a time variable.

Note: Overall, this is a simplified approach for the purpose of this blog post. In reality, ɑ (churn sensitivity) is highly dependent on the task, so it should be defined as ɑ_task, which would then be multiplied with the individual error probabilities. This makes more sense in cases where, for example, we want our agent to use a DB tool accurately (high risk), but we don’t mind if it makes a few web search errors (low risk).

Revenue Impact

Total Revenue Impact

Revenue loss (chatbots):

$$A \cdot U \cdot \alpha = 1 - (1 - p_c)^{n_c}$$

Revenue loss (autonomous agents):

$$A \cdot U \cdot \alpha = 1 - (1 - p_a)^{k \cdot n_a}$$

Total ARPU Impact

If we consider ARPU over the initial user base, then after losing some users, the effective ARPU drops, because revenue dropped but we’re averaging over the original count. Here, ARPU would be

$$A \times (1 - \text{churn fraction due to AI})$$

Using the chatbot case, effective ARPU becomes

$$A \cdot \left(1 - \alpha \cdot \left(1 - (1 - p_c)^{n_c}\right)\right)$$

For example: let’s say churn is 7.9%, then ARPU effectively is 92.1% of baseline (for that original cohort). In the example, $10 * 0.921 = $9.21 effective ARPU vs $10 originally.

Preventing Revenue Loss

We can use all of this to determine what an “acceptable” error rate is. For example, if we want less than 1% revenue loss due to AI, we can set

$$\alpha \cdot \left(1 - (1 - p_h)^k\right) < 0.01$$

and solve for p_h (given n_c or {k, n_a}).

It also shows the value of powerful guardrail and optimization tools like Patronus: e.g., if we can set guardrails on an autonomous agent or reduce the k (steps) needed, the risk goes down nonlinearly. For example, halving k drastically lowers P_{error, task}for the agent. Note that risk reduction isn’t linear – small improvements in p_h or reductions in the number of agent steps can yield outsized reductions in error probability, due to the exponential nature of (1-p) powers.

A Real World Example

Let’s assume that:

Baseline ARPU A = $15/month, user base U = 50,000.
Chatbot with p_c = 0.02 per query, and heavy users n_c = 100 queries per month on average.
ɑ = 0.1 (10% churn if an error is seen).

First, the probability a user sees an error: 1 - (1-0.02)¹⁰⁰ = 86.7% (!).

Even though 2% per query is low, at 100 queries most users will hit a mistake at some point. Churn due to AI = 0.1 * 0.867 = 0.0867 (8.67% of users). So that is ~4,335 users churning.

Revenue loss = $15 * 50,000 * 0.0867 = $65,025 for that month.

ARPU (original-base) drops to $15*(1-0.0867) = $13.70.

The large impact here is because the usage was high. This implies that for frequently-used chatbots, even a small hallucination rate can have big consequences; improving the model’s accuracy p_c or limiting exposure can have significant financial benefit.

Advanced: Bayesian Inference

Bayesian inference is especially important to this framework because an enterprise may not have launched their AI product yet. This means that all parameters, like error frequencies and churn sensitivity, are subject to uncertainty. We can use Bayesian inference to incorporate prior knowledge and update our beliefs about the parameters as new data comes in.

1. First, assign a prior distribution to the uncertain parameters.
a. For example, we might believe the chatbot’s error rate p_c is around 2%, but we’re not certain. We could use a Beta prior,

$$p_c \sim \text{Beta}(a_0, b_0)$$

– we could choose a prior with mean 0.02 and some concentration.

b. Similarly, for churn sensitivity ɑ, if we think 10% of users would churn on a bad outcome, but it could reasonably range say 5–20%, we could give ɑ a prior like

$$\text{Beta}(a_0, b_0)$$

centered at 0.1.
Beta distributions are nice because they’re conjugate priors for binomial observations (success/failure data). This makes sense in this context since each interaction is success/failure, each user either churns or not, etc.

2. Then, do a Bayesian Update.
a. After product launch, gather data on number of queries, number of AI errors, numbers of users who churned, etc. Then update the distributions of those parameters using the Bayes’ Rule

b.For example, let’s say we observed x hallucinations out of N queries. The posterior distribution for p_c would be

$$\text{Beta}(a_0 + x,\, b_0 + N - x)$$

, combining the prior and the likelihood of data. This posterior reflects a refined estimate of p_c after seeing evidence. Likewise, if we tracked M users who encountered an error and saw that y of them churned, we could update ɑ with

$$\text{Beta}(a_0' + y_2,\, b_0' + M - y_2)$$

The neat thing now is that we can have distribution ranges for each parameter rather than single point estimates. This is especially useful for things like p_c.

3.Then, construct credible intervals for outcomes.
a. Instead of a single point estimate of revenue loss (like $59k), we can create a probabilistic range. By sampling from the posterior distributions of parameters (e.g. via Monte Carlo simulation), we can simulate many scenarios of

$$(p_h,\, \alpha,\, n_{\text{c}, \text{user}})$$

and compute churn and revenue loss for each. This yields a distribution of possible financial impacts

b. For example, we might conclude that there’s a 90% probability that monthly revenue loss will be between $50k and $100k, and a 10% chance it exceeds $100k

c. This helps us understand the tail risks (low-probability but high-impact outcomes)

Extension: we can consider using Bayesian hierarchical models, in case users across segments have different churn sensitivities.

Main takeaway: Bayesian inference is useful because there is uncertainty around what will happen with the AI product launch. We don’t need to be overly confident in single point estimates, and we can continue to refine outcome estimates over time.

How to Use this Statistical Model

1. Put in company data for the parameters

2. Calculate base risk

3. Scenario analysis

a. What if usage doubles?
b. What if the error rate doubles?
c. What if the error rate goes down by 50% because of guardrail intervention?

4. Mitigation planning
‍
a. How much can pre-launch testing reduce the hallucination probability?
b. What if the error rate doubles?
c. What if the error rate goes down by 50% because of guardrail intervention?

5. Bayesian updates
‍
a. After launch, feed real metrics back in. Update churn sensitivity accordingly, update hallucination probability accordingly
b. Recompute to get new projections

Conclusion

If you’d like a free statistical risk assessment of your AI product, fill out this form below:

View file

PDF Document

Thank you! Your submission has been received, we'll be in touch soon!

Oops! Something went wrong while submitting the form. Please try again.