Gamma: Scaling AI Performance with Automated Evals and Rigorous Experimentation

‍

Key Results

1K+ hours saved on manual evaluation per month through Patronus Judges
15+ LLMs benchmarked with Patronus Experiments
10K+ real world samples distilled into one coherent ground truth dataset

"Patronus helped our AI team find signals and patterns of error in our datasets. Their LLM Judges enabled us to triage errors and optimize our AI outputs in production settings. Patronus up-leveled our evaluation process and was an invaluable part of our workflow." - Jon Noronha, Co-Founder of Gamma

‍

Introduction

Gamma is an AI-powered design and presentation platform. 50M users operate on Gamma to create presentations, websites, and social media content. ✨

‍

Problem

Gamma’s AI team approached Patronus in search of a tool for automated AI evaluation. As Gamma’s app scaled, the team collected user feedback that quickly exceeded 10K+ samples. The team faced an issue endemic to other AI teams: “how do we extract patterns that will help us build a more performant AI system, as measured by user behaviors?”

“How do we parse our large volume of data for high signal events?”

The team was interested in identifying patterns that could help them answer questions like:

How do we define the anatomy of a highly-rated slide deck?
How do we select the best LLM to increase user satisfaction?

‍

Automated Failure Detection with Patronus Judges

Gamma’s task of slide deck generation resulted in long, open-ended outputs that were expensive to inspect manually and annotate. Patronus Judges excel for this challenge, where teams require a heuristic to grade thousands of samples. The Judges automated human labeling by localizing errors.

The team selected one common error class of user feedback— missing content– and curated 4 Patronus Judges to thoroughly cover edge cases:

factual-completeness
content-length-coverage
structural-completeness
followed-instructions

*Patronus Judges for Gamma’s missing content error class*

*A Patronus Judge can be built and refined on platform, via* *API, or* *SDK*

‍

Building Ground Truth Datasets with Patronus Experiments

Gamma’s open-ended creative text generation task created challenges with building golden datasets, which are touted as the standard for evaluations. The Patronus team worked with Gamma to build ground truth datasets as an alternative, in which samples were graded as “negative” or “positive” using results from the 4 Judges built above. Ground truth and golden datasets help teams benchmark models and optimize prompts.

‍

“Golden datasets are challenging to build for open-ended long-form text generation tasks. Patronus worked with Gamma to build ground truth datasets as an alternative.”

An Experiment is a set of evaluators run over a dataset for a specific task, like evaluating missing content. Experiments can be set up quickly through the Patronus SDK, as shown below.

Patronus Experiment run with 4 Patronus Judges

‍

Patronus Experiments achieve faster error localization, as shown to the right. Experiment result summaries show aggregate failures over multiple Judges, guiding teams toward hotspots of error. In the Experiment shown here, the factual-completeness judge performed at 54%, while the others hovered >80%, hinting that one failure case was more prevalent.

‍

On platform, teams can annotate a sample indicating their agreement with a Judge, export results, and build a ground truth dataset. For each run of an Experiment on Gamma’s data, the result summaries shown above were used to localize an error class, and Annotations were used to label salient samples. The filtered collection was run through additional rounds of Experiments with Evaluators, until the 10K+ samples were distilled into a ground truth dataset of 50 samples, with human and Judge scores.

*Patronus Annotations allow teams to label their agreement with Evaluators*

‍

Validating Judges through Evaluation Metrics

Gamma’s team, like other AI teams, wanted to assess the validity of their Judge scores. Judges are excellent heuristic tools, but iterating on system performance requires a north star metric. To help the Gamma team ground their Experiments in a metric, the Patronus team labeled ~30 samples using specific criteria for missing content. The team then computed human and Judge alignment using Cohen’s Kappa, a correlation metric used traditionally in machine learning to assess inter-rater alignment.

According to the Cohen Kappa index, there was moderate alignment between the human and the factual-completeness Judge. In some cases, the Judge Explanations revealed errors not captured by the human evaluator who was tasked with reading long context samples, as shown in the example below. The Tags show that both human labels (original_rating and human_score) indicated that the sample contained no missing content while the LLM reasoned otherwise.

*Patronus log showing human labels and Evaluator reasoning*

Because Judges can surface errors that humans might miss, teams can choose to finetune Judges to increase human-Judge alignment if they agree with the human scoring. Alternatively, teams can hill climb on the existing score and compare future run performance to the current.

Additional metrics in the Tags above like fact_coverage and Rouge grounded the Judges. fact_coverage was computed using a BertScore (embedding-based similarity), and the weighted_score was computed from BertScore and Rouge-L (longest subsequence overlap). These traditional NLP metrics can be used as a crosscheck.

‍

“Ground Judges in a quantifiable metric like Cohen’s Kappa, which assesses human-Judge alignment, and iterate on system performance using this metric.”

‍

Achieving Faster Time to Optimization (TTO)

Once errors were localized and the Judges were validated, the Gamma team was interested in identifying the next action for optimizing system performance.

The Patronus team helped Gamma extract key features of their user inputs and map correlations with Judge scores using Patronus Comparisons. Examples of features involved the complexity of instructions, the length of instructions, and the formatting structure. Shown below is evaluator performance for a range of content complexity:

*Patronus Comparisons showing correlations between user inputs and Judge scores*

‍

Shown above, errors are more likely when the instructions from users contain less detail. These correlations are useful in understanding the next step to take, including prompting users differently, prompting the models with a class of error feedback, and finetuning models with negative samples. Patronus Experiments help measure these changes and metrics like Cohen’s Kappa confirm that system performance is trending upwards over time.

The Gamma team continues to measure the performance of their models using Patronus Experiments on various datasets, including the ground truth dataset produced by the Patronus team. Show below are benchmarks on Gamma datasets across the latest multimodal models.