AI LLM Test Prompts: Best Practices for AI Evaluation
Prompt testing refers to evaluating a prompt to determine how well a large language model (LLM) response aligns with the desired output. With generative AI and LLMs gaining traction, prompt testing has become critically important for ensuring the quality, reliability, and accuracy of LLMs across different use cases. However, prompt testing presents significant challenges.
Since LLM outputs are sensitive to prompt texts, a slight change in the prompt can significantly impact the response. Moreover, LLM outputs are nondeterministic, meaning that repeated tests with the same prompts may produce different outputs. A prompt that shows good performance in one test may show degraded performance when run multiple times.
Additionally, the absence of standardized evaluation metrics, sensitivity to contextual variations and conversation history, and the complexity of handling edge cases make it challenging to ensure consistent and reliable prompt performance.
This article explores techniques and methodologies for efficiently testing LLM prompts. You will learn about various datasets and tools to help select prompts that best suit your generative AI application requirements. After reading this article, you will understand how to test and evaluate prompts for different use cases, choose the appropriate prompt and LLM model for your needs, and implement an efficient prompt testing system using emerging LLM testing tools.
Summary of AI LLM test prompt concepts
Prompt types
Prompts can be broadly categorized into several types, each designed to elicit specific responses from large language models. Here are some common categories.
Zero-shot prompting
In zero-shot prompting, you ask an LLM to generate a response without providing any explicit examples. The model relies purely on its pre-trained knowledge to generate a response.
The following is an example of zero-shot prompting, in which the user asks the model to predict the sentiment of a tweet without providing any examples.

Zero-shot prompting is used for the most basic tasks, where a model’s pre-trained knowledge is enough to generate an appropriate response.
Few-shot prompting
Few-shot prompting is one of the most straightforward prompting approaches. In few-shot prompting, you provide examples of the tasks you want to perform. Here is an example:

Few-shot prompting is particularly useful when you have limited labeled data for specialized domains.
Chain-of-thought (CoT) prompting
Chain-of-thought prompting guides an LLM by providing a step-by-step description of how to perform a specific task. As a result, an LLM breaks down the problem into smaller steps and returns a more accurate response than a simple prompt.

The figure above illustrates an example of CoT prompting. A simple prompt is used on the left, where the model generates a direct answer to a question. However, when asked a follow-up question, the model attempts to provide a direct answer again, which is incorrect.
On the right, CoT prompting is applied, guiding the model to provide a step-by-step explanation. As a result, the model breaks down the follow-up question into smaller steps and arrives at the correct answer. This demonstrates how CoT prompting enhances logical reasoning and improves accuracy in multi-step problems.
CoT is a go-to prompt engineering technique for solving complex problems requiring step-by-step logical reasoning.
Tree-of-thought (ToT) prompting
A tree-of-thought prompt structures the reasoning process like a decision tree and explores multiple potential solutions simultaneously before reaching a conclusion.

The figure above demonstrates the difference between simple input-output prompting, chain-of-thought prompting, and tree-of-thought prompting. On the right, you can see the tree-of-thoughts prompting, where the model explores multiple paths simultaneously and selects the most suitable answer.
Datasets for prompt testing
Once you know the different types of prompts, the next step in prompt testing is to find datasets containing examples similar to your problem, which you can use to evaluate your prompts. The choice of the dataset for testing your prompt depends highly on the tasks you want to perform. However, a few guidelines exist for selecting a dataset to test your prompts.
Selecting a dataset for prompt testing
The following are some recommendations for consideration in selecting a dataset to test your prompts:
- Relevance: The dataset you select must align with the specific task you want to perform. For example, suppose you want to test your prompt for sentiment classification. In that case, the dataset must contain examples relevant to your sentiment classification task, not just any sentiment classification task. Furthermore, the dataset must be in the same domain as your problem.
- Data quality: Your dataset must contain high-quality examples relevant to your task. For example, the SQuAD dataset contains high-quality answers sourced directly from Wikipedia, ensuring high accuracy. However, it is important to remember that academic datasets may not generalize well to real-world problems. The best approach is to craft a dataset unique to your problem that captures the nuances of your problem domain.
- Dataset size: A dataset must be sufficiently large to yield statistically significant results and encompass all the potential query-response scenarios your LLM may encounter. Overly small datasets may omit significant use cases that could arise in production applications. However, it is pertinent to mention that the quality of examples matters most compared to the quantity.
- Ethical considerations: Always analyze the dataset for biases that could affect prompt performance and lead to unfair outcomes. Ensure the dataset complies with privacy regulations and does not contain sensitive information.
- Licensing and accessibility: Finally, ensure you have permission to use datasets in any way you wish. The dataset must be accessible and, ideally, in a form you can directly use to test your prompts.
Finding datasets for prompt testing
For prompt testing, you can use readily available community datasets or create custom ones based on your problem.
You can find open license community datasets on the following platforms:
- Kaggle: Kaggle is an open-source community platform for data science and machine learning. You can find datasets for various NLP problems such as sentiment classification, text summarization, question-answering, etc. You will also find datasets for retrieval-augmented generation (RAG) on Kaggle.
- Papers with Code: Contains a comprehensive list of NLP datasets linked to academic papers. You can also explore the data collection process.
- Google Datasets: Google provides a tool for searching for datasets on third-party websites, such as Kaggle, Papers with Code, etc.
- Hugging Face: Contains one of the most extensive collections of off-the-shelf datasets for various machine learning and data science tasks, including datasets for prompt evaluation.
- Patronus AI Datasets: Third-party LLM evaluation tools like Patronus AI also provide off-the-shelf datasets for prompt testing and evaluation.
In addition to the above sources, you can find task-specific datasets on academic websites such as Arxiv and the prompt engineering guide, synthetic multilingual prompts dataset, and PromptSet datasets containing ready-to-go prompts for various tasks.
Ideally, create a dataset specific to your problem domain. A tailored dataset provides the most accurate evaluation results and enhances your LLM applications' robustness, maintainability, and accuracy. While building a custom dataset takes time and effort, it outperforms generic academic or open-source datasets.
Prompt testing purposes
Prompt testing is a broad concept with more than one objective. You may have already selected an LLM and want to determine which prompt best solves your problem using that specific model. Alternatively, you might test a single prompt across multiple LLMs to identify the best-performing one. You can even take things further by selecting both the prompt and the LLM based on prompt testing results. This section explores a few dimensions of prompt testing.
Application-based prompt testing
In application-based prompt testing, a prompt is evaluated based on how well it generates a response for a specific application. For example, if you are testing a RAG prompt, you would evaluate the retrieval accuracy and the groundedness of the response in the retrieved context. On the other hand, for chatbots and personal assistants that do not involve RAG, you do not care about the retrieval part; you only focus on response relevance, coherence, and fluency.
Prompt testing for prompt selection
In this scenario, you have already selected an LLM and want to choose a prompt that generates the best response from the LLM. Selecting the best prompt for an LLM involves iteratively refining prompts with different phrases, structures, and levels of detail. Few-shot, zero-shot, and chain-of-thought prompting are standard techniques for testing prompts for a specific LLM.
Prompt testing for LLM selection
In this case, you have not selected an LLM for your problem. Instead, you are trying the same prompts for different LLMs to see which LLM response best fits your application requirements. Prompt refinement is generally not recommended in this situation, as the goal is first to select an LLM that gives a better response even if a prompt is not well-refined.
Prompt testing for prompt and LLM selection
This involves combining the two approaches above. Ideally, you would first select an LLM that generates the best response to your problem. However, you may have to refine the prompt in this case because the first LLM may generate a better response for a specific prompt, even though the second LLM generates a better response for a different prompt.
Once you have selected an LLM, you iteratively refine your prompt and select the prompt that generates the best response for that particular LLM. For example, if you want to select the best LLM and prompt to solve a coding problem, you can use a general-purpose prompt to select an LLM that is strong at coding. You can then tweak the prompt to address its weaknesses and get a better response from the LLM.
Prompt testing techniques
To explain different prompt techniques, we will use a distilled version of the DeepSeek R1 model from the Hugging Face Inference API.
Install the following Python libraries to run the following scripts in this article.
!pip install huggingface_hub==0.24.7
!pip install rouge-score
!pip install transformers
Import the following modules in your application.
from huggingface_hub import InferenceClient
from transformers import pipeline
import torch
import os
import pandas as pd
import seaborn as sns
sns.set_style("darkgrid")
import matplotlib.pyplot as plt
from rouge_score import rouge_scorer
from google.colab import userdata
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from nltk.tokenize import word_tokenize
nltk.download('punkt', quiet=True)
nltk.download('wordnet')
Next, create a client object for the DeepSeek model to call the Hugging Face Inference API.
hf_token = userdata.get('HF_API_TOKEN')
deepseek_model_client = InferenceClient(
"deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
token=hf_token
)
Finally, define the generate_response() function, which accepts a model name, the system and user prompt, and the model's temperature and top-p values and returns the model response.
def generate_response(model, system_prompt, user_query, temperature = 0.5, top_p = 0.5):
response = model.chat_completion(
messages=[{"role": "system", "content": system_prompt},
{"role": "user", "content": user_query}],
max_tokens=4000,
temperature = temperature,
top_p = top_p
)
return response.choices[0].message.content
Pointwise vs. pairwise testing
Pointwise Testing
Pointwise prompt testing involves evaluating a prompt independently based on predefined criteria such as accuracy, recall, precision, coherence, or relevance. A prompt’s response is analyzed in isolation without direct comparison with the others.
For example, the following prompt requests the sentiment of a Twitter query. Pointwise testing would mean analyzing the prompt’s response without comparing it to another prompt.
system_prompt = "You are an expert tweet sentiment analyzer."
user_query = f"""What is the sentiment expressed in the following tweet:
I like the movie but it was a bit too long."""
output = generate_response(deepseek_model_client,
system_prompt,
user_query)
## only retrieve the response not the thought process
response = output.strip().split("</think>")[-1].strip()
response
Here’s the output:

You should use pointwise testing in scenarios where you have a clear metric or ground truth and want to evaluate a prompt for absolute quality.
Pairwise Testing
In pairwise testing, you compare two different prompts and responses side by side and analyze which one is better for your use case.
For example, in the script below, you have two prompts predicting the tweet's sentiment side by side.
You compare the responses of the two prompts and decide which prompt better suits your requirements.
system_prompt = "You are an expert tweet sentiment analyzer."
user_query1 = f"""What is the sentiment expressed in the following tweet:
I liked the movie but it was a bit too long."""
user_query2 = f"""What is the sentiment expressed in the following tweet.
Your response must be one word: positive, negative, or mixed.
I liked the movie but it was a bit too long."""
prompts = {"Prompt 1": user_query1,
"Prompt 2": user_query2}
for prompt, user_query in prompts.items():
output = generate_response(deepseek_model_client,
system_prompt,
user_query)
response = output.strip().split("</think>")[-1].strip()
print(f"Response from {prompt}: {response}")
Output:

Reference-free vs. reference-based testing
Another common scenario when evaluating prompts is comparing reference-free and reference-based prompt testing. Both reference-based and reference-free approaches may follow pointwise or pairwise testing strategies.
Reference-based Testing
In reference-based testing, you have ground truth response labels, and a model’s output is evaluated against these ground truth labels. For example, the script below has a target label of “mixed.” This code compares the model response with the target label to evaluate the prompt.
target_label = "mixed"
system_prompt = "You are an expert tweet sentiment analyzer."
user_query = f"""What is the sentiment expressed in the following tweet.
Your response must be one word: positive, negative, or mixed.
I liked the movie but it was a bit too long."""
output = generate_response(deepseek_model_client,
system_prompt,
user_query)
response = output.strip().split("</think>")[-1].strip()
print(response)
if response == target_label:
print("Correct")
else:
print("Incorrect")
Output:

Prompt evaluation approaches such as exact match and fuzzy match are often used to evaluate reference-based prompt testing.
Reference-free Testing
In reference-free approaches, you don't have any ground truth value. For example, in retrieval augmented generation, you don't usually have ground truth values to compare model responses. In such cases, you will typically use human annotators or other LLMs to evaluate a model’s response.
For example, in the following script, we pass the user query and response from our last example to an LLM and ask if the response is correct.
#Llama 3.3 endpoint
#https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct
llama_model_client = InferenceClient(
"meta-llama/Llama-3.3-70B-Instruct",
token=hf_token
)
system_prompt = "You are an expert LLM response evaulator."
user_query = f"""Given the following input to an LLM:{user_query},
and the following response {response}. Do you think the response is accurate?"""
output = generate_response(llama_model_client,
system_prompt,
user_query)
output
Output:

{{banner-large-dark-2="/banners"}}
Factors affecting prompt response
In addition to changing prompt text and structure, many other factors affect a prompt’s response. The major factors that affect a prompt’s response are system instructions, temperature, and top-p value.
System instructions
System instructions are global instructions for the model, telling it how to respond to a prompt. They are typically used to set the system prompt.
For example, the following example will return a sentence in response since the system or user instructions don't mention that we want a single word in response.
system_prompt = "You are an expert tweet sentiment analyzer."
user_query = f"""What is the sentiment expressed in the following tweet.
Your response must be positive, negative, or mixed.
I liked the movie but it was a bit too long."""
output = generate_response(deepseek_model_client,
system_prompt,
user_query)
response = output.strip().split("</think>")[-1].strip()
response
Output:

In contrast, in the script below, we add system instructions that we want a single word in response.
system_prompt = "You are an expert tweet sentiment analyzer. You respond in a single word."
user_query = f"""What is the sentiment expressed in the following tweet.
Your response must be positive, negative, or mixed.
I liked the movie but it was a bit too long."""
output = generate_response(deepseek_model_client,
system_prompt,
user_query)
response = output.strip().split("</think>")[-1].strip()
response
Output:

Ideally, all global instructions for user prompts must be specified in the system instructions. The user prompts should only contain instructions intrinsic to themselves.
Temperature settings
Temperature is another critical setting that affects prompt response. Depending on the framework, it defines randomness in a prompt response and can be between 0 and 1 or 0 and 2. A lower temperature value leads to a more deterministic response.
In the following example, we set the temperature value to 0.
system_prompt = "You are an expert tweet sentiment analyzer."
user_query = f"""What is the sentiment expressed in the following tweet:
I liked the movie but it was a bit too long."""
output = generate_response(deepseek_model_client,
system_prompt,
user_query,
temperature = 0)
response = output.strip().split("</think>")[-1].strip()
response
Output:

If you run the above script again with the temperature set to 0, you will receive a response with words similar to those in the above output.
Increasing the temperature will generate a more creative response.
For example, running the script multiple times with a temperature value of 0.9 is likely to produce more varied responses with different vocabulary each time.
output = generate_response(deepseek_model_client,
system_prompt,
user_query,
temperature = 0.9)
response = output.strip().split("</think>")[-1].strip()
response
Output:

Ideally, you should use lower temperature values for reference-based prompt testing since you want a more deterministic output for recurrent inputs.
Top-p settings
Top-p sampling, also known as nucleus sampling, is another approach to control the randomness in a prompt response. Top-p settings define the threshold to sample the next most probable words in the response sequence. The next word in a sequence is chosen from the smallest subset of words whose cumulative probability is less than the top-p value. This means that a higher top-p value allows for selecting words from a larger subset than a lower top-value.
Like the temperature value, a lower top-p value returns a similar response if you run the following example multiple times.
system_prompt = "You are an expert tweet sentiment analyze."
user_query = f"""What is the sentiment expressed in the following tweet:
I liked the movie but it was a bit too long."""
output = generate_response(deepseek_model_client,
system_prompt,
user_query,
temperature = 0.5,
top_p = 0.1)
response = output.strip().split("</think>")[-1].strip()
response
Output:

Increasing the top-p value to 0.9 returns a more versatile and creative response each time.
system_prompt = "You are an expert tweet sentiment analyze."
user_query = f"""What is the sentiment expressed in the following tweet:
I liked the movie but it was a bit too long."""
output = generate_response(deepseek_model_client,
system_prompt,
user_query,
temperature = 0.5,
top_p = 0.9)
response = output.strip().split("</think>")[-1].strip()
response
Output:

Prompt evaluation criteria
Prompt evaluation criteria highly depend upon the task and the prompt testing strategy. For example, you would ideally choose deterministic criteria such as accuracy or exact match for a sentiment classification task with a reference-based prompt testing strategy. On the other hand, fuzzy match criteria are better suited to summarization and translation tasks.
You can also use other LLMs to evaluate the prompt response from an LLM model, an approach called LLM as a judge.
In this section, we will discuss some commonly used evaluation criteria. The next section explains the LLM as a judge.
Exact match
Exact match is another word for accuracy; it returns true if an exact match is found between the ground truth and LLM-generated text. It is commonly used for tasks with small and highly deterministic outputs, such as sentiment classification, where the model has to predict either positive, negative, neutral, etc.
BLEU and ROUGE scores
Both BLEU and ROUGE scores evaluate machine translation, text summarization, and QA applications, but they differ in focus.
BLEU emphasizes precision, measuring the percentage of words in the generated response that match the ground truth.
For example:
- Words in ground truth: ["The", "cat", "is", "on", "the", "mat"]
- Words in LLM-generated response: ["The", "cat", "is", "sitting", "on", "the", "mat"]
- BLEU score: 6/7 = 0.85
Conversely, ROUGE focuses on recall, counting how many words from the ground truth appear in the generated response. In the same example, the ROUGE score is 1, as all ground truth words are present.
BLEU and ROUGE struggle with evaluating LLM outputs as they rely on surface-level n-gram overlap, missing valid responses with different phrasing. Their fixed-reference approach fails to account for the open-ended nature of LLMs, often penalizing creative yet accurate answers. Moreover, they correlate poorly with human judgment, as they do not measure coherence, relevance, or factual accuracy.
Fuzzy match
Fuzzy matching measures similarity between texts even when they are not identical, making it useful for evaluating LLM-generated responses that differ in phrasing but retain the same meaning.
Unlike BLEU or ROUGE scores, which rely on exact word overlaps, fuzzy match techniques, such as token-based similarity, evaluate how closely the generated text resembles the reference. This approach is particularly useful for evaluating open-ended responses where multiple valid answers exist, reducing unfair penalization for minor variations in wording.
While deterministic evaluation criteria are helpful for reference-based prompt testing, they are unsuitable when you do not have a ground-truth value. In such a case, you either have to evaluate the results manually, or you can use LLM as a judge.
Note: LLM as a judge, is not limited to reference-free testing. You can also use it in reference-based prompt testing and then calculate the results by combining the results with deterministic metrics.
Additionally, prompting testing can be cumbersome. You must select or develop your evaluation criteria, select an LLM, and then perform experiments. Large datasets and result visualization further complicate this process.
If you are facing these problems, consider using an LLM evaluation platform such as Patronus AI that provides access to state-of-the-art LLM-as-a-judge evaluators.
Patronus AI platform
Patronus AI offers end-to-end LLM evaluation and prompt testing functionalities where you can use a built-in LLM evaluator or define your own evaluator to test your prompts. You can review a list of all LLM evaluators with various criteria and see which LLM best suits your needs. You can also access Patronus AI evaluators via its Python Library.
LLM as a judge
You have already seen an example of LLM as a judge in the reference-free prompt testing example. This section will show more extensive examples of using Judge evaluators from Patronus AI.
Let’s see how to use Patronus AI judge evaluators to calculate exact match and fuzzy scores for LLM-generated summaries.
To access Patronus AI judge evaluators, you must install the Patronus AI Python library and import the Patronus AI API key into your application.
!pip install -q -U patronus
from patronus import read_csv, read_jsonl
from patronus import Client
PATRONUS_API_KEY = userdata.get('PATRONUS_API_KEY')
We will summarize an article from the News Articles Dataset and calculate the exact match and fuzzy match using the Patronus AI evaluators. The following script imports the dataset and displays its header.
# dataset download link: https://github.com/reddzzz/DataScience_FP/blob/main/dataset.xlsx
summaries = pd.read_excel(r'/content/summary_dataset.xlsx')
summaries.head()
Output:

Next, we use the distilled DeepSeek model to generate a summary of an article from the dataset.
content = summaries["content"].iloc[10]
summary = summaries["human_summary"].iloc[10]
system_role = "You are an expert in text summarization. Summarize the articles like human."
user_query = f"""Generate a summary of the following article in 1000 characters:\n{content}"""
output = generate_response(deepseek_model_client,
system_role,
user_query)
## only retrieve the response not the thought process
response = output.strip().split("</think>")[-1].strip()
response
Output:

To use the judge evaluators, we will define the `evaluate_summarization_patronus` function that accepts the reference and LLM-generated summaries as parameters and calculates the exact match and fuzzy match scores.
The function uses the `exact_match` and `patronus:fuzzy-match` judge evaluators to evaluate LLM-generated summaries against human summaries.
To use Patronus AI evaluators, you must create an object of the `Client` class from Patronus AI and call the `evaluate` method. You must pass the `evaluator` as a function parameter. Additionally, depending upon your problem, you can pass the `evaluated_model_input`, `evaluated_model_output`, `evaluated_model_gold_answer`, and `evaluated_model_retrieved_context`.
For evaluating summaries, we only need the `evaluatoed_model_output,` which stores the LLM-generated summaries, and the `evaluated_model_gold_answer`, which corresponds to human-generated summaries.
The following script calls the `evaluate_summarization_patronus` function to evaluate the `response` we retrieved in the previous script using the `exact-match` and `fuzzy-match` evaluators.
client = Client(api_key=PATRONUS_API_KEY)
def evaluate_summarization_patronus(reference, candidate):
exact_match = client.evaluate(
evaluator="exact-match",
evaluated_model_output=reference,
evaluated_model_gold_answer=candidate
)
fuzzy_score = client.evaluate(
evaluator="judge",
criteria="patronus:fuzzy-match",
evaluated_model_output=reference,
evaluated_model_gold_answer=candidate
)
results = {
'Exact Match': exact_match,
'Fuzzy Match': fuzzy_score
}
return results
result = evaluate_summarization_patronus(summary, response)
result
Output:

The output shows various evaluation details for both `exact-match` and `fuzzy-match` evaluators. The following script demonstrates how to retrieve individual output fields such as the `pass_,` `score_raw,` and `explanation` fields.
for key, value in result.items():
print(f"Results for {key}")
print(f"Pass: {value.pass_}")
print(f"Score Raw: {value.score_raw}")
print(f"Explanation: {value.explanation}")
print("=================================")
Output:

The output shows that the `exact-match` marked the evaluation as failed since LLM and human-generated summaries have different text. On the contrary, the `fuzzy-match` evaluator has passed the evaluation. In the `explanation` field, you can see the reasoning process of the `fuzzy-match` evaluator explaining the pass/fail decision.
You can see Patronus AI logs to get more information about your evaluations.

Patronus AI Experiments with GLIDER LLM
Patronus AI's Experiments feature allows you to experiment on large datasets. With experiments, you can run batched evaluations and compare performance across different configurations, such as prompts, models, and datasets.
This section shows an example of running a Patronus experiment to compare two prompts using LLM-as-a-judge.
You can use any LLM “as a judge,” but researchers have explicitly trained certain LLMs to perform “LLM-as-a-judge” tasks. GLIDER from Patronus AI is one such LLM that outperforms other LLMs for evaluation tasks.
In the following section, we create a Patronus AI Experiment that uses a GLIDER-based judge evaluator to detect hallucinations in LLM responses in a dataset.
To create your own judge evaluator, go to the Patronus AI Evaluators page and click “Configure an Evaluator” in the top right corner.

You will see the following page. Select the evaluator type. We will select “Glider” for our experiment type.

On the following page, enter your criteria name and description. You can enter anything; it does not affect the evaluator’s performance.
The critical part here is the “Pass Criteria” value. Follow the Patronus AI GLIDER documentation instructions to see how to set this criterion. Ideally, you must specify the MODEL OUTPUT, and RETRIEVED CONTEXT values for RAG applications.
In the following judge evaluator, we set criteria for hallucination detection.

Click the “Validate Pass Criteria” button to validate the pass criteria. As the above figure demonstrates, you will see three green notifications on the right side if your criteria are validated.
Finally, specify rubric criteria that set the output scores. In the following, we set three rubric score values and the corresponding instructions for the rubric.

Once you save your evaluator, you should see it in the list of evaluators on your dashboard.

We are ready to create an experiment using our judge evaluator to detect hallucinations.
We import a dataset containing the user input and the context. The context column contains the context, and the question column contains the user input.
For the experiment, we will only use the first 50 records.
dataset = pd.read_csv("/content/validation-squad.csv")
random_records = dataset.sample(n=50)
random_records.to_csv("qa_records.csv", index=False)
print(random_records.shape)
random_records.head()
Output:

Next, we convert the dataset into a Patronus AI dataset. In the Patronus AI dataset, we must specify the evaluated_model_input_field, which contains the user input; and the evalulated_model_retrieved_context_field, which contains the context.
dataset = read_csv(
"/content/qa_records.csv",
evaluated_model_input_field="question",
evaluated_model_retrieved_context_field="context",
)
Check out the official documentation to see how to create a Patronus AI dataset.
We will compare two RAG prompts and see which prompt results in less hallucination using the judge evaluator we created.
The following script defines a Patronus task using our first prompt and the `gpt-4o-mini` model to answer the user’s question based on the given context. This is what RAG applications do.
The output of this task is stored in the `evaluated_model_output` field, which corresponds to the model output. Our evaluator will use this field to detect hallucinations in the model’s responses.
oai_client = OpenAI(
api_key = OPENAI_API_KEY
)
@task
def gpt_4o_mini_basic(row: Row) -> TaskResult:
"""Simple hallucination detection"""
system_prompt = "Based on the context, answer the user's question."
query = f"""
Answer the following question based on the context.
Question: {row.evaluated_model_input}
Context: {row.evaluated_model_retrieved_context}
"""
evaluated_model_output = (
oai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": system_prompt,
},
{
"role": "user",
"content": query
},
],
temperature = 0.0
)
.choices[0]
.message.content
)
return TaskResult(
evaluated_model_output=evaluated_model_output,
evaluated_model_provider="openai",
)
The following script defines the task running our second prompt that uses a Chain-of-thought prompting approach.
@task
def gpt_4o_mini_cot(row: Row) -> TaskResult:
"""COT based hallucination detection"""
system_prompt = """You will receive a user's question and the context
Based on the context, answer the user's question.
Only include information from the context and do not generate text inconsistent with the context.
Think step by step to generate your final response."""
query = f"""
Answer the following question based on the context.
Question: {row.evaluated_model_input}
Context: {row.evaluated_model_retrieved_context}
"""
evaluated_model_output = (
oai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": system_prompt,
},
{
"role": "user",
"content": query
},
],
temperature = 0.0
)
.choices[0]
.message.content
)
return TaskResult(
evaluated_model_output=evaluated_model_output,
evaluated_model_provider="openai",
)
Next, as shown in the script below, we define a Patronus AI client object and the evaluator we will use to create a Patronus AI experiment.
Finally, we run the experiment.
client = Client(api_key=PATRONUS_API_KEY)
small_hallucination_evaluator = client.remote_evaluator("glider", "small-hallucination-check")
assistants = [
(gpt_4o_mini_basic, "gpt_4o_mini_basic"),
(gpt_4o_mini_cot, "gpt_4o_mini_cot"),
]
async def run_experiment():
for assistant_func, assistant_name in assistants:
experiment_results = await client.experiment(
"Compare RAG prompts",
data=dataset,
task = assistant_func,
evaluators=[small_hallucination_evaluator],
tags={"dataset_type": "qa RAG", "model": "gpt-4o-mini"},
experiment_name= assistant_name
)
# Use await to run the async function
experiment_results = await run_experiment()
Output:

The above output shows the results for the `gpt-4o-mini-cot` experiment that uses chain-of-thought prompting.
The results show that our judge evaluator passed 68% of the records.
You can also see the histogram distribution for each rubric score.
Once you run the experiment, click the link at the bottom in the results window to view the details. Alternatively, you can go to your dashboard and click Experiments, select the experiments you want to compare, and click the “Compare” button from the top right to see how the two prompts compare.

You will see results like this:

The above output shows that both evaluators assign a higher pass rate to the responses generated via the chain-of-thought prompts.
Prompt testing infrastructure
A robust prompt testing pipeline is essential for consistent, reliable, and efficient evaluation. A well-structured prompt-testing infrastructure enables systematic testing, versioning, and analysis of prompts across various use cases.
This section discusses key components of prompt testing infrastructure.
Prompt storage and management
Since prompt testing involves extensive modification, a structured storage system is needed to enable robust prompt organization and a version control system. This ensures that prompt changes and refinements are trackable.
Ideally, you need a version control system, such as GitHub or GitLab, to track changes in your prompts over time.
Reporting and feedback loops
Reporting and feedback are crucial to prompt testing as they streamline the refinement process, ensuring continuous improvement. Ideally, you should have an automated reporting mechanism to implement using custom scripts or tools like experiments in Patronus AI.
Based on the automated reports, you can implement human-in-the-loop review systems or even LLMs to evaluate reports and feedback and improve your prompts or the whole infrastructure that suits your needs.
Considerations for selecting a prompt testing infrastructure
Finally, there are a few factors that you must consider before selecting a prompt testing infrastructure:
- In-house vs. third-party solutions: Developing a custom pipeline offers flexibility but requires more resources. Alternatively, third-party solutions like Patronus AI, LangChain evaluation frameworks, or Hugging Face’s transformers library provide ready-to-use evaluation functionalities. Amongst these solutions, Patronus is the only platform dedicated to evaluating AI applications with a library of out-of-the-box evaluators, and the ability to generate test datasets and run experiments.
- Cost considerations: Running LLM evaluations at scale can be expensive. Choosing cost-effective solutions (e.g., using open-source models or optimizing API calls) can help manage expenses.
- Latency and scalability: If real-time prompt evaluation is required, you must optimize your infrastructure for low-latency responses. Cloud-based solutions with scalable architecture (e.g., serverless functions or distributed processing) can handle large-scale evaluations.
- Model size and training time: The choice of LLM impacts infrastructure needs. Smaller, distilled models allow for rapid testing, while larger models require substantial computational resources and time.
A well-structured prompt testing infrastructure ensures that AI applications produce high-quality, reliable, consistent responses. It optimizes performance while maintaining control over prompt iterations and evaluations.
{{banner-dark-small-1="/banners"}}
Last thoughts
Prompt testing is crucial for developing robust LLM applications. Your AI applications may suffer from inconsistencies, hallucinations, and irrelevant responses without rigorous prompt testing strategies.
However, testing LLM prompts is challenging. Furthermore, managing in-house prompt testing pipelines can be resource-intensive, expensive, and time-consuming.
To address these challenges, leveraging third-party applications like Patronus AI can streamline prompt evaluation and refinement. Patronus AI provides end-to-end prompt evaluation solutions from prompt testing to result analysis and prompt refinement to feedback and reporting.
If you want to improve your LLM applications, consider integrating Patronus AI into your prompt testing workflow. Visit the Patronus AI website to explore available resources, demo pages, and evaluation tools.