Exa vs Bing API: A Search Performance Comparison Case Study
Overview
The rise of AI applications has made the quality of search and retrieval systems increasingly critical. We conducted a detailed evaluation comparing Exa's neural search capabilities against Bing's API, focusing on their ability to provide relevant results for real-world queries that are highly semantic. We used the Patronus AI automated evaluation suite to perform the comparison, generating aggregate metrics and handy visualizations in the process.
Methodology
We chose a highly semantic query set and tested for whether results semantically match the search query. We describe our methodology below.
Data Collection
We first constructed a representative evaluation dataset. Our dataset consisted of the following attributes:
- 150 highly semantic queries
- 5 results retrieved per query from each API
- Full text, highlights, and summaries captured for each result
In order for the comparison to be accurate, we augmented the data with Exa contents for Bing search results, since it only returns URLs. This ensures a fair comparison focused only on the relevance of results.
Our code to query Exa and Bing Search is shown below:
# Example implementation
from exa_py import Exa
exa_client = Exa(api_key="TODO")
exa_results = exa_client.search_and_contents(
query,
type="neural",
use_autoprompt=True,
num_results=5,
text=False,
highlights=True,
summary=True
)
bing_results = bing_client.web.search(
query=query,
count=5,
text_decorations=True,
text_format="HTML"
)
Evaluation
Results were evaluated using an independent judge evaluator on the Patronus platform, assessing both summary quality and result relevance. This evaluator allowed us to obtain reliable evaluation results at scale, ensuring high human-AI alignment in the process. Results were evaluated on a PASS/FAIL basis, based on the following judge definition:
"Given a search query in USER INPUT, a summary of the content from the returned search result in MODEL OUTPUT, and highlights (or snippets) from the returned search results, determine whether the MODEL OUTPUT or RETRIEVED CONTEXT provide useful and relevant information related to the USER INPUT."
We ran the following code to kick off an evaluation with the Patronus experiments framework:
# Example implementation
from patronus import Client
patronus_client = Client(api_key="TODO")
query_result_relevance = patronus_client.remote_evaluator(
evaluator_id_or_alias="judge",
criteria="is-search-query-result-relevant",
)
patronus_client.experiment(
project_name="web-search-comparison",
data=exa_results,
evaluators=[query_result_relevance],
experiment_name="exa",
)
patronus_client.experiment(
project_name="web-search-comparison",
data=bing_results,
evaluators=[query_result_relevance],
experiment_name="bing",
)
Performance Analysis
We see that Exa outperformed Bing Search in search result relevance. The Comparisons view shows that Exa had a pass rate of 60% whereas Bing had a pass rate of 38%.
Let's dig into some example queries to understand the performance differences!
Example Queries
Query: “best online language learning apps with proven effectiveness for native english speakers learning mandarin chinese”
Example Result: Exa
Exa's result recommended Ninchanese for native English speakers learning Mandarin Chinese. Patronus scored the result as PASS as it is relevant to the user query.
Example Result: Bing
Bing's result provides examples of learning apps for 2024. Patronus scored the result as FAIL. To understand why, we can take a look at the Patronus evaluator's explanation. In this case, the result was marked as FAIL because the results were general in scope and not specific to native English speakers.
Key Findings
1. Semantic Understanding
- Exa's neural search showed superior performance in understanding complex technical queries
- Particularly strong in cases requiring deep domain understanding
2. Result Relevance
- Higher precision in technical and specialized searches
3. Content Depth
- Exa consistently returned more technically relevant content
- Better at finding specific, detailed information rather than general overviews
Implications for Developers
The results demonstrate clear advantages for applications requiring:
- Complex query understanding
- Accuracy and relevancy of full content within a URL
Conclusion
Our evaluation reveals that Exa's neural search capabilities provide significantly more relevant results for technical and complex queries compared to traditional search APIs. This makes it particularly valuable for applications requiring deep semantic understanding and technical content retrieval.