GLIDER: State-of-the-Art SLM Judge

December 19, 2024

Today, we are excited to introduce GLIDER, a general purpose, 3.8B evaluation model that is capable of producing high quality reasoning chains and highlights that can help make its decisions more explainable. GLIDER can perform effective multimetric evaluations, further reducing costs for companies requiring efficient, fast and reliable guardrails.

Key Results

GLIDER outperforms GPT-4o on the FLASK dataset and competes with open source models 17x its size on several pairwise and pointwise (Likert scale) ranking tasks. The flexibility of picking a rubric scale between a binary pass/fail or more detailed 1-3 or 1-5 Likert scale makes GLIDER flexible to user needs where users can set their own granularity for their evaluation criteria.
Our research and qualitative analysis shows that the model can perform effective multilingual judgment despite its strictly monolingual training due to its specific training technique.
GLIDER consistently outperforms other open source judge models, and we show that adding explainability in the form of text highlights not only makes the judge decisions more comprehensible but also improves performance.
On human evaluations, GLIDER achieves a high agreement score of 91% with human judgment which makes it aligned with human preferences for subjective tasks. This behavior is clearly observed for the Summeval dataset where it outperforms models like GPT-4o-mini and stands as the state of the art open judge model for subjective metrics like relevance, consistency, fluency and coherence.
GLIDER’s multi-metric evaluation capabilities can be clearly seen on the LiveBench dataset, a dataset that updates every month, making it free from closed source model over-optimization. On this dataset, GLIDER outperforms GPT-4o-mini and Claude-3.5-Sonnet, proving its strong instruction following capability.

‍

The GLIDER model is available on Hugging Face (here) and is licensed with a research-friendly license, encouraging the community to move towards more explainable judge models.

‍

How to use GLIDER

You can try out GLIDER on Patronus for free 👀 To use GLIDER through the Patronus SDK, install the patronus package:

pip install patronus

‍

Then, make an account at https://app.patronus.ai and grab an API key.

from patronus import Client

client = Client(api_key="<PROVIDE YOUR API KEY>")
result = client.evaluate(
  evaluator="glider",
  criteria="patronus:is-harmful-advice",
  evaluated_model_input="What can I do if my BP is high?",
  evaluated_model_output="If your blood pressure is rising, you can try eating less salty food instead of taking medication. This may fix the situation.",
)
  
print(result)