Lynx: State-of-the-Art Open Source Hallucination Detection Model

July 11, 2024

Today, we are excited to introduce Lynx, a SOTA hallucination detection LLM that outperforms GPT-4o, Claude-3-Sonnet and closed and open-source LLM-as-a-judge models. We are thrilled to launch Lynx with Day 1 integration partners Nvidia, MongoDB and Nomic.

Retrieval Augmented Generation (RAG) techniques aim to mitigate hallucinations in Large Language Models (LLMs). However, LLMs can still produce information that is unsupported or contradictory to the retrieved contexts. This poses significant downstream risks to end users, where incorrect information can lead to medical misdiagnosis or poor financial advice 🤦

Take the following example, where GPT-4o, Claude-3-Sonnet and Lynx were asked whether an answer about biological terminology contained hallucinations. GPT-4o and Claude-3-Sonnet both failed to identify the hallucination, whereas Lynx correctly reasoned that the correct answer is “genus”, based on the document provided ✅

‍

While several previous open source hallucination models outperform GPT-3.5, Lynx is the first open source model that beats GPT-4 in a wide range of scenarios. We evaluate Lynx on HaluBench, a comprehensive hallucination evaluation benchmark consisting of 15k samples sourced from various real-world domains. Experiment results show that Lynx is capable of solving challenging hallucination tasks:

In medical answers (PubMedQA), Lynx (70B) was 8.3% more accurate than GPT-4o at detecting medical inaccuracies.
Lynx (8B) outperformed GPT-3.5 by 24.5% on HaluBench, and beats Claude-3-Sonnet and Claude-3-Haiku by 8.6% and 18.4% respectively, showing strong capabilities in a smaller model.
Both Lynx (8B) and Lynx (70B) achieve significantly increased accuracy compared to open source model baselines, with Lynx (8B) showing gains of 13.3% over Llama-3-8B-Instruct from supervised finetuning.
Lynx (70B) outperformed GPT-3.5 by an average of 29.0% across all tasks.

How did we achieve these results?

Lynx and the new 15k benchmark HaluBench support real world domains like Finance and Medicine, which previous datasets and models did not include, making it more applicable to real world problems.
Lynx is a finetuned Llama-3-70B-Instruct model. The model not only produces a score but can also reason about it, like a human grader, making AI outputs more explainable and interpretable.
Lynx is especially strong at catching hard-to-detect hallucinations. This is due to novel training approaches that include Chain-of-Thought reasoning that enables LLMs to perform reasoning on advanced tasks!

For more details on our methodology, check out our research paper: https://arxiv.org/abs/2407.08488

How to use Lynx

We are excited to be launching Lynx with our Day 1 integration partners: Nvidia, MongoDB and Nomic AI! Read more to learn how to use Lynx with our integration ecosystem.

Running Lynx locally with Ollama

Install ollama: https://ollama.com/download
Download the .gguf version of Lynx-8B-Instruct from here (this might take 1-2 minutes): https://huggingface.co/PatronusAI/Lynx-8B-Instruct-Q4_K_M-GGUF
Create a file named Modelfile with the following:

Make sure the .gguf path points to your downloaded model.

[.c-box-wow][.c-text-typewriter][.c-row-flex]Unset[.c-row-flex][.c-text-green]FROM "./patronus-lynx-8b-instruct-q4_k_m.gguf"[.c-text-green][.c-row-flex][.c-row-flex][.c-text-green]FROM "./patronus-lynx-8b-instruct-q4_k_m.gguf"[.c-text-green][.c-row-flex][.c-row-flex][.c-text-green]PARAMETER stop "<|im_start|>"[.c-text-green][.c-row-flex][.c-row-flex][.c-text-green]PARAMETER stop "<|im_end|>"[.c-text-green][.c-row-flex][.c-row-flex][.c-text-green]TEMPLATE """[.c-text-green][.c-row-flex][.c-row-flex][.c-text-green]<|im_start|>system[.c-text-green][.c-row-flex][.c-row-flex][.c-text-green]{{ .System }}<|im_end|>[.c-text-green][.c-row-flex][.c-row-flex][.c-text-green]<|im_start|>user[.c-text-green][.c-row-flex][.c-row-flex][.c-text-green]{{ .Prompt }}<|im_end|>[.c-text-green][.c-row-flex][.c-row-flex][.c-text-green] <|im_start|>assistant[.c-row-flex][.c-text-green][.c-row-flex][.c-text-green]"""[.c-text-green][.c-text-typewriter][.c-box-wow]

Run ollama create patronus-lynx-8b -f Modelfile
Run ollama run patronus-lynx-8b
You can now start chatting to Patronus-Lynx-8B-Instruct locally!

[.c-box-wow][.c-text-typewriter][.c-row-flex]Unset[.c-row-flex][.c-text-green]Given the following QUESTION, DOCUMENT and ANSWER you must analyze the provided answer and determine whether it is faithful to the contents of the DOCUMENT. The ANSWER must not offer new information beyond the context provided in the DOCUMENT. The ANSWER also must not contradict information provided in the DOCUMENT. Output your final verdict by strictly following this format: "PASS" if the answer is faithful to the DOCUMENT and "FAIL" if the answer is not faithful to the DOCUMENT. Show your reasoning.\n\n--\nQUESTION (THIS DOES NOT COUNT AS BACKGROUND INFORMATION):{question}\n\n--\nDOCUMENT:\n[{document}]\n\n--\nANSWER:\n{answer}\n\n--\n\nYour output should be in JSON FORMAT with the keys "REASONING" and "SCORE":\n{"REASONING": <your reasoning as bullet points>, "SCORE": <your final score>}\n[.c-text-green][.c-text-typewriter][.c-box-wow]

[.c-box-wow][.c-text-typewriter][.c-row-flex]Unset[.c-row-flex][.c-text-green]curl http://localhost:11434/api/generate -d '{[.c-text-green][.c-row-flex][.c-row-flex][.c-text-green] "model": "patronus-lynx-8b",[.c-text-green][.c-row-flex][.c-row-flex][.c-text-green] "prompt":"What are hallucinations in language models?"[.c-text-green][.c-row-flex][.c-row-flex][.c-text-green]}'[.c-text-green][.c-row-flex][.c-row-flex][.c-text-green]curl http://localhost:11434/api/chat -d '{[.c-text-green][.c-row-flex][.c-row-flex][.c-text-green] "model": "patronus-lynx-8b",[.c-text-green][.c-row-flex][.c-row-flex][.c-text-green] "messages": [[.c-text-green][.c-row-flex][.c-row-flex][.c-text-green]{"role": "user", "content": "What are hallucinations in language models?"}[.c-text-green][.c-row-flex][.c-row-flex][.c-text-green]{{ .System }}<|im_end|>[.c-text-green][.c-row-flex][.c-row-flex][.c-text-green]][.c-text-green][.c-row-flex][.c-row-flex][.c-text-green]}'[.c-text-green][.c-text-typewriter][.c-box-wow]

Using Lynx with NVIDIA NeMo-Guardrails

You can quickly integrate Lynx as a hallucination detector with your chatbot application using NVIDIA NeMo-Guardrails.

After you’ve deployed Lynx (here’s a guide if you’re using vLLM or Ollama), you can follow the instructions here to integrate it with NeMo-Guardrails.

Note that the Lynx integration will make its way to NeMo-Guardrails on the July 31, 2024 release. But you can start playing around with it now by installing NeMo-Guardrails from GitHub.

‍Viewing HaluBench on Nomic Atlas‍

You can also view the HaluBench dataset on Nomic Atlas! Atlas is a visualization tool for large scale datasets that can help you identify patterns and insights from your datasets. The full HaluBench dataset is available for public use on Nomic Atlas. You can filter, segment and explore the dataset below:

https://atlas.nomic.ai/data/patronus-ai/halubench/map