Llama Guard is Off Duty 😲

August 22, 2024

TLDR

We benchmarked Llama-Guard-3 and Llama-3.1-8B models on toxicity detection, and we found that Llama-Guard-3 significantly underperformed compared to the base model prompted with a simple toxicity detection prompt.

Background

The task of toxicity detection is crucial for identifying whether the outputs generated by a model are harmful or intended to cause harm. Toxic content can manifest in various forms, such as assault, vulgarity, sexually explicit material, and other inappropriate content, often arising from biased training datasets. Ensuring the detection and prevention of such outputs is vital for the widespread adoption of generative models. The Llama-Guard series of open-sourced models has been specifically developed to be used jointly with the base generative models to assess whether the generated content is toxic. If toxic content is detected, these models can further classify it into predefined toxicity categories (defined here). The most recent iteration in this series is Llama-Guard-3, a fine-tuned version of the Llama-3.1-8B model designed specifically for toxicity detection in eight languages, including English, French, German, Hindi, Italian, Portuguese, Spanish, and Thai.

Results

Below are the results on 500 samples each, from the toxic splits of popular English as well as multilingual toxicity datasets:

We tested these models exhaustively on single query and chat-based datasets (some results attached above). We noticed that Llama-Guard-3 underperforms even the random baseline of 50% for this binary classification task of marking inputs as safe or unsafe. We also observed that most of the heavy lifting for producing safe outputs is done by the base model itself, and adding Llama-Guard-3 to the pipeline may be redundant.

Let’s study a few patterns that we observed from the outputs (attached):

1. Llama-Guard-3 frequently confuses the context of the text. Let us pick out an example from the BeaverTails dataset where the model misclassifies 41% of such cases:

In this case, because there is no explicit toxic or harmful vocabulary used, the model fails to capture the general context that the question is ill-intended and the action of stealing is illegal.

2. Llama-Guard-3 tends to accept leakage of personal information. For example:

Here, the model conveniently ignores the targeted information extraction attack.

We noticed that this behavior extends beyond English where the model fails to capture harmful intent and similarly affects high-resource and low-resource languages. You can find our complete code and run outputs here: https://github.com/patronus-ai/llama-3-toxicity-experiments

**We have ensured consistency with the prompt provided by Meta for the Llama-Guard-3 while running our experiments and have evaluated toxicity unbiasedly here. Our overall findings (as displayed above) have led us to believe that Llama-Guard-3 underperforms at toxicity classification.

Conclusion

Based on our research, Llama-Guard-3 is a weak guard model today. Llama-Guard-3 is built mainly for the Llama-3.1 collection of models, which appear to be already very strongly aligned for safety. Given these results, Llama-Guard-3 may not only be redundant to use with Llama-3.1 in conjunction, but would also be underwhelming as a standalone guard model as well.

At Patronus AI, we rigorously benchmark all things to help engineers trust all the AI tools they use. Reach out to contact@patronus.ai to learn more!

Appendix

Evals, prompts, and references: https://github.com/patronus-ai/llama-3-toxicity-experiments

References

[1] Ji, J., Liu, M., Dai, J., Pan, X., Zhang, C., Bian, C., ... & Yang, Y. (2024). Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems, 36.

[2] Li, L., Dong, B., Wang, R., Hu, X., Zuo, W., Lin, D., ... & Shao, J. (2024). Salad-bench: A hierarchical and comprehensive safety benchmark for large language models. https://aclanthology.org/2024.findings-acl.235.

[3]Lin, Z., Wang, Z., Tong, Y., Wang, Y., Guo, Y., Wang, Y., & Shang, J. (2023). Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation. https://aclanthology.org/2023.findings-emnlp.311/.

[4] Kluge, N. (2022). Nkluge-correa/Aira-EXPERT: release v.01. Zenodo.

[5] cjadams, Jeffrey Sorensen, Julia Elliott, Lucas Dixon, Mark McDonald, nithum, Will Cukierski. (2017). Toxic Comment Classification Challenge. Kaggle. 

https://kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge

[6] Röttger, P., Kirk, H. R., Vidgen, B., Attanasio, G., Bianchi, F., & Hovy, D. (2023). Xstest: A test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263.

[7] Tonneau, M., Liu, D., Fraiberger, S., Schroeder, R., Hale, S. A., & Röttger, P. (2024). From Languages to Geographies: Towards Evaluating Cultural Bias in Hate Speech Datasets. arXiv preprint arXiv:2404.17874.

[8] Sirihattasak, S., Komachi, M., & Ishikawa, H. (2018, May). Annotation and classification of toxicity for Thai Twitter. In TA-COS 2018: 2nd Workshop on Text Analytics for Cybersecurity and Online Safety (p. 1).

[9] Çöltekin, Ç. (2020, May). A corpus of Turkish offensive language on social media. In Proceedings of the Twelfth language resources and evaluation conference (pp. 6174-6184).

[10] Ä°. Mayda, Y. E. Demir, T. Dalyan and B. Diri, "Hate Speech Dataset from Turkish Tweets," Innovations in Intelligent Systems and Applications Conference (ASYU), Elazig, Turkey, 2021, pp. 1-6, doi: 10.1109/ASYU52992.2021.9599042.

[11] Kadir Bulut Ozler, "5k turkish tweets with incivil content", 2020, "https://www.kaggle.com/datasets/kbulutozler/5k-turkish-tweets-with-incivil-content

[12] Overfit-GM/turkish-toxic-language · Datasets at Hugging Face. (n.d.). https://huggingface.co/datasets/Overfit-GM/turkish-toxic-language

[13] Daryna Dementieva, Valeriia Khylenko, Nikolay Babakov, and Georg Groh. 2024. Toxicity Classification in Ukrainian. In Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024), pages 244–255, Mexico City, Mexico. Association for Computational Linguistics