Announcing our $3M seed round to boost enterprise confidence in generative AI

Consumers adopted generative AI at an unprecedented pace. ChatGPT was the fastest growing consumer product ever: 100M+ users in the first 2 months! AI has been front and center in everyone’s minds this year. Meanwhile, enterprises have been understandably hesitant to deploy AI products at a similar pace. They’re worried about the mistakes that LLMs can make. And unfortunately, evaluating and inspecting language models today is incredibly unscalable and ineffective. At Patronus, we want to change that. We are on a mission to boost enterprise confidence in generative AI.

How We Got Here

Rebecca and Anand have known each other for the better part of a decade. After studying CS together at the University of Chicago, Rebecca led responsible NLP and alignment research at Meta AI (FAIR), while Anand developed early causal inference and experimentation foundations at Meta Reality Labs. At Meta, both experienced firsthand the difficulties of evaluating and interpreting ML outputs — Rebecca from a research perspective and Anand from an applied perspective.

When OpenAI CTO Mira Murati announced ChatGPT on Twitter in November last year, Anand noticed and sent it to Rebecca 5 minutes later. They immediately knew it was a transformative moment, and they naturally assumed enterprises would rush to apply language models to a variety of use cases. So when Anand heard that his brother’s investment bank Piper Sandler banned OpenAI access internally, he was shocked. Over the coming months, they would hear again and again that traditional enterprises were proceeding very cautiously.

They realized that while NLP had a significant technological advancement, there was a large gap to true enterprise adoption. Everyone agreed that generative AI will be incredibly useful, but no one understood how to use it in the right way. That’s when they realized that the AI evaluation and security layer would be the most important problem to be solved in the coming years.

Funding

We are launching out of stealth today with a $3M seed round led by Lightspeed Venture Partners, with participation from Factorial Capital, Replit CEO Amjad Masad, Gokul Rajaram, Michael Callahan, Prasanna Gopalakrishnan, Suja Chandrasekaran, and others.

We are fortunate to have partnered with an extraordinary group of investors who have an extensive background investing in and operating iconic enterprise companies, especially enterprise security and AI companies. Something just felt right about all the conversations we had with Lightspeed early on, especially with Nnamdi Iregbulem. We were impressed by Nnamdi’s technical expertise, his thoughtful approach to developer-centric products, and his deep understanding of the problem space and our vision.

Team

Our founding team comes from top applied ML and research backgrounds, including Facebook AI Research (FAIR), Airbnb, Meta Reality Labs, and quant finance. As a team, we have published NLP research papers at top AI conferences (NeurIPS, EMNLP, ACL), designed and launched Airbnb’s first conversational AI assistant, pioneered causal inference at Meta Reality Labs, exited a quant hedge fund backed by Mark Cuban, and scaled 0→1 products at high growth startups. We are also advised by Douwe Kiela, CEO of Contextual AI and Adjunct Professor at Stanford University, and the former Head of Research at HuggingFace. Douwe has produced foundational research in NLP, especially in evaluation, benchmarking, and RAG.

Problem Space

Current LLM evaluation is unscalable and ineffective. Here’s why:

Manual evaluation is slow and expensive. Large enterprises spend millions of dollars on thousands of internal QA testers and external consultants to manually find errors in their AI. Engineers deploying AI products spend weeks manually creating test sets and inspecting AI outputs.
Non-deterministic nature of LLMs makes it difficult to predict failures. LLMs are large probabilistic machines. By nature, its input space is unconstrained (within the context length), providing a wide surface area for attack vectors. Therefore, there is a wide range of possible failures.
There is no standard LLM testing framework today. While software testing is heavily integrated into traditional engineering workflows with unit testing frameworks, large QA teams, and release cycles, companies haven’t yet developed the same processes around LLMs. Continuous and scalable evaluation, identifying and logging LLM errors, and performance benchmarking are critical for production-ready LLM usage.
Academic benchmarks don’t capture real world scenarios. Enterprises today are testing LLMs on academic benchmarks (e.g. HELM, GLUE, SuperGLUE), which do not reflect real world use cases. Academic benchmarks are getting saturated, and there is training data leakage.
There is a long right tail of AI failures. The last 20% is the most challenging to overcome. Adversarial attacks have shown that the security problem for LLMs is far from solved. Even if general pre-trained language models demonstrate strong baseline capabilities, there is still a long right tail of possibilities. As a team, we have done a lot of groundbreaking research on adversarial model evaluation and robustness. But we are just scratching the surface.

Enter Patronus AI

We are on a mission to boost enterprise confidence in generative AI.

Patronus AI is the industry-first automated evaluation and security platform for LLMs. Customers use Patronus AI to detect LLM mistakes at scale and deploy AI products safely and confidently.

The platform automates:

Scoring: Scores model performance in real world scenarios and key criteria like hallucinations and safety.
Test generation: Automatically generates adversarial test suites at scale.
Benchmarking: Compares models to help customers identify the best model for specific use cases.

Companies prefer to evaluate frequently to adapt to the constantly evolving models, data, and needs of users. Ultimately, they want a credibility checkmark. No company wants to see their end users unhappy about surprising failures, or worse, see negative headlines and cause regulatory problems.

What’s more is that companies are looking for a trusted third party evaluator. It’s easy for someone to say their LLM is the best and beats the State-of-the-Art models, but we need an unbiased, independent perspective. Think of us like the Moody’s of AI.

Our early partners include leading AI companies like Cohere, Nomic AI, and Naologic. In addition, several high profile companies in traditional industries like financial services are in talks with Patronus AI to pilot the platform. If the problems we’re solving resonate with you and your team, please reach out to contact@patronus.ai. We’d love to help.

The Best is Yet to Come

While launch is a big moment for us, it’s just the very beginning. We will continue to partner with leading enterprises and scale our world-class team of AI researchers, software engineers, and product designers.

If you find this exciting, reach out to contact@patronus.ai. Let’s talk!

“Do not go gentle into that good night,
Rage, rage against the dying of the light”
- Dylan Thomas (1954)