Be the first to know and get exclusive access to offers by signing up for our mailing list(s).

Subscribe

We ❤️ Open Source

A community education resource

3 key metrics for reliable LLM performance

What makes AI observability different from DevOps?

AI observability often lags behind innovation, and that gap can create big risks. In his lightning talk at All Things Open, Richard Shan from CTS shares why monitoring generative AI systems matters and which metrics and tools can keep your deployments reliable.

Subscribe to our All Things Open YouTube channel to get notifications when new videos are available.

Richard opens with a critical question: How do we know if generative AI systems are performing the way we expect? While chatbots on retail sites might not need strict oversight, mission-critical use cases like healthcare or finance demand accuracy and speed. Observability is what ensures that a model isn’t just running, but producing outputs you can trust.

He highlights three core metrics developers should track: Output coherence, accuracy, and latency. Coherence ensures the model’s response is logical and relevant, accuracy checks if the answer aligns with ground truth, and latency measures response time for time-sensitive scenarios. Richard also points out the challenge of defining accuracy in this space, which has led to tools like RAGAS for retrieval-augmented generation systems.

Read more: Observability is confusing, here’s how to learn it

Finally, Richard stresses the limitations of applying traditional DevOps monitoring to language models. LLM operations require frameworks tailored to tracing inputs, measuring contextual correctness, and preventing issues like hallucinations. His advice? If possible, build custom observability solutions to fit your use case, using Python libraries like RAGAS or ARES for dynamic evaluation. A solid observability framework should offer dashboards for both technical and non-technical users and deliver insights that guide system improvements.

Key takeaways

  • Observability is essential for generative AI, especially in high-stakes domains like finance or healthcare.
  • Tracking the right metrics like coherence, accuracy, and latency, helps teams identify and resolve performance issues.
  • Custom LLM observability frameworks, built around your organization’s priorities, provide better results than generic DevOps tools.

Conclusion

Building bigger models isn’t enough. Richard’s talk reminds us that observability is what keeps generative AI systems safe, efficient, and aligned with user expectations.

More from We Love Open Source

About the Author

The ATO Team is a small but skilled team of talented professionals, bringing you the best open source content possible.

Read the ATO Team's Full Bio

The opinions expressed on this website are those of each author, not of the author's employer or All Things Open/We Love Open Source.

Want to contribute your open source content?

Contribute to We ❤️ Open Source

Help educate our community by contributing a blog post, tutorial, or how-to.

We're hosting two world-class events in 2026!

Join us for All Things AI, March 23-24 and for All Things Open, October 18-20.

Open Source Meetups

We host some of the most active open source meetups in the U.S. Get more info and RSVP to an upcoming event.