aiTech‌ ‌Trend‌ ‌Interview‌ ‌with‌ Xander Song, a Developer Advocate and Machine Learning Engineer at Arize AI

Introduction

Contents

Research breakthroughs are always fascinating! Can you provide insights into any recent advancements or methodologies developed at Arize AI that have been instrumental in advancing the field of model monitoring and debugging?

Can you walk me through a typical scenario?

What other solutions exist around LLM observability?

Has anyone tested Phoenix?

How does this fit into Arize AI overall?

Why are tools like this important?

How does Arize AI foster a culture of innovation and collaboration to encourage research-driven advancements in the field of AI monitoring and explainability?

Conclusion

Welcome, ladies and gentlemen, to an exciting session with Xander Song, a Developer Advocate and Machine Learning Engineer at Arize AI, a leading company in the world of technology and AI.

Song has been instrumental in shaping Arize AI’s success in ML observability and pushing the boundaries of AI monitoring, model performance, and transparency in the evolving generative AI space.

Most recently, Song – along with a team consisting of Mikyo King, Francisco Castillo Carrasco, and Roger Yang – helped develop Phoenix, an open-source library offering ML observability in a notebook to better monitor and fine-tune generative LLM, computer vision, and tabular models.

We caught up with Song on the thinking behind Phoenix and Arize’s strategy more generally.

Research breakthroughs are always fascinating! Can you provide insights into any recent advancements or methodologies developed at Arize AI that have been instrumental in advancing the field of model monitoring and debugging?

One big breakthrough that I’ve been focused on for the past six months is Arize Phoenix.

Phoenix is open source software that enables evaluation and risk management for LLMs, computer vision and tabular models. Phoenix’s main users are the people building applications on top of LLMs.

For example, a data scientist might be building an application using an LLM like OpenAI’s ChatGPT to generate legal advice in a virtual lawyer product, or a startup might be working with medical providers trying to accurately summarize doctor-patient meetings for an electronic medical record.

As the industry re-tools around LLMs and data scientists apply large foundational models to new use cases like these – supplanting traditional approaches – they lack ways to reliably evaluate whether LLM applications they build are ready for production. And when they are in production, data scientists also have no idea when models fail, when they make wrong decisions, or give poor responses (LLM) or incorrectly generalize. That’s dangerous in a world where we have known issues around bias and hallucinations for major models like GPT-4.

The risk of deploying LLMs in high risk environments (i.e. working with medical or legal data) is immense, and running blind without tools such as Phoenix should give pause to businesses that depend on LLM technology. Phoenix can help teams visualize complex LLM decision-making, monitor LLMs when they produce false or misleading results, and narrow in on fixes to improve outcomes. Phoenix also supports computer vision and other language model use cases, and traditional ML use cases.

Can you walk me through a typical scenario?

Phoenix finds where LLMs go wrong. Let me give you an example. Say you’re building a health insurance customer care chatbot. Users can ask this chatbot about their coverage plans from the health insurance provider. This is an application that demands a high degree of trust in the output of the LLM since users depend on the answers to decide what specialists to see or procedures to take. We want to find where the chatbot gives inaccurate/hallucinatory responses.

Phoenix runs in a notebook locally, and the library leverages clustering of embeddings for debugging.

Embeddings are vector representations of data. They are everywhere in modern deep learning, such as transformers, recommendation engines, layers of deep neural networks, encoders, and decoders. They preserve relationships within your data.

In order to use Phoenix, users:

Load their data (Example: Chatbot conversations which include prompts & responses). This leverages embeddings and LLM-assisted evaluation to generate scores for responses
Start Phoenix
Investigate groups of responses that are problematic (Example: questions from Spanish-speaking patients where the LLM responded incorrectly)
Download bad responses to use for LLM fine-tuning & improvement

Step 1: Users upload their data and embeddings into Phoenix. They can see groups where the LLM gave good responses, and areas where LLMs gave bad responses.

Step 2: Users can grab groups of responses (clusters) that represent a problem

Step 3: Troubleshoot and grab prompt & response pairs

In short, Phoenix provides ML insights at lightning speed with zero-config observability for model drift, performance, and data quality.

What other solutions exist around LLM observability?

We haven’t seen many. LLMOps is a rapidly-emerging discipline with new players emerging seemingly daily, so it’s an exciting space to contribute to and watch!

Modern models are built on latent structure and embeddings as the foundation of how they work. Embeddings are the core building blocks of transformers. Phoenix maps out how the embeddings connect, how they are related to each other and how they progress as sentences are generated by LLMs.

Embeddings can either be extracted from the LLM itself as it’s generating text, generated using services such as OpenAI’s embedding generator service, or generated locally on data by another LLM. Once extracted the latent structure gives an idea behind what the model has learned, what it’s thinking and how that thinking progresses.

Phoenix is the first observability solution we’ve seen built with embeddings as the core foundation but we are certain it won’t be the last.

Has anyone tested Phoenix?

Anyone can try out Arize Phoenix now, and we’ve been fortunate to get feedback from over 100 users and researchers at different companies and organizations who were generous with their time in advising us on the development of Phoenix and related technology using embeddings.

Phoenix is still relatively new, but reception has been positive. Here are a few quotes from folks on the technology:

“A huge barrier in getting LLMs and Generative Agents to be deployed into production is because of the lack of observability into these systems. With Phoenix, Arize is offering an open source way to visualize complex LLM decision-making.” – Harrison Chase, Co-Founder of LangChain
“This is something that I was wanting to build at some point in the future, so I’m really happy to not have to build it. This is amazing.” – Tom Matthews, Machine Learning Engineer at Unitary.ai

How does this fit into Arize AI overall?

Phoenix is designed to be a standalone offering delivering ML observability in a data science notebook environment where data scientists build models.

The team designed Phoenix so that data scientists can quickly evaluate their model decisions, augment data, iterate on it, and identify patterns or clusters to perform production workflows such as prompt iteration and model analysis without the need to rely on engineering functions for implementation. This aspect is the key to empowering enterprise data science teams and anyone building on top of foundational models, giving them the right tools to improve performance and model outcomes.

It is the Arize AI team’s vision that these ML notebook-based observability tools (personal tools for the data scientist) have connections to larger platforms such as Arize. The ability to download datasets, iterate locally and upload clusters of data or discoveries into large platforms will become the normal operational workflows for fixing and improving AI systems.

Why are tools like this important?

There is probably nothing more important in the tech world right now than tools that help teams understand what AI is doing, where it is going wrong and why.

According to a University of Pennsylvania study 80% of the U.S. workforce and over 300M people globally will have their jobs impacted by GPTs. Generative AI is already reshaping industries in ways we’re barely starting to understand. As new applications get built, Phoenix is here to provide the right guardrails to experiment and innovate with this new technology safely. By remaining open source, Phoenix provides the implementers of AI the ability to evaluate LLMs and generative models in an unbiased environment.

How does Arize AI foster a culture of innovation and collaboration to encourage research-driven advancements in the field of AI monitoring and explainability?

There is probably nothing more important in the tech world right now than tools that help teams understand what AI is doing, where it is going wrong and why. Arize combines diversity with unique passion and expertise to continue to encourage research-driven advancement, and I’m proud to be a part of such a talented group. Perhaps nowhere is that culture of innovation more prominent than Phoenix, which realizes a vision for contributing to the open source community.

Conclusion

We would like to express our heartfelt gratitude to Xanader Song for sharing his experience, knowledge, and perspective.

As we conclude this interview, let us remember that the future of AI lies in the hands of visionaries like those at Arize AI. Together, we can continue to unlock the untapped potential of AI, shaping a future where technology serves as a catalyst for positive change in all aspects of our lives.

Thank you for joining us on this enlightening journey, and we look forward to witnessing the continued success and groundbreaking advancements from Arize AI and its exceptional team.