Best Practices for Running an AI app in Production

At Olvy we built a tool that extracted user feedback from all customer interactions, ran it through an analysis and enrichment pipeline, and generated beautiful reports for the user. At least, that’s what our marketing focused on. The overall product did many things (one of the reasons we failed, will discuss this in later posts), and we had AI integrated into every piece of our product. This was before agents became all the rage, and AI workflows meant call chaining, RAGs, fine-tuning, and prompt engineering. We processed millions of data points and consumed billions of tokens every month. At one point we were one of the few SaaS companies from India that had built a good production use-case for AI in enterprise, that got to paying customers who loved the product.

In the entire journey, we learned a lot of things about running AI apps in production. The kind of issues you run into, the kind of infrastructure you need to set up, and the kind of experiences you need to build to solve the problems. If you’re someone who’s working at a startup and building a product using LLMs, this would give you a starting point on best practices to follow.

Context

We started as a changelog tool, one of our features was that users could share feedback on the changelog updates. We want that feedback through a sentiment analysis pipeline to give you a sense of “Did my users like my release?” – it was a gimmicky feature to attract attention. The response we received on it, we didn’t expect. The customer requests on that feature are what led us to build the feedback management and product lifecycle vision of Olvy.

The first versions of the system used Google’s cloud NLP APIs for sentiment and entity extraction. We later moved to OpenAI. We had fine-tuned models on OpenAI that ran classification use cases (“Is this piece of text a good feedback, or a random support ticket?”). We also stored embeddings for all the feedback landing on the system. We ran clustering on the feedback to identify themes (common issues people were complaining about). We generated text summaries across thousands of user feedback a company had. We automated the generation of Jira/Linear tickets from user feedback. We built a tool to generate PRDs.

Basically, we ran a lot of experiments and didn’t work on building the core models that powered the workflows. Our entire AI thesis in 2021 when we started was that the hard technical challenges are going to be solved by everyone else, and our small team wouldn’t do a good job at that, instead what we should focus on is building a great experience on top of these technologies for a good business use-case.

Let’s walk through how our systems evolved, and the lessons we learned.

Make Everything Configurable

We started with just feedback classification using OpenAI, where we used a fine-tuned GPT-3 model. We also built WriteMyPRD (which went viral) that allowed product managers to generate their PRDs. Was on GPT-3.

Then, OpenAI launched GPT-3.5, overnight we switched all our API calls to use gpt-3.5. We had to change the request structure, and params to make it work.

Then, GPT-4 came, and we also needed to keep GPT-3.5 for some use cases as it was cheaper to run. Then OpenAI began deprecating the older fine-tuned GPT-3 models, so we had to switch our classification models. Then we solved the feedback summarization workflow for reports, and we played around with Claude. Things just kept on changing significantly every few weeks/months.

When we started, the system wasn’t designed to handle these changes, and with every change in the early days, we refactored our analysis pipeline to make it easy for us to plug in/out things later.

Eventually, we created a separate microservice for all AI workflows (which was our first service, as Olvy was a monolith backend).

If I was starting it today, I’d keep the entire journey in mind and do two things.

Break into a separate service. Write in Python. Python because as you scale, you might need to run a few small models yourselves or expand beyond the LLM APIs, and Python has the best ecosystem for that.
Your abstractions for your LLM API calls should be like black boxes. You give it a value, and expect it to return another value. The model, structuring the prompt, etc. everything is just an implementation detail. Do not marry the API implementation as well, everything can be changed tomorrow. The real value is in the application.

var model; var prompt; type Business;

The response of your LLM, depends purely on two things, the model you’re using, and the prompt you’ve provided. Your job as a developer is to keep

The model and prompt should be configurable by the business user. They shouldn’t be tied to code. They shouldn’t be considered as code. The prompts, the model, and the parameters should lie outside your codebase or should be exposed via some admin UI. They should be configurable by the business people.

Once you go to production you will come across a lot of cases you didn’t think of, you’ll introduce new rules for your LLM to consider when generating the response (Some of those rules are going to be in ALL CAPS, iykyk)

Ideally, if you’re using an AI gateway it already provides that. At Olvy we used Portkey, we configured our models, the prompts, and the params, and portkey gave us an API we could call. This just simplified everything.

Another advantage is you are able to run tests and improve the accuracy of your configuration. If it lies inside your codebase, you’ll need to do the testing outside. Some new AI gateways also allow you to run accuracy tests on your prompts and models. Trust me, as you start selling to larger companies the accuracy number is going to come up a lot.

Monitoring from Day One

When we started we were monitoring our AI calls like we would monitor any other API calls. We were using the same tools. Log the requests that were coming in, log the response status, error message, time taken, userid, etc.

Very soon after the systems went live, we started getting complaints of how the AI was giving them wrong answers, the AI was hallucinating, or the output that they received didn’t make sense to them (even though it must’ve been correct).

Your LLM API calls need to be monitored separately. Instead of only monitoring the request-response status, latencies, and errors, you also need traces of the prompt and the output so you can dive deeper.

There should be a place you can go to and see which API calls with what prompts returned what completions. This will become very critical to making sure your system is performing as you intended.

Another factor of monitoring is costs. Your monitoring setup will also provide you with the consumption numbers, the tokens you have consumed, which user/customer is consuming the most tokens, and how much is each customer costing you.

So all in all, one suggestion here is, to choose a good AI gateway that gives you these things right away.

Controlling Costs

Continuing on the monitoring piece. Initially, our OpenAI bill used to come to around $100 – low scale, fewer customers, fewer demos, etc.

One fine day, everything stopped working. We had a hard limit set for $500, and API calls started failing. When we went looking we were surprised how the consumption rose so much, we had recently launched a few free tools which could have caused this, but we didn’t know anything more. No problem, we extended the limit, set up soft limits and hard limits again, and went back to work.

A few months later, the bill crossed $1500 – WHAT! What happened? Token costs were going down with every model, but our bills kept on increasing. By this time we had gotten rid of other APIs, and everything was on OpenAI. So it made sense why the spike would’ve happened, but we found ourselves asking more questions. Who was the user who consumed it? Was it us when we were importing all those data points during customer demos, was it an enterprise POC we were running, or was it a free customer abusing our system? None of it was clear, and OpenAI’s usage monitoring systems were pretty bad. They eventually improved, but you’ll still get a way better experience switching to a dedicated AI gateway/monitoring solution.

My suggestion here is the same as above, get a good monitoring solution that doesn’t just give you an idea of costs, but gives you costs on a user/account level. Other than your engineering systems, this is needed for you to make the simple calculation to see if you’re making or losing money on a customer.

Data -> Models

Remember I told you about the classification example above, where we fine-tuned a model to identify if a data point was feedback or not? To do that fine-tuning we needed data. If our data was wrong, our classification was going to be. We also didn’t have a lot of data in our systems that customers had shared that we could fine-tune the model on.

Around this time, my co-founder had an idea for a side project to get more eyeballs. We built TwitterFeedback.app – a simple tool where you entered a Twitter handle, we extracted all tweets and helped you analyze your Twitter mentions like Olvy analyzed your user feedback. I loved this tool because it worked in two ways. One, people discovered it, entered their Twitter handle, and got a taste of what Olvy could give them. Two, in the UI we had built a way to manually tag a tweet as feedback or not feedback. People were just scrolling by and didn’t use it a lot, but we did. It became a way for us to extract all the data, tag it, and then fine-tune the model based on that.

The day after we launched we extracted more than a million data points, and our classification model’s accuracy went from 80% to 97%. Marketing advantages aside.

My suggestion here is, you’ll need data to fine-tune and improve the accuracy of your LLMs. That is something you will have to figure out where to get it from. You can just google and you will find a lot of public data available for you to play with. Once you have the data, spend time either using something off the shelf or building your tagging system. For a classification system, your tagging could just be a boolean flag, nothing complicated.

Solve for Trust on Day One

This is the biggest challenge when it comes to responses generated by AI, is that they are undeterministic. Given an input, you don’t have a guarantee of what the output looks like. You run the same prompt twice, and you will get two different answers every time.

When AI was very early, it was a novelty, people were just amazed to see outputs that at least read correctly. Eventually, as the novelty wore off, everybody focused on “How is this going to help me?” – once that was done, the skepticism set in. People had been using AI a lot for their work and they knew the limitations of the system. So the reactions went from, “This is amazing!” to “How do I know it is correct?”

This is the current stage. Eventually, as the models improve this limitation might fade off, but it will never go away. As someone who’s been writing software since high school, you should never trust a computer.

This is not a technology problem, this is an experience problem. This is a problem that you need to solve by building a good user experience that gets the user to trust the output you’ve given. One core principle for this experience is going to be transparency, the model is already a black box, and if your UX is also a black box the trust will go away. Tell the user what exactly are you doing internally for trust-sensitive results, and show the exact data you’ve used to generate their requests. This is the best way I’ve found, there may be better things here.

</end>

Running AI in production isn’t just about picking the right models. It’s about building systems that can adapt, designing experiences that users can trust, and constantly iterating as everything changes around you. We learned a lot of these lessons the hard way, and if you’re building something with LLMs, hopefully, this gives you a few things to think about. AI is going to keep evolving, but the fundamentals of building great software won’t change.