Artificial Intelligence

The Dos and Don’ts of Production Grade GenAI Software

As businesses race to adopt artificial intelligence, generative AI stands out as a game changer, reshaping how we interact with technology and data. From crafting personalized customer experiences to powering intelligent virtual assistants, GenAI is not just a buzzword—it's a catalyst for innovation. Yet, the journey to creating robust GenAI software is not a straightforward one. There are no easy wins; simply plugging in solutions like ChatGPT won't solve complex business problems. Instead, a more nuanced approach is required to develop tailored AI solutions that truly address specific business needs. This article distills Brightly’s hands-on experience into essential dos and don'ts when developing GenAI applications for production use. From automated test generation to the critical importance of AI design, we provide actionable insights to help you navigate the GenAI landscape. Dive in to ensure your GenAI applications are not just cutting-edge but also reliable, user-friendly, and ready to exceed expectations.

Generative AI (GenAI) is a transformative force across industries, enabling advanced applications from automated brand analysis to interactive car service manuals, and enhancing productivity with AI assistants like Microsoft’s Copilots and ChatGPT. However, for business leaders and AI experts who are at the forefront of this technological revolution, the journey to develop production-ready GenAI software is fraught with complexity. It demands not only a strategic mindset but also a deep understanding of the unique challenges inherent to GenAI applications.

In this post, we'll equip you with the dos and don'ts of developing GenAI software for production use (moving beyond the proof-of-concept stage) that we've figured out working together with our customers. We’ll focus on text-based models, particularly large language models (LLMs), as the realm of GenAI for images and videos presents a distinct set of challenges.

No Shortcuts to Success in GenAI

Products like ChatGPT and Microsoft’s Copilots have proven to be versatile tools capable of performing a wide array of tasks that can benefit many businesses. They offer a glimpse into the potential of GenAI, handling everything from drafting powerpoint presentations to helping software developers in programming.

However, while these generalist AI products are useful, they often serve as a one-size-fits-all solution and not a shortcut to success. Many businesses find that bespoke AI applications tailored to their specific needs, can provide a significant advantage. By addressing the unique pain points of a business or unlocking new opportunities, these bespoke AI solutions can drive business growth in ways that generalist AI products simply can’t.

Building on this, the development of a bespoke GenAI application requires more than picking an AI model and slapping a user interface on it. A truly effective GenAI application also requires, e.g., data processing and storage solutions and smooth integration with existing business systems, along with specialized logic to effectively use the AI model. For a deeper understanding of the full AI stack needed for a custom solution, check out our blog post Today’s GenAI Stack. It highlights the necessity of going beyond the AI model itself to create a system that truly improves business operations and provides a competitive edge in the market.

Automated testing of GenAI components

GenAI components are notoriously difficult to test due to their vast array of use cases and possible user interactions. Manually covering all possible scenarios and edge cases is nearly impossible. Moreover, the variability of free-form text output from GenAI components adds another layer of complexity to their evaluation.

To tackle these challenges, we can implement two main strategies using GenAI models:

  1. Automatic Test Generation: GenAI models can be utilized to automatically generate a broad spectrum of test cases, which mimic a variety of use cases and user inputs that your application may face. This method ensures extensive test coverage and reduces the need for labor-intensive manual test creation.
  2. Evaluating GenAI Outputs: Traditional unit-test frameworks struggle with analyzing GenAI outputs (in this case, free-form text). An alternative is to use a separate GenAI model to verify the outputted text. By measuring against established guidelines, this evaluator AI can check whether the component's outputs satisfy the predefined criteria the developer has set.

Tools like Giskard play a crucial role in this process. They automatically generate test suites specifically designed for your GenAI components, addressing several critical areas:

  • Check Your Use Case: Tests to confirm that your GenAI component functions correctly, producing the appropriate output for the provided inputs.
  • Check Edge Cases: Giskard’s ability to generate a variety of unexpected inputs—inputs that developers might not anticipate—is invaluable. By passing these tests, we can be relatively confident that our GenAI component is sufficiently robust for real-world use, where end-users may interact with the application in unforeseen ways.
  • Look for Common Problems: Checking for typical issues is essential for the ethical development of AI applications. This includes detecting instances where the AI model generates incorrect or fabricated information (hallucinations), verifying its capability to manage diverse scenarios (robustness), ensuring it does not exhibit unfair biases (racial, political, etc.), and safeguarding against user inputs that could lead to security vulnerabilities.

Employing automatic test generation tools like Giskard enables you to ensure your GenAI components are not only high-performing and robust, but also ethically responsible. Comprehensive testing safeguards against harmful biases, unfairness, or misuse of your application. This approach is a vital part of responsible AI development, i.e., aligning technical expertise with ethical standards.

Update Model Version with Caution

GenAI models are constantly evolving. For instance, OpenAI's GPT-4 has rolled out three updates in the past year alone. Midjourney’s image generation model isn't far behind, with several new versions since early 2023, and Anthropic's chat-AI Claude has also been updated multiple times.

While it's tempting to jump on the latest model version to benefit from the freshest knowledge, enhanced performance, and new features, there's a catch. We've observed that even minor updates can alter the model's behavior. For example, when we moved from the March 14th version of GPT-4 to the June 13th version for a project, we noticed a significant drop in the model's ability to reason within our specific area of interest.

Our advice? Stick to a fixed version of the model. When you do decide to upgrade, take the time to thoroughly test the new version for any setbacks in performance. We managed to fix our client's issue by tweaking the prompts given to the AI, but if we had set the model to update automatically, it would have caused a noticeable dip in the quality of our client’s application.

By carefully managing updates and thoroughly testing new versions, you can maintain a consistent and reliable user experience for your application.

Bigger Isn't Always Better in GenAI

While the allure of the latest and largest GenAI models is strong, it's important to consider their impact on latency and response times. In scenarios where an application requires frequent interaction with the AI, opting for a larger model could significantly affect the user experience, potentially leading to decreased satisfaction due to slower responses.

It's worth noting that the speed of GenAI models is not fixed; it improves over time as developers optimize the software and as advancements in hardware are made. For applications where speed and low latency are crucial, smaller models may be the more prudent choice.

Moreover, the landscape of model providers is diverse. Competing architectures are emerging, such as the Mistral MoE-model, which has quickly become popular. Despite its smaller size, which allows for rapid responses, it competes well in quality with OpenAI’s models like GPT-3.5-turbo.

Distributing the API calls to your AI service among multiple instances in different regions provides fault tolerance and improved throughput of AI calls.

Maximize Uptime by Load Balancing Across AI Service Instances

We're big fans of Azure OpenAI Service and Amazon Bedrock. They're like a dream come true, offering us the power of cutting-edge AI models without the hassle of managing heavy-duty infrastructure ourselves. In fact, nearly 90% of our customers' GenAI projects are built using these platforms. But there's a catch: both services have rate limits that can put the brakes on scaling up GenAI applications for lots of simultaneous users.

The smart move? Spread your bets. Use multiple service instances across various regions and set up a system for load balancing and keeping an eye on AI service health. By spreading out across different AI service instances, you're not just opening the door to more users. You're also building a safety net for your business. If one region’s AI service gets overwhelmed—which we've seen a few times over the past year, thanks to the growing demand for cloud AI services—you'll keep running without a hitch.

Never Skip AI Design

When it comes to generative AI applications, the possibilities can seem endless. With a natural language interface, like the one in ChatGPT, many users will think they can ask the system to do just about anything.

That's why it's crucial to invest time in the design process. You need to map out which use cases your application should support and, just as importantly, which ones it can't. For instance, we built a Retrieval Augmented Generation (RAG) system for a client that allowed their users to converse with the AI about the contents of their documents. The system was really good at pulling relevant answers from the documents and blending them with the AI's own insights. But it wasn’t long before we noticed some users asking specific quantitative questions about the documents—queries about word counts or the number of mentioned entities.

Given that a RAG system typically has access to only portions of a document or document collection, it was not equipped to provide accurate answers to these types of questions, despite its attempts to do so. Our solution? We augmented the AI model with a clear understanding of its capabilities. Now, it can accurately inform users when a request is out of its reach. Clarity is essential for maintaining user trust and ensuring a positive experience with your GenAI application.

Educating Your Users Is Key

When it comes to your GenAI application, you'll encounter users with differing amounts of experience with AIs. There are the power users who test the boundaries of the app and often come up with innovative uses that even the designers didn't foresee. Alongside them, you have users with a moderate grasp of AI technology and those who are just beginning their journey.

New to GenAI, some users might interact with a chat interface as though it were a simple search engine, typing in queries like "problem with car cold start" or "clicking noise in LaundryBasket washing machine." This is where the importance of user education shines. By embedding educational prompts and interactive tutorials within your app, you can help users of all levels quickly become adept at conversing with the AI while also recognizing the system's boundaries.

For a seamless educational experience, consider also including resources like Brightly's AI Horizon package. One of its key modules focuses on leveling up your employees’ AI skills, making it an ideal solution for bringing your entire organization up to speed with the ins and outs of GenAI.

Final Notes

The field of GenAI is expanding rapidly, and the guidance we've shared here is designed to help you build strong, user-friendly, and scalable GenAI applications. By following these practical dos and don'ts, you'll be better equipped to avoid common issues and make the most of GenAI technology.

In summary, remember to:

  • Acknowledge No Shortcuts to Success: Generalist AI products–like ChatGPT–are convenient, but transformative applications often demand bespoke GenAI solutions.
  • Automate Your Testing: Use automated tools to test your GenAI components efficiently.
  • Be Cautious with AI Model Updates: Carefully evaluate new versions of GenAI models before using them.
  • Choose the Right AI Model Size: Consider the impact of model size on performance and user experience.
  • Build a Reliable Infrastructure: Use multiple AI service instances to keep your services running smoothly.
  • Commit to AI Design: Prioritize the AI design phase as a non-negotiable step in your development process to fully explore and define the scope of your GenAI application.
  • Educate Your Users: Help users get the best out of your GenAI application with clear instructions and support.

By taking these steps, you'll be well on your way to creating GenAI solutions that not only meet but exceed user expectations. The future of AI is generative, and with thoughtful development practices, your software will lead the charge in this exciting new era.If you're looking to dive deeper into GenAI or want to start implementing these strategies in your projects, we're here to help. At Brightly, we believe in the power of collaboration to bring about the best in technology. Reach out to us, and let's explore how GenAI can make a difference in your work.

Authors

Janne Solanpää

Armed with a tech PhD and a strong background in AI, data science, data engineering, and software engineering, Janne Solanpää is a seasoned specialist in the field. His expertise lies in designing and implementing scalable data infrastructure on leading cloud platforms and in developing innovative software solutions. Leveraging advanced analytics and cutting-edge AI solutions, Janne has been instrumental in helping businesses unlock the potential of their data, driving growth and success.