Artificial Intelligence (AI) continues to evolve at a breakneck pace, and it’s become close to impossible for organizations to keep up with the latest industry trends and technological advancements.
As the frequency of new and updated models increases, organizations of all sizes are looking to better understand the business applications of this complex and massively disruptive technology field and improve upon their existing implementations. Innovative experimentation and practical research are crucial to these endeavors.
Boomi innovation experts recently designed an experiment to create a method to evaluate leading AI models side by side. The goal: help AI practitioners and line of business users evaluate the rapidly growing menu of AI tools, models, and technologies.
An AI Performance Measurement Experiment
The team leveraged the Boomi Enterprise Platform to measure the performance of multiple AI models simultaneously, comparing generative AI models that leverage Retrieval-Augmented Generation (RAG) technologies against “vanilla” (standard, unmodified) Large Language Models (LLMs) while also exploring cutting-edge implementations released by generative AI’s industry leaders.
Principal Solutions Architect Chris Cappetta has been working with AI for years. In 2019, he filed a patent in 2019 (awarded in 2023) for an invention focused on extending Natural Language Processing AI capabilities with business-specific logic, orchestration, and connectivity. Now, he’s an AI expert within the Boomi Innovation Group, leading research and exploring new technologies as well as building new solutions.
Capetta initiated this particular experiment to investigate and better understand the ever-increasing maze of newly emerging AI capabilities and techniques.
Through analysis of the experiment and its results, Boomi innovators aim to determine which tools and designs might provide the most value to customers, and therefore be the most effective for future Boomi implementations of AI use cases. However, because the available pool of AI tools and designs continues to grow so quickly, the team’s primary goal was to separate speculation from evidence-based strategies.
Retrieval-Augmented Generation vs. Vanilla LLMs
Cappetta developed a side-by-side analysis of several prominent AI designs to see which model yielded the most consistent and highest-quality responses. He built a Boomi process designed to leverage multiple AI models simultaneously and created a scoring system to evaluate each model’s individual performance. This allowed Boomi innovators to create a quantified analysis of AI performance in real-world scenarios, using real-world data.
One of the experiment’s central hypotheses was that RAG designs would outperform standalone vanilla LLMs by providing more accurate and high-quality results. Spoiler alert: they did.
Cappetta said that RAG models, which integrate external data retrieval into their response generation, demonstrated superior performance over their base-LLM counterparts. The improvement in response quality from vanilla GPT-4 to the GPT-4 RAG design even outweighed the quality jump from OpenAI’s ChatGPT-3.5 to GPT-4.
This surprised the Boomi team, as one of ChatGPT-4’s most significant advantages over GPT-3.5 is its ability to access more recent information. The noticeable leap in response quality using the GPT-4 RAG design compared to standard GPT-4 is a strong indicator of RAG’s potential in applications requiring up-to-date, specific information.
Not All AI Is Alike
The experiment also uncovered some other interesting data. When assessing different technologies built onto GPT-4, the Chat Completions API strongly outperformed the OpenAI Assistants API, even when both leveraged RAG to reference external documentation to guide a response.
“It appears that the Assistants version of the GPT-4 model focused too strongly on the reference material it was provided, creating a summary of that material instead of using it as context to build and support its answer to the user’s original question,” Cappetta said. “The Chat Completion model, on the other hand, was more effective at using the referenced material to build a meaningful answer.”
Semantic Search: Synthetic Data vs. Raw Data
The team got another surprise when comparing synthetic data augmentation with direct semantic searching against raw data. Contrary to initial expectations, searching against raw data chunks yielded comparable, and sometimes better, results than using synthetically generated questions.
“We knew rich data was useful,” Cappetta said. “The revelation was that the specific semantic search design we were using saw better performance from searching larger chunks of information even if the information was intuitively less of a directly close comparison to the structure of the question.”
The Boomi team also conducted a second layer of testing that examined which designs retrieved the highest quality or “best” content, using both raw data sets and synthetically enhanced data with various chunking and tagging methods.
“What really struck me from this whole project was that the leading models are fairly well established, but the leading methods of loading context are a vast and variable landscape,” Cappetta said.
While the list of fully established and trusted LLMs is growing, they could still largely be counted on two hands. At the same time, there are countless methods to prepare, chunk, tag, enhance, search, and retrieve relevant context to provide to those LLMs, and any of them could end up playing a role in an AI implementation.
Future Directions and Applications
These findings open up numerous possibilities for organizations striving to find practical applications of AI. The insights gained from this experiment provide a roadmap for leveraging AI more effectively across various industries and orchestrating more meaningful AI activities, from enhancing chatbot interactions to refining search algorithms.
While industry giants like OpenAI and Google will continue to lead the way in developing future LLM models, Boomi is uniquely positioned to use its platform to orchestrate larger and more practical use cases.
A longer version of this post first appeared on Observing.AI in February 2024. For more insights into the experiments, read the Boomi Community article.