Beyond GenAI Chatbots: The Bedrock AI Advantage
Nov. 7, 2023, Repost from Bedrock AI Blog -- Generative AI is exceptional at imitating human language patterns. GenAI, however, is not a search engine or database that can be queried. Some of the limitations of GenAI:
Generative models hallucinate—This means they make up very plausible-sounding false information.
A lack of reliability/repeatability—If you wish to guarantee that you get the same results every time, quality has to be sacrificed.
Reasoning errors - While language models often appear to reason like a human, they are in fact doing advanced pattern matching which can fail in unpredictable ways.
GenAI models are even more likely to hallucinate (invent “facts”) in areas where there is limited existing information on the web. Finance is a blindspot for most generalist language models i.e. models like GPT-4 that have been trained primarily on web data. GenAI models predict the next word to output by determining what the most likely response would be, given your input and the words it has written so far. This process ensures grammatical cohesion and that the response almost always sounds right. However, this means that generative models are not designed to provide factual information, especially in the context of a chatbot. Chatbot do often provide factual information because it is also the “most likely” response, but it is not what they were designed to do. As shown in our experiment below, factuality rates decrease sharply in the context of more niche or complex topics like finance.
At Bedrock AI, we approach these limitations by using the right language model for the right job. We ensure factuality by making sure we never put the model in a position where it will fail. This is one of the reasons we do not provide a general purpose chatbot interface. We use generative models to do what they do best—wordsmithing. We use in-house finance-specific models for information retrieval, topic tagging, noise reduction and more. We control the process from beginning-to-end, in order to ensure accurate, reliable results.
Our in-house language models are trained more than 8 million pages of financial disclosure for better domain understanding. Some of our summarization workflows involve six differen language models designed to perform specific tasks. Learn more about our approach to language modelling and AI research here - Financial NLP and Large Language Models - The Bedrock AI Advantage.
Our experiment: Bedrock AI vs. other GenAI tools
To illustrate the difference between Bedrock and other generative AI tools, we tested it against two chatbots—ChatGPT and a finance-specific bot, Hila.ai — on a randomly selected group of U.S. public companies:
Euronet Worldwide (EEFT),
Hamilton Lane (HLNE),
Flowers Foods (FLO), and
Extra Space Storage (EXR).
We chose midsize companies that are moderately well-known to better illustrate chatbot failure modes. Chatbots are more likely to provide correct answers for companies like Alphabet, Meta and Tesla where there is extensive information on the web.
We selected three different aspects of qualitative business analysis to test each modelling platform — their understanding and retrieval accuracy on general business operations, business segments, and seasonality. Each requires some degree of financial or corporate “knowledge”. you’ll be surprised on the results.
Note that Hila.ai is a generative chatbot designed to pull info from SEC filings and earnings transcripts, specifically. Unlike ChatGPT, each question asked to Hila requires that the user specify a filing where the bot will search for the answer. While this adds complexity to the information search process, it is set up that way to limit the opportunity for hallucination. As shown below, however, hallucination persists.
Our comparison to ChatGPT is not apples-to-apples. For the purposes of this test, we did not provide the relevant filing to ChatGPT within the prompt. We structured it this way because the majority of users we speak with interact with chatbots without providing a source file. The comparison to Hila.ai (which we believe is using the GPT-4 API) is, however, a more appropriate comparison.
The comparison to Hila.ai demonstrates our superior results when compared to a generative model-based application using best practices regarding retrieval etc., while the comparison to ChatGPT demonstrates our results compared to the most common usage pattern including seasonality, business segments and business operations.
Comparison with other GenAI chatbots on Business operations
We asked ChatGPT and Hila.ai specific questions about the details of each company’s operations rather than an open ended one. It’s important to note that in order to get the info you want from a chatbot, this assumes you already know what to ask. We asked for specifics about each company’s business. The following table summarizes whether the tool answered the prompts correctly for each company:
ChatGPT - Score: 1.5/5. ChatGPT provided a good list of Flowers Foods Inc.’s brands. For Domino’s, it was able to provide the number of locations, but from two years ago, for which we awarded it half a point. ChatGPT could not retrieve material information disclosed in the company’s 10-K for the other three companies.
Hila.ai - Score: 0.5/2. Hila did not have coverage for three companies (Euronet, Flowers Foods, Hamilton Lane). It did retrieve the number of stores for Extra Space Storage but from the company’s previous 10-K filing, not its most recent one. We granted it half a point. For Domino’s, Hila incorrectly reports that information on the number of locations is unavailable in the filing.
Bedrock AI - Score: 5/5. Bedrock AI proactively identifies the most relevant information about a company’s operations, so you do not have to know what to ask. For instance, here are a few detailed tidbits you will find in Bedrock’s AI-generated background memos, amongst other relevant information about the business operations:
Euronet—“The EFT Processing Segment processes transactions for a network of 45,009 ATMs and approximately 613,000 POS terminals across Europe, Africa, the Middle East, Asia Pacific, and the United States.
Hamilton Lane—“As of March 31, 2023, the company manages approximately $112 billion of assets under management (AUM) and approximately $745 billion of assets under advisement (AUA).”
Dominos—”Domino's is the largest pizza company globally, operating more than `19,800 locations in over 90 markets.”
The Bedrock AI advantage
As you can see, Bedrock AI results speak for themselves. Most large language models are trained on text from the web. Corporate disclosure, meanwhile, is linguistically and semantically very different, so a general model will struggle to understand the nuance, boilerplate, and legalese effectively.
Here is a summary of our Bedrock AI advantage for financial AI / LLM models:
Domain adaptation—Adapting open source LLMs to securities filings and financial text.
Boilerplate model—Our boilerplate identification model can correctly classify more than 99 percent of sentences as being boilerplate or not. Less boilerplate reduces noise and improves the quality of our input data(and therefore the quality of the output).
Representation learning—For data-oriented applications, the adage is “garbage in, garbage out.” We have extensive processes to ensure we feed high-quality inputs to our models.
Sample-efficient fine-tuning—Our few-shot learning algorithms—models that learn to solve select tasks with a few examples—allow us to extract 328 different types of red flags with just 1,625 labelled sentences.
Text ranking—Not only can our models locate the hard-to-find information, but they can rank it according to relevance and importance.
Training set selection—We use our in-house algorithms for selecting training sets that reduce the chances of shortcut learning. Our algorithms select training examples that give the best value in terms of the number of real-world examples that they could help the model learn to classify correctly.