RAG-Enabled AI Stops Hallucinations, Adds Sources

New GenAI method enables better answers and performs more functions.

popularity

Many EDA companies have taken the first steps to incorporate generative AI into their tools, and in such tightly controlled environments GenAI appears to have great benefits. But its broader adoption has been delayed by its notorious inaccuracy, giving results that are often out of date, untrue, and unsourced.

That’s starting to change. GenAI is evolving so rapidly that these kinds of problems are well on the way to being solved. And with new approaches, GenAI is becoming much more fit-for-purpose in the semiconductor industry. Through offerings from both EDA incumbents and startups, GenAI is a strong contender to enable a new era of efficiency, automating many tedious tasks and reducing the time to complete more complex operations.

Making effective use of a generative AI model is generally a two- or three-step process. First, a developer selects a pre-trained foundation model. These are the enormous models offered by companies like Meta and OpenAI. (MIT had a recent paper on how to assess reliability in foundation models.) Second, to be more effective, the foundation model then is customized with domain-specific knowledge in a procedure called fine-tuning.

“In the basic approach you can use a foundation model, which is already well-trained on natural language understanding,” said Dan Yu, AI/ML solution manager for digital verification at Siemens EDA. “Then you add some fine-tuning on top of that. For example, our customers are looking for verification solutions, so we add information from our digital verification team and fine-tune it.”

For many use cases, fine-tuning is the last step in turning a foundation model into a specialist resource. However, to ensure accurate, up-to-date information, new frameworks such as Retrieval Augmented Generation (RAG) are becoming increasingly popular as a third step.

Many users are seeing their effects in the AI-created summaries showing up everywhere from Google searches to Otter transcripts. They are powered by RAG, which is the “secret sauce” behind chatbots that now helpfully offer to summarize and interrogate Adobe PDFs and other documents.

According to the AI Alliance,[1] the AI landscape in the semiconductor industry is evolving beyond simple chatbots toward diverse, domain-specific systems that combine models of varying sizes with specialized tools to address complex challenges like chip design optimization and yield improvement. These AI solutions are specialized for semiconductor-specific tasks, often integrating foundation models with industry knowledge.

Foundation frameworks, models
In 2021, as generative AI expanded beyond text queries to graphics, audio, and video, the Stanford Institute for Human-Centered Artificial Intelligence (HAI) said the term “language model” was too narrow. As a replacement, it coined the term “foundation model” to underscore their “critically central yet incomplete character.” In practice, a foundation model is the base model, trained on large, undifferentiated datasets. It then can be modified for specific purposes, often including being made smaller for less power-hungry processing. For example, Meta offers its flagship foundation model Llama 3.1 at 405B training parameters, and lighter models at 70B and 8B.

For all their power, foundation models have been dogged by persistent problems. HAI further said that the models’ effectiveness across so many tasks incentivizes homogenization. While homogenization provides powerful leverage, it also demands caution, as the defects of the foundation model are inherited by all the adapted models downstream. This is partly what underlies the allegations of racism against foundation models. Not only can mistakes be repeated, but long-tail data can be averaged out of the model, eliminating critical outliers that contradict more widespread information.

Training sets may run a year or more behind current knowledge, resulting in out-of-date answers. Also, models can “hallucinate,” returning entirely made-up answers. Worst of all, AI foundation models don’t link directly to sources, so there’s no way to double-check if their answers are obsolete, hallucinated, or line-by-line plagiarism, rather than paraphrases.

There are workarounds for the sourcing problem, such as using academic plagiarism checkers, or more simply, pasting a quoted phrase into Google to see if it returns on an original article. A model also can be prompted to include sources, but a human will need to double-check if the cited sources exist.

Those approaches still don’t solve the out-of-date information problem, and they add tedious double-checking, which makes these models less than optimal for enterprise-level productivity. Worse, AI checkers often give false positives, resulting in many innocent students and professionals being accused of cheating. In one journalist’s experience, “A copyeditor used a tool on an article I submitted and found it to be 49% written by AI. She then ran it through a different tool that said it was 15% written by AI.”

RAG underlies GenAI agents
To respond to these issues and create more reliable models, researchers have developed “agent frameworks,” which modify foundation models to be more precise and reliable. These new approaches also expand the actions that models can perform. To the user, they can appear as more human-friendly front ends that perform such actions as summarizing data while automatically including citations and links to genuine, non-hallucinated sources.

For an example of what’s possible, Tavily.com gives well-written answers to queries with genuine references. Tavily’s navigation is admittedly a bit confusing. From its homepage, look for “Dashboard,” on the top right; on the next page, click on the menu icon (triple lines); and then click on “Research Assistant” in the left-hand column. An app with a similar output, but a cleaner interface, is Perplexity. Because it cites sources, it’s easy to discover that an answer has been taken directly from another site. Of note, they both cited different sources when answering the same prompt.

Both sites are powered by RAG, which was first published in 2020. For coding examples of RAG, check Hugging Face.

“When you’re fine-tuning, what you’re teaching a model is a language. That might be the language of semiconductors, or you’re teaching it the skills to respond in a specific way,” said Amanda Saunders, director of generative AI at NVIDIA. “RAG connects that model that’s now been trained to real-time live data sources, or corpuses of data, so that it can pull information from those databases directly where they are. This means you don’t have to keep training; it keeps the models smaller, more compact, but it allows them to have access to all that data. RAG takes a question from the user or an application, and there’s a planning agent that sends it to the right database. It will look for the related data and information based on the question, so it’s retrieving that data. It then augments that data to the question. Essentially RAG says, ‘This is the question that came in, and here’s the relevant data source that I’ll hand to the model.’ The application now says, ‘Model, I want you to translate this question, this data, into a response.’ That response could be summarizing text, generating new code, creating new images, any number of things. The retrieval of data, augmentation of that data, and then generation of the new response, is the essence of RAG.”

Given Saunders’ projections about an agentic future, it’s likely that one day an agent could review Tavily’s interface, for example, suggest revamps, and undertake the revisions on request. At present, even summaries with authentic sources are a welcome advance.

“The classic search engines, like Google, retrieve all the data with a crawler,” Siemens’ Yu explained. “They retrieve every web page on the internet, and then they apply smart algorithms to find the similarity between text. Then they store it in their internal database. When you have a question, they try to relate your question to the nearest answer to the nearest web page, using a vector database. So when you have a question, they find the similarity between your question and the nearest web page. RAG is just one step further beyond that. After you get the most relevant top 10 answers, it gives you the summary from all those 10 answers so you don’t have to go through 10 web pages or more. You get everything immediately in the style you are looking for. The retrieval part is nothing new. We do that every day with a Google search. The beauty part is the summarization, and that can be tuned to different levels.”

In essence, a pre-trained foundation model is a generalist. It knows a little bit about everything but doesn’t have the depth of specialist knowledge. “If I’m looking to do place-and-route and achieve incredibly low power, I don’t need a general-purpose agent that also can answer trivia about Doctor Who,” said Rob Knoth, product management director in the Digital and Signoff Group at Cadence. “There is a cost that you pay for having a large general-purpose model that you may not need to pay if you’ve got a very tailored, very specific model.”

Then, RAG can retrieve this knowledge, augment it with additional and up-to-date information, and summarize it at an expert level. However, if the underlying foundation model is not fine-tuned with domain-specific data, Yu noted, a RAG summary on chiplets could sound like what a perplexed English major would make of an engineering class.

“You can think of RAG as a post-filter that is not going to mess with the guts of the foundational model, like fine-tuning would, but will look at the query, look at the answer, and look at the knowledge base that it has access to, and use that to improve the answer,” said Knoth. “RAG is useful because, regardless of which foundation model you choose, you’re likely going to have some very specific use case data, emails, spreadsheets, and other documents that you either don’t want or can’t incorporate into the training. With RAG, you can have your foundation model, you can get answers out of it, and then you can improve the accuracy of those answers.”

There is a lot of discussion about RAG versus fine-tuning, with the consensus being that fine-tuning alone is best on domain-specific datasets that are not frequently updated. For example, if a company has a datastore of important company knowledge that it isn’t going to update, fine-tuning would be enough. But if old information is constantly being updated and new information is constantly being added, the way to keep returns accurate would be to use RAG. There is now a potentially superior technique that combines both, called RAFT (Retrieval Augmented FineTuning).

Explainpaper.com is another example of a RAG implementation that may be of value to working engineers. Upload an academic paper and it will summarize and explain it, with a slider bar that lets the user set the level of understanding, from five-year-old to expert.

“It’s hard for us to really consume a paper quickly, but sometimes we just want to ask one or two quick questions,” said Yu. “And that’s the beauty of that system. You can quickly go down to the knowledge you are looking for.”

Explainpaper’s functionality is an example of how agent-makers hope AI will help professionals. Rather than eliminating jobs, it should save time by streamlining tedious tasks. It makes professional papers easier to digest, and the “high school” and “undergraduate” summaries can help engineers and physicists quickly communicate to non-specialists.

“Training involves self-supervised learning on vast amounts of unlabeled data, typically occurring in two stages — initial training by large players, followed by crucial fine-tuning in secure, private clouds to protect proprietary information,” said the AI Alliance team. “RAG  is one component in this ecosystem, providing access to up-to-date information. But it’s just one piece of a larger toolkit that includes agent frameworks like OpenSSA (Small-Specialist Agents) and LangChain [a language model integration framework], and ReAct [Reason +Act, a framework that combines chain-of-thought prompting and acting]. These frameworks allow AI to reason through challenges and manage multi-step processes in semiconductor operations. When combined with industry-specific models like SemiKong, they create powerful AI problem-solvers that leverage human expertise to speed up cycle times.”

SemiKong was created by the AI Alliance by training a model on semiconductor knowledge. It is a foundation model for the semiconductor industry that companies can use as a base for their own proprietary models. The Alliance members endorse it and similar approaches for data security. “Companies can fine-tune these models with their specific data, owning the resulting artifact. Combined with on-premises computation and federated learning, this allows companies to leverage AI’s power while safeguarding sensitive information.”

Knoth agrees that RAG serves the semiconductor industry by allowing a safe, walled look at proprietary data. “Whenever we’re talking about training data, you can’t separate that from intellectual property and worries about security. Each company has its own set of data that has made it successful and unique. That’s where RAG is indispensable. We’re working with many different partners. For example, we have a great relationship with NVIDIA, and we use their NeMo service to be able to help use RAG. So our joint partners are able to use their own designs and whatever else they have on their servers that never leaves their network, that doesn’t get uploaded into the cloud, or accidentally brought into one of their competitors’ models. It allows for a very good chain of custody so that you can use this knowledgebase, which you as a company own, to make sure that the results coming out of the foundation model that you’re using are accurate and tailored to the applications and intellectual history that you have. RAG is a great way to not have to spend the time or the resources to train or fine-tune a foundation model and get better accuracy without compromising security.”

Under the hood
RAG retrievals are accomplished through a series of steps that involve other models and agents. “The foundation model understands how to speak, understands how to do words,” said Saunders. “Embedding models understand how to put data into vector databases and embed them with the right code so that the planning agent can go get the right data. There are re-ranking models. These models take the data that was retrieved and sorts it by relevancy. So when RAG gets information, it goes through these models, finding the right data, re-ranking that data, and then returning that to the foundation model.”

Embedding models are pivotal to successful RAG implementations. Unlike keyword searching, where a search on “baseball” just returns documents containing the keyword baseball, vector search goes wider and deeper because it structures the semantic relationships between words. As a result, it returns documents that broaden into more associations with baseball — competitive sports, ball games, pitchers, catchers, etc. Picture a scatter plot in high-dimensional space, with each return as a point on the plot. Using nearest-neighbor algorithms, they are clustered based on their nearness to each other and to the original query. Thus, basketball might appear because it’s also a competitive team sport with ball in its name, but it would be far away from more directly related terms, such as “World Series.”

Fig.1 OpenAI’s Dall-e illustrates the concept of vector-based search, showing how query vectors and database vectors interact in a high-dimensional space. Some of its current limitations are evident in the spelling errors. Source: OpenAI

In order to find its place in vector space, a RAG model first cuts the vast corpus of foundation model data into “chunks,” which is the actual, technical term for the action. It’s a way of optimizing the chances of precise answers by chunking the data into smaller, relevant sections, with the added benefit of faster processing. There are many methods for chunking, depending on data type and desired outcome.

Generative AI does not understand individual words, it divides data into “tokens,” which can be thought of as phonemes: manageable parts that do not in themselves have inherent meaning.

In the process of embedding, the tokens are converted to numbers that anchor them in vector space. “When you go through the process of embedding, you’re encoding a lot of real meaning in context within that list of numbers. That long list of numbers is essentially the word, but also all the associations that we have created based on the training. That leads to the big breakthrough of RAG, where you can search by meaning,” said Elik Eizenberg, CEO of Scroll.ai, a transcription, translation, and summarization startup. The canonical example, he explained, is from the field of computational linguistics: king – man + woman = queen.

And while embeddings have found a practical home in RAG, they have a long and distinguished history in academic research. There is still much work to be done, as researchers continue to hunt for faster, more reliable returns, including more efficient ways to do embeddings. Successful implementations could give a company a competitive edge, because its tools could produce returns both more quickly and more accurately.

Fig. 2 Google’s Gemini imagines vector space, with semantic clusters. Source: Google Gemini. A simpler illustration of the concept, with embedding formulas is here. A more precise technical visualization is here.

Still, RAG has issues to overcome. Anthropic, inventor of the Claude AI assistant, wrote recently in a blog post that, “Traditional RAG solutions remove context when encoding information, which often results in the system failing to retrieve the relevant information from the knowledge base.” Anthropic claims its proposed solution, “Contextual Retrieval,” will solve the problem by prepending chunk-specific explanatory context to each chunk before embedding. A prompt engineer is already offering a tutorial on YouTube.

Chain-of-thought
Another way to improve GenAI model accuracy is chain-of-thought, introduced at NeurIPS in 2022. This method pushes a generative model to slow down and work through a problem in sequential steps, much as humans reason through math problems.

It’s a featured part of OpenAI’s preview of its o1 update. Optimally, the process keeps models from making mistakes, such as the now canonical example of miscounting the number of Rs in strawberry.

It could be powerful when combined with RAG, as described in a paper published last March. In this method, dubbed RAT for Retrieval Augmented Thoughts, each step in the chain-of-thought is modified by RAG as it proceeds, a feedback mechanism for increasingly accurate and relevant answers.

Hardware demands
For all the advantages of RAG and associated approaches, the amount of power involved might leave a company tempted to go the old-fashioned route and rely on keyword searching for an existing corporate database, but that won’t be enough for contemporary purposes.

“In semiconductors, the difference between relational database keyword searching and neural networks for generative AI is significant,” the AI Alliance team wrote. “Traditional database searches are efficient for retrieving specific information from structured data, like inventory records or test results. They’re computationally lightweight and can be cost-effective for straightforward queries. Generative AI, however, understands natural language and can leverage deep industry knowledge. It can analyze complex chip designs, predict yield issues, or suggest novel manufacturing processes based on accumulated expertise. While using ‘the old-fashioned way’ might seem to save on hardware costs, this view is short-sighted in our industry. The ROI for investing in AI often significantly outweighs the costs. Companies like EnCharge AI are developing novel low-power circuitry for AI inference, potentially reducing the energy costs associated with AI deployment.”

The AI Alliance team strongly advise against undertaking the massive expenses of foundation-model pre-training. “Instead, focus on fine-tuning existing models with your proprietary knowledge. Companies like Lepton AI are specializing in efficient fine-tuning processes, making this approach more accessible.”

While costs weigh heavily on everyone’s mind, there is hope, said Tony Chan Carusone, CTO of Alphawave Semi.Reducing hardware cost is a key factor in scaling AI.  Once the cost of AI hardware is reduced, industry-specific chatbots built on foundation models will improve decision-making and automate more complex tasks, offering an ROI that outweighs the initial cost of hardware. Chiplet-based designs are key to unlocking the required performance/cost ratio. Chiplets are the only way to develop new, more powerful, bespoke AI hardware every 12 months, tailored to workloads, and with a cost profile that will support AI scaling.”

Reference

  1. The AI Alliance’s comments were jointly provided by members Dean Wampler, IBM’s chief technical representative to the AI Alliance, Christopher Nguyen, Aitomatic CEO, co-lead of foundation models for the AI Alliance, and William Poucher, Aitomatic AI for semiconductors lead.

Related Reading
Dealing With AI/ML Uncertainty
How neural network-based AI systems perform under the hood is currently unknown, but the industry is finding ways to live with a black box.
Can Models Created With AI Be Trusted?
Evaluating the true cost and benefit of AI can be difficult, especially within the semiconductor industry.



Leave a Reply


(Note: This name will be displayed publicly)