It\u2019s been almost one year since a new breed of artificial intelligence took the world by storm. The capabilities of these new generative AI tools, most of which are powered by large language models (LLM), forced every company and employee to rethink how they work. Was this new technology a threat to their job or a tool that would amplify their productivity? If you don\u2019t figure out how to make the most of GenAI, are you going to get outclassed by your peers?\n\nThis paradigm shift placed a dual burden on engineering and technical leaders. First, there\u2019s the internal demand to understand how your organization is going to adopt these new tools and what you need to do to avoid falling behind your competitors. Second, if you're selling software and services to other companies, you're going to find that many have paused spending on new tools while they sort out exactly what their approach should be to the GenAI era.\n\nThere is a ton of hype, and it can be exhausting trying to figure out where to direct your resources. Before you can dive into the details of what to do with the answers or art your GenAI is creating, you need a robust foundation to ensure it\u2019s operating well. To help, we\u2019ve come up with four key areas you\u2019ll need to understand to make the most of the time and resources you invest.\n\nThese are almost certain to be fundamental pieces of your AI stack, so read on below to learn more about the four pillars needed for effectively adding GenAI to your organization.\n\nVector Databases\n\nTo make use of a Large Language Model, you\u2019re going to need to vectorize your data. That means the text you feed into the model is going to be reduced to arrays of numbers, and those numbers are going to be as a vector on a map, albeit one with thousands of dimensions. Finding similar text is reduced to finding the distance between two vectors. This allows you to move from the old-fashioned approach of lexical keyword search\u2014typing a few terms and getting back results that share those keywords\u2014to semantic search, typing a query in natural language and getting back a response that understands a coding question about Python is probably referring to the programming language and not the large snake.\n\n\u201cTraditional data structures, typically organized in structured tables, often fall short of capturing the complexity of the real world,\u201d says Weaviate\u2019s Philip Vollet. \u201cEnter vector embeddings. These embeddings capture features and representations of data, enabling machines to understand, abstract, and compute on that data in sophisticated ways.\u201d\n\nHow do you choose the right vector database? In some cases, it may depend on the tech stack your team is already using. Stack Overflow went with Weaviate in part because it allowed us to continue using PySpark, which was the initial choice for our OverflowAI efforts. On the other hand, you may have a database provider, like MongoDB, which has been serving you well. Mongo now includes vectors as part of their OLTP DB, making it easy to integrate with your existing deployments. Expect this to be standard for database providers in the future. As Louis Brady, VP of Engineering at Rockset explained, most companies will find that a hybrid approach combining a vector database with your existing system offers you the most flexibility and the best results.\n\nEmbedding Models\n\nHow do you get your data into the vector database in a way that accurately organizes it by the content? For that, you\u2019ll need an embedding model. This is the software system which will take your text and convert it to the array of numbers you store in the vector database. There are a lot to choose from, and they vary greatly in cost and complexity. For this article, we\u2019ll focus on embedding models that work with text, although embedding models can also be used to organize information about other types of media, like images or songs.\n\nAs Dale Markowitz wrote on the Google Cloud blog, \u201cIf you\u2019d like to embed text\u2013i.e. to do text search or similarity search on text\u2013you\u2019re in luck. There are tons and tons of pre-trained text embeddings free and easily available.\u201d One example is the Universal Sentence Decoder, which \u201cencodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering, and other natural language tasks.\u201d With just a few lines of Python code, you can prepare your data for a GenAI chatbot-style interface. If you want to take things a step further, Dale also has a great tutorial on how to prototype a language-powered app using nothing more than Google Sheets and a plugin called Semantic Reactor.\n\nYou\u2019ll need to evaluate the tradeoffs between the time and cost of putting huge amounts of text into your embedding model and how thinly you slice the text, which is usually chunked into sections like chapters, pages, paragraphs, sentences, or even individual words. The other tradeoff is the precision of the embedding model -- how many decimal places to use on vectors, as each decimal place increases in size. Over thousands of vectors for millions of tokens, this adds up. You can use techniques like quantization to shrink the model down, but it\u2019s best to consider the amount of data and degree of detail you\u2019re looking for before you choose which embedding method is right for you.\n\nRetrieval Augmented Generation (RAG)\n\nBig AI models read the internet to gain knowledge. That means they know the earth is round\u2026and they also know that it\u2019s flat.\n\nOne of the main problems with large language models like ChatGPT is that they were trained on a massive set of text from across the internet. That means they\u2019ve read a lot about how the earth is round, and also a lot about how the earth is flat. The model isn\u2019t trained to understand which of these assertions is correct, only the probability that a certain response to a question will be a good match for the query the user enters. It also mixes those inputs into a statistically probable new one, which is where hallucinations can occur. It may be responding with neither response, which is why checking sources is good.\n\nWith RAG, you can limit the dataset the model searches, meaning the model hopefully won\u2019t be drawing on inaccurate data. Secondly, you can ask the model to cite its sources, allowing you to verify its answer against the ground truth. At Stack Overflow, that might mean containing queries to just the questions on our site with an accepted answer. When a user asks a question, the system first searches for Q&A posts that are a good match. That\u2019s the retrieval part of this equation. A hidden prompt then instructs the model to do the following: synthesize a short answer for the user based on the answers you found that were validated by our community, then provide the short summary along with links to the three posts that were the best match for the user\u2019s search.\n\nA third benefit of RAG is that it allows you to keep the data the model is using fresh. Training a large model is costly. Many of the popular models available today are based on training data that ended months, or even years ago. Ask it a question about something after that, and it will happily hallucinate a convincing response, but it doesn\u2019t have actual information to work with. RAG allows you to point the model at a specific dataset, one that you can keep up to date without having to retrain the entire model.\n\nRAG means the user still gets the benefit of working with an LLM. They can ask questions using natural language and get back a summary that synthesizes the most relevant information from a vast data store. At the same time, drawing on a predefined data set helps to reduce hallucinations and gives the user links to the ground truth, so they can easily check the model\u2019s output against something generated by humans.\n\nKnowledge Base\n\nAs mentioned in the previous section, RAG can constrain the text your model is drawing on when generating its response. Ideally, that means you\u2019re giving it accurate data, not just a random sampling of things it\u2019s read on the internet. One of the most important laws of training an AI model is that data quality matters. Garbage in, garbage out, as the old saying goes, holds very true for your LLM. Feed it low-quality or poorly organized text, and the results will be equally uninspiring. \n\nAt Stack Overflow, we kind of lucked out on the data quality issue. Question and answer is the format being adopted by most LLMs used inside organizations, and our dataset was already built that way. Our Q&A couplets can show us which information is accurate and which is still lacking a sufficient confidence score by analyzing the number of votes or which question has an accepted answer. Votes can also be used to determine which of three similar answers might be the most widely utilized and thus the most valuable. Last but not least, tags allow the system to better understand how different information in your dataset is related. \n\nLearn more about how Stack Overflow for Teams helps the world\u2019s top companies share knowledge and build their foundation for an AI future.