Alex Ratner is the co-founder and CEO at Snorkel AI, and an Affiliate Assistant Professor of Comput. Sci. at the University of Washington.
The age of large language models (LLMs) and generative AI has sparked excitement for business leaders. But those who want to launch their own LLM face many hurdles between wanting a production generative AI tool and developing one that delivers real business value and sustained advantage.
Foundation models themselves have quickly become commoditized. Any developer can build on the Google Bard or OpenAI APIs. More mature organizations can deploy models like Llama 2 in their own walled gardens. But if their competitors also use Llama 2, what advantage do they have? Proprietary data—and the knowledge of how to develop and use it—provides the only sustainable enterprise AI moat.
As generative AI has reached the peak of the Gartner AI Hype Cycle, enterprises are learning that off-the-shelf LLMs can’t solve every problem—particularly not unique, high-value problems. Proprietary data can close the gap, but only when properly curated and developed.
Your Data, Your Moat
Off-the-shelf LLMs yield fun experiments and demos, but using them in a business setting will rarely achieve the accuracy needed to deliver business value. Businesses don’t need chatbots that can discuss poetry as competently as they explain computer code—they need highly accurate specialists.
Private data is a moat—a potential competitive advantage. By leveraging your proprietary data and subject matter expertise, you can build generative models that work better for your domain, your chosen tasks and your customers.
Enterprises can gain these advantages in three ways:
1. Retrieval augmentation.
2. Fine-tuning with prompts and responses.
3. Self-supervised pre-training.
Let’s briefly look at each.
Retrieval Augmentation
Retrieval augmented generation, better known as RAG, allows your generative AI pipeline to enrich prompts with query-specific knowledge from a company’s proprietary databases or document archives. This generally yields better, more accurate answers, even with a standard LLM.
But this is similar to giving an intern access to your intranet: Even with all the information before them, the intern may misunderstand or miscommunicate. To get better performance, your data team needs a customized model.
Fine-Tuning With Prompts And Responses
Data-driven organizations can fine-tune LLMs with curated prompts and responses. This sharpens and improves the model’s output on the organization’s most important tasks. To use a metaphor, a doctor needs access to a patient’s medical chart (retrieval augmentation) and specialty training (fine-tuning) to render an accurate diagnosis. Data scientists can carefully choose the prompts and responses used to fine-tune the LLM to improve performance on a wide variety of tasks or greatly boost performance on a very narrow set of them.
Self-Supervised Pre-Training
Some organizations may want to take their LLM customization further and build a model from scratch. However, this can demand more effort than it’s worth. Firms with business vocabularies well-represented in the embedding spaces of off-the-shelf LLMs can often achieve the necessary performance gains through fine-tuning alone.
If an organization feels that it needs a model custom-built from the ground up, its data team first selects a model architecture and then trains it on unstructured text—initially on a large, generalized corpus, then on proprietary data. This teaches the model to understand the relationships between words in a way that’s specific to the company’s domain, history, positioning and products. The data team can then further train the model on prompts and responses to make it not just knowledgeable but task-oriented.
The Data Lift
The ideal deployment would incorporate all three of the above approaches, but that represents a heavy data labeling load. Studies from McKinsey and Appen show that a lack of high-quality labeled data blocks enterprise AI projects more often than any other factor—and, to be clear, all three of these approaches require labeled data.
Fine-tuning with prompts and responses requires data teams to identify and label prompts according to the task and then determine high-quality responses. Pre-training with self-supervised learning requires companies to carefully curate the unstructured data they feed the model. Training on lunch orders and payroll could degrade performance or cause sensitive data to leak internally.
Even retrieval augmentation benefits from data labeling. Although vector databases efficiently handle relevance metrics, they won’t know if a retrieved document is accurate and up to date. No company wants its internal chatbot to return out-of-date prices or recommend discontinued products.
Data Is Essential To Delivering Generative AI Value
Using your proprietary data to build your AI moat requires work, and that work rests heavily on data-centric approaches including data labeling and curation. Firms can outsource some labeling to crowd workers, but much of it will be too complex, specialized or sensitive for gig workers to handle. And it’s still time-consuming and expensive, similar to relying on internal experts for data labeling.
Data science teams can use active learning approaches such as label spreading to amplify the impact of internal labelers. Programmatic labeling is another option. Our researchers used those tools to build a better LLM.
Your data—properly prepared—is the most important thing your organization brings to AI and where your organization should spend the most time to extract the most value. Your data is your moat. Use it.
Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?
Read the full article here