CEO and cofounder of one of the first data intelligence platforms, BigID, and a privacy, security and identity expert.
In today’s data-driven business landscape, the role of artificial intelligence (AI) and machine learning (ML) has never been more prominent. While these technologies offer unprecedented opportunities for innovation and efficiency, they also introduce a host of new challenges including increasing risk, governance and management of unstructured data. This is especially pertinent when dealing with large language models (LLMs), such as those that power generative AI.
Unstructured Data: A Double-Edged Sword
For years, businesses have amassed vast repositories of unstructured data—email archives, chat logs, PDF files and more—that sit in systems like Microsoft Office 365 or Slack. While this unstructured data is increasingly becoming a powerful input for AI-driven solutions, it also represents a potential Achilles’ heel.
Unlike structured data, which resides in well-defined formats and databases, unstructured data often exists in a more chaotic state, making it harder to govern and secure. This is the data that often contains some of the most sensitive information: personal and customer information, credit card numbers, social security numbers, intellectual property, trade secrets and more.
The Risk Vector: Data Misuse And Exposure
LLMs are trained on large sets of unstructured data—and that amplifies risks. Training generative models on sensitive or regulated data, whether it be client-specific, customer-related or otherwise, introduces the risk of violating data privacy regulations.
Failure to properly govern this data could result in data leaks, financial penalties and reputational damage.
Moreover, incorporating confidential intellectual property into the training sets for these models might inadvertently expose the organization to data breaches or unauthorized dissemination of proprietary information.
Simply put, the training process of LLMs could become a liability if the data isn’t carefully curated and managed.
Data Governance: A Proactive Solution
Before deploying generative AI in any business application—be it customer service automation, analytics, marketing, general efficiency or cyber threat detection—it’s imperative to establish robust data governance procedures.
The first step involves identifying, classifying and tagging datasets that contain sensitive or regulated information. Categories to consider include personal information (PI), personally identifiable information (PII), confidential business strategies or intellectual property.
When looking for solutions to accelerate AI governance, companies need to make sure they’re built to scale in their environment, have enterprise-grade security incorporated and can easily address the breadth and depth to cover the range and scope of their data by type, sensitivity and location.
When implementing these solutions, companies should start by categorizing what data is safe for use, what’s regulated and what data requires additional controls. They should avoid leveraging data to fuel AI training sets until they’ve verified that the data—even training sets—have been marked as safe for use and don’t contain any potentially compromising information, regulated data or customer data.
Ensuring The Right Data For The Right Use
Once datasets have been classified, they can be partitioned accordingly for specific applications. For instance, organizations might choose to exclude sensitive human resources data from the training sets for customer service LLMs. Likewise, they could guide the models to rely on publicly available, nonconfidential data, thus further mitigating the risks associated with data misuse.
When choosing the right data to use to fuel AI adoption, make sure it doesn’t include customer or employee information, intellectual property, secrets and credentials or anything that may expose your company to unwanted data breaches.
A Future-Ready Approach
As AI and ML technologies continue to evolve and integrate more deeply into organizational processes, proactive data governance will become a nonnegotiable facet of responsible business operation.
As we grow increasingly reliant on complex AI models like LLMs for a wide array of applications, the unstructured data that powers these models must be managed with an equal, if not greater, level of scrutiny and care. While the capabilities of generative AI offer a plethora of opportunities for business innovation, they should not be deployed without a rigorous data governance strategy in place.
Failure to manage the risks associated with unstructured data can have serious legal and financial repercussions, rendering the promising advantages of AI null and void. As we tread further into this uncharted territory, it’s more critical than ever to know your data and control your data—wherever it lives.
Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?
Read the full article here