- As contract professionals, it is important to be aware of some of these key data usage issues since they will feed into the wider contractual negotiation on price and expected service levels.
- The quality of the Training Data reflects the quality of the AI product, so AI Vendors want Training Data from its customers.
- It is important to distinguish Training Data from Customer Data or Personal Data.
A new wave of contractual terms are surfacing—those pertaining to data provided by a customer to a service provider that the service provider wants to use to train their AI—aka Training Data. As contract professionals, it is important to be aware of some of these key data usage issues since they will feed into the wider contractual negotiation on price and expected service levels. Further, understanding the importance of data and how it is used to develop products is key to protecting your company data from gratis usage and, at the extreme end, fines caused by privacy violations.
This article explains what Training Data is, how to distinguish it from Personal Data and Customer Data, and how to draft and negotiate terms related to these types of data in agreements with AI Vendors.
Definitions of Different Data Types
First, let us define a few key terms:
“Training Data” is the data used to train an algorithm or machine learning model to predict the outcome you design your model to predict.
“Customer Data” is the data provided by customers while interacting with your service(s).
“Personal Data” is any information that identifies an individual.
Training Data Usage
The quality of the Training Data reflects the quality of the AI product. So, what do artificial intelligence service providers (“AI Vendors”) want more than anything right now? They want quality Training Data.
Training Data can be used to: 1) improve an AI Vendor’s products and tools; and 2) train AI Vendor products and services to provide a more enhanced version of their current products and services to customers.
In addition, Training Data could be in the form of dummy data or real data. Customers should have a clear understanding of what type of data their business considers to be Training Data. For example, a business might be okay with testing the product using dummy data. In this case, the exercise would not involve Customer Data. But if a business requires that the AI Vendor use real data, then the customer’s confidential information, sensitive information, and Personal Data may be considered Training Data.
Contract professionals should scrutinize the contractual terms that accompany the use of AI tools to ensure that: 1) Training Data does not include the customer’s Personal Data without explicit permission from the customer; and 2) ownership rights and permissions to Customer Data are clearly stated and identified. The contractual terms that address the use and treatment of Customer Data are often found in the trial agreement, master agreement, or data processing agreement.
The two main issues that arise from using data for training purposes are: 1) how the customer’s Personal Data is treated upon onward transfer to service providers; and 2) who owns the Training Data.
Personal Data vs. Training Data
With data privacy laws proliferating around the globe, businesses that determine the means and purposes of data processing (controllers) are increasingly being held accountable for how their service providers handle their Personal Data.
As a result of increased privacy regulations, Data Processing Agreements (“DPAs”) are used to outline how Personal Data will be handled by the parties. DPAs have become necessary when negotiating the purchase of software.
In addition to ensuring that DPAs address the requirements needed by applicable privacy laws, contract professionals must ensure that the customer’s Personal Data is not used to train large language models (LLMs) due to the complexities that are introduced when data subjects would like to exercise their data subject rights. For example, if a user of a service or product wanted to eventually have their Personal Data deleted from the model, this would be difficult to operationalize. This is because Personal Data input cannot normally be unlearned. Further, use of Personal Data to train LLMs could also lead to security breaches such as when there are unintended Personal Data leakages.
When reviewing and negotiating agreements with AI Vendors, it is important to distinguish between Customer Data and a customer’s Personal Data. Customer Data encompasses all the data that is provided by customer on the platform and includes a customer’s Personal Data. A customer’s Personal Data is Customer Data that identifies a natural person such as users of the service or product.
Contract professionals should ensure that service contracts have terms that either: 1) require the AI Vendor to apply anonymization or masking techniques to prevent the customer’s Personal Data from being used to train AI Vendor LLMs; or 2) prevent AI Vendors from using a customer’s Personal Data to train their LLMs. Data masking can be achieved by ensuring that the contract has language that requires the AI Vendor to aggregate, de-identify, or permanently anonymize the data used by the AI Vendor to train the LLM in such a way that your Customer Data cannot ever again be used to identify an individual.
Ownership and Rights to Training Data
Usually, the customer owns Customer Data. The Customer should therefore ensure that the contract addresses its ownership of the data by obtaining acknowledgements of its rights in the data from the licensee through stating that the data provided under the agreement is the licensor’s sole and exclusive property.
Right holders can carve out their data rights in contracts. They can do so by adopting a broad definition of Customer Data or by adopting a narrow definition of the same.
By broadening the definition of Customer Data, to capture all the data the AI Vendor collects or receives indirectly or directly from the customer (including its derivatives), contract professionals can ensure that there is no ambiguity about how Customer Data is to be used. Customers should specify that: 1) the AI Vendor can only use Customer Data to provide the services to customer; and 2) the AI Vendor promises that it will not use or attempt to use Customer Data for any other purposes, including the development of LLMs and other similar products.
The customer could also opt to exert more control over the AI Vendor’s data usage by: 1) ensuring that the definition of licensed data is narrow; and 2) reserving the right to obtain additional fees from the AI Vendor for the usage of Customer Data for additional manners of usage including the development of LLMs and other similar products. Such agreements aim at ensuring that the AI Vendor cannot generate any derivative data without paying the customer for it.
Tips for Reviewing AI Vendor Agreements
Here is a list of top 3 contract reviewing tips to consider when working with an AI Vendor that wants Training Data from the customer:
- Customer Data should be clearly defined with the aim of protecting the customer data from gratis usage.
- Ensure Training Data does not contain customer’s Personal Data. Specifically, ensure that the customer’s Personal Data is not used to train the AI Vendor’s LLMs.
- Clarify the terms of the agreement regarding the use and ownership of customer input data (data fed into the AI by users) and customer output data (data produced using the AI in response to inputs or prompts by users).
As more and more AI Vendors seek out free Training Data from their customers, contracts professionals representing the customer should understand the difference between Training Data and other types of data and ensure appropriate contractual protections.