top of page

Using ChatGPT in Product Development Poses Unresolved Risks for Trade Secrets and Copyright

What is ChatGPT, and how does it learn?

ChatGPT is an AI-powered chatbot developed by the software company, OpenAI. The chatbot uses a neural network to generate responses to user questions and learn from feedback and new information to improve accuracy of responses. While OpenAI’s ChatGPT is one of the more notable generative AI chatbots due to OpenAI providing a free version easily accessible to consumers, there are a host of AI chatbot services available on the commercial market, such as Google Bard or even Microsoft Bing AI.

ChatGPT, like other generative AI chatbots, uses neural networks to learn and generate responses to user inquiries. Neural networks learn how to create responses by studying a dataset, referred to as “training data,” for patterns, and then generating responses based on what the neural network has learned from that training data. To simplify neural network training, it’s similar to learning math by checking the answers in the back of the textbook.

For generative AI chatbots like ChatGPT that need to learn far more than just basic algebra, it needs to learn from more than just a simple question-and-answer set. The training data needs to include a wide array of information that the chatbot can learn from in order to be able to produce responses on such a range of topics. An AI chatbot’s answers are typically created from a variety of sources rather than one single point of data, especially if the question asked is rather broad. The more the chatbot does, the larger the training data needs to be. For a complex chatbot like ChatGPT, that practically means learning from the entire internet.

To get such vast swaths of information for ChatGPT to learn from, OpenAI gets their training data for ChatGPT from Common Crawl. Common Crawl is a Section 501(c)(3) non-profit that provides “a free, open repository of web crawl data that can be used by anyone.” Common Crawl regularly scrapes the open internet for data to be collected in one large digital dataset available to the public. Its mission is to provide this data for free rather than only have it accessible to large companies with the means and technology to collect this kind of data on their own. OpenAI uses Common Crawl’s publicly available dataset to train ChatGPT. However, this is only a portion of the data that ChatGPT is given to learn from.

How is ChatGPT allowed to use input data from its users according to OpenAI’s Terms of Service?

While Common Crawl provides a large portion of ChatGPT’s training data, OpenAI also uses its user’s input to further train ChatGPT. When a user asks ChatGPT a question, OpenAI can take that question and add it to ChatGPT’s training data so that ChatGPT can learn from it. According to OpenAI’s Terms of Service, the company is allowed to treat user input this way, and based on the user’s tier of service, this may prove difficult to stop.

By default, OpenAI collects user input for training data for ChatGPT. Users that access ChatGPT for free are able to submit a request to opt out of having their input data used for training, but this request only applies to data collected after the request was submitted and processed. This provides users with no way to stop their previously submitted data from being used to train ChatGPT. For paid users, they are given the option to opt out of having their data used by changing a setting in their account, but data collection is turned on by default. OpenAI recently changed its API setting to an opt-in model in March, which is separate from their standard paid chatbot.

While OpenAI does offer options for users to opt out of having their data used for training ChatGPT, the opt-out processes can be tedious and do not apply to data collected before the user opted out. OpenAI is also not upfront with how it uses user data, and users may not be aware of how OpenAI is collecting and using their data.

Could using ChatGPT risk trade secret protection?

OpenAI collects user input to be used in ChatGPT’s training data, meaning that it becomes part of the large body of knowledge that ChatGPT can pull from when generating answers. While ChatGPT’s answers are often developed using a variety of different sources within its training data, it can only use sources that are applicable to the question. This could mean ChatGPT has hundreds of thousands of sources to pull from when answering a high school math question or writing a paper on the Gettysburg Address, but far fewer sources to reference when it comes to more niche or complex topics such as developing fluid dynamics simulation models using Rust. ChatGPT is trained largely on the internet, so its training data will have more information available covering topics that are more commonly asked about online.

When getting into more niche and specific topics with ChatGPT, it will have fewer data points to reference, resulting in potentially less unique responses for the user. ChatGPT’s response to a question about Jane Austen’s Pride & Prejudice will be produced with a combination of many different sources from across the internet, while ChatGPT may only have a handful of sources for information on the biography of a small-town mayor. Having fewer sources greatly increases the chances that ChatGPT will copy larger portions of a source to create its response. If the question is niche enough, ChatGPT may simply plagiarize its response without meaning to. While spitting out responses from the open internet may not seem terribly problematic, it is important to remember that ChatGPT is also trained on user input. If ChatGPT finds a relevant answer to a niche question in its training data provided by user input, a user could find that they have received someone else’s input as their output.

Potentially being able to get a previous user’s input as output poses a serious risk to trade secret protection. Under the Uniform Trade Secrets Act and the Defend Trade Secrets Act, reasonable measures must be taken to keep information secret in order to maintain trade secret protections. Trade secrets also tend to cover more specialized subject areas, meaning there may not be that many publicly available sources of information on the topic for ChatGPT to reference. This can cause issues for companies that use ChatGPT in their product development.

If a company’s employee has not opted out of having their data used as training data and enters their company’s trade secret information into ChatGPT, it could become part of ChatGPT’s training data. That trade secret information is likely to be specialized and possibly one of the only sources ChatGPT has available on the subject. If that information is used as training data, a competitor to that company may be able to coax ChatGPT into producing the first company’s input as output, inadvertently sharing the trade secrets with the competitor. Because the competitor gets the information from ChatGPT, it may not be considered improper means of obtaining the information, meaning the original company may not be able to use trade secret protections against the competitor.

Further, even if the competitor already has the information, by getting ChatGPT to produce the original company’s trade secrets as output, the competitor could show that the original company gave the trade secret information to ChatGPT and did not take reasonable measures to keep the information secret. This would also prevent the original company from utilizing trade secret protections.

ChatGPT is a new tool that is changing frequently, so this issue has yet to be litigated. However, OpenAI’s handling of training data poses serious risks, and companies considering using ChatGPT as part of product development should think critically before implementing it.

Who owns the copyright to AI generated content?

Trade secret protections are not the only IP concerns that generative AI poses to the legal world. With companies like OpenAI using large amounts of available information to train their generative AI models, who really owns the copyright to AI-generated content?

Massive amounts of data are required to train generative models like ChatGPT. However, courts, lawmakers, and IP rights holders are calling into question where the data is coming from and if these AI companies have the right to use it. Getty Images, a large online catalog of stock photos, filed suit against Stability AI claiming that Stability AI’s model was trained on over 12 million of its copyrighted photos without permission.

If Getty is successful in proving copyright infringement, an injunction could be issued, shutting down all versions of Stable Diffusion associated with copyrighted material. OpenAI now faces a similar legal battle in Tremblay v. OpenAI Inc, in which authors allege that OpenAI’s language model was trained using illegally published books from a “shadow library.” The owners of AI models are already facing a shortage of data to train their models on, and with these decisions looming, the accuracy of AI models and ownership of their outputs remain uncertain.

The other facet of this issue pertains to the outputs generated by models like ChatGPT and Stable Diffusion. As per OpenAI’s terms of service, users own all user inputs. Ownership of outputs, or the content generated by ChatGPT, is assigned to the user by OpenAI which includes the ability to use outputs for commercial purposes.

However, this IP assignment clause only extends as far as the law allows it to. Courts are hesitant to acknowledge IP rights for non-human generated content. In 2018, the Ninth Circuit rejected PETA’s claim which sought to uphold a monkey’s IP rights in a selfie he took. The court made this decision because PETA lacked standing, but more importantly, the court failed to address whether a human has a copyright to content generated by animals or AI models.

Business Implications

This unsettled area of the law has implications beyond individual users. More businesses and enterprises are utilizing AI for generating content, many of which assume they hold the ownership rights over such outputs. Depending on the results of the aforementioned cases, there is a looming risk that the content these organizations rely on may be either non-copyrightable or even infringing on the copyrights of others. As AI seeks to upend the world of content creation, existing legal precedents and undecided cases threaten the legitimacy of AI generated content.

OpenAI’s terms of service, while allowing for users to opt out of their data being used for training purposes, is still a cumbersome process which does not protect user data prior to opting out. This not only poses risks to user privacy and trust, but also opens the door to exposing business queries, a concern that Microsoft warned employees of earlier this year.

As the courts muddy the waters of AI’s future, the booming AI investment space faces a roadblock. Currently, 108 VC-backed start-ups are on track to fail by the end of 2023, more than the ninety-five failing in 2010. However, continued interest in AI is proven by the near $40 billion invested in AI start-ups during the first half of the year. Adding an additional layer of uncertainty to an already uncertain area poses risks to existing VC-back ventures and the ability of new start-ups to garner funding.

While investors, founders, companies, and lawyers continue to hurry up and wait, one of the only guarantees that remains is that more litigation is sure to follow.

*The views expressed in this article do not represent the views of Santa Clara University.


bottom of page