March 3, 2025

Navigating the Challenges of Data Collection: Insights from a Legal Perspective

The legal challenges surrounding data collection for AI training have been recently raising alarms in the AI technology landscape. This article emphasizes the critical role of data in AI development and highlights the complexities involved in data collection processes. Placing a great focus on the rising number of AI-related lawsuits, indicating the growing importance of understanding and navigating the legal landscape in AI data collection. 

AI data collection 

Importance of data to AI training

AI is at the heart of the evolution of technology, and a large dataset is needed for machine learning. Data is an integral part of AI training as the data enables the models to make predictions and extract useful information. Hence, high-quality data is paramount to developing an effective AI model as good as the quality of the training data. 

Definition of data collection

Data collection is a complex process of gathering, organizing, and curating raw information from multiple sources to train, validate, and test AI models for a specific purpose. 

A brief explanation of the process of AI data collection 

The process of collecting AI data requires selecting a method that aligns with the type of data needed, such as  crowdsourcing, off-the-shelf solutions, in-house collection, automation, or generative AI. Key steps in this process include identifying the objectives of the model, ensuring quality assurance, and organizing data through proper storage and annotation.

AI Lawsuits 

What are they? 

AI lawsuits refer to legal actions taken against individuals or companies regarding the use of Artificial Intelligence in various sectors. 

Why are people suing? 

There are numerous reasons for filing lawsuits against individuals, or companies could be due to concerns about copyright infringement or harmful AI-driven outcomes. These include 

Intellectual Property

Disputes over copyrights, trademarks, or the ownership of data used in AI model training

Liability

Focuses on who is responsible for damages caused by AI systems that provide inaccurate information leading to risky actions.

Discrimination

Claims that AI algorithms produce biased outcomes, resulting in discrimination against specific groups based on race, gender, or other characteristics.

Privacy Violations

Legal issues surrounding the collection and use of personal data without consent, especially in the context of data-driven AI applications.

Contractual Issues

Disputes over agreements related to AI development, deployment, or licensing.

Regulatory Compliance

Compliance laws and regulations governing AI use, including those concerning data protection and consumer rights.

Recent Case studies 

The immense growth, capabilities, and adoption of AI, have prompted individuals, organizations, and companies to leverage AI technologies, which require large datasets for training their AI models. Recently, AI companies have reported exhausted the available supply of human knowledge and data for training their models. As a result, it has had an incredible impact on the world, however, it has led to an increase in legal disputes related to the breach of legal procedures in data collection and its utilization by various companies and organizations. Below are some of the lawsuits that have made headlines 

GitHub, Microsoft, and OpenAI

In this case, a class-action lawsuit on behalf of authors of code available on GitHub was filed against GitHub, Microsoft, and Open AI on the basis of intellectual property violations in November 2022. The plaintiff claims that GitHub Copilot, a coding assistant powered by OpenAI’s technology, copied their code without complying with the requirements of their open-source licenses. The allegations are AI programs violate the requirements to display copyright information when using the original code, as Copilot may generate code identical to the developers’. Also alleges violation of the Digital Millennium Copyright Act (DMCA), Lanham Act, and California Consumer Privacy Act (CCPA). Other complaints include mishandling of personal data and information and fraud.  The federal judge dismissed most of the claims but two allegations of open source license violation and breach of contract are to proceed. 

Stability AI, Midjourney, and DeviantArt

A copyright lawsuit against AI image generator companies Stability AI, Midjourney, DeviantArt, and others alleging the misuse of visual artists’ work to train their AI image generation systems, whereby the models could generate work in a similar style to the artists. The plaintiffs argue that the companies unlawfully copied and stored their servers and used the visual work without permission. At some point, users of the services provided by the companies could directly reproduce copies of the artists’ work. The judge dismissed some of the claims but allowed the allegations concerning trademark rights violation and the company's false implication to endorse the systems. 

Authors Paul Tremblay and Mona Awad vs OpenAI

Authors Mona Awad and Paul Tremblay filed a lawsuit against OpenAI, the company behind the generative AI tool ChatGPT, alleging copyright infringement. The authors claim that OpenAI used their novels to train the ChatGPT model without permission, as the generative tool generated very accurate summaries of the novels. Other than breaching copyright laws, the authors allege that OpenAI profits unfairly from the ‘stolen’ writing ideas, hence they are demanding monetary compensation. 

The New York Times vs OpenAI and Microsoft

The New York Times filed a lawsuit against OpenAI, creator of ChatGPT in 2023 based on copyright infringement. The lawsuit claimed that millions of articles from the New York Times were used without authorization in training OpenAI models, which now compete with the New York Times in news source reliability. Moreover, the case also states that the Chatbots mimic the New York Times writing style and recite its content. The plaintiff's demand includes the defendants being responsible for monetary damages due to the copying of Time’s “uniquely valuable work”. Moreover, the New York Times calls for the AI company to destroy chatbot models and training data that used copyrighted material from the news company.

Google

Google is facing a class-action lawsuit filed in 2023 for the misuse of personal information and copyright infringement. The case mentions data from TikTok, dating websites, Spotify playlists, and books that were used as training data for Bard AI. The damages on Google could be up to $5 billion. The case was dismissed, due to the inadequacy  of legal claims, but the judge allowed the case to be refiled after the plaintiff amended the complaint. This is just one of the recent lawsuits against Google and its data collection practices. 

Best practices for obtaining customer consent for AI training

Clear and Conspicuous Notice

Provide clear and easy-to-understand notice to customers regarding the AI training practices. Avoid burying the notices in lengthy terms of service or privacy policies. Instead highlight the notice and make it noticeable to customers. 

Specific and Informed Consent 

Ensure to provide specific and informed consent from customers on how their data will be used to train AI. The customers should be given detailed information on data utilization, how it will be processed, who will have access to the data, and the potential risks and benefits of the process. 

Should also be aligned with the legal policies and requirements of data privacy and security laws such as the General Data Protection Regulation (GDPR) and The California Consumer Privacy Act of 2018 (CCPA).

Granular Opt-In, Opt-out and withdrawal Choices

Customers should be presented with the will to opt-in options where they can decide what type of data they want to share for AI model training. Additionally, customers should have flexible and easy options to opt out or withdraw their consent anytime, as part of exercising their rights. 

Regular Reminders and Updates

Customers should receive regular reminders about the status of their data in the process of AI development. This ensures that the customers can evaluate their consent and remain informed of the current updates of their shared information. 

Data Retention and Deletion

Policies and procedures for data retention and deletion are essential to ensure that customer data is not retained longer than necessary and is securely deleted when consent is withdrawn or expires. 

Third-Party Consent Management

If a third party is involved in AI training, ensure that they have appropriate consent management practices. As well as manage risks associated with third-party involvement and transparent data-sharing agreements and contracts.  

Consent Auditing and Recordkeeping

Maintain up-to-date records of customer consent, conduct regular audits for consent practices (such as re-obtaining consent if practices change), and improve practices based on feedback. Moreover, companies or organizations should ensure they are up to date with the legal standards of regulatory bodies to avoid potential violations and lawsuits. 

Data security assurance 

Ensure customers' data is well secured by implementing strong security measures to prevent unauthorized access and breaches. This will also improve customer trust. 

A great recommendation for secure data storage would be  OORT Decentralized storage

Conclusion

AI data collection is paramount to the development of AI models, but it faces a rise in legal issues that call for more stringent measures by AI regulatory agencies to ensure customer rights are protected. Companies and organizations need to implement several practices mentioned to ensure they can obtain the necessary data without breaching the regulations in place. This would help in avoiding potential legal problems while leveraging the immense benefits of AI has to offer.