The legal challenges surrounding data collection for AI training have been recently raising alarms in the AI technology landscape. This article emphasizes the critical role of data in AI development and highlights the complexities involved in data collection processes. Placing a great focus on the rising number of AI-related lawsuits, indicating the growing importance of understanding and navigating the legal landscape in AI data collection.
AI is at the heart of the evolution of technology, and a large dataset is needed for machine learning. Data is an integral part of AI training as the data enables the models to make predictions and extract useful information. Hence, high-quality data is paramount to developing an effective AI model as good as the quality of the training data.
Data collection is a complex process of gathering, organizing, and curating raw information from multiple sources to train, validate, and test AI models for a specific purpose.
The process of collecting AI data requires selecting a method that aligns with the type of data needed, such as crowdsourcing, off-the-shelf solutions, in-house collection, automation, or generative AI. Key steps in this process include identifying the objectives of the model, ensuring quality assurance, and organizing data through proper storage and annotation.
AI lawsuits refer to legal actions taken against individuals or companies regarding the use of Artificial Intelligence in various sectors.
There are numerous reasons for filing lawsuits against individuals, or companies could be due to concerns about copyright infringement or harmful AI-driven outcomes. These include
Disputes over copyrights, trademarks, or the ownership of data used in AI model training
Focuses on who is responsible for damages caused by AI systems that provide inaccurate information leading to risky actions.
Claims that AI algorithms produce biased outcomes, resulting in discrimination against specific groups based on race, gender, or other characteristics.
Legal issues surrounding the collection and use of personal data without consent, especially in the context of data-driven AI applications.
Disputes over agreements related to AI development, deployment, or licensing.
Compliance laws and regulations governing AI use, including those concerning data protection and consumer rights.
The immense growth, capabilities, and adoption of AI, have prompted individuals, organizations, and companies to leverage AI technologies, which require large datasets for training their AI models. Recently, AI companies have reported exhausted the available supply of human knowledge and data for training their models. As a result, it has had an incredible impact on the world, however, it has led to an increase in legal disputes related to the breach of legal procedures in data collection and its utilization by various companies and organizations. Below are some of the lawsuits that have made headlines
In this case, a class-action lawsuit on behalf of authors of code available on GitHub was filed against GitHub, Microsoft, and Open AI on the basis of intellectual property violations in November 2022. The plaintiff claims that GitHub Copilot, a coding assistant powered by OpenAI’s technology, copied their code without complying with the requirements of their open-source licenses. The allegations are AI programs violate the requirements to display copyright information when using the original code, as Copilot may generate code identical to the developers’. Also alleges violation of the Digital Millennium Copyright Act (DMCA), Lanham Act, and California Consumer Privacy Act (CCPA). Other complaints include mishandling of personal data and information and fraud. The federal judge dismissed most of the claims but two allegations of open source license violation and breach of contract are to proceed.
A copyright lawsuit against AI image generator companies Stability AI, Midjourney, DeviantArt, and others alleging the misuse of visual artists’ work to train their AI image generation systems, whereby the models could generate work in a similar style to the artists. The plaintiffs argue that the companies unlawfully copied and stored their servers and used the visual work without permission. At some point, users of the services provided by the companies could directly reproduce copies of the artists’ work. The judge dismissed some of the claims but allowed the allegations concerning trademark rights violation and the company's false implication to endorse the systems.
Authors Mona Awad and Paul Tremblay filed a lawsuit against OpenAI, the company behind the generative AI tool ChatGPT, alleging copyright infringement. The authors claim that OpenAI used their novels to train the ChatGPT model without permission, as the generative tool generated very accurate summaries of the novels. Other than breaching copyright laws, the authors allege that OpenAI profits unfairly from the ‘stolen’ writing ideas, hence they are demanding monetary compensation.
The New York Times filed a lawsuit against OpenAI, creator of ChatGPT in 2023 based on copyright infringement. The lawsuit claimed that millions of articles from the New York Times were used without authorization in training OpenAI models, which now compete with the New York Times in news source reliability. Moreover, the case also states that the Chatbots mimic the New York Times writing style and recite its content. The plaintiff's demand includes the defendants being responsible for monetary damages due to the copying of Time’s “uniquely valuable work”. Moreover, the New York Times calls for the AI company to destroy chatbot models and training data that used copyrighted material from the news company.
Google is facing a class-action lawsuit filed in 2023 for the misuse of personal information and copyright infringement. The case mentions data from TikTok, dating websites, Spotify playlists, and books that were used as training data for Bard AI. The damages on Google could be up to $5 billion. The case was dismissed, due to the inadequacy of legal claims, but the judge allowed the case to be refiled after the plaintiff amended the complaint. This is just one of the recent lawsuits against Google and its data collection practices.
Provide clear and easy-to-understand notice to customers regarding the AI training practices. Avoid burying the notices in lengthy terms of service or privacy policies. Instead highlight the notice and make it noticeable to customers.
Ensure to provide specific and informed consent from customers on how their data will be used to train AI. The customers should be given detailed information on data utilization, how it will be processed, who will have access to the data, and the potential risks and benefits of the process.
Should also be aligned with the legal policies and requirements of data privacy and security laws such as the General Data Protection Regulation (GDPR) and The California Consumer Privacy Act of 2018 (CCPA).
Customers should be presented with the will to opt-in options where they can decide what type of data they want to share for AI model training. Additionally, customers should have flexible and easy options to opt out or withdraw their consent anytime, as part of exercising their rights.
Customers should receive regular reminders about the status of their data in the process of AI development. This ensures that the customers can evaluate their consent and remain informed of the current updates of their shared information.
Policies and procedures for data retention and deletion are essential to ensure that customer data is not retained longer than necessary and is securely deleted when consent is withdrawn or expires.
If a third party is involved in AI training, ensure that they have appropriate consent management practices. As well as manage risks associated with third-party involvement and transparent data-sharing agreements and contracts.
Maintain up-to-date records of customer consent, conduct regular audits for consent practices (such as re-obtaining consent if practices change), and improve practices based on feedback. Moreover, companies or organizations should ensure they are up to date with the legal standards of regulatory bodies to avoid potential violations and lawsuits.
Ensure customers' data is well secured by implementing strong security measures to prevent unauthorized access and breaches. This will also improve customer trust.
A great recommendation for secure data storage would be OORT Decentralized storage
AI data collection is paramount to the development of AI models, but it faces a rise in legal issues that call for more stringent measures by AI regulatory agencies to ensure customer rights are protected. Companies and organizations need to implement several practices mentioned to ensure they can obtain the necessary data without breaching the regulations in place. This would help in avoiding potential legal problems while leveraging the immense benefits of AI has to offer.