"Data uprising" breaks out in the United States: Hollywood literature, journalism and social media rebel against AI

Author: Intern Chen Xiaorui; Reporter Fang Xiao

Source: The Paper

Eric Goldman, a professor at the School of Law at Santa Clara University in the United States, believes that the wave of litigation has just begun, and the "second and third waves" are coming, which will define the future of artificial intelligence.

AI companies argue that it is reasonable to use copyrighted works to train AI — a reference to the concept of "transformative use" in U.S. copyright law, where material is changed in a "transformative" way will create an exception.

Image source: Generated by Unbounded AI tool

The American Screenwriters Guild has been on strike for more than 70 days, demanding wage increases, increasing the share of streaming media platforms, and the supervision of artificial intelligence.

A “data uprising” is breaking out in America, with Hollywood, artists, writers, social media companies and news organizations among the rebels.

All the blame points to generative artificial intelligence tools such as ChatGPT and Stable Diffusion, which are accused of illegally using the work of content creators to train large language models without permission or compensation.

At the heart of this "data uprising" is a new recognition that online information -- stories, artwork, news articles, web posts and photos -- can have significant untapped value. The practice of scraping public content on the internet has a long history, and most companies and nonprofits that do so publicly disclose it. But before ChatGPT was released, data owners didn't know much about it, nor did they see it as a particularly serious problem. Now, that has changed as the public has learned more about the basics of AI training.

"This is a fundamental reshaping of the value of data." Brandon Duderstadt, founder and CEO of Nomic, said in an interview with the media. You can access data and run ads to get value from it. Now, people think they have to protect their data.”

Tide after wave

In recent months, social media companies like Reddit and Twitter, news organizations like The New York Times and NBC, science fiction author Paul Tremblay and actress Sarah Silverman (Sarah Silverman) and others have taken actions to oppose the unauthorized collection of their works and data by artificial intelligence. This series of moves was dubbed "Data Revolt" by the American media.

Last week, Silverman filed a lawsuit against OpenAI and Meta, accusing them of using pirated copies of his book in their training data because the companies' chatbots can accurately summarize content from his book. Additionally, more than 5,000 authors, including Jodi Picoult, Margaret Atwood, and Viet Thanh Nguyen, have signed a petition calling for tech companies to Ask for their permission and give them attribution and compensation when using their books as training data.

To protect their work, writers and artists have resorted to different forms of protest. Some choose to lock works and prevent artificial intelligence from obtaining them; some choose to boycott websites that publish artificial intelligence-generated content; some choose to write subversive content to interfere with artificial intelligence learning.

On July 13, SAG-AFTRA, one of the three major Hollywood unions with 160,000 members, announced a strike. Before that, the American Screenwriters Guild had been on strike for more than 70 days. According to the New York Times, the general strike has brought the $134 billion US film and television industry to a standstill. Guaranteed not to replace actors with AI and computer-generated faces and voices.

Meanwhile, some news organizations are resisting AI. In June, in an internal memo on the use of generative AI, The New York Times said, "AI companies should respect our intellectual property." In a statement, online publishers such as The New York Times and The Washington Post argued that using copyrighted news articles as training data for artificial intelligence has potential risks and legal issues, and they called on artificial intelligence companies to respect the knowledge of publishers Property rights and creative labor.

Social media companies have also taken a stand. In April, social news site Reddit said it wanted to charge third parties for access to its application programming interface (API). Reddit CEO Steve Hoffman said his company "doesn't need to give all the value away for free to some of the biggest companies in the world." In July, Twitter owner Elon Musk (Elon Musk) also stated that some companies and organizations "illegally" grab a large amount of Twitter data. In response to "extreme data scraping and system manipulation", Twitter decided to limit the number of tweets that individual accounts can view.

Reddit founder and CEO Steve Hoffman wants to charge third parties for access to its application programming interface (API), sparking a massive outcry among netizens.

This “data uprising” also includes a “lawsuit wave,” with some AI companies being sued multiple times over data privacy concerns. In November, a group of programmers filed a class-action lawsuit against Microsoft and OpenAI, alleging that the companies violated their copyrights by using their code to train artificial intelligence programming assistants. In June of this year, the Los Angeles-based Clarkson law firm filed a 151-page class action lawsuit against OpenAI and Microsoft, pointing out how OpenAI collected data from minors, saying that web scraping violated copyright law and constituted "Theft". The firm has since filed a similar lawsuit against Google.

Santa Clara University School of Law professor Eric Goldman (Eric Goldman) said in an interview with the media that the arguments of this lawsuit are too broad and are unlikely to be accepted by the court. But he argues that the wave of lawsuits is just beginning, with a “second and third wave” coming that will define the future of artificial intelligence.

Legal Controversy

OpenAI's ChatGPT and Dall-E, Google's Bard, Stability AI's Stable Diffusion and other generative AIs are all trained based on massive news articles, books, pictures, videos and blog posts grabbed from the Internet, many of which are public are protected by copyright.

In March of this year, OpenAI released an analysis report of the institution's main language model, showing that the text part of the training data used data from news websites, Wikipedia and a pirated book database (LibGen), which is currently closed. Seized by the U.S. Department of Justice.

On July 13, the U.S. Federal Trade Commission (FTC) sent a 20-page document to OpenAI, requesting OpenAI to provide records on risk management, data security, and information review of its artificial intelligence models to investigate whether it violated consumer rights regulations. rights.

On July 12, the U.S. Senate subcommittee held a hearing on artificial intelligence, intellectual property and copyright issues, and the witnesses who attended were sworn in court. The hearing heard from the music industry, Photoshop maker Adobe, artificial intelligence company Stability AI and illustrator Karla Ortiz.

But in public appearances and in response to lawsuits, AI companies have argued that it is reasonable to use copyrighted works to train AI—a reference to the concept of "transformative use" in U.S. copyright law, which occurs if material is published in a A "transformative" way of changing that creates an exception.

"The AI model is basically learning from all the information. It's like a student reading in a library and then learning how to write and read," Kent Walker, Google's president of global affairs, said in an interview. "At the same time, you have to make sure you're not copying someone else's work or doing something that violates copyright."

Halimah DeLaine Prado, Google’s general counsel, told the media: “It’s been clear to everyone for years that we use data from public sources—such as posting to the open web and public data. Collected information to train the AI models behind services like Google Translate." She noted, "U.S. law supports the creation of new and beneficial uses of public information, and we look forward to refuting these baseless claims."

Andres Sawicki, a professor at the University of Miami who studies intellectual property law, said in an interview that there is some precedent that could favor tech companies, such as a 1992 U.S. Court of Appeals ruling that allowed companies to sue other companies for their intellectual property rights. Software codes are reverse engineered to design competing products. But many say it's intuitively unfair for large corporations to use the work of creators to make new money-making tools. "The question about generative AI is really hard to answer," he said.

Jessica D. Litman Sawicki, a professor of copyright law at the University of Miami, said the doctrine of fair use is a powerful defense for AI companies because of the size of AI models. Much of the output does not unambiguously resemble the work of a particular human being. But she argues that if creators suing AI companies can show enough examples of AI output that closely resemble their work, they will have good reason to believe that copyright is being violated.

AI companies begin to respond

AI companies can avoid this by installing filters in their products to ensure they don't generate anything too similar to existing work, Sauwicki said. For example, the video site YouTube already uses technology to detect and automatically remove copyrighted works uploaded to its site. In theory, AI companies could also build algorithms that find outputs that closely resemble existing works of art, music, or writing.

This "data uprising" may not make waves in the long run. Tech giants like Google and Microsoft already have vast amounts of proprietary data and have the ability to acquire more. But start-ups and nonprofits looking to take on the bigger players may not get enough data to train their systems as content becomes more difficult to obtain.

Just in early July, Stuart Russell, a professor of computer science at the University of California, Berkeley and author of "Artificial Intelligence: A Modern Approach," warned that AI-driven robots such as ChatGPT could soon "run out of the universe." text,” and techniques for training bots by collecting large amounts of text “started to struggle.”

Some companies are also riding the wave with a cooperative attitude. In a statement, OpenAI said, “We respect the rights of creatives and authors and look forward to continuing to work with them to protect their interests.” On July 14, the Associated Press agreed to license its archive of news stories from 1985 to To OpenAI, while also utilizing OpenAI's technology and products.

Google also said in a statement that it was involved in negotiations over how publishers will manage their content in the future. "We believe that everyone can benefit from a vibrant content ecosystem," the company said.

Margaret Mitchell (Margaret Mitchell), chief ethics scientist at the artificial intelligence company HuggingFace, said in an interview with the media, "The entire data collection system needs to be changed, and unfortunately it needs to be achieved through litigation, which is often the case. It’s the way to push tech companies to change.” She said she wouldn’t be surprised if OpenAI pulled one of its products entirely by the end of the year because of lawsuits or new regulations.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)