Nvidia leak reveals they scraped “80 years worth” of YouTube videos a day to train AI

According to leaked internal Slack chats, emails, and documents, Nvidia utilized videos from YouTube, Netflix, and other sources to train its AI products.

The investigation by 404 Media has uncovered a large leak from Nvidia. According to the publication, Nvidia has been utilizing videos from various sources for their Omniverse 3D world generator, self-driving car systems, and “digital human”products through scraping.

The employees responsible for scraping videos frequently raised concerns about its ethics and legality, but were silenced by their managers. These managers also claimed to have obtained permission from the highest levels of the company to use the content.

The majority of the videos were obtained from YouTube, with additional material taken from platforms such as Netflix and GitHub.

During a Slack conversation, an employee from Nvidia proposed the idea of scraping movies. The rationale behind this suggestion was that “movies can provide high-quality data with gaming-like 3D consistency and fictional content.”

Ming-Yu Liu, Vice President of Research at Nvidia, responded, “We require a volunteer to download all of the movies.”

Screenshot of Nvidia internal slack chat. — 404 Media

According to emails obtained by 404 Media, project managers are considering using 20 to 30 virtual machines on Amazon Web Services to download 80 years’ worth of videos per day.

“In an email sent in May, Liu stated that we are in the process of completing the v1 data pipeline and securing the required computing resources to establish a video data factory capable of producing a daily yield of training data equivalent to a lifetime of human visual experience.”

In the Slack channels, employees were also deliberating on which YouTube channels’ videos to gather for AI training. A research scientist shared multiple links to YouTube channels in a Slack channel and added, “In case you are still seeking suggestions for YouTube channels to download, here are a few that could be worth considering.”

The scientist noted that the links were from a variety of YouTube channels, including well-known brands such as Expedia and Architectural Digest’s official channel, as well as individual content creators like Marques Brownlee (MKBHD). The scientist also made a comment about the high quality of MKBHD’s tech product reviews next to the link to his YouTube video.

Nvidia responded to inquiries from 404 Media regarding the legal and ethical considerations of utilizing copyrighted material for AI training by stating that their methods are fully compliant with both the letter and the intent of copyright law.

Nvidia was also alleged in July to have utilized data from a third-party company to train its AI models. The said third-party company had acquired the data through the unauthorized scraping of YouTube videos from content creators.