According to a recent report from 404 Media’s Samantha Cole, leaked internal communications from Nvidia show the continuation of what appears to be the industry trend of big tech companies taking the ‘ask for forgiveness instead of permission’ approach regarding the data they use to train AI models.
Even when employees raised legal and ethical concerns, managers reportedly told them that the company’s practice of scraping millions of hours of videos from YouTube, Netflix, and other data sets was “an executive decision” in one instance and called “an open legal issue” in another.
If you were still on the fence regarding the ongoing debate about the legal and ethical aspects of where AI companies get their training data, this might be enough to make you pick a side.
Won’t somebody please think of the creators?
cool cool cool cool cool cool now leaked NVIDIA slack messages discussing which YouTube channels to scrape videos from. MKBHD videos? Yeah grab those too. https://t.co/0XczvTNVBH
— Marques Brownlee (@MKBHD) August 5, 2024
Nvidia has opted to stick to its guns regarding its unscrupulous scraping, as Cole writes, “When asked about legal and ethical aspects of using copyrighted content to train an AI model, Nvidia defended its practice as being “in full compliance with the letter and the spirit of copyright law.”
Well, the leaked Slack conversations and emails from the team working on a project codenamed ‘Cosmos’ tell a different story.
As does YouTube’s CEO Neal Mohan who said in April that using YouTube to train AI models is a “clear violation” of the platform’s terms. Back then, he was responding to reports that OpenAI used YouTube videos to train its Sora text-to-video generator.
Just last month, AI startup Runway came under the same fire in another 404 Media report that it used YouTube videos and other pirated content as training data without proper permission. Can you see the pattern yet?
Read More: AI propels Nvidia into the stratosphere
Training an AI model on video content isn’t inherently bad. HD-VG-130M is a video data set of 130 million YouTube videos compiled specifically for training AI models by researchers at Peking University in China. The important differences are that a) this dataset’s videos are in the open domain and b) it is protected by a licence that specifies it can only be used as training for academic research.
“Any content from HD-VG-130M dataset is available for academic research purposes only. You agree not to reproduce, duplicate, copy, trade, or exploit for any commercial purposes,” states the licence agreement.
While Nvidia does contribute to that sort of research, the leaked communications clearly indicate that ‘Cosmos’ isn’t that sort of research.
“Emails from the project’s leadership to employees show that the goal of Cosmos was to build a state-of-the-art video foundation model “that encapsulates simulation of light transport, physics, and intelligence in one place to unlock various downstream applications critical to NVIDIA,” writes Cole.
Read the full 404 Media report with leaked screenshots here.