Here’s a question that can embarrass a generative AI company: “What content was used to train your models?” While some avoid answering this question and others stubbornly back away from it, the issue of whether an AI company scraped content without permission for its own business purposes is a thorny one.
At best, you’ll probably get a vague explanation. “selected data sets”and at worst, a polemic about whether everything on the internet is fundamentally fair game.
Now document obtained by 404media It appears that some of the data used to train Runway’s latest AI video-generation tool, Gen-3, may have come from the YouTube channels of thousands of popular media companies, including Pixar, Netflix, Disney, and Sony.
While 404media doesn’t provide details on how the document was obtained, nor can it verify that every video referenced in it was used to train Gen-3, it could be a potential insight into the types of practices an AI company could apply to acquire copyrighted material to train its models.
A former Runway employee spoke to 404media about the methodology used. Allegedly, 14 spreadsheets included in the leaked document contain terms like “beach” or “rain” with the names of Runway employees next to them.
According to the source, the names of the individuals were allegedly employees whose job was to find videos or channels related to those keywords. They would then apply a YouTube video downloader through a proxy server to download videos from the site without being blocked by Google.
It appears that it wasn’t just YouTube content that was scraped. A spreadsheet containing 14 links to non-YouTube sources, including a link to a website dedicated to streaming popular cartoons and animated movies, with thousands of copyright complaints.
In general, pirated media seems to be at least considered in the creation of training data, if not directly collected and used.
404media took it a step further and attempted to apply Gen-3 to generate videos using keyword prompts based on terms found in a spreadsheet, ultimately creating clips that looked very similar to the content associated with them.
Runway itself was partially funded by Googleamong other things, so scraping content without the permission of creators on his platforms, if true, will likely land him in earnest trouble. Not to mention the potential for wider legal repercussions.
Still, while the issue of AI content theft is a thorny one, the model still seems to have problems. Ars Technica recently tried making videos with Gen-3 Alpha, and gave the cat a pair of human hands. I’m not sure what content was used to train this particular version of the model, but I would suggest that regardless of the methodology used, it would be worth working on anyway.