Skip to main content

AI data collection is too hard to do ethically

· 5 min read

All AI models require training data — and most developers get their data unethically. Everyone seems to be doing it: OpenAI was recently hit with multiple lawsuits (more) claiming the company stole internet data to train its popular ChatGPT tool. And Stability AI, the creator of the well-known Stable Diffusion tool, is being sued by Getty Images for copyright infringement for using millions of images without the proper licensing.

These cases are still in the courts, but one thing is certain now: the amount of data required to train a model like ChatGPT is huge — much larger than the amount of data available for most AI developers. Yet most publicly available datasets are licensed for non-commercial use only. That means developers can’t (ethically) use that data to train an AI model if the final product is part of a commercial endeavor.

Now, there are some commercial-use datasets available. But the issue with them is that the datasets are often too small or lack quality or variety needed to properly train an AI model. The cost of developing large one-off datasets of training data is too high for most AI developers, if they can even get enough data as it is.

Does that mean that commercial entities shouldn’t use non-commercial data to train their AI models? That’s a tough question to answer. Take the example of Casia Africa. This dataset of face images from Africa’s stated goal so AI developers would be able to use more inclusive images in their training data. It was a great idea that had incredibly good, ethical goals.

However, because the data is only licensed for non-commercial use only, companies that might use the training data for any project that had a commercial purpose can’t use it — defeating the database’s stated goal of getting diverse images into widespread use.

Another issue is that even images available for commercial use can have restrictions. Did you know that, just because somebody posts their images as publicly available, doesn’t mean that you actually have the right to use them?

One common way to share images is through sites such as Flickr. However, Flickr forbids scraping the site for images, no matter what the images themselves are licensed as. So, using software to extract large numbers of files for training data is against Flickr’s terms and conditions, even if an image owner gives permission to use the file for any purpose.

You can see how this gets confusing. Now, here’s another wrinkle.

There’s actually a loophole in the copyright law that might apply to AI training data. If a developer’s work is transformative in some way, then they might have a fair use right to publicly available images. The U.S. Copyright Office says the fair use exception to copyright applies when the end result is completely new, has a further purpose or different character than the original work, and isn’t a substitute for the original material.

With fair use, there’s a good argument to be made that if an operation is making something limited and unique — in other words, it’s not recreating or altering existing photos, it’s just using those images to train a model to identify faces and images — that it should be okay to use non-commercial training data. However, there’s an equally good argument that this practice is unethical because the original owner of the imagery didn’t consent.

In an effort to get enough training data, developers may ignore licensing agreements and resort to using non-commercially licensed datasets to train models for commercial use. One example of this is the Stable Diffusion application, which used tons of images from all over the internet to train. This is a highly unethical practice, but it still happens in the industry.

Some developers may argue, so what? Is anyone really ever going to know? Probably not. But that doesn’t make it okay. And even if it was legal, that doesn’t mean it’s ethical.

There may not be an easy solution for this, but one thing that would help is if more datasets were available. At DeepMake, we feel that when dataset creators limit use, they make the environment more hostile to entities that want to work ethically.

We think developers should have access to more data, and we’re not the only ones who feel that way. Just recently, the Japanese minister of education, culture, sports, science, and technology said in a committee meeting that AI companies are not violating copyright when they train on data, because the output of the model isn’t intrinsically a copy of the training data.

The problem with fair use is that it’s a defense for copyright infringement, not a blanket exception. That means that the only way to know if you’re in the right is to go to court and argue your case before a judge. This isn’t possible for small AI developers like Open Source projects and doesn’t guarantee legal security until after you prove it to the court.

In addition to having the money to argue for fair use, private companies have more tricks up their sleeves. Google just updated its Privacy Policy to lay claim to publicly available information in Google so it can train its AI products, including Bard. This means that if you use any Google services (and who doesn’t) they’ve claimed the right to take anything you put on the internet for itself.

Giant tech companies will always find a way to get training data — ethically or not. The way to give all developers equal footing would be to have more datasets released publicly and licensed for everyone to use. This would make the data useful and democratic, encouraging future growth and improvements from that data.