Training your AI with images is missing a dimension, video is better

September 14, 2023 · 4 min read

DeepMake

AI-powered VFX

Deepfake videos are becoming easier than ever to make these days, thanks to software advancements that produce better images. Technology is constantly improving, so we can reasonably expect even more mind-blowing advancements in the near future.

But you don't have to wait around for the next big breakthrough to occur. You can make better deepfakes today with the tools that are already available to you. It's pretty simple, really: Instead of relying on still photos for training data, use the higher-quality data that videos provide to train your machine learning (ML) model.

So what makes video a superior training tool for AI?

The reason is that video provides a larger volume of diverse images. You need both a large number of images, and images that show your subject in diverse views, such as from different angles, with different lighting, and different expressions. Video is simply the best way to meet both of these criteria.

The developers of Faceswap, the open source software that DeepMake grew from, provide guides to Faceswap and best practices for training a Faceswap model. There are several models and many options, but they all recommend using a large amount of training data, preferably several different high-quality videos. As the guide says, "The more data, the more varied, the better."

The Faceswap training guide recommends 1000 to 10,000 quality images --- about the same amount a few short videos can provide. Now, let's get real. No one wants to handle 10,000 individual photos. But a few videos? No problem.

But suppose you do have access to thousands of high-resolution photos. What then? Video is still going to provide superior training data, whether you're mapping people, pets, or landscapes. Video is simply going to provide more complete data for a better end result.

One reason video contains better data is that it includes microexpressions. Microexpressions are short, mostly imperceptible facial expressions that last from about one-fifty-fifth to one-twenty-fourth of a second. If you're trying to create authentic-looking deepfakes, you need to be able to recreate a wide variety of expressions, and video does a good job capturing even the tiniest expressions. While photographs typically have a very limited range of expressions.

A short video clip can provide more than just a wide range of expressions to train your ML model. Videos are far more likely to capture multiple angles, lighting, various motions, and context about your subject, making them a veritable explosion of training data.

Even non-face data benefits from using videos instead of photographs. Would you rather ride in a car that had been trained on a few images of a stretch of road, or one that had been trained with several videos of that same road, taken at different times of day, that showed other cars, bicycles, and pedestrians? Video simply contains a far more expansive and accurate view of our world that provides the AI being trained with what it craves the most --- data on every conceivable variation.

Now, obviously, there are plenty of ways to train ML models using still images and tons of tools out there that use photos as training data. A picture may be worth a thousand words, but it's actually a pretty limited example compared to a video.

Photos only capture a brief moment in time. Photos are often posed, taken from limited angles and lighting, and only catch a few of the millions of expressions that a person may have. Even with multiple photos available, the sample size remains woefully small.

The situation gets worse if you're collecting your training images online. Most people that post photos online will cherry-pick images and only post their "best" shots, giving you limited options. This is especially true for celebrities. You'll be hard-pressed to find a photo of a well-known person scratching their nose or frowning.

No matter what your ML model is, video is going to give you better training data and a better result for your AI project.

As you use more convincing training data, you'll end up with a more convincing deepfake. And as deepfakes get more realistic, we want to remind everyone to look at our Ethical Manifesto on the responsible use of AI. We're putting our software out there to help everyone learn and advance AI, and we support development and techniques that further the use of AI in a responsible manner.

AI has a growing importance in society, with more applications from entertainment to education. Yet far too many datasets only use a single image, when a video would be better. The future of AI training lies in using video training data.