There are so many applications of generative AI dropping at the moment, it is hard to keep up. One that fascinates me is the latest wave of generative AI that turns still images into video.
Perhaps you’ve seen the Mona Lisa reading Eminem lyrics or Taylor Swift explaining relativity. There are many platforms doing this right now, and many different approaches, but often it’s a two-step process consisting of:
Encoding visual features to create a face
Managing the noise and movements to generate detailed and accurate facial movements
Check out, for example, the below deepfake of Leonardo DiCaprio singing - in Chinese (!) - courtesy of the Loopy platform launched recently by ByteDance (owners of Tik Tok).
Encoding visual features to create a face
The process would normally start with one main image of a face, sometimes called the ‘reference image’. Ideally the reference image sits alongside multiple other images of that same face, taken at different angles, smiling, frowning etc.
These multiple images give the system important clues as to how to construct a moving version of the face, with eyes, nose, mouth etc. that look like the original face.
Some systems can even do an okay job of this with just a single reference image and no other alternative angles of the face. Though obviously, the more shots of the face, the better.
Integrating sound and facial movements
Once a flexible, moving ‘base face’ is created, the next stage involves integrating sound and facial movements.
A selection of audio or script is provided and the system subtly manipulates the face to make it seem like it is producing that audio, and using believable facial reactions to the text being spoken.
These calculations are highly detailed. Even sitting on one expression for a tenth of a second too long can make a face seem unrealistic.
Friend or foe?
The implications of these types of Gen AI technology are not entirely positive. The ability to create deepfakes of politicians, public figures - or even the unpopular kid at school who can now be cyberbullied in a whole new way - are of real concern.
At the same time, we might not be far from a world where every content creator can imagine and make their own studio quality video output - from home.
The results of many of these current platforms are, in my opinion, amazing. And when you look at how far we’ve come in just 12 months, who knows where we will be by the end of 2025.
That’s all from me for now. If you'd like more geeky fun, please check out my other newsletters below, or connect with me on LinkedIn and/or X.
Yours in numbers,
Adam
header.all-comments