Skip to main content

Talking Video

L
Written by LX

Talking Video turns a face and a voice into a speaking video. It combines two tools in one workstation tab — Talking avatar (make a still image speak) and Lip sync (re-sync a video's mouth to new audio) — so you can pick whichever fits the footage you already have.

What is Talking Video?

It's the workstation mode for making someone (or something with a face) appear to talk. You bring a face and an audio track, and the app drives the mouth, expression, and timing to match the voice. Depending on what you start with, you use one of two sub-modes:

  • Talking avatar — start from a single still image and an audio track. The app animates the still so it speaks.

  • Lip sync — start from an existing video and an audio track. The app keeps your video but replaces the mouth movements so they match the new audio.

How do I choose between Talking avatar and Lip sync?

It comes down to what footage you have:

  • If you only have one image of the face, use Talking avatar.

  • If you already have a video of the person (or character) and just want the lips to match a different voice-over, use Lip sync.

The app nudges you toward the right one: in Talking avatar you'll see a prompt that says "For video-driven talking avatars, use Lip sync," and in Lip sync you'll see "If you only have 1 image, use Talking avatar instead." Both prompts double as one-tap switches between the two.

How do I switch between the two modes?

When both modes are available, a small toggle appears at the top of the workstation with Talking Avatar and Lip sync chips — tap the one you want and the inputs below change to match. You can also switch using the inline "use Lip sync" / "use Talking avatar" link shown under the image or video area. Switching keeps you in the same workstation; only the required inputs change.

What do I need to provide?

In both modes you provide a face source and an audio source:

  • Talking avatar needs a reference image plus audio.

  • Lip sync needs a reference video plus audio.

For the audio, you can either upload your own file or generate a voice with text-to-speech right inside the uploader. See Generate audio (text-to-speech) for how the voice generator works.

Does the voice's language matter?

Yes — match the voice to the language of your script. When the spoken words and the voice are in the same language, the mouth movements and accent come out more natural; a mismatch can make the lip motion look off. If you're generating the voice with text-to-speech, pick a voice that fits the language you wrote in. See Generate audio (text-to-speech).

How much does it cost?

Both modes are priced per second of the audio you provide, so a longer voice track costs more. The per-second rate is shown next to the Generate button, along with the total estimate based on your audio length, before you commit. The exact rate depends on your plan, the mode, and the quality/resolution you pick.

Why is the Generate button greyed out?

Generate stays disabled until every required input is ready. Common reasons:

  • You haven't added an audio source yet, or the audio is still uploading or failed to upload.

  • (Talking avatar) you haven't added a reference image, or it's still uploading / failed.

  • (Lip sync) you haven't added a reference video, or it's still uploading / failed.

  • Your audio is longer than the mode's limit (3 minutes for Talking avatar, 15 minutes for Lip sync).

Once the face source and audio are both uploaded and within limits, Generate lights up.

On mobile

Both modes work on a phone. The mode toggle sits in a thin scrollable bar at the top of the workstation, and the image/video and audio inputs stack vertically below it. The quality (Talking avatar) and resolution (Lip sync) controls live in the bottom bar rather than inline. Everything else — choosing a face source, adding audio, and the per-second pricing — behaves the same as on desktop.

Did this answer your question?