wan 2.1 lip sync and ai avatar voide
Eden
ComfyUI

Bringing Multiple AI Voices to Life with MultiTalk + WAN 2.1

Promptus
12 min
Wiki 174
promptus ai video generator

Wan 2.1 Lip Sync in Promptus ComfyUI Just Got a Massive Upgrade!

In this video, we explore the latest evolution of AI-driven talking avatars: the MultiTalk Lipsync model now fully integrated into the WAN 2.1 video framework, delivering multi-character AI conversations that feel impressively natural.

🚀 What’s New?

  • Multi-audio support for up to four unique AI voices in one video.
  • A sleek workflow combining:
    • Ollama (or other LLMs) to structure conversational scripts.
    • Chatterbox Dialog TTS, including voice cloning demos with both short and longer samples.
    • Updated WAN Video Wrapper for proper video lip-sync and face animation.
    • Segment Anything + Prepare Mask/Image Groups to isolate characters for accurate visual control.
    • Speaker diarization to identify which AI character is speaking when.

This pipeline runs seamlessly locally or on cloud GPUs, empowering creators to build their own AI-hosted podcasts, interviews, and multi-character storytelling—right from ComfyUI.

🔍 Real-World Testing & Performance

We don’t just demo — we test. You’ll see a 3‑minute demo using two voices, split cleanly into separate audio tracks. Using the WAN 2.1 framework, the video renders 1,500 frames in chunks (~81 frames each), taking around 19 minutes. You’ll observe the smooth flow of animation, realistic background details, and areas that still need fine-tuning—like occasional mouth flicker or body movement when one avatar speaks.

🎯 Where It Shines (and Where It’s Still Rough)

✅ Pros:

  • Engaging, podcast-style conversational flow
  • Clean separation of voices → cleaner lip-sync
  • Background scene presence
  • Runs offline with GGUF quantization support

⚠️ Cons:

  • Occasionally both avatars move during a single voice
  • Facial sync not always perfect
  • The system doesn’t always assign mask-to-audio correctly

One clear takeaway? There's a strong need for future improvements such as explicit mask-to-audio assignment in multi-character setups—similar to Cling AI or Runway’s approach.

🛒 Interested? Try Promptus with a One-Time Lifetime License

This full pipeline is built into the Promptus platform, which offers powerful ComfyUI-based tools for AI image, video, and audio generation. Best of all, there's a one-time purchase option ($49 lifetime license) without monthly fees. That single payment grants:

  • Full offline access to WAN 2.1 and all Cosyflows
  • 10,000 bonus credits
  • ComfyUI interface
  • Lifetime updates and support

Or, there are optional subscription tiers for creators needing ongoing cloud/GPU access: http://promptus.ai/pricing

🔧 Who Should Watch This?

  • AI developers building multi-character avatars
  • Content creators exploring AI-hosted podcasts or interviews
  • Technologists mastering Ollama, TTS, segmentation, and lip-sync models

This tutorial isn't just a walkthrough — it's a realistic look at the pipeline's strengths and weaknesses, giving you the tools to decide if this is right for your next project.

🔗 Resources & Links

Dive in, experiment, and let us know how it goes!

Written by:
Eden
A trained artist who once feared AI art might end her career, Eden has since embraced it as a powerful ally. Now, she confidently creates with AI, blending tradition and technology in her work.
Try Promptus Cosy UI today for free.
Just create your
next AI workflow
with Promptus
Try Promptus for free ➜