HappyHorse 1.0: the first video model that nails sync audio

Alibaba’s HappyHorse 1.0 shipped on Thursday. By the end of the day it was at the top of the public video leaderboards, and by the end of the week the only conversation in our group chat was about how the audio actually works. We’re calling it: this is the first generally-available text-to-video model where the sound matches the picture by design, not by retrofitted lip-sync. That changes the lineup. It also changes the kind of work you can credibly ship out of a generative pipeline.

The audio-sync problem, briefly

Most “text-to-video with audio” up to now has been two pipelines stapled together. The visual model generates a clip. A separate audio model generates sound. A third stage tries to align them — match the dog bark to the dog’s mouth, the door slam to the door closing, the music swell to the camera move. The seams show, especially on dialogue. You see a person speaking. Their lip shapes don’t match the syllables. The brain notices instantly.

The fix isn’t better lip-sync software. It’s training the visual and audio streams together, so the model is reasoning about sound and picture as one signal. That’s hard for practical reasons (the training data is messier and more expensive to curate) and architectural reasons (you need cross-modal attention that doesn’t degrade either stream). HappyHorse 1.0 is the first model that ships this jointly, at quality, in production.

What it actually does well

A week into using it on real briefs, the wins are concrete:

Dialogue. Lip-sync is correct for the syllables, not just the mouth movements. Stress on the right word lines up with eyebrow lift. Eye contact pattern matches the cadence of the line. This is the part that felt like science fiction in our first test prompt and is now the part we lean on hardest.

Foley. Footsteps land on the frames where the foot lands. Doors slam on the frame where they close. Pour a glass of water in a clip and the water sound starts when the lip of the pitcher tips, not 200ms late.

Score. When the prompt asks for music, the music has structure that matches the visual structure — swells on motion, hits on cuts. It’s not at “professional composer” level, but it’s at “competent indie scene-setter” level, which is far better than any other generative audio we’ve integrated.

What still needs another generation

The model isn’t magic. Three areas where we’re still patching by hand:

Music identity. It can compose, but it can’t reproduce a specific artist or track. If your brand has a sonic signature, you’re still laying that in post.
Sound effects libraries. For very specific named effects — a Star Wars-style blaster, a sitcom laugh-track — you’re better off generating silent video and dropping the effect on the timeline.
Long clips. Like every video model in 2026, HappyHorse drifts past the 15-second mark. For multi-shot sequences, you still cut.

What it makes shippable

The bigger story is the kind of work that’s now feasible without a post-production stage. A few examples from real use this week:

A 12-second product explainer with VO that lip-syncs to a person holding the product. Previously: two days, three tools, one freelance animator. With HappyHorse: forty minutes, one prompt with reference images, one round of edits.
A 30-second ad cut with diegetic sound (footsteps, ambient room noise, a phone notification on cue). Previously: video tool, audio tool, alignment pass, mixing pass. With HappyHorse: render twice, pick the better take.
A character-talking-to-camera reel for a brand spokesperson. Previously: this didn’t work — the lip-sync from prior models was uncanny enough to kill the take. With HappyHorse: usable on first or second generation about 70% of the time.

Where it sits in the recommendation logic

Effective today in the studio: HappyHorse 1.0 is the default for video generations when your prompt requires audio. The picker reads “audio implied” from prompts that include words like dialogue, says, talks, speaks, music, sound of, foley, or VO, and routes there unless you override.

For cinematic 4K with audio added in post: Veo 3.1 still wins on visual. For social drafts: Grok Imagine still wins on speed and feel. For reference-driven sequences: Seedance 2.0 still wins on consistency. HappyHorse owns the joint audio-video slot specifically — and that slot is suddenly the most useful one in the lineup, because so much production work is the audio-video joint.

A note on the leaderboard

HappyHorse 1.0 also leads the visual-only benchmarks this week, which is real but not the headline. Visual leaderboards swap every six weeks. The audio-video integration is the structural shift — the thing the rest of the field will spend the next year catching up to. We’re glad it’s already in the studio at the same shared credit cost as anything else.