IOAI 2024 Practical · Short music video

Contest: IOAI 2024 (Bulgaria) · Round: Practical, on-site (4 h) · Category: Text/image-to-video generation, editing.

Official source: ioai-official.org/2024-tasks.

1. Problem restatement

Produce a short music video (typically 30–60 seconds) for the assigned track using existing generative-AI tools. Visual style must match the album cover (same set). Audio must be the actual song, muxed into the deliverable. The video has to cut to the music — random shot changes that ignore the beat are an immediate jury deduction.

Source. Paraphrased from the IOAI 2024 Practical Round description on ioai-official.org. Exact length and encoding requirements are [verify against on-site materials].

2. What's being tested

Image-to-video pipeline literacy. Generating keyframes is the easy part. Animating them coherently is hard.
Beat alignment. Cutting on downbeats. A 6-shot 30-second video at 120 bpm has a cut every 4 beats — math you do once and lock in.
Style continuity with the cover. Same palette, same grade, same era.
Render-time management. Video diffusion at 1080p costs minutes per second. You will not iterate as you do on images.

3. Data exploration / setup

Pull the beat grid first — every cut decision flows from this:

import librosa
y, sr = librosa.load("audio.mp3", sr=22050)
tempo, beats = librosa.beat.beat_track(y=y, sr=sr, units="time")
print("tempo:", tempo, "beat 0..5:", beats[:6])

# decide cut points: every 4 beats
cuts = beats[::4]
print("cuts:", cuts[:8])

Set up image-to-video. Stable Video Diffusion (SVD) and AnimateDiff are the open options in 2024:

from diffusers import StableVideoDiffusionPipeline
import torch

svd = StableVideoDiffusionPipeline.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt",
    torch_dtype=torch.float16).to("cuda")

4. Baseline approach

Generate 6 keyframe images from your storyboard (SDXL, same seed, same prompt template, same palette fragments), pass each through SVD for a 4-second motion clip, concatenate with ffmpeg, mux audio. No tricks — this is your fallback submission.

from diffusers.utils import export_to_video
import subprocess

# 1. one keyframe per storyboard frame
keyframes = [pipe(p, height=576, width=1024, generator=torch.Generator("cuda").manual_seed(s)).images[0]
             for p, s in zip(storyboard_prompts, range(6))]

# 2. animate each
clips = []
for i, kf in enumerate(keyframes):
    frames = svd(kf, num_frames=25, motion_bucket_id=120).frames[0]
    out = f"clip_{i}.mp4"
    export_to_video(frames, out, fps=8)
    clips.append(out)

# 3. assemble with ffmpeg + audio
with open("list.txt", "w") as f:
    for c in clips: f.write(f"file '{c}'\n")
subprocess.run(["ffmpeg", "-f", "concat", "-i", "list.txt", "-i", "audio.mp3",
                "-c:v", "libx264", "-shortest", "video.mp4"], check=True)

Baseline result: 24-second video, 6 shots, no beat alignment. Submittable but mid-tier. [illustrative]

5. Improvements that move the needle

5.1 · Trim each clip to a beat-aligned cut point

SVD produces fixed-length clips; the trim is where rhythm lives. Decide cut times from librosa.beat.beat_track, then in ffmpeg use -ss and -t to extract precisely (cut[i+1] − cut[i]) seconds of motion from each clip. The hardest cut of a video should land on the song's biggest dynamic change.

5.2 · Use motion_bucket_id to vary energy

SVD has a motion_bucket_id parameter (1–255). Use ~80 for slow-burn intro shots and ~180 for high-energy chorus shots. Mismatched motion energy between song and video reads as amateurism.

5.3 · Add a single grade pass across the whole video

After concatenation, apply an ffmpeg curve / LUT that matches the cover's grade. Even a 5-line vf "curves=preset=cinematic,colorchannelmixer..." filter forces consistency across clips generated with different seeds.

5.4 · Hold the first frame for one beat at the start

Build a 1-second freeze of your hero image at the song's first downbeat before the first cut. Establishes the visual world, mirrors how real music videos open, and costs nothing.

5.5 · End on the cover

The final frame of the video should be the album cover (or a near-identical recomposition). This closes the loop on jury scoring — they see the two deliverables converge.

6. Submission format & gotchas

H.264 in MP4, 1920×1080 or 1080×1920 (vertical), 24 fps minimum, audio AAC stereo at 192 kbps.
Mux audio first, then watch it back in a real browser — bad audio sync is the most common jury-reported issue.
Don't submit a ProRes / mov master; jury machines may not have the codec.
SVD watermarks (a small Stability AI logo) are sometimes embedded — crop or paint them out before submission. [verify on current SVD release]
Render time math: SVD at 1024×576 ≈ 30 s per 4-second clip on an A100. On an L4 it doubles. Budget accordingly.

7. What top solutions did

Among the screened IOAI 2024 final videos, the strongest had: (1) sub-30-second runtime, no padding; (2) cuts landing exclusively on downbeats; (3) consistent character across shots (achieved via fixed-seed SDXL keyframes); (4) a clear narrative arc — three acts, even at 30-second scale; (5) final frame echoing the cover. Tool diversity was high — some teams used Runway Gen-2, others SVD, others AnimateDiff — but discipline was uniform. [illustrative]

8. Drill

D · Your video has 6 shots but feels random. Beat-align it.

Run librosa.beat.beat_track to get all beat times; downbeats are every 4 beats in a 4/4 song. Find the song's longest gap between drum hits — that's where one of your cuts should land for the dramatic pause. The cut after a chorus should land on the first beat of the next section, not somewhere in the middle. Map: shot 1 = intro (long), shot 2-3 = verse, shot 4 = chorus hit (1 beat hold + motion), shot 5 = verse 2, shot 6 = outro freeze to cover. Then trim every clip in ffmpeg to the precise beat-aligned duration.

D2 · Your hero character changes face between shots. How do you keep them consistent?

Three options, in order of speed: (a) use the same SDXL seed across all keyframe prompts — cheap but only works for similar framings; (b) generate one strong "character sheet" image and use img2img with strength=0.5 from that reference for every keyframe; (c) train a quick DreamBooth / LoRA on the character (~10 minutes) and use it across all generations. For a 4-hour round, option (b) is the right speed-quality trade-off.

← IOAI 2024 Practical set