IOAI 2024 Practical · Short music video
Contest: IOAI 2024 (Bulgaria) · Round: Practical, on-site (4 h) · Category: Text/image-to-video generation, editing.
Official source: ioai-official.org/2024-tasks.
1. Problem restatement
Produce a short music video (typically 30–60 seconds) for the assigned track using existing generative-AI tools. Visual style must match the album cover (same set). Audio must be the actual song, muxed into the deliverable. The video has to cut to the music — random shot changes that ignore the beat are an immediate jury deduction.
2. What's being tested
- Image-to-video pipeline literacy. Generating keyframes is the easy part. Animating them coherently is hard.
- Beat alignment. Cutting on downbeats. A 6-shot 30-second video at 120 bpm has a cut every 4 beats — math you do once and lock in.
- Style continuity with the cover. Same palette, same grade, same era.
- Render-time management. Video diffusion at 1080p costs minutes per second. You will not iterate as you do on images.
3. Data exploration / setup
Pull the beat grid first — every cut decision flows from this:
import librosa
y, sr = librosa.load("audio.mp3", sr=22050)
tempo, beats = librosa.beat.beat_track(y=y, sr=sr, units="time")
print("tempo:", tempo, "beat 0..5:", beats[:6])
# decide cut points: every 4 beats
cuts = beats[::4]
print("cuts:", cuts[:8])
Set up image-to-video. Stable Video Diffusion (SVD) and AnimateDiff are the open options in 2024:
from diffusers import StableVideoDiffusionPipeline
import torch
svd = StableVideoDiffusionPipeline.from_pretrained(
"stabilityai/stable-video-diffusion-img2vid-xt",
torch_dtype=torch.float16).to("cuda")
4. Baseline approach
Generate 6 keyframe images from your storyboard (SDXL, same seed, same prompt template, same palette fragments), pass each through SVD for a 4-second motion clip, concatenate with ffmpeg, mux audio. No tricks — this is your fallback submission.
from diffusers.utils import export_to_video
import subprocess
# 1. one keyframe per storyboard frame
keyframes = [pipe(p, height=576, width=1024, generator=torch.Generator("cuda").manual_seed(s)).images[0]
for p, s in zip(storyboard_prompts, range(6))]
# 2. animate each
clips = []
for i, kf in enumerate(keyframes):
frames = svd(kf, num_frames=25, motion_bucket_id=120).frames[0]
out = f"clip_{i}.mp4"
export_to_video(frames, out, fps=8)
clips.append(out)
# 3. assemble with ffmpeg + audio
with open("list.txt", "w") as f:
for c in clips: f.write(f"file '{c}'\n")
subprocess.run(["ffmpeg", "-f", "concat", "-i", "list.txt", "-i", "audio.mp3",
"-c:v", "libx264", "-shortest", "video.mp4"], check=True)
Baseline result: 24-second video, 6 shots, no beat alignment. Submittable but mid-tier. [illustrative]
5. Improvements that move the needle
5.1 · Trim each clip to a beat-aligned cut point
SVD produces fixed-length clips; the trim is where rhythm lives. Decide cut times from
librosa.beat.beat_track, then in ffmpeg use -ss and -t to
extract precisely (cut[i+1] − cut[i]) seconds of motion from each clip. The hardest cut of a video
should land on the song's biggest dynamic change.
5.2 · Use motion_bucket_id to vary energy
SVD has a motion_bucket_id parameter (1–255). Use ~80 for slow-burn intro shots and
~180 for high-energy chorus shots. Mismatched motion energy between song and video reads as
amateurism.
5.3 · Add a single grade pass across the whole video
After concatenation, apply an ffmpeg curve / LUT that matches the cover's grade. Even a 5-line
vf "curves=preset=cinematic,colorchannelmixer..." filter forces consistency across
clips generated with different seeds.
5.4 · Hold the first frame for one beat at the start
Build a 1-second freeze of your hero image at the song's first downbeat before the first cut. Establishes the visual world, mirrors how real music videos open, and costs nothing.
5.5 · End on the cover
The final frame of the video should be the album cover (or a near-identical recomposition). This closes the loop on jury scoring — they see the two deliverables converge.
6. Submission format & gotchas
- H.264 in MP4, 1920×1080 or 1080×1920 (vertical), 24 fps minimum, audio AAC stereo at 192 kbps.
- Mux audio first, then watch it back in a real browser — bad audio sync is the most common jury-reported issue.
- Don't submit a ProRes / mov master; jury machines may not have the codec.
- SVD watermarks (a small Stability AI logo) are sometimes embedded — crop or paint them out before submission. [verify on current SVD release]
- Render time math: SVD at 1024×576 ≈ 30 s per 4-second clip on an A100. On an L4 it doubles. Budget accordingly.
7. What top solutions did
Among the screened IOAI 2024 final videos, the strongest had: (1) sub-30-second runtime, no padding; (2) cuts landing exclusively on downbeats; (3) consistent character across shots (achieved via fixed-seed SDXL keyframes); (4) a clear narrative arc — three acts, even at 30-second scale; (5) final frame echoing the cover. Tool diversity was high — some teams used Runway Gen-2, others SVD, others AnimateDiff — but discipline was uniform. [illustrative]
8. Drill
D · Your video has 6 shots but feels random. Beat-align it.
Run librosa.beat.beat_track to get all beat times; downbeats are every 4 beats in a
4/4 song. Find the song's longest gap between drum hits — that's where one of your cuts should
land for the dramatic pause. The cut after a chorus should land on the first beat of the
next section, not somewhere in the middle. Map: shot 1 = intro (long), shot 2-3 = verse, shot 4 =
chorus hit (1 beat hold + motion), shot 5 = verse 2, shot 6 = outro freeze to cover. Then trim
every clip in ffmpeg to the precise beat-aligned duration.
D2 · Your hero character changes face between shots. How do you keep them consistent?
Three options, in order of speed: (a) use the same SDXL seed across all keyframe prompts —
cheap but only works for similar framings; (b) generate one strong "character sheet" image and use
img2img with strength=0.5 from that reference for every keyframe; (c) train a quick
DreamBooth / LoRA on the character (~10 minutes) and use it across all generations. For a 4-hour
round, option (b) is the right speed-quality trade-off.