IOAI 2024 Practical · Album cover (text-to-image)
Contest: IOAI 2024 (Bulgaria) · Round: Practical, on-site (4 h) · Category: Generative imaging.
Official source: ioai-official.org/2024-tasks.
1. Problem restatement
Produce a single square album cover (≥ 1024×1024) for the assigned song, using any text-to-image tool. The cover must clearly visualise the song's mood, fit the team's locked palette, and read legibly at thumbnail (Spotify-sized, 200 px) scale. The title and artist name are composited in post — diffusion-rendered text is unreliable in 2024-era SDXL.
2. What's being tested
- Prompt-engineering fluency. Knowing what fragments produce reliable composition, light, lens, grade. Bad prompts give "generic AI cover".
- Thumbnail readability. A cover that looks great at 1024 but mushy at 200 px is a fail. The composition must survive a 5× downsample.
- Palette discipline. Sticking to the brief's three colours.
- Post-production. Compositing title type, fixing minor artifacts, removing extra fingers — basic image editing literacy.
3. Data exploration / setup
You start from the team brief. Pick one generation tool: SDXL (open, runs locally on the supplied GPU), Midjourney-style (paid web), or a hosted SDXL service. SDXL via diffusers is the most reproducible:
from diffusers import StableDiffusionXLPipeline
import torch
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
variant="fp16",
).to("cuda")
pipe.enable_xformers_memory_efficient_attention()
Stash a "negative prompt" template the moment you set up the pipeline: "ugly, blurry, extra
fingers, distorted face, text, watermark, low quality, jpeg artifacts". It catches 80% of
SDXL's failure modes.
4. Baseline approach
Generate one batch of 4 images at 1024×1024 using your brief sentence + style fragments.
Eyeball, pick the strongest, run img2img with strength=0.4 for refinement.
prompt = ("a single empty train platform at dusk, deep blue and amber neon, "
"rain on the camera lens, 35mm, shallow depth of field, "
"cinematic, Wong Kar-wai inspired, moody, soft volumetric light")
negative = "ugly, blurry, extra fingers, distorted face, text, watermark, low quality"
imgs = pipe(prompt, negative_prompt=negative,
num_inference_steps=40, guidance_scale=7.0,
height=1024, width=1024,
num_images_per_prompt=4,
generator=torch.Generator("cuda").manual_seed(42)).images
for i, im in enumerate(imgs):
im.save(f"cover_v1_{i}.png")
Baseline score: a competent SDXL cover with negative-prompt hygiene typically scores in the middle of the leaderboard — clean but generic. [illustrative]
5. Improvements that move the needle
5.1 · Refiner pass at high guidance, low strength
Once you've picked your hero image, run it through SDXL Refiner with strength ≈ 0.3,
guidance_scale ≈ 10. The refiner specialises in late-denoising detail and sharpens texture
without changing composition. Reliable +1 jury point.
5.2 · ControlNet for composition lock
Sketch the composition (a 1-minute Figma layout marking subject silhouette + horizon line) and feed it as a ControlNet "scribble" or "canny" conditioning. This guarantees the cover composition matches your brief storyboard, instead of being whatever SDXL felt like producing.
from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel
cn = ControlNetModel.from_pretrained("diffusers/controlnet-canny-sdxl-1.0",
torch_dtype=torch.float16).to("cuda")
pipe_cn = StableDiffusionXLControlNetPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", controlnet=cn,
torch_dtype=torch.float16).to("cuda")
import cv2, numpy as np
sketch = cv2.Canny(np.array(my_layout), 100, 200)
img = pipe_cn(prompt, image=sketch, controlnet_conditioning_scale=0.7).images[0]
5.3 · Inpaint problem regions
The cover is 90% there but one element is wrong (a distorted hand, an extra moon, ugly text on a sign). Mask that region and inpaint with the original prompt — fixes the bug without re-rolling the whole image.
5.4 · Post-grade in Python or any colour tool
Pull a curves adjustment toward your two locked colours. cv2 or PIL is enough — the
point is to nudge the cover's blacks toward your blue and the highlights toward your amber so that
cover + video frames share an unmistakable grade.
5.5 · Composite the type, don't generate it
Open the cover in Figma / Inkscape / Photopea. Place title in a single serif weight at the bottom third, artist name above title at 60% size. Don't rotate, don't outline, don't add a drop shadow on the first pass. Restraint reads professional.
6. Submission format & gotchas
- Final cover: 1024×1024 minimum, PNG, sRGB colour profile, < 10 MB.
- Keep an un-composited
cover_raw.pngalongside the final — jury asks for it sometimes. - Don't include alpha channels; some viewers render alpha differently and your blacks shift.
- Save your seed and prompt next to the file. If you're asked to defend the cover in jury Q&A, being able to say "seed 42, prompt X, ControlNet sketch Y" wins points.
7. What top solutions did
Screening the IOAI 2024 final presentations, the strongest covers had three traits in common: (1) a single dominant silhouette readable at thumbnail scale; (2) a tightly limited palette (often only two colours plus paper); (3) typography that looked designed by a graphic designer, not pasted on. Tools were a mix of SDXL, Midjourney, and DALL-E 3 — model choice mattered less than the brief discipline. [illustrative]
8. Drill
D · Your SDXL cover renders a great image, but at thumbnail (200 px) it becomes a blur. What's wrong?
Too much fine detail, no dominant silhouette. SDXL loves to fill every pixel; at thumbnail scale that detail blurs into noise. Fix by (a) increasing negative-space ratio — instruct prompt for "minimalist composition, large empty sky"; (b) using a ControlNet sketch with a clear silhouette cutout; (c) post-grading to crush the midtones so the subject reads against the background. Quick test: downsample to 200 px and squint. If the subject doesn't pop, the cover fails.
D2 · Title type rendered by SDXL looks wrong. Why is hand-compositing better than re-prompting?
Diffusion models tokenise text characters as visual subwords and stitch them statistically. In 2024-era SDXL, words longer than 4 characters are garbled ~50% of the time, especially on non-English titles. Re-prompting wastes generation passes. Composting in a vector tool gives you pixel-perfect type in a single second — and lets you choose a typeface that matches the brief.