← IOAI 2024 Practical set

IOAI 2024 Practical · Album cover (text-to-image)

Contest: IOAI 2024 (Bulgaria) · Round: Practical, on-site (4 h) · Category: Generative imaging.

Official source: ioai-official.org/2024-tasks.

1. Problem restatement

Produce a single square album cover (≥ 1024×1024) for the assigned song, using any text-to-image tool. The cover must clearly visualise the song's mood, fit the team's locked palette, and read legibly at thumbnail (Spotify-sized, 200 px) scale. The title and artist name are composited in post — diffusion-rendered text is unreliable in 2024-era SDXL.

Source. Paraphrased from the IOAI 2024 Practical Round brief. Specific deliverable constraints (resolution, file format) are [verify against on-site materials].

2. What's being tested

3. Data exploration / setup

You start from the team brief. Pick one generation tool: SDXL (open, runs locally on the supplied GPU), Midjourney-style (paid web), or a hosted SDXL service. SDXL via diffusers is the most reproducible:

from diffusers import StableDiffusionXLPipeline
import torch

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16",
).to("cuda")
pipe.enable_xformers_memory_efficient_attention()

Stash a "negative prompt" template the moment you set up the pipeline: "ugly, blurry, extra fingers, distorted face, text, watermark, low quality, jpeg artifacts". It catches 80% of SDXL's failure modes.

4. Baseline approach

Generate one batch of 4 images at 1024×1024 using your brief sentence + style fragments. Eyeball, pick the strongest, run img2img with strength=0.4 for refinement.

prompt = ("a single empty train platform at dusk, deep blue and amber neon, "
          "rain on the camera lens, 35mm, shallow depth of field, "
          "cinematic, Wong Kar-wai inspired, moody, soft volumetric light")
negative = "ugly, blurry, extra fingers, distorted face, text, watermark, low quality"

imgs = pipe(prompt, negative_prompt=negative,
            num_inference_steps=40, guidance_scale=7.0,
            height=1024, width=1024,
            num_images_per_prompt=4,
            generator=torch.Generator("cuda").manual_seed(42)).images

for i, im in enumerate(imgs):
    im.save(f"cover_v1_{i}.png")

Baseline score: a competent SDXL cover with negative-prompt hygiene typically scores in the middle of the leaderboard — clean but generic. [illustrative]

5. Improvements that move the needle

5.1 · Refiner pass at high guidance, low strength

Once you've picked your hero image, run it through SDXL Refiner with strength ≈ 0.3, guidance_scale ≈ 10. The refiner specialises in late-denoising detail and sharpens texture without changing composition. Reliable +1 jury point.

5.2 · ControlNet for composition lock

Sketch the composition (a 1-minute Figma layout marking subject silhouette + horizon line) and feed it as a ControlNet "scribble" or "canny" conditioning. This guarantees the cover composition matches your brief storyboard, instead of being whatever SDXL felt like producing.

from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel
cn = ControlNetModel.from_pretrained("diffusers/controlnet-canny-sdxl-1.0",
                                     torch_dtype=torch.float16).to("cuda")
pipe_cn = StableDiffusionXLControlNetPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", controlnet=cn,
    torch_dtype=torch.float16).to("cuda")

import cv2, numpy as np
sketch = cv2.Canny(np.array(my_layout), 100, 200)
img = pipe_cn(prompt, image=sketch, controlnet_conditioning_scale=0.7).images[0]

5.3 · Inpaint problem regions

The cover is 90% there but one element is wrong (a distorted hand, an extra moon, ugly text on a sign). Mask that region and inpaint with the original prompt — fixes the bug without re-rolling the whole image.

5.4 · Post-grade in Python or any colour tool

Pull a curves adjustment toward your two locked colours. cv2 or PIL is enough — the point is to nudge the cover's blacks toward your blue and the highlights toward your amber so that cover + video frames share an unmistakable grade.

5.5 · Composite the type, don't generate it

Open the cover in Figma / Inkscape / Photopea. Place title in a single serif weight at the bottom third, artist name above title at 60% size. Don't rotate, don't outline, don't add a drop shadow on the first pass. Restraint reads professional.

6. Submission format & gotchas

7. What top solutions did

Screening the IOAI 2024 final presentations, the strongest covers had three traits in common: (1) a single dominant silhouette readable at thumbnail scale; (2) a tightly limited palette (often only two colours plus paper); (3) typography that looked designed by a graphic designer, not pasted on. Tools were a mix of SDXL, Midjourney, and DALL-E 3 — model choice mattered less than the brief discipline. [illustrative]

8. Drill

D · Your SDXL cover renders a great image, but at thumbnail (200 px) it becomes a blur. What's wrong?

Too much fine detail, no dominant silhouette. SDXL loves to fill every pixel; at thumbnail scale that detail blurs into noise. Fix by (a) increasing negative-space ratio — instruct prompt for "minimalist composition, large empty sky"; (b) using a ControlNet sketch with a clear silhouette cutout; (c) post-grading to crush the midtones so the subject reads against the background. Quick test: downsample to 200 px and squint. If the subject doesn't pop, the cover fails.

D2 · Title type rendered by SDXL looks wrong. Why is hand-compositing better than re-prompting?

Diffusion models tokenise text characters as visual subwords and stitch them statistically. In 2024-era SDXL, words longer than 4 characters are garbled ~50% of the time, especially on non-English titles. Re-prompting wastes generation passes. Composting in a vector tool gives you pixel-perfect type in a single second — and lets you choose a typeface that matches the brief.

← IOAI 2024 Practical set