IOAI 2024 · CV on-site · Add a hydrant to cow images (without losing the cow)

Contest: IOAI 2024 (Bulgaria) · Round: Scientific, on-site stage (8 h) · Category: Computer Vision / compositional model editing.

Official sources: ioai-official.org/2024-tasks · On-Site-Round folder (this task is the Madarian_Cow directory) · open-cu/awesome-ioai-tasks.

1. Problem restatement

Same SDXL-mini backbone as the at-home CV task. New twist: instead of swapping two concepts, you must add a new concept. Specifically, when the prompt asks for a cow, the model must produce an image containing both a cow and a fire hydrant. The cow must remain recognisable; the hydrant must be clearly visible. Prompts that do not mention cows must not gain a hydrant. You have 8 hours and a single L4 GPU.

The graded metric combines (a) a CLIP-based detector for "is there a cow?" and "is there a hydrant?" on cow prompts (you want both yes), and (b) a control-prompt collateral score on non-cow prompts (you want hydrant=no there). [verify thresholds against the on-site notebook]

Source. Paraphrased from the IOAI 2024 On-Site-Round repo (folder Madarian_Cow). Detailed scoring weights are encoded in the official notebook; treat them as [verify against source].

2. What's being tested

Compositional editing. Adding a concept is harder than swapping — there's no pre-existing "key" inside the UNet that means "the hydrant goes here". You need to graft new attention behaviour.
Time pressure. 8 hours is enough for one good idea, not three. Pick a recipe and iterate on its hyperparameters.
Textual inversion + ROME hybrids. The cleanest solution combines a learned embedding for "hydrant context" with a rank-1 cross-attention edit.
Sample-efficient training. The on-site environment ships no large image set — you need to either generate your own training pairs or use a small handful of reference hydrant images.

3. Data exploration / setup

Setup is similar to the at-home CV task:

The frozen SDXL-mini checkpoint.
prompts.json partitioned into "cow prompts", "other prompts", and "evaluation prompts".
A small folder of reference hydrant images (~10–20), to serve as supervision for textual inversion.
The CLIP-based grading script.

from diffusers import StableDiffusionXLPipeline
import torch, json

pipe = StableDiffusionXLPipeline.from_pretrained(
    "path/to/sdxl-mini", torch_dtype=torch.float16).to("cuda")
prompts = json.load(open("prompts.json"))
ref_hydrants = ["ref_hydrant_{:02d}.png".format(i) for i in range(20)]

# baseline cow image — confirm grader sees a cow but no hydrant
img = pipe(prompts["cow"][0], generator=torch.Generator("cuda").manual_seed(0)).images[0]
img.save("baseline_cow.png")

4. Baseline approach

The naive baseline: edit the prompt at encoder time by concatenating "a fire hydrant" to every cow prompt before tokenisation. This violates the on-site rule that forbids prompt mutation, but it's the cheapest sanity check that "the model can produce a cow + hydrant scene at all". If it can't even with prompt help, no weight edit will save you.

# DIAGNOSTIC ONLY — not a legal submission
img = pipe(prompts["cow"][0] + ", with a fire hydrant").images[0]
# eyeball: does SDXL-mini even know what a fire hydrant looks like?

Once you've confirmed the model can compose the two concepts when explicitly told to, your job becomes "make the model do this on its own for cow prompts only".

5. Improvements that move the needle

5.1 · Textual inversion on the cow token

Optimise a small embedding-shift vector δ applied to the "cow" token such that the encoder output e_cow + δ reliably produces cow + hydrant images. Train δ for 30–60 minutes on the ref hydrant images using diffusion loss with a cow + hydrant scene as target.

delta = torch.zeros(1, 1, embed_dim, device="cuda", requires_grad=True)
opt   = torch.optim.AdamW([delta], lr=1e-3)

for step in range(2000):
    img, prompt = sample_train_pair()        # cow+hydrant ref image
    emb  = encode(prompt) + cow_mask * delta
    loss = diffusion_loss(pipe.unet, emb, img)
    opt.zero_grad(); loss.backward(); opt.step()

5.2 · Rank-1 cross-attention edit (ROME variant)

Identify the UNet block where the cow token attention peaks and apply a rank-1 update to its to_v projection so the value at "cow" positions includes a hydrant-shaped feature vector. Combines well with 5.1: textual inversion shifts the encoder side, ROME shifts the UNet side.

5.3 · Locality via control-prompt regularisation

Generate ~50 control images from non-cow prompts using both the original and edited models. Penalise the L2 difference of their CLIP embeddings during δ training. This is the single biggest collateral saver and only adds 30 s per optimisation step.

5.4 · Time-boxed iteration: ship every 90 minutes

On-site, you cannot afford a "big bang" submission at hour 7:55. Run a stripped-down version of your pipeline in 30 minutes, submit, score, then iterate. The official scoring script is fast — every iteration tells you whether you're going up or down.

5.5 · Cherry-pick the best seed slice

If the grader fixes seeds, you don't control them. But during development, sweep 10 seeds and look at which seeds your edit handles well. If 8/10 are good and 2/10 produce no hydrant, your edit isn't strong enough — increase the textual-inversion learning rate or add a second UNet block to the ROME edit.

6. Submission format & gotchas

Submit the modified SDXL-mini directory in from_pretrained layout — same as the at-home task. The grader reloads it cold.
Don't save your training optimiser state into the checkpoint directory — it'll bloat the zip and the grader rejects unknown files.
If your textual inversion δ lives outside the standard checkpoint structure, you must inject it back into the saved text_encoder weights before submission; the grader won't load loose tensors.
Half-precision NaNs are common with large δ values — clip the embedding shift to ‖δ‖ < 1.0.

7. What top solutions did

Two recipes appear repeatedly in the on-site solutions archive: (1) textual inversion only, taking ~3 hours of the 8-hour budget — simpler and more reliable; (2) textual inversion + a single rank-1 UNet edit applied to a mid-resolution block — higher peak score but more failure modes. Top teams also spent ~30 minutes calibrating the CLIP grader on their own outputs to avoid the "swap looks perfect but score is low" failure mode. [verify against source]

8. Drill

D · Why does textual inversion alone work for the on-site task but ROME alone struggles?

The on-site task is compositional, not substitutional. ROME's rank-1 update writes a new value into a single attention slot at a single layer — useful for swapping one concept for another, less useful for "cow stays AND hydrant arrives". Textual inversion modifies the encoder output across all attention layers at once, which is how compositional concepts naturally enter diffusion models. A hybrid wins because TI handles "what new thing exists" and ROME handles "where the new thing should appear".

D2 · Your edit works on training cow prompts but fails on "Holstein cow grazing". Why?

"Holstein" likely tokenises into multiple subwords that dilute the cow-attention signal, and your δ was trained only on prompts with the bare word "cow". Two fixes: (a) include diverse cow-related prompts in TI training (Jersey, calf, dairy cow); (b) apply δ to the mean of all cow-meaning tokens rather than only the literal id of "cow". Generally, distributional coverage of the training prompts is the single most undervalued knob.

← IOAI 2024 Scientific set