IOAI 2024 · CV on-site · Add a hydrant to cow images (without losing the cow)
Contest: IOAI 2024 (Bulgaria) · Round: Scientific, on-site stage (8 h) · Category: Computer Vision / compositional model editing.
Official sources:
ioai-official.org/2024-tasks ·
On-Site-Round folder
(this task is the Madarian_Cow directory) ·
open-cu/awesome-ioai-tasks.
1. Problem restatement
Same SDXL-mini backbone as the at-home CV task. New twist: instead of swapping two concepts, you must add a new concept. Specifically, when the prompt asks for a cow, the model must produce an image containing both a cow and a fire hydrant. The cow must remain recognisable; the hydrant must be clearly visible. Prompts that do not mention cows must not gain a hydrant. You have 8 hours and a single L4 GPU.
The graded metric combines (a) a CLIP-based detector for "is there a cow?" and "is there a hydrant?" on cow prompts (you want both yes), and (b) a control-prompt collateral score on non-cow prompts (you want hydrant=no there). [verify thresholds against the on-site notebook]
Madarian_Cow). Detailed scoring weights are encoded in the official notebook;
treat them as [verify against source].
2. What's being tested
- Compositional editing. Adding a concept is harder than swapping — there's no pre-existing "key" inside the UNet that means "the hydrant goes here". You need to graft new attention behaviour.
- Time pressure. 8 hours is enough for one good idea, not three. Pick a recipe and iterate on its hyperparameters.
- Textual inversion + ROME hybrids. The cleanest solution combines a learned embedding for "hydrant context" with a rank-1 cross-attention edit.
- Sample-efficient training. The on-site environment ships no large image set — you need to either generate your own training pairs or use a small handful of reference hydrant images.
3. Data exploration / setup
Setup is similar to the at-home CV task:
- The frozen SDXL-mini checkpoint.
prompts.jsonpartitioned into "cow prompts", "other prompts", and "evaluation prompts".- A small folder of reference hydrant images (~10–20), to serve as supervision for textual inversion.
- The CLIP-based grading script.
from diffusers import StableDiffusionXLPipeline
import torch, json
pipe = StableDiffusionXLPipeline.from_pretrained(
"path/to/sdxl-mini", torch_dtype=torch.float16).to("cuda")
prompts = json.load(open("prompts.json"))
ref_hydrants = ["ref_hydrant_{:02d}.png".format(i) for i in range(20)]
# baseline cow image — confirm grader sees a cow but no hydrant
img = pipe(prompts["cow"][0], generator=torch.Generator("cuda").manual_seed(0)).images[0]
img.save("baseline_cow.png")
4. Baseline approach
The naive baseline: edit the prompt at encoder time by concatenating "a fire hydrant" to every cow prompt before tokenisation. This violates the on-site rule that forbids prompt mutation, but it's the cheapest sanity check that "the model can produce a cow + hydrant scene at all". If it can't even with prompt help, no weight edit will save you.
# DIAGNOSTIC ONLY — not a legal submission
img = pipe(prompts["cow"][0] + ", with a fire hydrant").images[0]
# eyeball: does SDXL-mini even know what a fire hydrant looks like?
Once you've confirmed the model can compose the two concepts when explicitly told to, your job becomes "make the model do this on its own for cow prompts only".
5. Improvements that move the needle
5.1 · Textual inversion on the cow token
Optimise a small embedding-shift vector δ applied to the "cow" token such that the encoder
output e_cow + δ reliably produces cow + hydrant images. Train δ for 30–60 minutes on
the ref hydrant images using diffusion loss with a cow + hydrant scene as target.
delta = torch.zeros(1, 1, embed_dim, device="cuda", requires_grad=True)
opt = torch.optim.AdamW([delta], lr=1e-3)
for step in range(2000):
img, prompt = sample_train_pair() # cow+hydrant ref image
emb = encode(prompt) + cow_mask * delta
loss = diffusion_loss(pipe.unet, emb, img)
opt.zero_grad(); loss.backward(); opt.step()
5.2 · Rank-1 cross-attention edit (ROME variant)
Identify the UNet block where the cow token attention peaks and apply a rank-1 update to its
to_v projection so the value at "cow" positions includes a hydrant-shaped feature
vector. Combines well with 5.1: textual inversion shifts the encoder side, ROME shifts the UNet side.
5.3 · Locality via control-prompt regularisation
Generate ~50 control images from non-cow prompts using both the original and edited models. Penalise the L2 difference of their CLIP embeddings during δ training. This is the single biggest collateral saver and only adds 30 s per optimisation step.
5.4 · Time-boxed iteration: ship every 90 minutes
On-site, you cannot afford a "big bang" submission at hour 7:55. Run a stripped-down version of your pipeline in 30 minutes, submit, score, then iterate. The official scoring script is fast — every iteration tells you whether you're going up or down.
5.5 · Cherry-pick the best seed slice
If the grader fixes seeds, you don't control them. But during development, sweep 10 seeds and look at which seeds your edit handles well. If 8/10 are good and 2/10 produce no hydrant, your edit isn't strong enough — increase the textual-inversion learning rate or add a second UNet block to the ROME edit.
6. Submission format & gotchas
- Submit the modified SDXL-mini directory in
from_pretrainedlayout — same as the at-home task. The grader reloads it cold. - Don't save your training optimiser state into the checkpoint directory — it'll bloat the zip and the grader rejects unknown files.
- If your textual inversion δ lives outside the standard checkpoint structure, you must inject it
back into the saved
text_encoderweights before submission; the grader won't load loose tensors. - Half-precision NaNs are common with large δ values — clip the embedding shift to ‖δ‖ < 1.0.
7. What top solutions did
Two recipes appear repeatedly in the on-site solutions archive: (1) textual inversion only, taking ~3 hours of the 8-hour budget — simpler and more reliable; (2) textual inversion + a single rank-1 UNet edit applied to a mid-resolution block — higher peak score but more failure modes. Top teams also spent ~30 minutes calibrating the CLIP grader on their own outputs to avoid the "swap looks perfect but score is low" failure mode. [verify against source]
8. Drill
D · Why does textual inversion alone work for the on-site task but ROME alone struggles?
The on-site task is compositional, not substitutional. ROME's rank-1 update writes a new value into a single attention slot at a single layer — useful for swapping one concept for another, less useful for "cow stays AND hydrant arrives". Textual inversion modifies the encoder output across all attention layers at once, which is how compositional concepts naturally enter diffusion models. A hybrid wins because TI handles "what new thing exists" and ROME handles "where the new thing should appear".
D2 · Your edit works on training cow prompts but fails on "Holstein cow grazing". Why?
"Holstein" likely tokenises into multiple subwords that dilute the cow-attention signal, and your δ was trained only on prompts with the bare word "cow". Two fixes: (a) include diverse cow-related prompts in TI training (Jersey, calf, dairy cow); (b) apply δ to the mean of all cow-meaning tokens rather than only the literal id of "cow". Generally, distributional coverage of the training prompts is the single most undervalued knob.