IOAI 2025 · CV · Chicken counting via density estimation

Contest: IOAI 2025 (Beijing, China) · Round: Individual contest, Day 1 · Category: Computer vision · density estimation / object counting.

Official sources: IOAI-official/IOAI-2025 · Chicken_Counting task folder · HuggingFace dataset.

1. Problem restatement

You are working with Silkie chicken farmers. Free-range flocks need to be counted accurately for both inventory and insurance purposes, but a human cannot reliably count hundreds of birds in a single aerial photo. The organisers provide a dataset of overhead photos of Silkie chicken flocks and ask you to predict the density map of chickens per pixel; the integrated density over the image equals the chicken count.

A pre-trained feature extractor (a small VGG-style convolutional stack, weights in base.pth) is provided. You must design the decoder — i.e. the network that turns the spatial feature map into a single-channel density prediction — and the training recipe. The full model class skeleton is given; you fill in the decoder.

The image carries a Gaussian-blurred density target where each chicken contributed a small Gaussian bump. The loss the official baseline uses is per-pixel MAE between predicted density and ground-truth density, with the density values pre-scaled by 100.

Source. Verbatim paraphrase of the official problem notebook on the IOAI-official repo (CC-BY-4.0).

2. What's being tested

Density estimation sits between segmentation and regression. The test asks:

Do you understand that counting via density is more robust than detection-then-count when objects are dense, overlapping, and small (the textbook reason CSRNet beat Faster-RCNN on crowd counting)?
Can you design a small decoder that upsamples a low-resolution feature map back to (close to) the input resolution while preserving spatial extent of the density bumps?
Can you keep the loss numerically stable when targets are scaled (×100) and predictions need to stay non-negative?
Can you train inside the Colab L4 budget?

Maps onto Deep Learning (Conv/ConvTranspose, dilated convolutions, MAE vs MSE), plus the parts of Python that handle HuggingFace datasets and PyTorch DataLoader.

3. Data exploration / setup

The training set ships via HuggingFace as ioaihsc/Task2_Chicken_Counting_Train2. Each sample is a dict with keys:

image — a PIL image, typically ~720×720 of an overhead chicken flock photo.
density — a 2-D NumPy array (downsampled 4× relative to the image, due to the two max-pools in the feature extractor) where each pixel holds the local chicken density. Sum of the density map = ground-truth count.

The provided FeatureExtraction module is 4 dilated 3×3 conv layers (3 → 64 → 64 → 128 → 128) with two 2×2 max-pools, so its output is 128 channels at H/4 × W/4. You build a decoder that takes (128, H/4, W/4) → (1, H/4, W/4) with non-negative output.

Metric (from the provided metrics.py): mean absolute error of the integrated count, plus a relative-error rate = |pred − true| / true. The public leaderboard combines them; lower is better.

Hardware: trainable inside Colab L4 (24 GB) at batch size 8 in 20-40 minutes for 20 epochs.

4. Baseline approach

The official baseline ("decoder = one 3×3 conv to 1 channel") is intentionally weak. Here is a slightly better minimal decoder that still trains in < 10 minutes on L4 and gives a real score.

import torch
import torch.nn as nn
import torch.nn.functional as F

# Provided by the contest. Outputs 128-ch feature map at H/4 x W/4.
class FeatureExtraction(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3,   64,  3, padding=2, dilation=2)
        self.conv2 = nn.Conv2d(64,  64,  3, padding=2, dilation=2)
        self.pool2 = nn.MaxPool2d(2, 2)
        self.conv3 = nn.Conv2d(64,  128, 3, padding=2, dilation=2)
        self.conv4 = nn.Conv2d(128, 128, 3, padding=2, dilation=2)
        self.pool4 = nn.MaxPool2d(2, 2)
    def forward(self, x):
        x = F.relu(self.conv1(x)); x = F.relu(self.conv2(x)); x = self.pool2(x)
        x = F.relu(self.conv3(x)); x = F.relu(self.conv4(x)); x = self.pool4(x)
        return x

class DensityDecoder(nn.Module):
    """Minimal CSRNet-style back-end: dilated convs to keep receptive field large
    without further downsampling, then a 1x1 to a single non-negative channel."""
    def __init__(self, in_ch=128):
        super().__init__()
        self.b1 = nn.Conv2d(in_ch, 64, 3, padding=2, dilation=2)
        self.b2 = nn.Conv2d(64,    64, 3, padding=2, dilation=2)
        self.b3 = nn.Conv2d(64,    32, 3, padding=2, dilation=2)
        self.out = nn.Conv2d(32, 1, 1)
    def forward(self, x):
        x = F.relu(self.b1(x)); x = F.relu(self.b2(x)); x = F.relu(self.b3(x))
        return F.relu(self.out(x))   # density must be >= 0

class ChickenCounting(nn.Module):
    def __init__(self):
        super().__init__()
        self.feature_extraction = FeatureExtraction()
        self.decoder = DensityDecoder()
    def forward(self, x):
        return self.decoder(self.feature_extraction(x))

# --- training loop (abridged) ---
model = ChickenCounting().cuda()
model.feature_extraction.load_state_dict(
    {k.split(".", 1)[1]: v
     for k, v in torch.load("base.pth").items()
     if k.startswith("feature_extraction.")},
    strict=False,
)
opt   = torch.optim.Adam(model.parameters(), lr=1e-4, weight_decay=1e-4)
crit  = nn.L1Loss(reduction="sum")           # MAE matches the leaderboard metric in spirit
SCALE = 100.0

for epoch in range(20):
    for batch in train_loader:
        img, dens = batch["image"].cuda(), (batch["density"] * SCALE).cuda()
        pred = model(img)
        loss = crit(pred, dens)
        opt.zero_grad(); loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 2.0)
        opt.step()

Expected score band on the public val split: ~3-6 chickens MAE per image [illustrative]. The official baseline notebook routinely reports rates between 0.05 and 0.20 (relative error) on individual val samples in the printed log.

5. Improvements that move the needle

5.1 · Data augmentation that respects density

Random horizontal flips and crops are fine — but the density map must be flipped/cropped in lockstep with the image. Random brightness/contrast jitter on the image alone is also safe. Do not rotate by arbitrary angles unless you also reinterpolate the density map; nearest-neighbor on a Gaussian density blob will destroy the integration property (∫density ≠ count).

import torchvision.transforms.functional as TF
import random

def joint_aug(img, dens):
    if random.random() < 0.5:
        img = TF.hflip(img);  dens = TF.hflip(dens)
    # color jitter on image only
    img = TF.adjust_brightness(img, 0.8 + 0.4*random.random())
    return img, dens

5.2 · Train with a count-aware auxiliary loss

The leaderboard cares about the integrated count, but per-pixel MAE doesn't directly minimise integration error — a slightly biased per-pixel prediction summed over millions of pixels gives a large count error. Add a count-MAE auxiliary term:

def total_loss(pred, dens, scale=100.0, alpha=0.1):
    px  = F.l1_loss(pred, dens, reduction="sum")
    # count: sum over spatial dims, then average over batch
    count_pred = pred.sum(dim=(1,2,3)) / scale
    count_true = dens.sum(dim=(1,2,3)) / scale
    cnt = F.l1_loss(count_pred, count_true)
    return px + alpha * cnt * pred.shape[0]

5.3 · Better decoder: CSRNet back-end with progressive upsample

The H/4 × W/4 output is coarse. Two ConvTranspose2d(stride=2) upsamples bring the density map back to the input resolution so the loss compares the high-frequency density bumps at their native scale. This usually buys 0.5-1 MAE.

5.4 · Test-time augmentation

Predict on the image and its horizontal flip, then average the two density maps before integrating. Free improvement, costs only a 2× inference time.

@torch.no_grad()
def tta_count(model, img, scale=100.0):
    p1 = model(img)
    p2 = TF.hflip(model(TF.hflip(img)))
    pred = (p1 + p2) / 2 / scale
    return pred.sum(dim=(1,2,3))

5.5 · Cosine LR + longer training inside the time budget

The baseline uses a tiny LR-decay schedule. Cosine annealing from 1e-4 to 1e-6 across 40 epochs converges to a noticeably lower validation MAE without overfitting; the small dataset means most of the regularisation does the work via augmentation.

6. Submission format & gotchas

Submit submission.npz with two arrays: pred_a (validation predictions) and pred_b (final test predictions). Each is an (N, 1, H/4, W/4) density map array, after dividing by the SCALE factor.
The shapes must match exactly — the grader does np.load(...)["pred_a"] and integrates per-image.
Output is divided by SCALE in inference; if you forget, your counts are 100× too high and your score is catastrophic.
ReLU on the final output is essential — negative densities sum to give counts that can be near zero or negative.
Set cudnn.deterministic = True (the official cell does this) so a grader rerun gives the same number.

7. What top solutions did

The unofficial write-up site ioai-writeup.github.io documents that top scorers used: (i) a CSRNet-style dilated-conv decoder of 6-8 layers, (ii) a count-aware auxiliary loss, (iii) horizontal-flip TTA, (iv) a small ensemble of 2-3 seeds. The official solution notebook in the IOAI-2025 repo (Chicken_Counting_Solution.ipynb) shows a richer decoder than the problem starter and uses the SCALE=100 trick to keep gradients sane. Public score values for individual contestants are not reproduced outside the official scoreboard, so improvements above are quoted as qualitative; absolute deltas are [illustrative].

8. Drill

D · Why density estimation, not detection?

On overhead photos of a tightly packed flock, individual chickens overlap, occlude each other, and cover ~20-50 pixels each. Detection-then-count fails because (1) NMS suppresses neighbouring true positives in dense regions and (2) annotating bounding boxes for every bird is far more expensive than dropping a point per bird and convolving with a Gaussian. Density estimation sidesteps both issues: the model learns "how much chicken-ness is at this pixel" and the integral recovers the count even when no single bird is cleanly delineated.

Follow-up: under what regime would detection beat density? When chickens are sparse and big — a handful of birds at high resolution. Then YOLO-style detection has clean boxes and density's Gaussian-bump approximation becomes overkill.

D2 · Why scale the targets by 100?

The raw density values are tiny (each Gaussian bump integrates to 1 spread over hundreds of pixels, so per-pixel values are ~1e-3 to 1e-4). Per-pixel L1 loss on such small numbers produces small gradients and Adam's adaptive step sizes blow up the effective learning rate, causing instability. Multiplying targets and predictions by 100 keeps everything in a sane numerical range and lets you use the standard lr=1e-4 without retuning.

← Back to problem archive