What object detection is
Object detection is one of the foundational tasks of computer vision. Given an image, the model has to do two things at once: find the objects (where in the frame are they?) and recognise them (what is each one?). The output is a list of bounding boxes, each with a class label like person, car, dog, chair, and a confidence score between 0 and 1.
It is a different task from classification (which only labels the dominant subject) and from segmentation (which produces a per-pixel mask). Detection sits in the middle: looser than segmentation, more informative than classification.
The model we use
The detector is a Faster R-CNN with a MobileNet V3 backbone. To unpack that:
- Faster R-CNN is a two-stage detection architecture. The first stage proposes regions of the image likely to contain something. The second stage classifies what is in each proposed region and refines its bounding box.
- MobileNet V3 is the feature extractor backbone — a smaller, more efficient network compared to alternatives like ResNet-50. Trading some accuracy for speed; ideal for free public infrastructure.
- The model was trained on COCO (Common Objects in Context), a dataset of around 330,000 images covering 80 categories of everyday objects.
What categories it knows
The 80 COCO categories include people, animals (cat, dog, horse, elephant, bear, etc.), vehicles (car, bicycle, bus, train, plane, boat), kitchen items (cup, fork, knife, bowl, banana, apple, sandwich), furniture (chair, couch, bed, dining table), electronics (TV, laptop, mouse, keyboard, cell phone), sports equipment (frisbee, skateboard, surfboard, tennis racket), and a handful of others. It does not know about specific brands, individual people, character or cartoon objects, fine-grained species (it knows "bird," not "robin"), or anything outside the 80 trained classes.
What it gets right
- Common indoor and outdoor scenes with multiple recognisable objects
- People in natural poses
- Vehicles, pets, and food, even partially obscured
- Small objects, as long as they are not too small to resolve at 512px
Where it confuses itself
- Cluttered scenes. If many objects overlap or pile up, individual bounding boxes can merge or get missed.
- Unusual angles. Top-down views, extreme close-ups, or fisheye distortion are outside the training distribution.
- Stylised or artistic images. Drawings, paintings, anime, low-poly renders — the detector was trained on photographs.
- Objects similar to other classes. A wolf can be detected as a dog. A skateboard ramp can be detected as a couch. A statue can be detected as a person.
- Tiny objects in wide shots. The 512px downscale leaves very few pixels for distant objects.
Practical uses
- Cataloguing a photo collection. What is in this picture? Run a few thousand images through a detector and you have machine-readable tags for free.
- Accessibility. A list of detected objects is a starting point for an alt-text description.
- Education. Seeing where the model finds and misses things is a great introduction to where computer vision is in 2025.
- Curiosity. Sometimes you just want to know what an AI sees in your photo.
A note on accuracy
Object detection models report a confidence score with each detection. We currently filter to a sensible threshold and show all boxes above it; you may notice some labels you disagree with. That is the model being honest — it is showing you its best guess, even when its confidence is moderate. The score in the label is between 0 and 1; treat anything below 0.6 as maybe, anything above 0.9 as almost certainly.