How AI Object Detection Works: Faster R-CNN Explained

If you have ever used your phone's camera and watched it draw little boxes around faces before you tapped the shutter, you have seen object detection in action. The technology has become quietly ubiquitous: it powers content moderation systems, retail self-checkout cameras, autonomous vehicle perception, accessibility tools, photo organisation, and a thousand other things you probably do not think about. This guide explains what is actually happening inside an object detection model, why it sometimes fails in oddly specific ways, and what to expect from the open-source models powering free tools like ours.

Three problems, not one

Computer vision splits the broad question "what is in this image?" into a few related but distinct sub-problems:

Image classification assigns a single label to the whole image. "This is a dog photo." It does not say where in the image the dog is, or whether there is more than one.
Object detection finds and labels multiple objects. "There is a dog at coordinates (100, 80, 400, 350) and a frisbee at (320, 200, 380, 240)." Each detection is a bounding box plus a class label plus a confidence score.
Semantic segmentation labels every pixel. "These pixels are dog, those pixels are grass, those are sky." No bounding boxes; pixel-level precision.
Instance segmentation combines the last two: per-pixel masks, but separated per-object, so two dogs are two distinct instances rather than a single "dog pixels" region.

Object detection sits in the middle. It tells you what is there and where it is, with enough precision to be useful but at lower computational cost than full segmentation.

The two-stage approach

Most modern detectors fall into one of two design families: two-stage detectors and single-stage detectors. Faster R-CNN is the canonical two-stage model.

The first stage is called the region proposal network (RPN). Its job is to look at the image and output a list of candidate regions that probably contain something interesting. It does not yet say what; it just says "there's a thing here, and here, and here." The RPN is trained to be high-recall: it would rather propose too many regions than miss anything. Modern RPNs typically output a few hundred or a few thousand candidate regions per image.

The second stage takes each candidate region and runs a classification network on it. For each region the classifier outputs a probability distribution over all the object categories the model knows about, plus a "background" class for "actually, nothing of interest is here." Regions that classify as background get discarded; regions that classify as a real object are kept, with their bounding boxes refined slightly by a regression head that nudges the box to fit the object more tightly.

The final output is a list of detections, each with a class, a refined bounding box, and a confidence score (the probability the classifier assigned to the chosen class).

Why two stages

Single-stage detectors like YOLO and SSD do classification and localisation in one pass, which is faster. Two-stage detectors like Faster R-CNN run the classifier separately on each candidate region, which is slower but generally more accurate. The trade-off depends on your use case. For real-time on-device inference (autonomous vehicles, security cameras), single-stage is often the right choice. For free server-side processing where a couple of seconds is acceptable, two-stage gives better results.

The role of the backbone

Both stages share a feature extractor — the backbone. This is a deep convolutional neural network that processes the image once and produces a rich feature map: a representation in which each spatial location is described by a high-dimensional vector encoding what the network has learned about that part of the image. The RPN and the classifier both operate on this shared feature map rather than re-processing the raw pixels.

The choice of backbone is the main lever for trading speed against accuracy. ResNet-50 and ResNet-101 are common high-accuracy choices. MobileNet V3 is a lighter alternative designed for efficiency: smaller, faster, and only modestly less accurate. We use MobileNet V3 in our object detector because it produces good results in 2–3 seconds on a single CPU thread, with no GPU required.

What COCO is and why it matters

The dataset a detector is trained on shapes everything about what it can do. The dominant detection benchmark is COCO — Common Objects in Context — released by Microsoft Research in 2014 and continually expanded since. COCO contains around 330,000 images, with 80 object categories labelled across them.

The 80 categories are heavily skewed toward Western, urban, contemporary scenes: people, common pets, vehicles, kitchen items, furniture, sports equipment, electronics. There is no bicycle helmet, no rice cooker, no traditional clothing from outside Western Europe and North America, no laboratory equipment, no construction tool. The model literally cannot recognise things it has not seen — and it has only seen 80 categories, biased toward what tech-company researchers in Silicon Valley were photographing in 2014.

This is one of the most important caveats about any pre-trained object detector: the model's worldview is the dataset's worldview. If you upload a photo from a context the dataset under-represents, expect mediocre results.

Confidence scores: what they mean and what they don't

Every detection comes with a number between 0 and 1, often called a confidence score. It is tempting to interpret this as the probability that the detection is correct. It isn't, quite.

The score reflects the classifier's output probability for the predicted class — a number that summarises how strongly the model preferred that class over all the others. A high score means the model was very sure of its choice given the features it extracted. It does not account for the possibility that the features themselves are misleading. A model can be 95% confident that a wolf is a dog, because at the feature level, wolves and dogs are extremely similar.

That is also why models can be confidently wrong. The classifier's confidence is calibrated against the categories it was trained on, not against the open world of all possible images.

Common failure modes

Stylised inputs. Photos run through Instagram filters, anime conversions, line drawings, and abstract art are all out-of-distribution. The detector either misses everything or hallucinates wildly.
Unusual angles. COCO photos are taken from roughly eye-level. Top-down drone shots, ground-level macro, fisheye, and dashcam-style perspectives all confuse the model.
Tiny objects in wide frames. Anything smaller than about 20 pixels on its longest side is hard. Faster R-CNN has multi-scale variants designed to mitigate this, but they have limits.
Heavy occlusion. Objects hidden behind other objects often get missed entirely.
Adversarial-looking patterns. Striped clothing, regular geometric backgrounds, and certain printed patterns can produce phantom detections.
Categories outside COCO. A laptop bag classifies as "backpack" or "handbag." A bidet classifies as "toilet." The model picks the closest category it knows and reports it confidently.

When to trust a detector and when not to

Object detection is a useful technology with well-understood limitations. It is right for tasks that tolerate some error, where a human is reviewing results before action is taken, or where the output is one signal among many.

It is dangerous for high-stakes use cases — medical, legal, security — where a confident wrong answer can cause harm. Modern detection models are remarkable in many ways, but their failure modes are not uniformly distributed: they fail more often on images of certain demographics, contexts, and cultural settings, because their training data was collected from a non-representative slice of the world. Anyone deploying a detector at scale should think hard about whose images the model has seen, and whose it has not.

Try it on your own photos

Our free object detector runs the model directly. Upload a photo and you'll get the original back with bounding boxes overlaid. It is interesting to try it on photos with unusual content — old prints, drawings, photos taken from unusual angles, photos of objects the model probably hasn't seen — and see how it responds. The failures are often more illuminating than the successes.