Guide

How Software Portrait Mode Works: Bokeh from a Single Lens

A look at how phones and software simulate the shallow depth-of-field of a fast prime lens — and where the simulation falls short.

The first time you saw it on a phone, it probably felt like magic. You point a camera at a friend, the background goes soft and creamy, the subject pops, and suddenly your phone produces a photo that looks like it came from a $2,000 camera with a $1,500 lens. It is the kind of feature you take for granted now — every flagship phone has it, every messaging app offers a version, and entire generations of social-media photography depend on it. But the technology underneath is more interesting than it appears, and the limitations more telling.

Why optical bokeh exists in the first place

Real bokeh — the out-of-focus blur in a photograph taken with a wide aperture — is a consequence of physics. Light from a single point in the world spreads into a cone after it passes through a lens. If the cone happens to converge on the camera's sensor, the point appears sharp. If it converges in front of or behind the sensor, the cone hits the sensor as a circle of light called the circle of confusion. Wider apertures produce wider cones, which means out-of-focus points look more spread out — which means the image looks blurrier in the regions away from the focal plane.

The strength of the effect depends on three things: the aperture (wider = more blur), the focal length (longer = more blur), and the distance from camera to subject (closer = more blur). On a small-sensor phone camera with a wide-angle equivalent focal length, the optical effect is barely visible even at the maximum aperture. To get the photographic look people associate with portraits — sharp subject, creamy background — you need either a much bigger sensor or a much longer lens than a phone has room for. Or you fake it.

The basic recipe

Software portrait mode, in any of its forms, follows the same recipe at a high level:

  1. Identify the subject. Determine which pixels belong to the foreground and which belong to the background. The output is some kind of mask or depth estimate.
  2. Blur the background. Apply a blur kernel to the regions identified as background, while leaving the foreground untouched.
  3. Composite. Combine the sharp foreground with the blurred background using the mask, with smooth blending at the boundary.

The interesting part is step 1, because it is what determines whether the result looks convincing or not. There are several different ways to do it, with very different quality and cost trade-offs.

Approach 1: salient object segmentation

This is the simplest approach and the one our tool uses. A neural network is trained to predict, for every pixel, whether it belongs to the main subject. The mask is a 2D matrix the same size as the image, with values between 0 and 1. There is no explicit notion of depth — only "subject" and "not subject."

The advantage is simplicity: a single network produces the mask in one forward pass. The disadvantage is that everything-not-subject is treated identically. A wall right behind your subject and a tree fifty metres beyond it will both be blurred the same amount. Real bokeh, by contrast, blurs more as you move away from the focal plane. Result: the simulation looks subtly off, especially in landscapes with lots of depth.

Approach 2: monocular depth estimation

A more sophisticated approach uses a neural network to estimate per-pixel depth from a single 2D image. This is a much harder problem — depth is not directly visible in a 2D image, the network has to infer it from cues like perspective, occlusion, lighting, atmospheric haze, and learned priors about how the world looks. But modern monocular depth networks (MiDaS, DPT, ZoeDepth, and others) have become surprisingly good.

With a depth map, you can apply a depth-dependent blur: pixels at the focal distance are sharp, pixels further away (closer or further) get progressively blurrier. This produces a much more convincing optical-bokeh look, with the depth-dependent gradient that real lenses produce.

Approach 3: dual-camera or depth-sensor stereo

The flagship-phone approach. Capture the scene from two slightly different viewpoints (either with two camera lenses or with a dedicated depth sensor) and use the parallax between them to compute a real depth map. This is the same technique your eyes use to perceive depth, and it produces excellent results — provided your subject is at the right distance for the parallax baseline. Too close or too far and the depth estimate breaks down.

Phones combine this hardware approach with neural refinement: the dual-camera depth map is often noisy, especially around edges, and a network is used to clean it up using cues from the colour image. The combination is what produces the very clean cutouts you see in modern phone portrait modes.

Approach 4: end-to-end neural rendering

The most recent approach, used in some research and high-end commercial products, skips the explicit "make a mask, then blur" pipeline entirely. A network is trained directly to take a sharp photo as input and produce a portrait-mode photo as output. The training data is paired: sharp photos and corresponding "ground truth" shallow-depth-of-field photos taken with a real camera. The network learns the entire transformation as a single function.

Done well, this can capture nuances the explicit pipeline misses: how partially-occluded edges should be rendered, how out-of-focus highlights should turn into bokeh discs, how skin should be subtly retouched, and so on. Done poorly, it produces uncanny artefacts that are hard to debug because the entire process is opaque.

Why simulated bokeh almost-but-doesn't-quite look like the real thing

A few specific things are hard to fake:

These are not flaws so much as fundamental limits. Without the real cone of light spreading through the lens, you cannot fully reproduce the appearance of light passing through a lens.

Where the segmentation typically fails

Even leaving the optical limitations aside, the segmentation itself produces visible failures:

When to use software portrait mode and when not to

Software portrait mode is the right choice when:

It's the wrong choice when:

Try it

Our free portrait mode tool uses the simplest of the approaches above — segmentation followed by Gaussian blur — because it produces good-enough results in 3 seconds on free infrastructure. If you compare the output to the same photo on a recent flagship phone, the phone will probably look better, especially around hair and on busy backgrounds. But for many use cases the difference is small enough not to matter.

← Back to all guides