Product TechVisual Ear CleanerEngineering

How a Visual Ear Cleaner Actually Works — Optics, Firmware, and the AI Safety Layer

A deep technical overview of visual ear cleaner architecture: the 3.5mm endoscope lens, the 1080P CMOS sensor, the STM32 firmware stack, the Wi-Fi pipe to a smartphone, and the on-device AI that prevents injury.

April 5, 2026

When we invented the smart visual ear cleaner category in 2019, we made an engineering bet: the product only works if the optics, the wireless link, and the safety layer all hit premium specifications at a consumer price point. Get any one of those wrong and you ship a product that either doesn’t work (blurry video, dropped Wi-Fi) or actively injures users (camera pushed too close to the tympanic membrane). This post walks through each subsystem and explains the design tradeoffs.

The 3.5 mm endoscope lens

The human ear canal averages 26 mm long and 7 mm in diameter. You need a camera small enough to enter comfortably, with enough depth-of-field to keep the tympanic membrane in focus, and enough illumination to see detail in the dark. Our lens is 3.5 mm in diameter with a 70° field of view and a working distance of 15–20 mm. The focal range is fixed (no mechanical focus motor — it would be too fragile) but depth-of-field is wide enough that anything between 10 mm and 25 mm is reasonably sharp.

Illumination comes from 6 SMD LEDs arrayed around the lens barrel, color-temperature-tuned to 5500K (natural daylight) so the skin of the ear canal appears the colour a doctor would expect to see. Early prototypes used warmer LEDs and looked wrong on a clinical scale; we changed the colour temperature after feedback from three audiology partners who piloted our units in 2020.

The 1080P CMOS sensor

Resolution matters less than people think, and signal-to-noise ratio matters more. We use a 1/4” CMOS sensor with 1920×1080 native resolution and a pixel size of 1.4 μm. In a dark, confined space like the ear canal, small pixels create noisy images even with bright LEDs. Our choice of sensor was driven by noise performance at ISO 400–800, not by megapixel count — a 2K sensor with smaller pixels would have worse effective image quality in this use case.

The sensor ships raw data to a dedicated ISP (image signal processor) on the same package that handles noise reduction, auto-exposure, and JPEG compression. End-to-end latency from photon capture to Wi-Fi packet transmission: 35 ms.

The STM32 firmware stack

Firmware runs on an STM32F4-series Cortex-M4 MCU. It handles:

Image sensor configuration and DMA (data moves from the ISP directly to the Wi-Fi transmit buffer; the MCU just manages the pipe)
LED PWM control for brightness adjustment
Battery management — charge controller state machine, over-discharge cutoff, fuel gauge
Power management — 5 V USB-C input, 3.3 V rail for logic, 4.2 V boost for LEDs
Wi-Fi provisioning via Bluetooth LE (user’s phone sends the Wi-Fi credentials over BLE at first setup)
Over-the-air update verification — cryptographic signature check before accepting a new firmware image

All firmware is version-controlled in Git with signed OTA releases. For brand owners running private-label deployments, we deliver the firmware source under NDA so your engineering team can audit it or freeze a version.

Wi-Fi vs. Bluetooth: why we support both

Bluetooth alone is too slow for smooth 1080P video. At 2 Mbps data rate, you can push 10–12 fps at low quality; no consumer will be happy with that. So the video pipe is Wi-Fi 2.4 GHz, direct from the device to the phone (ad-hoc mode, no router required).

But Wi-Fi has provisioning overhead — the user has to tell the device which network to join. That’s where BLE comes in: the phone advertises, the device scans, the user approves on the phone’s screen, the phone delivers the Wi-Fi credentials over the BLE encrypted link. Zero typing. Zero QR codes.

Total first-time setup: under 45 seconds from unboxing to first live video feed. We measure this at every FCC certification run because if provisioning takes longer than 90 seconds, real users will return the product.

The AI safety layer

This is the one nobody else in the category has gotten right.

The risk with a visual ear cleaner is simple: an untrained user, distracted by the live video on their phone, pushes the scoop too far and perforates the tympanic membrane. Ear, Nose and Throat doctors have been warning about this risk since 2020; there are published case studies.

Our response is an on-device AI model that detects proximity to the tympanic membrane by analysing:

Image focus — the membrane is a distinctive, slightly translucent disc; when it’s in focus, the scoop is dangerously close
Depth-of-field cues — the rate at which out-of-focus regions blur tells the algorithm how close the lens is to the surface
Dwell time — a stationary lens pointed at an in-focus membrane for more than 3 seconds triggers an alarm

The model runs on the MCU itself — not in the cloud, not on the phone. That means it works even if Wi-Fi drops, even in airplane mode, even in the middle of the ocean. False-positive rate in our QA lab testing (10,000 synthetic sessions): 0.28%. False-negative rate (failure to detect true proximity): 0.04%.

We hold patents on this algorithm in both China and the US. Brand owners who ODM our S9 model get the AI safety layer as part of the firmware — you cannot ship our product with that feature removed, because the liability-transfer clause in our contracts requires it remain active.

Why any of this matters to you

If you are evaluating visual ear cleaners for your brand, you need to understand this: the cheap ones — $5 BOM, $15 retail — ship with 480P cameras, flakey Wi-Fi, no safety layer, and a 6-month lifetime. Returns on those products run 15–25% and liability exposure is real when someone perforates an eardrum.

Our BOM is ~3× the cheap end, our retail sells $60–120 depending on configuration, returns run under 2%, and the patent moat means your product line is not going to be undercut by a factory that copy-pasted your design.

Ask us for a sample unit — we ship 1–5 units at cost so your team can evaluate before committing to a production run.

Back to blog