Photo-in-Chat: Grounded Multimodal Part Identification

Abstract

Conversational hardware-design tools benefit enormously from multimodal input — users can identify the motor they already own by dragging a photo into the chat, instead of typing model numbers. Naive vision-language-model (VLM) identification produces fluent but unreliable answers. We describe DRONA's two-stage pipeline: (1) a VLM generates a candidate set with per-candidate category and attribute predictions; (2) a catalog-grounded retrieval step over a pgvector index of verified parts constrains the VLM's output to real SKUs. We formalize a trust budget — a candidate is only auto-committed to the build when VLM top-1 confidence exceeds a threshold and it agrees with the catalog-top-1; otherwise the system surfaces a confirmation prompt. On a held-out evaluation set of 240 crowd-sourced images across seven part categories, the pipeline achieves 83% top-1 SKU accuracy, 96% top-3 SKU accuracy among in-catalog parts, and zero catastrophic silent errors (a hallucinated SKU committed to the build with no user confirmation). We discuss the evidence-tier system that tracks whether an identification is grounded, estimated, or conceptual, and argue that visible uncertainty is a feature, not a bug, in AI-assisted hardware design.

1 · Problem

A drone builder opens the design chat and says: "I have this motor, use it in my build." They drag a photo of an EMAX motor into the chat input. The chat's job is to (a) identify the motor, (b) decide what to do with the identification, and (c) keep the build physics-valid when the new part replaces whatever was there. The hard problem is (a) and (b); (c) is downstream and already covered by our compatibility engine ^[1].

Unconstrained VLM identification fails in three characteristic ways. The model invents plausible-looking SKUs that don't exist. The model is confidently wrong in ways users cannot easily detect. The model hedges in a way that is technically honest but useless for design progression. None of these failure modes are acceptable in a tool whose output is a bill-of-materials a user will actually purchase.

2 · Pipeline

We take the classical retrieval-augmented generation pattern and adapt it to hardware identification. The pipeline is two stages.

2.1 Stage A: VLM candidate generation

Given an image and an optional user hint (e.g., "category=motor"), we prompt a vision-language model to produce a structured JSON output:

{
  "category": "motor" | "esc" | "fc" | "propeller" | "battery" | "camera" | "frame" | "unknown",
  "brand_guess": "EMAX" | "iFlight" | ... | null,
  "model_family_guess": "ECO II 2207" | "XING2 2207" | ... | null,
  "visible_attributes": {
    "kv_label_visible": 2750,
    "stator_size_label_visible": "2207",
    "color": "black",
    "mounting_pattern_visible": "M3_16x16" | null
  },
  "top_1_confidence": 0.87
}

This is a constrained-output call with schema validation. If the VLM fails the schema, the pipeline retries once with a stricter prompt; on a second failure it returns a "could not identify" response to the user rather than improvising.

2.2 Stage B: Catalog-grounded retrieval

The Stage A output is converted to a query vector and a set of attribute filters. We retrieve the top-k (k=8) matching parts from a pgvector index over the 834-SKU production catalog, filtered by VLM-predicted category. The retrieval embeds the VLM's brand_guess + model_family_guess + visible_attributes into a semantic-search vector and ranks catalog SKUs by cosine similarity, re-ranked by attribute agreement count.

The final output to the chat is the list of top-k catalog matches, their grounded-match scores, and the original VLM candidate. The decision about what to show the user is governed by the trust budget (§3).

3 · Trust budget

We define a trust budget as the conjunction of two signals: the VLM's confidence and the catalog match's agreement score. Formally, let $c_v \in [0,1]$ be the VLM top-1 confidence and $s_c \in [0,1]$ be the cosine similarity of the top-1 catalog match. The identification is auto-committable iff:

$$c_v \geq \tau_v \;\; \wedge \;\; s_c \geq \tau_c \;\; \wedge \;\; \text{agree}(\text{VLM}_{1}, \text{Catalog}_{1})$$

Eq. 1 — Trust-budget predicate. Currently $\tau_v = 0.80$, $\tau_c = 0.70$.

When the trust budget is met, the system silently slots the part into the build and runs the compatibility engine. When it is not met, the system shows the user the top-3 candidates and asks for explicit confirmation. The user-facing question is always of the form: "Looks like X. Is that right?" — never "I'm uncertain; please help."

The uncomfortable asymmetry. A silent wrong commit is a catastrophic failure — the user pays for a part that doesn't belong in their build. A visible wrong guess is a minor inconvenience — the user clicks "no, it's actually this one" and moves on. The trust budget is tuned to eliminate the first failure mode at the cost of slightly more user confirmations.

4 · Evidence levels

Every part entering a DRONA build carries one of three evidence tags:

grounded — identified with high trust from the catalog, or manually typed by the user and matched to a verified SKU.
estimated — the photo did not match catalog with high confidence, but the user confirmed the VLM's best guess; specs are derived from family-typical values and marked as such throughout the UI.
conceptual — used only during early design exploration; represents a "hypothetical part" with no real SKU. Never ships to a BOM export without user acknowledgement.

The lowest evidence tier among parts in a build wins at the build level: a build with one estimated part is an estimated build. This surfaces in the UI as an amber-rather-than-green status badge on the BOM panel. The user sees exactly what they are buying and what they are approximating.

5 · Evaluation

We evaluated the pipeline on 240 crowd-sourced photos across seven categories (motor, ESC, FC, propeller, battery, camera, frame), split roughly evenly. Each image was labeled with its ground-truth SKU by two independent reviewers; we retained 231 images where reviewers agreed. Of the 231, 184 were of parts present in our catalog; 47 were of parts not in the catalog (deliberately included as distractors).

5.1 Results on in-catalog images (n=184)

Metric	Result	Target
Category (top-1) accuracy	94%	≥ 90%
SKU (top-1) accuracy	83%	≥ 70%
SKU (top-3) accuracy	96%	≥ 90%
Silent wrong commit (auto-committed w/ trust budget)	0 of 184	0
Median latency	2.8 s	< 6 s

5.2 Results on out-of-catalog images (n=47)

On distractor images — parts that are not in our catalog — the pipeline correctly returned a "no high-confidence match, here are the nearest candidates" response in 43 of 47 cases. The four incorrect cases were visually near-identical parts from different manufacturers (e.g., a rebadged motor) where Stage B returned a plausible SKU with borderline confidence. None of those reached the trust-budget threshold for auto-commit; all four surfaced a user-confirmation prompt, which we treat as correct behavior.

6 · Discussion

Visible uncertainty is a feature. A hardware-design tool that always commits is a tool that sometimes commits wrong. Our users are about to hand over money for real parts — the correct default is to surface ambiguity rather than hide it.

The catalog is the authority. The VLM proposes; the catalog disposes. A hallucinated SKU cannot reach the BOM because the retrieval stage physically cannot return a SKU that doesn't exist. This is an architectural invariant, not a training-time hope.

Evidence tiers compound. A build with ten grounded parts and one estimated part is flagged amber in a way that correctly reflects the underlying uncertainty. Users who see an amber build and want it green know exactly where to click: the estimated-tier part, which opens a sourcing-verification flow.

7 · Limitations

Novel parts. When the catalog doesn't contain the part and the VLM happens to guess something visually near-identical, the confidence can cross the threshold. Mitigation: manually-curated distractor list for frequently-confused pairs (e.g., EMAX ECO II 2207 vs. iFlight XING2 2207).
Low-resolution images. Below ≈ 600×600 px, identification degrades noticeably. We upscale client-side and reject images below 400×400 px before calling the VLM.
Privacy in multi-tenant setups. Photos are retained by default for provenance (see Evidence Chain discussion in TR-2026-01). Enterprise customers can opt into zero-retention for images.

8 · Conclusion

A two-stage, catalog-grounded pipeline produces identifications that are good enough to auto-commit when confident, honest enough to surface uncertainty when not, and fast enough to feel instant in chat. The architectural choice — put the catalog outside the model — is the same choice we make in the compatibility engine ^[1], and for the same reason: we do not want the language model to be the authority on anything the physical world will later punish us for getting wrong.

References

[1] DRONA Labs, "A Physics-Grounded Compatibility and Specs Engine for Multirotor Drone Design," Technical Report TR-2026-01, April 2026.
[2] P. Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," NeurIPS, 2020.
[3] J. Li, D. Li, S. Savarese, S. Hoi, "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models," ICML, 2023.
[4] pgvector contributors, "pgvector: Open-source vector similarity search for Postgres," github.com/pgvector/pgvector, 2021–present.

Cite this
DRONA Labs, Photo-in-Chat: Grounded Multimodal Part Identification for Hardware Design, Technical Report TR-2026-02, April 2026.
BibTeX and plain-text citation blocks available on request to research@drona.design.