Building a Vision Pipeline That Sees What Cameras Can't
How depth sensing, instance segmentation, and a lot of failed prototypes produced a dataset generator for transparent laboratory tubes — including multi-tube mixed-class scenes.
Building a Vision Pipeline That Sees What Cameras Can't
How depth sensing, instance segmentation, and a lot of failed prototypes produced a dataset generator for transparent laboratory tubes — including multi-tube mixed-class scenes.
The Problem
LABOKLIN, a veterinary diagnostic laboratory in Germany, processes thousands of biological sample tubes every day. Each tube contains a different additive — anticoagulants, clot activators, separating gels — and the type of additive determines exactly how a sample must be handled downstream. A misclassified tube doesn't just slow things down. It produces invalid diagnostic results.
The classification is currently done manually. At the scale LABOKLIN operates, that means human attention applied to thousands of nearly-identical objects under time pressure. The case for automating it is obvious.
But automation needs data. Specifically, a large annotated dataset of tube images — bounding boxes, segmentation masks, class labels — that a machine learning classifier can train on. That dataset didn't exist. My job was to build the pipeline that generates it.
Why Cameras Alone Fail
Before writing a line of code, I visited LABOKLIN with the team. The observation that shaped everything came immediately: the tubes are transparent.
Standard RGB segmentation depends on edge detection — the model looks for pixel intensity boundaries that correspond to object surfaces. Transparent, reflective, cylindrical objects don't give you clean edges. The tube blends into whatever is behind it. Labels wrap around the surface at arbitrary angles. The liquid fill level changes the internal appearance. And across 16 confirmed classes spanning three manufacturer families (VACUETTE, SARSTEDT Regular, SARSTEDT Small 1.3ml), several pairs share identical cap colours — the only feature most RGB-based systems would use to tell them apart.
Pure colour-based imaging was a dead end. The system needed geometry.
Depth sensing gives you a per-pixel distance map. It doesn't care whether a surface is transparent or opaque — it measures physical distance from the sensor. Even when the RGB image shows a tube barely distinguishable from its background, the depth map shows a vertical cylinder standing in front of a flat platform. That geometric signal is what makes reliable segmentation possible.
This was the founding insight: depth isn't just an extra stored output. It's an input to the segmentation process itself.
What I Built
I designed and built the full software pipeline solo. The system is organised into five functional layers:
Acquisition — Streams aligned RGB and depth frames from an Intel RealSense D435i. A depth-stability detector monitors the capture zone and triggers only when the scene has been stationary for N consecutive frames at the correct distance. No manual trigger.
Annotation Engine — Takes the captured frame pair, extracts depth-bounded regions of interest per tube, runs MobileSAM segmentation once per ROI, and writes out per-instance masks, bounding boxes, and structured metadata. This is the core of the system.
Storage Manager — Writes every sample into a consistent folder hierarchy by class. Each captured instance produces five files: RGB image, depth array, segmentation mask, bounding box annotation, and a JSON metadata file — all sharing a common frame ID and instance ID so the dataset stays fully traceable.
Cleaning Pipeline — Filters the raw dataset using Laplacian blur detection, perceptual hashing for duplicate removal, and bounding box quality checks. Anything below threshold gets logged and rejected.
Orchestrator — Coordinates startup validation (config, camera connection, model weights, CUDA availability), capture mode routing, error handling, and final export to COCO and YOLO formats.

The full source is on GitHub: Yadav108/Dataset_Pipeline
Capture Modes
The pipeline supports four capture modes, each routing through the same annotation engine but with different ROI detection and operator input paths:
single_side / single_top — One tube per frame. The operator declares class, volume, and fill level before capture. ROI extraction finds exactly one vertical object. This was the prototype mode and is where the IoU benchmarks were established.
multi_side — Two to six tubes per frame, operator-declared before capture. The operator specifies how many slots are occupied, and whether they're the same or different classes, then enters class, volume, and fill level per slot. The pipeline expects the detected ROI count to match the declared slot count exactly — if it doesn't, the frame is skipped, not guessed. Each detected ROI is mapped left-to-right to the declared slot, segmented independently by MobileSAM, and stored with its own instance ID and class folder.
multi_top — Multi-tube top-down capture. Bounding box extraction runs across multiple objects, but per-slot operator declaration is not yet wired for this mode.
The Decisions That Mattered
1. MobileSAM over SAM — a VRAM constraint, not a preference
SAM ViT-H exceeds 4GB VRAM on a single inference pass. My hardware is an RTX 3050 with exactly 4GB. MobileSAM uses a lightweight ViT-T encoder, fits comfortably in 4GB, and runs once per detected ROI — meaning in a multi_side frame with four tubes, it runs four sequential inference passes. Early prototype results: average IoU 0.84, minimum 0.66. After tuning ROI extraction parameters and improving the physical setup, the final single-tube numbers landed at average IoU 0.95, minimum 0.92, maximum 0.99.
The honest version: MobileSAM was a constraint-driven choice. It also turned out to be the right one for this environment. But if you have more VRAM, SAM ViT-H will give you better masks on edge cases.
2. Depth as a geometric prompt, not just stored output
Before MobileSAM sees any image, the pipeline runs depth ROI extraction — it analyses the depth frame to locate each vertical object, then crops both frames to each ROI independently. MobileSAM receives a pre-cropped input centred on one tube at a time, not the full frame. In multi_side mode this is what makes per-instance separation possible: the ROI extractor spatially isolates each tube using depth geometry before any segmentation happens.
This eliminates the holder, platform, and neighbouring tubes from each segmentation problem. The result: zero false-positive ROI detections in the prototype single-tube run, and correct per-instance mask separation in multi_side captures.
3. Slot-count matching over best-effort segmentation
In multi_side mode, when the detected ROI count doesn't match the operator's declared slot count, the pipeline skips the frame entirely rather than proceeding with a partial or misaligned annotation. This was a deliberate decision. A training dataset with silently mis-labelled samples is worse than a smaller clean dataset. The skip is logged with the frame ID and mismatch count so it can be reviewed.
4. Softbox lighting over a ring light
Ring lights produce specular reflections — concentrated bright spots on curved transparent surfaces. Switching to an FGen octagonal softbox eliminated the artefact. Combined with a black background (transparent tubes against white have near-zero contrast), RGB quality improved enough that masks stopped including reflection artefacts.
5. Hardcoding camera settings into the pipeline
The RealSense camera would reset its exposure and white balance between runs — fine for photography, catastrophic for a training dataset. Settings that produced good results (exposure 156, gain 30, brightness 1, contrast 55, saturation 46, sharpness 82, white balance 5200) are applied programmatically at startup, overriding auto-adjustment.
Results — Single Tube
The prototype acquisition run captured 148 frames across a subset of tube classes.
| Metric | Value | Interpretation |
|---|---|---|
| Frames captured | 148 | Initial prototype volume |
| After ROI + MobileSAM | 148 | Zero loss at segmentation entry |
| After blur filtering | 148 | Zero blurry frames — mechanical stability held |
| After deduplication | 118 | 30 near-duplicates removed by perceptual hash |
| Final exported set | 98 | End-to-end yield 66.2% |
| Average IoU | 0.95 | Up from 0.84 in early prototype |
| Minimum IoU | 0.92 | Up from 0.66 — no severe segmentation failures |
| Maximum IoU | 0.99 | Near-perfect under good conditions |
| MobileSAM inference | <200ms / frame | On RTX 3050 GPU |
RGB samples — cap colour, tube body, and label all visible

VAC_PURPLE · SARSTEDT Small

SAR_SM_ORANGE · small tube

VAC_PURPLE · VACUETTE

VAC_LIGHTBLUE · VACUETTE
Depth maps — tube geometry visible regardless of surface transparency

Tube separated from background

Shadow artefact — IR + transparent plastic

Tube above holder — clean vertical separation

Platform response — filtered by ROI extractor
Segmentation masks — tube-only pixels, holder excluded

Small tube, isolated

After depth ROI crop

Full vertical profile

Cap + body region

Paired capture sequence
Bounding boxes — tight localisation across tube types and sizes

Small blue/grey tube

Red-cap tube

VAC_PURPLE

Narrow profile

Blue-cap tube
Results — Multi-Tube (multi_side)
The multi_side mode extends the pipeline to handle two to six tubes in a single frame. The operator declares slot count and per-slot class/volume/fill before capture. The pipeline detects each tube independently via depth ROI extraction, runs MobileSAM once per ROI, and routes each instance to its declared class folder — all from one frame.

From this single frame, the annotation engine produces two independent annotation sets — one per instance. Each gets its own mask, bounding box, instance ID, and metadata file, stored in its respective class folder.
Per-instance segmentation masks from one multi_side frame

Slot 0 · green-cap tube — isolated from neighbour

Slot 1 · blue-cap tube — isolated from neighbour
Multi-tube performance figures (multi_side mode)
Tubes per frame: 2–6 · Throughput: 40–50 frames/min · Processing time: 1.5–2.5s/image
Blur rejection: 15–25% · Segmentation success rate: 85–95% · Quality score: 8.2–8.8/10
Frame is skipped if detected ROI count ≠ declared slot count.
The 15–25% blur rejection in multi_side is higher than the zero-blur result in single-tube mode. This is expected — with multiple tubes in frame, the probability of at least one ROI failing the blur threshold increases, and the current implementation rejects the full frame if any instance fails, not just the failing slot.
What Didn't Work
Fill-level detection via depth was the one feature I couldn't make reliable. At a capture distance of ~35cm, the RealSense D435i doesn't produce enough depth contrast inside a narrow transparent tube to locate the liquid surface with confidence. The signal exists but it's within the noise margin of the sensor at that distance. Fill level is currently operator-declared. A classifier trained on this dataset will have fill-level labels that are operator-entered, not sensor-derived. That's worth knowing.
Known failure modes in multi_side: ROI-count mismatch (frame skipped), no ROI detected, SAM inference failure, low blur/coverage/IoU threshold breach, duplicate rejection, and background-removal fallback. All are logged with frame ID and reason.
What's Next
The immediate next step is completing full dataset acquisition — 16 classes, ≥500 images each, across both single and multi-tube modes. The pipeline is ready; it's a capture logistics problem now.
Per-slot operator declaration for multi_top mode is designed but not yet wired. The longer-term goal is training the actual classifier on the generated dataset. This pipeline exists to produce the data. The model that consumes it doesn't yet exist.
If you've worked on RGB-D pipelines, transparent object segmentation, or have thoughts on fill-level detection at close range — I'd be interested in the conversation. aryan.yadav@study.thws.de