Appearance
Vision Runtime
This page covers captioning, inspection, grounding, segmentation, tracking, and OCR.
Public surface
mere.run vision captionmere.run vision inspectmere.run vision groundmere.run vision segmentmere.run vision trackmere.run vision track-livemere.run vision ocr
Model family
vision-ocr-lightonvision-ground-falcon-perceptionvision-segment-sam31
Captioning and inspect flows also depend on vision-language support code in MereRunCore.
Grounding runs natively through the Swift/MLX Falcon Perception stack in MereRunCore.
Segmentation and tracking run natively through the Swift/MLX SAM 3.1 stack in MereRunCore.
Current SAM 3.1 scope
vision segmentsupports text prompts plus geometry prompting with boxes and pointsvision tracksupports text, box, and point prompts on the init frame, then propagates tracked objects through later framesvision track-liverecords a camera clip, searches a short warm-up window for seed objects, and then runs the same native tracking path over the saved recording- the managed model package
vision-segment-sam31is the single SAM 3.1 package for segmentation and tracking
Current implementation notes
- still-image text prompting uses the native detector path
- still-image box and point prompting use the native interactive SAM prompt path
- offline video tracking currently uses native prompt propagation built on top of the image segmenter rather than a full SAM memory-bank tracker
- live capture is text-prompt seeded only in the current CLI surface
Typical workflows
Caption an image
bash
swift run mere.run vision caption ./image.pngInspect an image with a question
bash
swift run mere.run vision inspect ./image.png "What objects are visible?"Segment an image with SAM 3.1
bash
swift run mere.run model pull vision-segment-sam31
swift run mere.run vision segment ./image.png --prompt "a person"Ground objects with Falcon Perception
bash
swift run mere.run model pull vision-ground-falcon-perception
swift run mere.run vision ground ./image.png --query "a person"Track objects through a video
bash
swift run mere.run model pull vision-segment-sam31
swift run mere.run vision track ./clip.mp4 --prompt "a dog" --init-frame 12Track a recorded live camera session
bash
swift run mere.run vision track-live --output ./live.mp4 --prompt "a person"Output artifacts
vision ground
- annotated image written to
<stem>_grounded.<ext>unless--outputis provided - JSON metadata written to
<stem>_grounded.jsonunless--json-outputis provided - optional mask PNGs written to
--mask-output-dir
The JSON includes:
schemaVersion- model and input/output paths
- query list
- detections with
query, normalizedxy, normalizedhw, derivedbox, optionalscore, and optionalmaskPath
vision segment
- annotated image written to
<stem>_segmented.<ext>unless--outputis provided - JSON metadata written to
<stem>_segmented.jsonunless--json-outputis provided - optional mask PNGs written to
--mask-output-dir
The JSON includes:
schemaVersion- model and input/output paths
- prompts, threshold, and resolution
- detections with
label,score,box,maskAreaPixels, and optionalobjectID,promptKind,maskPath, andcandidateIndex
vision track and vision track-live
- annotated video written to
<stem>_tracked.mp4forvision track - JSON metadata written to
<stem>_tracked.jsonforvision track vision track-liverequires an explicit output video pathvision track-livedefaults to frame 0 but searches a short warm-up window when that frame yields no seed objects- optional per-frame mask PNGs written under frame-named subdirectories when
--mask-output-diris set onvision track
The tracking JSON includes:
schemaVersion- model and input/output paths
- fps, frame size, init frame, and dropped frame count
- stable tracked object metadata
- per-frame detections with
objectID,label,score,visible,box,maskAreaPixels, and optionalmaskPath
OCR
bash
swift run mere.run vision ocr ./page.png --backend lightonRuntime entrypoints
CLI
Sources/MereRunCLI/Commands/VisionCaptionCommand.swiftSources/MereRunCLI/Commands/VisionInspectCommand.swiftSources/MereRunCLI/Commands/VisionGroundCommand.swiftSources/MereRunCLI/Commands/VisionSegmentCommand.swiftSources/MereRunCLI/Commands/VisionTrackCommand.swiftSources/MereRunCLI/Commands/VisionTrackLiveCommand.swiftSources/MereRunCLI/Commands/VisionOCRCommand.swift
OCR runtime
Sources/MereRunCore/LightOnOCR/LightOnOCRGenerator.swiftSources/MereRunCore/LightOnOCR/LightOnOCRGenerator+Loading.swiftSources/MereRunCore/LightOnOCR/LightOnOCRGenerator+Inference.swiftSources/MereRunCore/LightOnOCR/LightOnOCRSupport.swift
Vision-language support
Sources/MereRunCore/VLM/Sources/MereRunCore/QwenVLCaptioner.swiftSources/MereRunCore/Qwen25VLEncoder.swiftSources/MereRunCore/QwenVisionAttention.swift
Falcon grounding runtime
Sources/MereRunCore/FalconPerception/FalconPerceptionConfig.swiftSources/MereRunCore/FalconPerception/FalconPerceptionResources.swiftSources/MereRunCore/FalconPerception/FalconPerceptionTokenizer.swiftSources/MereRunCore/FalconPerception/FalconPerceptionProcessor.swiftSources/MereRunCore/FalconPerception/FalconPerceptionModel.swiftSources/MereRunCore/FalconPerception/FalconPerceptionAnyUp.swiftSources/MereRunCore/FalconPerception/FalconPerceptionGrounder.swift
SAM 3.1 runtime
Sources/MereRunCore/SAM3/SAM31Config.swiftSources/MereRunCore/SAM3/SAM31Resources.swiftSources/MereRunCore/SAM3/SAM31Tokenizer.swiftSources/MereRunCore/SAM3/SAM31Model.swiftSources/MereRunCore/SAM3/SAM31InteractiveSAM.swiftSources/MereRunCore/SAM3/SAM31Prompts.swiftSources/MereRunCore/SAM3/SAM31ImageSegmenter.swiftSources/MereRunCore/SAM3/SAM31VideoIO.swiftSources/MereRunCore/SAM3/SAM31VideoTracker.swiftSources/MereRunCore/SAM3/SAM31CameraCapture.swift
How the OCR path works
- the CLI resolves the OCR model
- the OCR runtime loads the required components
- the input image is normalized into the expected tensor form
- OCR inference runs
- text is emitted without internal bring-up logs on stdout
How segmentation and tracking work
- the CLI resolves
vision-segment-sam31from the model store or uses the local root passed with--model - the native SAM 3.1 runtime validates the root, loads tokenizer/config/weights, and preprocesses the input image or video frames
- still-image text prompts run the detector once, then text + DETR + mask decode per prompt
- geometry prompts use the interactive SAM path, and video tracking reuses those prompts frame to frame after the seed frame
- native postprocessing applies thresholding, mask resize, score ordering, NMS, and optional mask export
- the runtime writes annotated media plus structured JSON metadata
How grounding works
- the CLI resolves
vision-ground-falcon-perceptionfrom the model store or uses the local root passed with--model - the native Falcon runtime validates the root, loads config/tokenizer/weights, and preprocesses the image plus text query
- the model autoregressively emits grounded detections, including coordinate and size tokens, and decodes optional segmentation masks
- native postprocessing derives normalized centers, sizes, bounding boxes, and optional exported mask artifacts
- the runtime writes an annotated image plus structured JSON metadata designed for downstream agent use
How caption and inspect differ
captionis a direct descriptive taskinspectis a question-driven vision-language path
They share some of the same underlying vision support code, but they are presented as separate public tasks because the user intent differs.