Skip to content

brain-squishers/mind-two

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Mind Two

This repository is a customized live object-finding runtime built on top of the original Streaming Grounded SAM 2 repo:

The current live pipeline combines:

  • OpenAI GPT-based query extraction
  • Grounding DINO for text-conditioned detection
  • SAM 2 for mask initialization and tracking
  • Depth Anything V2 Metric for distance estimation
  • Optional audio query input with OpenAI speech-to-text

The main entrypoint is run_live.py.

Toolchain Setup

Architecture overview

High-level architecture for the current live perception pipeline.

Watch the demo

Demo video: https://www.youtube.com/watch?v=kxXIVC9BiPU

The runtime is split into a few concrete layers:

Model Pipeline

The live path in run_live.py works like this:

  1. Input enters as either:

    • a text query from --query, or
    • an audio command captured after the wake phrase hello
  2. The query is queued and sent to the LLM extraction step.

    • The extraction produces:
      • targets
      • anchors
      • support_surfaces
  3. Anchor selection is resolved.

    • Default: fixed anchors
    • Optional: LLM-derived anchors
  4. Grounding DINO runs on the current frame for the active target phrase.

    • This produces candidate boxes for target initialization or re-detection.
  5. SAM 2 loads the frame and initializes tracked objects from those boxes.

    • After initialization, SAM 2 handles intermediate tracking updates between re-detections.
  6. Depth Anything V2 Metric runs on the full frame in a background worker.

    • The runtime samples depth values only inside the tracked SAM masks.
    • Median object depth is used for the distance overlay and scene reasoning.
  7. Context detections can run in parallel.

    • Anchor detections
    • Support-surface detections
    • Optional hand detections
  8. Scene reasoning and memory update from the tracked target state.

    • Spatial relations are computed from target, anchors, supports, and hand context.
    • Stable target observations can be written into scene memory.
  9. The UI overlay renders the current state.

    • query summary
    • tracking labels
    • depth estimates
    • spatial relations
    • memory fallback text when tracking is lost

In short:

query/audio -> LLM extraction -> Grounding DINO boxes -> SAM 2 tracking -> Depth Anything masked distance -> scene reasoning + memory -> overlay

Setup

Create an environment and install:

conda create -n sam2 python=3.10 -y
conda activate sam2
pip install -e .

If you use GPT models, set your API key in llm/.env:

API_KEY="..."
API_BASE=""

API_BASE is optional and only needed for Azure-style routing in the existing wrapper.

Download the required checkpoints:

cd checkpoints
./download_ckpts.sh
cd gdino_checkpoints
hf download IDEA-Research/grounding-dino-tiny --local-dir grounding-dino-tiny
cd depth_anything_checkpoints
./download_metric_indoor_ckpts.sh

Main Run Commands

Webcam + text query

This is the simplest local run:

python run_live.py --model gpt-4o-2024-05-13

Pass an explicit text query:

python run_live.py --model gpt-4o-2024-05-13 --query "I am trying to find my phone"

Use a different camera:

python run_live.py --model gpt-4o-2024-05-13 --camera-index 1

Webcam + audio query input

Audio query mode listens for the wake phrase hello, then records the spoken command and transcribes it with gpt-4o-transcribe.

python run_live.py --model gpt-4o-2024-05-13 --query-input audio

If your microphone is not the default input device:

python run_live.py --model gpt-4o-2024-05-13 --query-input audio --audio-input-device-index 1

If you want to override the wake phrase:

python run_live.py --model gpt-4o-2024-05-13 --query-input audio --wake-phrase "hello"

Server frame source

The live runner can also read frames from the local FastAPI server endpoint instead of a directly attached webcam.

Start the server:

python rtc_client_server/server.py

Then run the live pipeline against the server stream:

python run_live.py --model gpt-4o-2024-05-13 --frame-source server --stream-url http://127.0.0.1:5000/stream/latest-frame

If you also want audio query input while using server frames:

python run_live.py --model gpt-4o-2024-05-13 --frame-source server --stream-url http://127.0.0.1:5000/stream/latest-frame --query-input audio

Raspberry Pi client for the server

If you are using the included WebRTC client stream path, the client entrypoint is:

python rtc_client_server/client.py

That client currently has its server_url set directly inside rtc_client_server/client.py, so update it there if needed.

Query Input Modes

Text input

Default mode:

python run_live.py --model gpt-4o-2024-05-13

Audio input

Enabled with:

python run_live.py --model gpt-4o-2024-05-13 --query-input audio

Useful audio flags:

--wake-phrase hello
--transcription-model gpt-4o-transcribe
--audio-input-device-index 1
--audio-silence-threshold 550
--min-silence-duration-s 1.0

Anchor Modes

Anchors can come from either:

  • a fixed configured list
  • the LLM extraction output

Fixed anchors

This is the default behavior.

Current fixed anchor list:

  • water bottle
  • rubber duck
  • marker
  • usb
  • towel
  • snack

Run with default fixed anchors:

python run_live.py --model gpt-4o-2024-05-13

Run with an explicit custom fixed anchor list:

python run_live.py --model gpt-4o-2024-05-13 --anchor-source fixed --fixed-anchors "water bottle,rubber duck,marker,usb,towel,snack"

LLM-derived anchors

Switch to anchors from the LLM extraction:

python run_live.py --model gpt-4o-2024-05-13 --anchor-source llm

Useful Runtime Flags

Disable depth:

python run_live.py --model gpt-4o-2024-05-13 --disable-depth

Use the server frame source:

python run_live.py --model gpt-4o-2024-05-13 --frame-source server

Lower target detection thresholds are now the defaults:

  • --target-box-threshold 0.30
  • --target-text-threshold 0.25

Notes

  • The audio wake detection in this version is transcription-based, not a local keyword spotter.
  • In audio mode, say hello, then briefly pause, then say the command.
  • The live pipeline entrypoint is run_live.py.
  • The stream server entrypoint is rtc_client_server/server.py.

Citations

If you use this repository, cite the upstream projects and papers it builds on.

Original Streaming Grounded SAM 2 repo

SAM 2

Grounding DINO

Depth Anything V2

Segment Anything 2 codebase

Grounded-SAM-2

About

Grounded Tracking for Streaming Videos

Resources

License

Apache-2.0, BSD-3-Clause licenses found

Licenses found

Apache-2.0
LICENSE
BSD-3-Clause
LICENSE_cctorch

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors