How It Works

The system runs as two parallel processes. Python captures the RGB stream from the RealSense camera, runs MediaPipe gesture recognition, and sends hand data to Godot over UDP. Godot receives that data, samples the depth stream from the same camera independently, and combines the two — gesture type determines what happens, depth determines scale. All rendering happens in Godot.

For a full technical breakdown, see the thesis.

👁 Seeing your hands

An Intel RealSense SR305 depth camera sits at the front of the interaction zone at about 1m height. Its RGB stream feeds into Python, where MediaPipe tracks 21 landmarks on each detected hand in real time.

MediaPipe supports eight built-in gesture categories. Three are used here:

Closed_Fist ✊
Open_Palm 🖐
Victory ✌️

Gestures with a confidence score below 0.5 are treated as unrecognised. Up to ten hands can be tracked simultaneously — practically optimised for three users.

📡 Sending data to Godot

Each detected hand is serialised into a JSON object and all hands in a frame are sent together as a single array over UDP to Godot on port 9000.

A real packet from a session looks like this:

{
  "timestamp": "08-05-2026 14:41:28.081",
  "gesture": "rock",
  "raw_gesture": "Closed_Fist",
  "confidence": 0.54,
  "hand_x": 0.274,
  "hand_y": 0.273
}

UDP was chosen for minimal latency and zero dependencies. A dropped packet is simply superseded by the next frame — acceptable for real-time interaction.

On the Godot side, a global receiver node listens each frame and emits a hands_updated signal carrying all currently detected hands. If no packet arrives for 500ms, the signal fires with an empty array — so artworks respond correctly when users step away.

📏 Distance as input

The RealSense also captures depth. The distance of each hand from the camera scales objects on screen — closer hands spawn bigger objects, further hands spawn smaller ones.

Depth is sampled using a small 5-point cross-shaped pixel neighbourhood at the wrist position, with the closest valid values taken to reduce sensor noise.

Moving your hand toward or away from the screen is itself a form of interaction.

🎮 The display

Each artwork is a separate scene in the same Godot project. When a gesture packet arrives, Godot maps the normalised hand coordinates to screen space and samples the depth stream at the wrist position — a small 5-point cross-shaped pixel neighbourhood — to get the hand's distance from the camera. Gesture type determines what spawns or moves; depth determines its scale. The two streams meet here, in the rendering layer.

Read more about the artworks in the Artworks page.

The full system runs on a standard laptop.

Gesture recognition: ~25 fps
Rendering: 60 fps
Display: almost any horizontal screen

Spatial arrangement from one of the early testing sessions of Lavalamp. The user (1) was about 1.5m (5) from the display (2) and around 1.5m (4) from the RealSense camera (3), placed at the height of 1m.

🌐 Project website | 📄 Thesis | 🖥 Original Reflexions | 🔬 CGVR Study Lab | 🎓 Institute of Computer Science

Reflexions – Gestural Play as an Interactive Art Installation

Pages

Files

Implementation

How It Works

👁 Seeing your hands

📡 Sending data to Godot

📏 Distance as input

🎮 The display