Back to projects

Reflexions – Gestural Play as an Interactive Art Installation

Completed
computer-graphicscomputer-visionhuman-computer-interactioninteraction-designuser-experience
Master·Project start: 30.09.2025·by: Elina Meier

Pages

Files

Implementation

How It Works

architecture.png

The system runs as two parallel processes. Python captures the RGB stream from the RealSense camera, runs MediaPipe gesture recognition, and sends hand data to Godot over UDP. Godot receives that data, samples the depth stream from the same camera independently, and combines the two — gesture type determines what happens, depth determines scale. All rendering happens in Godot.

For a full technical breakdown, see the thesis.


👁 Seeing your hands

realsense.jpg

An Intel RealSense SR305 depth camera sits at the front of the interaction zone at about 1m height. Its RGB stream feeds into Python, where MediaPipe tracks 21 landmarks on each detected hand in real time.

handLandmarks.png

MediaPipe supports eight built-in gesture categories. Three are used here:

  • Closed_Fist
  • Open_Palm 🖐
  • Victory ✌️

detectedHands.png

Gestures with a confidence score below 0.5 are treated as unrecognised. Up to ten hands can be tracked simultaneously — practically optimised for three users.


📡 Sending data to Godot

Each detected hand is serialised into a JSON object and all hands in a frame are sent together as a single array over UDP to Godot on port 9000.

A real packet from a session looks like this:

{
  "timestamp": "08-05-2026 14:41:28.081",
  "gesture": "rock",
  "raw_gesture": "Closed_Fist",
  "confidence": 0.54,
  "hand_x": 0.274,
  "hand_y": 0.273
}

UDP was chosen for minimal latency and zero dependencies. A dropped packet is simply superseded by the next frame — acceptable for real-time interaction.

On the Godot side, a global receiver node listens each frame and emits a hands_updated signal carrying all currently detected hands. If no packet arrives for 500ms, the signal fires with an empty array — so artworks respond correctly when users step away.


📏 Distance as input

rgbAndDepth.png

The RealSense also captures depth. The distance of each hand from the camera scales objects on screen — closer hands spawn bigger objects, further hands spawn smaller ones.

Depth is sampled using a small 5-point cross-shaped pixel neighbourhood at the wrist position, with the closest valid values taken to reduce sensor noise.

Moving your hand toward or away from the screen is itself a form of interaction.


🎮 The display

Each artwork is a separate scene in the same Godot project. When a gesture packet arrives, Godot maps the normalised hand coordinates to screen space and samples the depth stream at the wrist position — a small 5-point cross-shaped pixel neighbourhood — to get the hand's distance from the camera. Gesture type determines what spawns or moves; depth determines its scale. The two streams meet here, in the rendering layer.

Read more about the artworks in the Artworks page.

The full system runs on a standard laptop.

  • Gesture recognition: ~25 fps
  • Rendering: 60 fps
  • Display: almost any horizontal screen

testSetup.png

Spatial arrangement from one of the early testing sessions of Lavalamp. The user (1) was about 1.5m (5) from the display (2) and around 1.5m (4) from the RealSense camera (3), placed at the height of 1m.


🌐 Project website  |  📄 Thesis  |  🖥 Original Reflexions  |  🔬 CGVR Study Lab  |  🎓 Institute of Computer Science