Fruit Ninja (Computer Vision v.) Case Study

this was the project that finally made all the “probability & statistics + linear algebra + stochastic calculus" (hmu if you want the whole Obsidian note i used) stuff click together into something cool. after doing this, i am DEFINITELY gonna be dabbling some more in computer vision & ML.

it was also my first project where it wasn't tight. no immediate feedback, no build step, no deployment. it was just change code, run, see it work (or not), repeat.

Context

this project started as a basic hand-tracking tool: MediaPipe + OpenCV + PyAutoGUI to move the cursor w/ my index finger & use a pinch as a left-click. that alone was cool, but you know me. gotta gamify it.

so i booted up Perplexity and asked it for "simple game ideas using hand tracking." it suggested this Fruit Ninja clone, which sounded ok (i just chose it cause the rest of the ideas were trash). so i set out to build it.

Approach

webcam capture w/ cv2.VideoCapture, MediaPipe’s HandLandmarker for 3D hand landmarks, & OpenCV for all rendering, text, & collision visualization. all inside a single loop.

btw the “blade” is just a trail of recent index finger positions stored in a list & rendered as a line across frames. the sole reason i included it was because i needed the reassurance that it was actually detecting my finger movement, since there’s no physical controller (first time ok).

slicing is a distance check b/w those trail points & each fruit’s center; when within radius, the fruit marks itself sliced & switches to a split animation.

game state lives in a few scalars + lists:

fruits
score
lives
paused
spawn_timer

fruits spawn w/ randomized \( x \), initial upward velocity, & i'm not god, but i tried my best to replicate gravity. then the computer updates each frame until they’re sliced or fall off-screen. and lastly, a thumb–index pinch, detected via landmark distance threshold, toggles pause / play.

The Math

under the hood this project is basically: take 3D hand landmarks, convert to 2D pixel coordinates, then use Euclidean geometry for the rest.

MediaPipe gives each hand as 21 landmarks w/ \( (x, y, z) \) where \(x, y\) are normalized to \([0, 1]\) in img coords & \(z\) is the relative depth value.

to actually get a fingertip in pixels, you do:

\( x_{px}= landmark.x × frame\)_\(width \)
\( y_{px}= landmark.y × frame\)_\(height \)

once you have that, you need to detect pinching. the way i did it was simple distance thresholding b/w the thumb tip (landmark 4) & index tip (landmark 8) in normalized landmark space:

in essence, you're computing the 3D Euclidean distance:

\( d = \sqrt{(x_4 - x_8)^2 + (y_4 - y_8)^2 + (z_4 - z_8)^2} \) and treating pinch as when \( d < 0.05 \).

conceptually, landmarks closer together in camera space = a smaller \( d \).

now the fruit itself. each fruit is a point mass w/ basic kinematics in 2D. & it has, really 2 components:
1. position: \( (x, y) \)
2. velocity: \( (v_x, v_y) \)

and then you just update each frame:
\( v_y \) <-- \( v_y + g \) (gravity term, here \( g = 0.5 \) pixels/frame²)
\( x \) <-- \( x + v_x \)
\( y \) <-- \( y + v_y \)

so they follow discrete parabolic arcs until sliced or off-screen (bottom). and the off-screen check is just if \( y > frame\)_\(height + margin\).

lastly, slicing detection. each fruit is treated as basically a filled circle w/ a center \( (f_x, f_y) \) & radius \( r \). the slicing test is a point-circle distance b/w each trail point \( (x_i, y_i) \) & fruit center:

\( d = \sqrt{(x_i - x_f)^2 + (y_i - y_f)^2} \)

if any point in the recent trail has \( d < r \), the fruit is counted as sliced.

and that’s basically it! the whole game is essentially a loop over time where you repeatedly (1) map normalized landmark coordinates to pixels, (2) update fruit state w/ simple kinematics, & (3) apply distance-based tests for both gestures (pinch) + slicing (trail vs circle) in a consistent 2D coordinate frame.

Outcome

this project was like scratching an itch i didn't know was there. web dev is fun & all, but actually seeing code translate to real-time physical interaction? mind-blowing for a dev who's only seen like MERN.

anyways, now i have a basic hand-tracking framework set up in Python if i ever wanna do anything more with CV.

next up: maybe some ACTUAL gesture recognition, or even a full-on sign language interpreter. who knows. hit me up maybe we can collab.