aiMay 9, 2026 · 4 min read · SONICHAOS editorial

Video-to-music: how the match runs from a 60-second clip to 30 candidates

Frame extraction, CLIP embeddings, a learned projection into CLAP space, ANN search across 40,000 vectors, and tempo-keyed re-ranking. The honest scope statement is at the end.

The /video-to-music route takes a 60-second clip and returns 30 candidate tracks ranked by visual fit. This post explains every step on the way, including the parts where the model is still wrong about a third of the time.

Frame extraction

Step one is FFmpeg, run as a single subprocess per upload. We pull one frame per second of the clip at 384 x 384, JPEG quality 85, into a temporary directory keyed by the upload ID. A 60-second clip yields 60 frames; a 90-second clip yields 90. The hard cap is 120 frames per upload, enforced at the API edge.

ffmpeg -i upload.mp4 -vf fps=1,scale=384:384 \
       -q:v 3 frame_%03d.jpg

The frame rate is fixed at 1 fps because the visual scene rarely changes faster than that for the kind of footage producers send to us (interview B-roll, product shots, drone footage, narrative cuts). Higher rates burn GPU time on near-duplicate frames.

The CLIP image embedding

Every frame runs through OpenAI's CLIP ViT-L/14 at 336 px. The output is a 768-dimensional vector per frame. We average the per-frame vectors into a single clip-level embedding, then L2-normalise the result. The average is a deliberate choice: a learned temporal aggregator added two points of accuracy on our internal eval but tripled the inference cost, which did not pencil out for a feature most users hit twice and move on from.

CLIP is run on a single A10G via TorchServe, batched at 32 frames per forward pass. End-to-end latency for a 60-frame clip lands at 1.4 seconds, dominated by the JPEG decode rather than the GPU.

The learned projector

CLIP embeds images. CLAP embeds audio. The two models share a contrastive training objective but their embedding spaces are not aligned. The projector is a small MLP — two hidden layers of 1024 units, GELU activations, residual connection from input to output — trained on roughly 80,000 (frame-strip, music-clip) pairs scraped from licensed production footage with their sync-licensed scores attached. The projector maps a CLIP vector into the 512-dimensional CLAP audio space.

Training ran on a single A100 over four days. The held-out validation loss bottomed out at 0.34 cosine distance, which sounds tight on paper but corresponds to "right vibe, sometimes wrong tempo" in subjective listening.

The ANN search

The projected vector hits an HNSW index built over 40,000 catalog tracks. Every track is represented by a single CLAP vector computed against its 30-second preview, the same window the player uses. The index is built with M=32, efConstruction=200, served by hnswlib inside the model server. Query-time efSearch is 64.

The top 200 nearest neighbours come back inside 8 ms on a single CPU core. We over-fetch on purpose because the next stage prunes hard.

The re-rank

The re-ranker scores the 200 candidates on three signals beyond visual similarity:

Tempo compatibility. The video pacing estimate (a one-liner that counts cut frequency in the FFmpeg scene-change detector) is cross-referenced against each track's BPM. Tracks within ±10% of the pacing target gain a score boost; tracks more than 30% off get dropped.
Key compatibility. If the upload includes reference audio (a scratch dialogue track, for example), the analyser estimates a key, and tracks within the Camelot wheel +/- 1 step gain a small boost.
Editorial tag overlap. If the projector lands a frame near "warehouse, industrial, neon," the re-ranker prefers tracks already tagged for that context over the closest cosine match.

The final 30 are sorted by a weighted sum: 0.6 visual similarity, 0.25 tempo, 0.10 editorial tags, 0.05 key.

Honest scope statement

This is the part the marketing copy elsewhere will not say. The projector is trained, not zero-shot, and on our internal eval (200 held-out clips, three human raters per clip) the top-3 results contain the right vibe about 71% of the time. Top-1 is closer to 38%. The model is good at "this is a tense night-time montage; show me dark synth music"; it is worse at "this is a lighthearted product reveal; show me something quirky but not grating." Quirky-but-not-grating is a hard target for any audio embedding.

We log every "play" and "save" event from the candidate list back into the training data, gated on user consent, and re-train the projector quarterly. The eval number has moved from 64% top-3 to 71% top-3 across the last three retrains.

Try the route and you'll see the 30 candidates ranked, with the BPM and key on every row.

Filed underai

Video-to-music: how the match runs from a 60-second clip to 30 candidates

Build the track you cannot find.

Video-to-music: how the match runs from a 60-second clip to 30 candidates