Use the Store for licensed catalog, then jump into AI Studio when the brief needs a custom melody, voice, or alternate take.
Grab a free starter kit — 50 sounds, no card.
Drum hits, one-shots, a few loops. Open in any DAW.Frame extraction, CLIP embeddings, a learned projection into CLAP space, ANN search across 40,000 vectors, and tempo-keyed re-ranking. The honest scope statement is at the end.
The /video-to-music route takes a 60-second clip and returns 30
candidate tracks ranked by visual fit. This post explains every step on
the way, including the parts where the model is still wrong about a
third of the time.
Step one is FFmpeg, run as a single subprocess per upload. We pull one frame per second of the clip at 384 x 384, JPEG quality 85, into a temporary directory keyed by the upload ID. A 60-second clip yields 60 frames; a 90-second clip yields 90. The hard cap is 120 frames per upload, enforced at the API edge.
ffmpeg -i upload.mp4 -vf fps=1,scale=384:384 \
-q:v 3 frame_%03d.jpg
The frame rate is fixed at 1 fps because the visual scene rarely changes faster than that for the kind of footage producers send to us (interview B-roll, product shots, drone footage, narrative cuts). Higher rates burn GPU time on near-duplicate frames.
Every frame runs through OpenAI's CLIP ViT-L/14 at 336 px. The output is a 768-dimensional vector per frame. We average the per-frame vectors into a single clip-level embedding, then L2-normalise the result. The average is a deliberate choice: a learned temporal aggregator added two points of accuracy on our internal eval but tripled the inference cost, which did not pencil out for a feature most users hit twice and move on from.
CLIP is run on a single A10G via TorchServe, batched at 32 frames per forward pass. End-to-end latency for a 60-frame clip lands at 1.4 seconds, dominated by the JPEG decode rather than the GPU.
CLIP embeds images. CLAP embeds audio. The two models share a contrastive training objective but their embedding spaces are not aligned. The projector is a small MLP — two hidden layers of 1024 units, GELU activations, residual connection from input to output — trained on roughly 80,000 (frame-strip, music-clip) pairs scraped from licensed production footage with their sync-licensed scores attached. The projector maps a CLIP vector into the 512-dimensional CLAP audio space.
Training ran on a single A100 over four days. The held-out validation loss bottomed out at 0.34 cosine distance, which sounds tight on paper but corresponds to "right vibe, sometimes wrong tempo" in subjective listening.
The projected vector hits an HNSW index built over 40,000 catalog
tracks. Every track is represented by a single CLAP vector computed
against its 30-second preview, the same window the player uses. The
index is built with M=32, efConstruction=200, served by hnswlib
inside the model server. Query-time efSearch is 64.
The top 200 nearest neighbours come back inside 8 ms on a single CPU core. We over-fetch on purpose because the next stage prunes hard.
The re-ranker scores the 200 candidates on three signals beyond visual similarity:
The final 30 are sorted by a weighted sum: 0.6 visual similarity, 0.25 tempo, 0.10 editorial tags, 0.05 key.
This is the part the marketing copy elsewhere will not say. The projector is trained, not zero-shot, and on our internal eval (200 held-out clips, three human raters per clip) the top-3 results contain the right vibe about 71% of the time. Top-1 is closer to 38%. The model is good at "this is a tense night-time montage; show me dark synth music"; it is worse at "this is a lighthearted product reveal; show me something quirky but not grating." Quirky-but-not-grating is a hard target for any audio embedding.
We log every "play" and "save" event from the candidate list back into the training data, gated on user consent, and re-train the projector quarterly. The eval number has moved from 64% top-3 to 71% top-3 across the last three retrains.
Try the route and you'll see the 30 candidates ranked, with the BPM and key on every row.