mode…
phase iii — voice lab

Train the voice.

Build training datasets from your vocal stems. Register checkpoints you trained on a GPU pod. Render new performances in your voice. Score how close you got.

Training happens off-platform — see ml/voice/README.md for the RunPod recipe.
Phase 3 · step 1

Dataset builder

Selects vocal stems by quality, slices them, exports a training manifest.

Source tags (all must match)
Min vocal quality ≥ 60
Segment seconds
Overlap (0–0.9)
Phase 3 · step 2

Off-platform training

Download a dataset manifest, ship it to a GPU pod, run RVC training, bring the .pth + .index files back. Detailed steps in ml/voice/README.md.

  1. RunPod RTX 3090 (~$0.30/hr) — clone the RVC repo, install requirements.
  2. Sync your storage/datasets/{id} folder onto the pod.
  3. Process → extract features (RMVPE) → one-click train.
  4. 200–400 epochs for 30 min of clean vocal data.
  5. scp the .pth + .index back, upload below.
Phase 3 · step 3

Checkpoint registry

Upload the .pth (and optional .index) you trained on the GPU pod. See ml/voice/README.md.

.pth weights (required)
.index (optional, RVC)
Phase 3 · step 4

Render console

Convert a guide vocal / hum / acapella into the trained voice. Requires a working inference backend — set SM_VOICE_BACKEND in .env.

Input audio
Checkpoint
Transpose +0 semitones
Dryness 0.75
Phase 3 · scoring

Voice similarity

Compare two clips — e.g. a real reference vocal vs. a rendered one. Cosine of Resemblyzer speaker embeddings. ≥ 0.75 = strongly matches.

Clip A (e.g. real reference)
Clip B (e.g. rendered)
No track playing
0:000:00