Skip to content

Yuvraj Sanghai

I build the systems behind the intelligence.

Explore My Work
// scroll to explore
2+years building AI infrastructure4production systems shippedsystems thinking
// experience

The Pipeline

// projects

The Engine Room

Orpheus TTS

A language model where the 'language' is sound

Llama 3.2 3BLoRAUnslothSNAC Codecllama.cppGGUFFastAPILightning AI

2025

Why train a language model to output sound?

Orpheus TTS treats audio generation as a language modeling problem. Audio is encoded into discrete SNAC tokens, and the LLM learns to predict token sequences that decode back into speech. The 'language' the model speaks is sound — same transformer architecture, completely different modality.

Why LoRA instead of full fine-tuning?

Full fine-tuning a 3B parameter model requires significant GPU memory and risks catastrophic forgetting. LoRA trains only low-rank adapter matrices — a fraction of the parameters — making it possible to fine-tune on a single GPU in 19 minutes while preserving the base model's learned representations.

Why GGUF quantization?

The full model is 6.3GB. GGUF Q4_K_M quantization compresses it to ~2GB with minimal quality loss, enabling inference on consumer hardware. Combined with llama.cpp's optimized C++ runtime, this made CPU inference viable (slow, but functional) and GPU inference fast.

Why the SNAC codec specifically?

SNAC produces hierarchical discrete tokens from audio — multiple codebook levels at different resolutions. Each token position requires subtracting a different codebook offset (0, 4096, 8192, 12288, 16384, 20480, 24576). Without this offset correction, the model generates perfect silence. This was the hardest bug to find.

Why move from CPU to GPU?

CPU inference on a 3B model is a memory bandwidth problem, not a compute problem. Even with AVX2+FMA optimizations and GGUF quantization, CPU topped out at ~7.5 tokens/sec (~95 seconds per sentence). The T4 GPU on Lightning AI hit ~65 tokens/sec — roughly 30x faster, bringing generation down to ~3 seconds for short prompts.

Architecture

aifrontendbackendprocessinfradata
Elise Dataset~1,200 pairsAudio InputSNAC CodecLlama 3.2 3B(Orpheus)LoRA Adapters(16-bit)Adapter MergeGGUF Q4_K_MExportllama.cppInferenceT4 GPU(Lightning AI)FastAPIBackendAudio Output

// audio sample

0:00 / 0:00Generated by fine-tuned Orpheus TTS

// skills

The Stack

Not a checklist. A system.

Languages
Frameworks
Databases
Concepts

// lab

The Lab

Where I think out loud about systems, infrastructure, and AI.

// first post~12 min read

I trained a language model to speak — here's what broke

A walk through fine-tuning Orpheus TTS on a single GPU: SNAC codec token offsets, why CPU inference is a memory bandwidth problem, and how 30× faster inference on a T4 changes what's worth shipping.

TTSFine-tuningllama.cppVoice AI

// writing queue

Why I migrated from ChromaDB to Qdrant — and what broke

Hybrid dense-sparse search, production migration pain points

RAGVector DBsVoiceraCXComing Soon

Designing a semantic cache that actually works

Similarity thresholds, cache invalidation, and the -60% API call reduction

CachingEmbeddingsLatencyComing Soon

Building for offline-first in Tier-2/3 India

Why CRDT sync beats PostgreSQL replication for schools with spotty connectivity

ArchitectureSyncSmart PathshalaComing Soon
Notify me when published →

// signal

Signal Strength

CompetitionAcademicsResearchBuildingInfra
hackathon wins
99.35%MHT-CET percentile
1published paper
4+projects shipped

// achievements

CompetitionWinner — Devclash 2025, DY Patil Pimpri (ShetNiyojan)
CompetitionWinner — Synapse 2.0, MKSSS CCOEW (Legify)
CompetitionWinner — L&T NeuroHack, COEP Mindspark 24 (WarCast)
AcademicsMHT-CET 2022 — 99.35 %ile, Rank 971 / 400,000+
AcademicsB.E. Computer Engineering, Honors in Data Science — PICT Pune
ResearchCo-authored paper on evidence-verified answer extraction for automated exam grading
Open SourcePublished fine-tuned TTS model + LoRA adapters on Hugging Face (yuv008)

// connect

The Handshake

terminal
GitHubLinkedInEmailHuggingFace

// currently open to opportunities

Send a message →