I Built an AI That Turns Your Reading List Into a Music Track — Here's Everything I Learned

I created this piece of content for the purposes of entering the Google Live Agent Challenge hackathon. #GeminiLiveAgentChallenge

I have a problem with information. I consume a lot of it — articles, YouTube deep-dives, research papers, Twitter threads — and most of it evaporates within 48 hours. Not because the content wasn't good. Because I read it passively, closed the tab, and moved on.

Music is different. A song you heard once in 2015 can reconstruct an entire afternoon in detail. The melody is a carrier signal for memory.

So during the Google Live Agent Challenge, I built Echo: an AI that takes whatever you've been reading and turns it into a personalized music track — generated album art, an original instrumental, and AI-written verses that distill what you learned. Your reading list becomes a song.

Here's exactly how I built it, what broke, and what I wish I'd known on day one.

The Architecture: A 3-Agent Pipeline Driven by Voice

post image

The entry point is a Gemini Live voice session. You open /live and Echo greets you by voice. It asks three things: what you're learning, what genre you're in the mood for, and paste some URLs. Once it has all three, it calls a trigger_generation() function tool — and the pipeline fires.

This is the core "Live Agent" pattern the hackathon is built around. The conversation is the agent, and the tool call is the handoff to the worker pipeline.

Building the Voice Layer (Gemini Live)

Gemini Live uses a bidirectional WebSocket protocol. The browser captures 16kHz PCM via an AudioWorklet, sends it to my Node.js server as binary chunks, and the server forwards it to Gemini's BidiGenerateContent endpoint. Audio responses come back as 24kHz PCM and are queued for playback.

The hardest part wasn't the audio — it was making the tool call reliable.

Gemini is conversational. If you give it a system prompt that says "call trigger_generation() when you have goal, genre, and links", it will sometimes call it too early (after hearing a genre but no URLs) or ask redundant clarifying questions forever. The fix was a strict step-gate system prompt:

STEP 1: Ask for learning goal. Wait.
STEP 2: Ask for genre. Wait.
STEP 3: Ask for URLs. Wait.
STEP 4: Read back all three and ask "Ready to generate?"
STEP 5: ONLY call trigger_generation() after yes.
NEVER skip steps. NEVER assume values.

Five steps felt over-engineered, but it completely eliminated premature tool calls.

The Musical DNA System

The second agent — the Creative Director — is where the creative magic happens. It takes the content analysis and produces two things: learning verses (16 lines of original writing that distill what you learned) and a Musical DNA object that drives Lyria:

{
  "bpm": "92",
  "mood": "Focused",
  "key": "A Minor",
  "density": 0.6,
  "brightness": 0.4,
  "music_direction": "Warm Rhodes piano, subtle bass, 92 BPM, late-night jazz feel"
}

The insight here: Lyria responds to parameters more than it responds to text. A jazz track about LLM attention mechanisms feels completely different from a lo-fi track about the same topic — not because of the genre label, but because of the BPM, key, and density values.

I got Gemini 2.5 Pro to reason about these values explicitly: "given that this content is about complex technical architecture and the user's mood is 'focused late-night study', what BPM and brightness values fit?" It produces better musical results than just passing a genre string.


The Hardest Part: Lyria's Undocumented WebSocket Protocol

Lyria RealTime (lyria-realtime-exp) is experimental. The documentation exists but leaves significant gaps. Here's what I had to figure out from trial and error:

The setup message must include the model

ws.send(JSON.stringify({
  setup: {
    model: 'models/lyria-realtime-exp'  // ← REQUIRED. Missing this = silent rejection
  }
}));

Omitting model from setup causes the connection to silently close with code 1000. No error. No message. Just gone.

Prompts are wrapped, not plain text

After setup completes (setupComplete message), you send prompts like this:

ws.send(JSON.stringify({
  clientContent: {
    weightedPrompts: [{ text: "jazz, focused, warm piano", weight: 1.0 }]
  }
}));

Plain strings don't work. The weightedPrompts wrapper is required.

Music config is a separate message

ws.send(JSON.stringify({
  musicGenerationConfig: {
    bpm: 92,
    scale: "SCALE_A_MINOR",
    density: 0.6,
    brightness: 0.4,
    temperature: 1.0
  }
}));

This goes after the prompt message, before the PLAY control. Order matters.

Audio is base64-encoded raw PCM

Lyria sends back chunks like this:

serverContent.audioChunks[].data  // base64 PCM

Sample rate: 48,000 Hz. Channels: 2 (stereo). Bit depth: 16-bit signed. I accumulate chunks until I hit ~11.5 MB (60 seconds of audio at that format), then build the WAV header manually and upload to Cloud Storage.

Building the WAV header is straightforward — it's a 44-byte RIFF structure — but it took me embarrassingly long to realize the chunk size values in the header need to be correct before streaming. I buffer all PCM first, then prepend the header:

function buildWavBuffer(pcmBuffer, sampleRate = 48000, channels = 2, bitDepth = 16) {
  const byteRate = (sampleRate * channels * bitDepth) / 8;  // = 192,000
  const blockAlign = (channels * bitDepth) / 8;             // = 4
  const header = Buffer.alloc(44);
  // ... RIFF header fields ...
  header.writeUInt32LE(36 + pcmBuffer.length, 4);  // file size - 8
  header.writeUInt32LE(pcmBuffer.length, 40);       // data chunk size
  return Buffer.concat([header, pcmBuffer]);
}

The Pivot: Instrumental Music + Learning Verses

About halfway through, I hit a product problem. Lyria generates instrumental music — no vocals, no singing. But I'd designed a "karaoke-style" lyric sync where verses highlighted as the music played.

The result was jarring: text was animating on screen while the music had no vocals. It looked like something was broken.

The fix was a framing pivot. Instead of positioning Echo as "AI karaoke", I reframed the verses as learning companions — text you read while you listen. The subtitle on the result page changed to: "Written by Gemini 2.5 Pro · Read along while Lyria plays."

This is actually a stronger product story. It's a multimedia learning digest: the instrumental sets your mental state, and the verses encode what you learned into language your brain can hold. Spotify + study notes in one.

The lesson: product framing is as important as the technology. A "bug" (music has no vocals) became a feature (ambient listening + reading) with a single reframe.


Real-Time Progress with SSE (Not Polling)

Generation takes 60–90 seconds. The browser needs to show which agent is running without the user wondering if it crashed.

I almost reached for WebSockets, but Server-Sent Events are so much simpler for this use case — one-directional, built into the browser, and reconnects automatically:

// Server
app.get('/api/pipeline-status/:chatId', (req, res) => {
  res.setHeader('Content-Type', 'text/event-stream');
  res.flushHeaders();

  const send = (event) => res.write(`data: ${JSON.stringify(event)}\n\n`);
  pipelineEvents.on(`progress:${chatId}`, send);
  req.on('close', () => pipelineEvents.off(`progress:${chatId}`, send));
});

// Browser
const evtSource = new EventSource(`/api/pipeline-status/${chatId}`);
evtSource.onmessage = (e) => updateProgressUI(JSON.parse(e.data));

One edge case: if the user refreshes mid-generation, they rejoin with no history. Fix: an in-memory pipelineState Map stores the last event per chatId, and new SSE connections replay it immediately.


The Result

A user opens /live, has a 30-second voice conversation with Echo, watches a real-time 3-agent progress stepper for 60 seconds, then lands on a result page with:

  • An Imagen 4 album cover styled around their content's themes

  • A Lyria instrumental composed with their exact BPM, mood, and key

  • AI-written verses from Gemini 2.5 Pro that distill what they read

  • Musical DNA showing BPM, mood, and key — proof that Lyria was steered precisely

Eight Google APIs chained together from a single voice conversation. That's the "Live Agent" pattern: voice → intent → structured data → parallel AI generation → result.

post image
demo page, click the orb and the voice assistant will greet you!
post image
Song Cover from Imagen4 and lyrics from Gemini 2,5 Pro.
post image
Lyrics hovering according to the music vibes

What I'd Do Differently

Start with Lyria earlier. I underestimated how experimental it is. The WebSocket protocol investigation cost a full day.

Design for instrumental from the start. The "learning verses" framing is better than karaoke anyway — but I should have started with it rather than arriving at it as a pivot.


What's Next: Live Lyria Steering

The feature I didn't have time to build: keeping the Lyria WebSocket open during the Gemini Live voice session and relaying voice commands as real-time parameter updates.

"Make it more upbeat" → setWeights({ brightness: 0.9, bpm: 120 }) "Go darker" → setWeights({ brightness: 0.2, key: "SCALE_D_MINOR" })

This is the definitive Live Agent demo — AI music that responds to your voice in real time. It's technically achievable with the current stack. Just needs time.


Try Echo

  • Live app: https://echo-535359416008.us-central1.run.app

  • Voice interface: https://echo-535359416008.us-central1.run.app/live

  • Web form (no mic): https://echo-535359416008.us-central1.run.app/demo

  • GitHub: (your repo URL)

Built with Gemini Live, Gemini 2.5 Pro, Imagen 4, Lyria RealTime, Cloud Firestore, Cloud Storage, and Cloud Run.

Created for the Google Live Agent Challenge 2026. #GeminiLiveAgentChallenge

SzeHao's Sharing

Written by

A place where I wish to inspire everyone!

Subscribers<100
Posts1
Collects0
Subscribe