Loading Now

Building an On-Device Voice Assistant with Microsoft Foundry Local

Many tutorials on “voice AI” assume that your audio is processed externally. For example, you might send a WAV file to the Whisper API, get a transcript from GPT-4, and then return a synthesized response. While this method works, it requires three separate transactions, incurs multiple costs, and requires logging the user’s voice at various points.

However, a new generation of compact, hardware-optimised models is shifting this balance. Take NVIDIA’s Nemotron Speech Streaming En 0.6B, a 600M-parameter streaming ASR model that’s now available in the Microsoft Foundry Local catalogue. When paired with a small chat model like qwen2.5-0.5b or phi-4-mini, you can run the entire process—capturing, transcribing, reasoning, and responding—directly on your developer laptop. This setup requires no API keys and leaves your network unaffected.

This article will guide you through how the fl-nemotron sample achieves this, highlight the issues we encountered, and explain the design choices that ensured a reliable pipeline.

We have a browser-based assistant powered by FastAPI at http://127.0.0.1:8000. It captures microphone audio, sends it to /api/transcribe, and streams the chat reply back using Server-Sent Events from /api/chat. All the inference processes run locally through two Foundry Local models, loaded in the same environment.



Here’s how the pipeline works:

Microphone (browser MediaRecorder)
   │  WebM/Opus blob
   ▼
Client-side WAV encoder (16 kHz, mono, PCM-16)
   │  multipart/form-data
   ▼
FastAPI /api/transcribe
   │
   ▼
Nemotron Speech Streaming En 0.6B  (Foundry Local audio client)
   │  transcript text
   ▼
Chat LLM e.g. qwen2.5-0.5b         (Foundry Local chat client)
   │  streamed tokens
   ▼
FastAPI /api/chat → SSE → browser bubble

Before diving into the code, there’s something crucial to note:

The Nemotron Speech Streaming model is exclusively available in the Foundry Local 1.1.x catalogue. Older SDKs (0.5.x/0.6.x) can’t access it and will return a model not found error.

Also, the module name has changed in version 1.1.0; it’s now foundry_local_sdk (with the underscore-sdk suffix), instead of foundry_local. The package for foundry-local-core is bundled, which means you won’t need a separate MSI or winget installation.

It’s essential to install the required package:

pip install --upgrade "foundry-local-sdk>=1.1.0,<2"

And confirm your installation:

python -c "import importlib.metadata as m; print('sdk', m.version('foundry-local-sdk'))"
# should output: sdk 1.1.0

The 1.1.x SDK introduces a single FoundryLocalManager that manages the runtime. For each loaded model, you receive an OpenAI-compatible client—get_chat_client() for text models and get_audio_client() for ASR. There’s no need to bring your own openai Python package since the SDK includes a lightweight client.

The wrapper used in the repository (src/foundry_client.py) does the following:

from foundry_local_sdk import Configuration, FoundryLocalManager

FoundryLocalManager.initialize(Configuration(app_name="fl-nemotron")) manager = FoundryLocalManager.instance

chat_model = manager.load_model("qwen2.5-0.5b") stt_model = manager.load_model("nemotron-speech-streaming-en-0.6b")

chat_client = chat_model.get_chat_client() audio_client = stt_model.get_audio_client()

Both models download on first use into the Foundry Local cache and remain available for the lifetime of the process. If you have a laptop with 16GB RAM, the total memory usage stays comfortably under 4GB.


Initially, we tried a straightforward approach:

with open(wav_path, "rb") as f:
    result = audio_client.transcribe(file=f, model="nemotron-speech-streaming-en-0.6b")

This call doesn’t work with the Nemotron model. The bundled ONNX Runtime GenAI in foundry-local-core does not recognize the nemotron_speech multi-modal model type expected by the standard AudioClient.transcribe() method. The error appears as a confusing model-type registration failure deep within the native runtime.

The solution is to use the streaming session API instead—this calls a different native entry point (core_interop.start_audio_stream) that the streaming model does support. You can find this approach detailed in the repository at src/_nemotron_live.py:

def transcribe_wav_live(audio_client, wav_path, *, language="en"):
    with wave.open(str(wav_path), "rb") as w:
        sample_rate  = w.getframerate()
        channels     = w.getnchannels()
        sample_width = w.getsampwidth()
        pcm          = w.readframes(w.getnframes())
session = audio_client.create_live_transcription_session()
session.settings.sample_rate     = sample_rate
session.settings.channels        = channels
session.settings.bits_per_sample = sample_width * 8
session.settings.language        = language
session.start()

# Feed PCM data in ~100 ms chunks using a worker thread
bytes_per_sec = sample_rate * channels * sample_width
chunk_bytes   = max(bytes_per_sec // 10, 1024)

def _pusher():
    try:
        for offset in range(0, len(pcm), chunk_bytes):
            session.append(pcm[offset:offset + chunk_bytes])
    finally:
        session.stop()

threading.Thread(target=_pusher, daemon=True).start()

parts = []
for resp in session.get_stream():
    for cp in getattr(resp, "content", []) or []:
        text = getattr(cp, "text", "") or getattr(cp, "transcript", "") or ""
        if text:
            parts.append(text)
return " ".join(p.strip() for p in parts if p.strip()).strip()</code></pre>

There are two key things to note:

  • Push from a thread while reading from the main coroutine: session.append() is a blocking write into the native stream, and session.get_stream() is a blocking generator. By running one in a worker thread, the other can process in parallel—avoiding deadlocks.
  • Chunk data in ~100 ms: Smaller chunks (e.g., 10 ms) take more time traversing the FFI boundary than actually transcribing, while larger chunks (e.g., 1 s) delay the retrieval of partial results, which can increase perceived latency.
  • Ensure you call session.stop(): Neglecting this results in the generator never terminating, leading to a request hang.

Since MediaRecorder typically records in audio/webm; codecs=opus, this format is efficient yet unsuitable for our speech-to-text (STT) model, which expects a 16-bit mono PCM WAV file at a specific sample rate. Decoding WebM/Opus server-side would necessitate ffmpeg as a runtime dependency—something we’re trying to avoid.

A better approach is to encode WAV files directly on the client side. The AudioContext.decodeAudioData function can handle WebM/Opus, allowing the page to decode the recording, resample it to 16 kHz, mix it to mono, and produce a PCM-16 WAV blob—all in 30 lines of JavaScript:

// Inside src/static/index.html
async function webmToWav(blob) {
  const ctx = new (window.AudioContext || window.webkitAudioContext)({ sampleRate: 16000 });
  const buf = await ctx.decodeAudioData(await blob.arrayBuffer());
  // Mix to mono
  const ch  = buf.numberOfChannels;
  const mono = new Float32Array(buf.length);
  for (let c = 0; c < ch; c++) {
    const data = buf.getChannelData(c);
    for (let i = 0; i < data.length; i++) mono[i] += data[i] / ch;
  }
  return encodeWav(mono, 16000);
}

function encodeWav(samples, sampleRate) { const buffer = new ArrayBuffer(44 + samples.length 2); const view = new DataView(buffer); // RIFF header writeStr(view, 0, "RIFF"); view.setUint32(4, 36 + samples.length 2, true); writeStr(view, 8, "WAVE"); // Format chunk writeStr(view, 12, "fmt "); view.setUint32(16, 16, true); // PCM chunk size view.setUint16(20, 1, true); // PCM format view.setUint16(22, 1, true); // mono view.setUint32(24, sampleRate, true); view.setUint32(28, sampleRate 2, true); // byte rate view.setUint16(32, 2, true); // block align view.setUint16(34, 16, true); // bits per sample // Data chunk writeStr(view, 36, "data"); view.setUint32(40, samples.length 2, true); // PCM-16 samples let o = 44; for (let i = 0; i < samples.length; i++, o += 2) { const s = Math.max(-1, Math.min(1, samples[i])); view.setInt16(o, s < 0 ? s 0x8000 : s 0x7FFF, true); } return new Blob([view], { type: "audio/wav" }); }

Now, when the server’s /api/transcribe endpoint receives the audio, it simply writes the bytes to a temporary file and processes it through transcribe_wav_live()—no additional audio decoding libraries needed on the Python side.

The server (located in src/app.py) is intentionally lightweight. A critical aspect is that the same process keeps both Foundry Local model handles active for its entire run, eliminating any warm-up cost per request:

@app.post("/api/transcribe")
async def transcribe(audio: UploadFile = File(...)):
    data = await audio.read()
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
        f.write(data); path = f.name
    text = _ai_client.transcribe(path)
    return {"text": text}

@app.post("/api/chat") async def chat(req: ChatRequest): if req.stream: return StreamingResponse( _sse(_ai_client.stream_completion(req.messages)), media_type="text/event-stream", ) return {"text": _ai_client.chat_completion(req.messages)}

We use Server-Sent Events for streaming because they're easily supported in both fetch() and FastAPI, avoiding the need for a WebSocket upgrade through any proxy that might stand between the client and localhost.

The repository also includes screenshots showcasing the user interface: a welcome screen with both models loaded, a displayed haiku response, a code block that can be copied, and the microphone recording state.



This setup is efficient for smaller models and is friendly for CPU use. For instance, we’ve tested it on an Arm64 Surface running the x64 SDK under emulation:

  • The first model load (cold cache) takes about tens of seconds, downloading around 600MB for the Nemotron and about 400MB for qwen2.5-0.5b.
  • Subsequent loads (warm cache) reduce to a few seconds per model.
  • End-to-end transcription of a 5-second utterance is completed in under a second after warm-up.
  • Getting the first chat token from qwen2.5-0.5b typically takes about 200–500 ms, with full replies arriving within 1–2 seconds.

When implemented on x64 hardware with a recent processor, performance improves significantly, and the SDK will automatically select the optimal execution provider (CPU / DirectML / CUDA) for each model.

  • Model quality: While qwen2.5-0.5b is a quick and compact model suitable for laptops, it’s not as powerful as GPT-4. If you have enough RAM and require better reasoning, consider switching to phi-4-mini or mistral-nemo-12b-instruct.
  • STT is limited to English: Presently, the available Nemotron streaming model is ...-en-0.6b. Multilingual versions should arrive soon.
  • Browser microphone requires a real browser: Automated browsers (like Playwright or Puppeteer) default to denying getUserMedia. Open the page in supported browsers like Edge, Chrome, or Firefox to grant access and record audio properly.
  • No agent framework exists yet: This example operates as a simple one-turn interaction with a chat client—it's not designed for tool calling, planning, or multi-agent orchestration. Integrating the Microsoft Agent Framework would be a logical extension to enhance functionality.

Running everything locally addresses privacy issues related to cloud data transmission, but it doesn't eliminate accountability:

  • Ensure recording is disclosed: The browser prompts for microphone permission; your user interface should clearly indicate when recording is active. The example uses a red button and a "Recording…" banner to convey this.
  • Avoid logging raw audio: The sample application writes audio to a temporary NamedTemporaryFile and deletes it post-transcription. Always treat WAV files as sensitive data, even when stored securely on the device.
  • Beware of small model hallucinations: While a 0.5B chat model can deliver quick responses, it may not always provide accurate information. For critical answers, consider using retrieval processes or escalate to a larger model for increased accuracy.
  1. Clone the repository from github.com/leestott/fl-nemotron.
  2. Run ./setup.ps1 (or ./setup.sh) to create a virtual environment and install the SDK.
  3. Execute python scripts/prefetch.py nemotron-speech-streaming-en-0.6b qwen2.5-0.5b to download both models.
  4. Start the application with .venv\Scripts\uvicorn.exe app:app --app-dir src --port 8000.
  5. Access http://127.0.0.1:8000 in a browser and click the button.
  • Ensure that foundry-local-sdk >= 1.1.0 is installed, as earlier SDKs cannot access the Nemotron Speech Streaming model.
  • Utilize the LiveAudioTranscriptionSession API for Nemotron instead of AudioClient.transcribe().
  • Encode WAV formats in the browser to avoid adding an extra heavy server-side dependency like ffmpeg.
  • Push audio chunks on a separate worker thread and read from the main one to prevent deadlocks.
  • A compact Foundry Local chat model combined with Nemotron STT creates a reliable local voice system all within one Python process—no cloud connection, no keys, and no data egress.

Share this content:


Discover more from Qureshi

Subscribe to get the latest posts sent to your email.

Discover more from Qureshi

Subscribe now to keep reading and get access to the full archive.

Continue reading