Loading Now

Building a hands-free voice concierge with Microsoft Foundry Voice Live and a Hosted Agent

This article provides an accessible guide on connecting your browser’s microphone to Azure AI Speech Voice Live. It details how to bind a real-time session to a Foundry hosted agent, allowing the agent to respond to travel-related queries using tool calls. You can find the complete source code, infrastructure details, and labs at the repository link given at the end.

Building effective voice user interfaces has been quite a challenge. Traditionally, tasks like streaming audio, generating partial transcripts, handling interruptions, detecting voice activity, dispatching tools, and playing audio required combining several services. However, by integrating Voice Live with a Foundry hosted agent, we can simplify this process into a single real-time WebSocket session with just one binding field.

  • Voice Live manages the complete audio loop, including speech-to-text conversion, neural text-to-speech, semantic turn detection, noise reduction, and echo cancellation.
  • The Foundry hosted agent is responsible for logic, memory management, model selection, evaluators, and making tool calls.
  • Communication between the two occurs via one query parameter in the WebSocket URL.

Practically, this means the browser never has to deal with a model API key, nor does it instantiate any tools or manage the agent prompt. Instead, the browser focuses on capturing audio and playing it back, while everything else is handled on the server side.

This example is named Contoso Travel Concierge. Users, often busy during their journey, might want to ask questions like:

  • What’s the weather like in Tokyo this weekend?
  • Is flight BA005 from Heathrow on time?
  • What time is check-in at the Marriott Marquis?

Each query triggers a tool call on the hosted agent, which provides concise and synthesised responses back to the user in under a second through a steady connection.

 

There are four main components in this setup, three of which are Azure services. Only the broker requires your custom code.

 

  1. Browser client – Records PCM16 audio at 24 kHz and streams it to the broker via WebSocket. It also plays back audio segments forwarded from Voice Live.
  2. Session broker (FastAPI) – Authenticates with Azure using DefaultAzureCredential, generates the Voice Live WebSocket URL with a short-lived bearer token, and relays audio frames between the browser and Voice Live.
  3. Voice Live – The Azure AI Speech real-time endpoint, responsible for transcribing user input, sending text to the bound agent, and synthesising the agent’s responses.
  4. Foundry hosted agent – A prompt-kind agent in Azure AI Foundry with the necessary tool definitions and microsoft.voice-live.enabled metadata set to true.

Let’s highlight two important design decisions.

The broker’s functionality is intentionally minimal. It handles authentication, URL creation, and WebSocket relay but does not transcode audio, implement business logic, or track conversation state. Voice Live and the agent efficiently manage those tasks.

The agent binding occurs as a URL query parameter, not through an SDK call. As a result, there’s no separate HTTP request for each interaction with the agent runtime. Voice Live establishes a session with the agent only once, streaming turns throughout the WebSocket’s duration. This setup significantly reduces latency.

This is a critical aspect. The public sample from Microsoft, available at liupeirong/ai-foundry-voice-agent, uses a different URL structure (services.ai.azure.com host, agent-id + agent-access-token parameters, and an Authorization header). Unfortunately, this format isn’t compatible with Foundry resources that utilise voice-live-enabled agents. The format below is what the portal uses correctly, and this sample adheres to it.

The following three details often lead to errors:

  • The host needs to be .cognitiveservices.azure.com, not services.ai.azure.com. The broker adjusts this automatically from VOICE_LIVE_ENDPOINT.
  • The bearer token must be included in the authorization query parameter, URL-encoded, with the prefix Bearer and a + (or %20) preceding the token. Avoid using an Authorization header.
  • Both agent-name and model should equal the agent’s display name, while agent-version can be left blank if you want to use the latest version.
  • Ensure you have Python 3.11 or later (the sample was developed with 3.13).
  • Sign in to Azure CLI with az login --tenant .
  • Create an Azure AI Foundry project in a Voice Live region (such as eastus2, swedencentral, or westus2).
  • Deploy a prompt-kind agent within that project, ensuring Enable Voice Live is activated.
  • Assign the Cognitive Services User role to the identity that the broker will use on the Foundry resource.

Next, duplicate .env.sample as .env and fill in the following values:

AZURE_AI_PROJECT_ENDPOINT=https://.services.ai.azure.com
AZURE_AI_PROJECT_NAME=
VOICE_LIVE_ENDPOINT=wss://.services.ai.azure.com/voice-live/realtime
VOICE_LIVE_API_VERSION=2025-10-01
FOUNDRY_AGENT_ID=

The agent name is displayed on the agent card in the Foundry portal, and the broker will use this for both agent-name and model query parameters.

python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
.\scripts\start-local.ps1

The broker provides three endpoints:

  • GET /healthz – For liveness checks.
  • GET /config – Returns the session.update that the browser sends as its initial frame.
  • WS /ws – The bi-directional relay to Voice Live.
.\scripts\test-session.ps1

If everything works correctly, the output will be:

[OK] /ws upgraded
   -> sent session.update
   <- {"type":"session.created",…}
   <- {"type":"session.updated",…}
[OK] session.updated received -- E2E works

This confirms that the entire process—broker, DefaultAzureCredential token, correct Foundry Portal URL format, Voice Live handshake, and the bound agent acknowledging the session—functions properly.

Visit http://localhost:8000/, hit Start talking, and try asking one of the sample questions. You’ll see real-time transcripts, and the spoken responses will play through your audio system.

The relay logic is compact; most of the work lies in building the URL. The following function serves as the standard reference; you can adapt it for use in other programming languages.

def build_voice_live_ws_url(agent_access_token: str) -> str:
    """
    Build a WebSocket URL for Voice Live in the Foundry Portal style.
Authentication is handled in the query string. No Authorization header is required.
"""
host = _ws_host_from_endpoint(VOICE_LIVE_ENDPOINT)
qs = urlencode(
    {
        "trafficType": "FoundryPortal",
        "agent-name": FOUNDRY_AGENT_ID,
        "agent-version": "",
        "agent-project-name": AZURE_AI_PROJECT_NAME,
        "api-version": VOICE_LIVE_API_VERSION,
        "model": FOUNDRY_AGENT_ID,
        "client-request-id": str(uuid.uuid4()),
        "authorization": f"Bearer {agent_access_token}",
    },
    quote_via=quote,
)
return f"wss://{host}/voice-live/realtime?{qs}"

The relay consists of two asyncio tasks: one for forwarding browser frames upstream and another for relaying Voice Live frames back. Audio data remains unaltered—the broker doesn’t decode it.

The most reliable method to create a voice-live-enabled agent is via the Foundry portal. Agents created through the Assistants v2 SDK typically lack the necessary metadata and will be rejected by the Voice Live URL.

To set up an agent in the portal, follow these steps:

  1. Open your Foundry project, navigate to Agents, and click New agent.
  2. Select Prompt agent, provide a name (like travel-concierge), and choose a model deployment.
  3. Copy the contents from agent/src/prompts/system.txt into the instructions box.
  4. On the Voice tab, ensure Enable Voice Live is activated. This sets the microsoft.voice-live.enabled = true metadata.
  5. Add the three tools (get_weather, get_flight_status, get_hotel_info) specified in agent/agent.yaml on the Tools tab.
  6. Publish the version and update the agent name in .env with FOUNDRY_AGENT_ID.

A comprehensive deployment guide, including instructions on hosting the broker using Azure Container Apps with a managed identity, can be found in docs/deployment.md in the repository.

Foundry agents often format their answers in markdown, including citations such as ([data.jma.go.jp](https://...)). When Voice Live speaks these responses, users hear URLs read aloud, letter by letter. To avoid this, write the agent instructions so they don’t contain any URLs, markdown, or symbols. You might append a brief section at the end of the instructions like this:

Voice output guidelines
- As this output is read aloud by TTS, avoid including URLs, domain names, or citation markers like "(source.com)" in responses. Instead, cite by the name of the source only.
- Do not use markdown for formatting. Avoid asterisks, brackets, backticks, bullets, or hashes. Write in clear spoken sentences.
- Keep numbers easy to understand: say “thirty degrees Celsius” instead of “30C / 86F”.
- Limit responses to around 40 words unless the user asks for more detail.

The browser can still present markdown for visual clarity. The sample includes a simple markdown renderer that allows only certain formats: bold, italic, code, and http(s) links. This way, the screen displays a well-polished version of the agent’s response, while the spoken version remains uncomplicated.

The broker employs DefaultAzureCredential and requests the https://ai.azure.com/.default scope. Locally, this resolves using your az login credentials. In Azure Container Apps, it resolves to the user-assigned managed identity. In either scenario, to interface with the Foundry account, you only require the Cognitive Services User role. The working URL structure does not include a pathway for API keys—it’s all about bearer tokens.

Be cautious when using the public liupeirong/ai-foundry-voice-agent repository with a voice-live agent provided by the portal. You might encounter HTTP 400 errors or a silent disconnection with code 1006. These issues arise from using an incorrect URL shape rather than coding mistakes. The reference probe in scripts/probe_portal_shape.py serves as the only reliable source for this format—keep it for reference.

  • Credentials are never exposed to the browser. Tokens are generated server-side and only travel through the upstream Voice Live URL.
  • No secrets in source code. The .env file is gitignored, and .env.sample only contains placeholders.
  • Markdown rendering follows a strict process. The browser HTML-escapes the agent’s reply before applying a small set of permissible markdown formats, ensuring links only use http(s) URLs to avoid security risks.
  • Tool calls are traceable. Every turn logs a run in the Foundry portal under the agent, which includes visible prompts, model outputs, and the inputs and outputs of tools for review.
  • Consider voice biometric verification. If your application involves account verification by voice, it’s better to use dedicated voice recognition technology instead of heavily relying on the conversational model.
  • Utilising Voice Live alongside a Foundry hosted agent offers a session-level integration, unlike traditional API integrations. It involves a single URL, one binding field, and one WebSocket connection.
  • The browser acts as a lightweight client. All authentication, URL creation, and relaying functionalities rest within a compact FastAPI broker.
  • Pay attention to the URL structure (it should be cognitiveservices.azure.com). Ensure the token is in the query string, with agent-name being identical to model and matching the agent’s display name. Getting this right streamlines the entire process.
  • Always create agents via the Foundry portal to ensure the voice-live metadata is set appropriately.
  • Frame agent instructions for auditory clarity rather than visual presentation, with visual formatting layered on top in the browser.

If you build upon this approach, please consider opening an issue or submitting a pull request to the repository. The sample is intentionally kept concise to facilitate easy forking.

Share this content:


Discover more from Qureshi

Subscribe to get the latest posts sent to your email.

Discover more from Qureshi

Subscribe now to keep reading and get access to the full archive.

Continue reading