Building a hands-free voice concierge with Microsoft Foundry Voice Live and a Hosted Agent
This article provides an accessible guide on connecting your browser’s microphone to Azure AI Speech Voice Live. It details how to bind a real-time session to a Foundry hosted agent, allowing the agent to respond to travel-related queries using tool calls. You can find the complete source code, infrastructure details, and labs at the repository link given at the end.
Building effective voice user interfaces has been quite a challenge. Traditionally, tasks like streaming audio, generating partial transcripts, handling interruptions, detecting voice activity, dispatching tools, and playing audio required combining several services. However, by integrating Voice Live with a Foundry hosted agent, we can simplify this process into a single real-time WebSocket session with just one binding field.
- Voice Live manages the complete audio loop, including speech-to-text conversion, neural text-to-speech, semantic turn detection, noise reduction, and echo cancellation.
- The Foundry hosted agent is responsible for logic, memory management, model selection, evaluators, and making tool calls.
- Communication between the two occurs via one query parameter in the WebSocket URL.
Practically, this means the browser never has to deal with a model API key, nor does it instantiate any tools or manage the agent prompt. Instead, the browser focuses on capturing audio and playing it back, while everything else is handled on the server side.
This example is named Contoso Travel Concierge. Users, often busy during their journey, might want to ask questions like:
- What’s the weather like in Tokyo this weekend?
- Is flight BA005 from Heathrow on time?
- What time is check-in at the Marriott Marquis?
Each query triggers a tool call on the hosted agent, which provides concise and synthesised responses back to the user in under a second through a steady connection.
There are four main components in this setup, three of which are Azure services. Only the broker requires your custom code.
- Browser client – Records PCM16 audio at 24 kHz and streams it to the broker via WebSocket. It also plays back audio segments forwarded from Voice Live.
- Session broker (FastAPI) – Authenticates with Azure using
DefaultAzureCredential, generates the Voice Live WebSocket URL with a short-lived bearer token, and relays audio frames between the browser and Voice Live. - Voice Live – The Azure AI Speech real-time endpoint, responsible for transcribing user input, sending text to the bound agent, and synthesising the agent’s responses.
- Foundry hosted agent – A prompt-kind agent in Azure AI Foundry with the necessary tool definitions and
microsoft.voice-live.enabledmetadata set totrue.
Let’s highlight two important design decisions.
The broker’s functionality is intentionally minimal. It handles authentication, URL creation, and WebSocket relay but does not transcode audio, implement business logic, or track conversation state. Voice Live and the agent efficiently manage those tasks.
The agent binding occurs as a URL query parameter, not through an SDK call. As a result, there’s no separate HTTP request for each interaction with the agent runtime. Voice Live establishes a session with the agent only once, streaming turns throughout the WebSocket’s duration. This setup significantly reduces latency.
This is a critical aspect. The public sample from Microsoft, available at liupeirong/ai-foundry-voice-agent, uses a different URL structure (services.ai.azure.com host, agent-id + agent-access-token parameters, and an Authorization header). Unfortunately, this format isn’t compatible with Foundry resources that utilise voice-live-enabled agents. The format below is what the portal uses correctly, and this sample adheres to it.
The following three details often lead to errors:
- The host needs to be
.cognitiveservices.azure.com, notservices.ai.azure.com. The broker adjusts this automatically fromVOICE_LIVE_ENDPOINT. - The bearer token must be included in the
authorizationquery parameter, URL-encoded, with the prefixBearerand a+(or%20) preceding the token. Avoid using anAuthorizationheader. - Both
agent-nameandmodelshould equal the agent’s display name, whileagent-versioncan be left blank if you want to use the latest version.
- Ensure you have Python 3.11 or later (the sample was developed with 3.13).
- Sign in to Azure CLI with
az login --tenant. - Create an Azure AI Foundry project in a Voice Live region (such as
eastus2,swedencentral, orwestus2). - Deploy a prompt-kind agent within that project, ensuring Enable Voice Live is activated.
- Assign the Cognitive Services User role to the identity that the broker will use on the Foundry resource.
Next, duplicate .env.sample as .env and fill in the following values:
AZURE_AI_PROJECT_ENDPOINT=https://.services.ai.azure.com
AZURE_AI_PROJECT_NAME=
VOICE_LIVE_ENDPOINT=wss://.services.ai.azure.com/voice-live/realtime
VOICE_LIVE_API_VERSION=2025-10-01
FOUNDRY_AGENT_ID=
The agent name is displayed on the agent card in the Foundry portal, and the broker will use this for both agent-name and model query parameters.
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
.\scripts\start-local.ps1
The broker provides three endpoints:
GET /healthz– For liveness checks.GET /config– Returns thesession.updatethat the browser sends as its initial frame.WS /ws– The bi-directional relay to Voice Live.
.\scripts\test-session.ps1
If everything works correctly, the output will be:
[OK] /ws upgraded
-> sent session.update
<- {"type":"session.created",…}
<- {"type":"session.updated",…}
[OK] session.updated received -- E2E works
This confirms that the entire process—broker, DefaultAzureCredential token, correct Foundry Portal URL format, Voice Live handshake, and the bound agent acknowledging the session—functions properly.
Visit http://localhost:8000/, hit Start talking, and try asking one of the sample questions. You’ll see real-time transcripts, and the spoken responses will play through your audio system.
The relay logic is compact; most of the work lies in building the URL. The following function serves as the standard reference; you can adapt it for use in other programming languages.
def build_voice_live_ws_url(agent_access_token: str) -> str: """ Build a WebSocket URL for Voice Live in the Foundry Portal style.Authentication is handled in the query string. No Authorization header is required. """ host = _ws_host_from_endpoint(VOICE_LIVE_ENDPOINT) qs = urlencode( { "trafficType": "FoundryPortal", "agent-name": FOUNDRY_AGENT_ID, "agent-version": "", "agent-project-name": AZURE_AI_PROJECT_NAME, "api-version": VOICE_LIVE_API_VERSION, "model": FOUNDRY_AGENT_ID, "client-request-id": str(uuid.uuid4()), "authorization": f"Bearer {agent_access_token}", }, quote_via=quote, ) return f"wss://{host}/voice-live/realtime?{qs}"The relay consists of two asyncio tasks: one for forwarding browser frames upstream and another for relaying Voice Live frames back. Audio data remains unaltered—the broker doesn’t decode it.
The most reliable method to create a voice-live-enabled agent is via the Foundry portal. Agents created through the Assistants v2 SDK typically lack the necessary metadata and will be rejected by the Voice Live URL.
To set up an agent in the portal, follow these steps:
- Open your Foundry project, navigate to Agents, and click New agent.
- Select Prompt agent, provide a name (like
travel-concierge), and choose a model deployment.- Copy the contents from
agent/src/prompts/system.txtinto the instructions box.- On the Voice tab, ensure Enable Voice Live is activated. This sets the
microsoft.voice-live.enabled = truemetadata.- Add the three tools (
get_weather,get_flight_status,get_hotel_info) specified inagent/agent.yamlon the Tools tab.- Publish the version and update the agent name in
.envwithFOUNDRY_AGENT_ID.A comprehensive deployment guide, including instructions on hosting the broker using Azure Container Apps with a managed identity, can be found in
docs/deployment.mdin the repository.Foundry agents often format their answers in markdown, including citations such as
([data.jma.go.jp](https://...)). When Voice Live speaks these responses, users hear URLs read aloud, letter by letter. To avoid this, write the agent instructions so they don’t contain any URLs, markdown, or symbols. You might append a brief section at the end of the instructions like this:Voice output guidelines - As this output is read aloud by TTS, avoid including URLs, domain names, or citation markers like "(source.com)" in responses. Instead, cite by the name of the source only. - Do not use markdown for formatting. Avoid asterisks, brackets, backticks, bullets, or hashes. Write in clear spoken sentences. - Keep numbers easy to understand: say “thirty degrees Celsius” instead of “30C / 86F”. - Limit responses to around 40 words unless the user asks for more detail.The browser can still present markdown for visual clarity. The sample includes a simple markdown renderer that allows only certain formats: bold, italic, code, and
http(s)links. This way, the screen displays a well-polished version of the agent’s response, while the spoken version remains uncomplicated.The broker employs
DefaultAzureCredentialand requests thehttps://ai.azure.com/.defaultscope. Locally, this resolves using youraz logincredentials. In Azure Container Apps, it resolves to the user-assigned managed identity. In either scenario, to interface with the Foundry account, you only require the Cognitive Services User role. The working URL structure does not include a pathway for API keys—it’s all about bearer tokens.Be cautious when using the public
liupeirong/ai-foundry-voice-agentrepository with a voice-live agent provided by the portal. You might encounter HTTP 400 errors or a silent disconnection with code 1006. These issues arise from using an incorrect URL shape rather than coding mistakes. The reference probe inscripts/probe_portal_shape.pyserves as the only reliable source for this format—keep it for reference.
- Credentials are never exposed to the browser. Tokens are generated server-side and only travel through the upstream Voice Live URL.
- No secrets in source code. The
.envfile is gitignored, and.env.sampleonly contains placeholders. - Markdown rendering follows a strict process. The browser HTML-escapes the agent’s reply before applying a small set of permissible markdown formats, ensuring links only use
http(s)URLs to avoid security risks. - Tool calls are traceable. Every turn logs a run in the Foundry portal under the agent, which includes visible prompts, model outputs, and the inputs and outputs of tools for review.
- Consider voice biometric verification. If your application involves account verification by voice, it’s better to use dedicated voice recognition technology instead of heavily relying on the conversational model.
- Utilising Voice Live alongside a Foundry hosted agent offers a session-level integration, unlike traditional API integrations. It involves a single URL, one binding field, and one WebSocket connection.
- The browser acts as a lightweight client. All authentication, URL creation, and relaying functionalities rest within a compact FastAPI broker.
- Pay attention to the URL structure (it should be
cognitiveservices.azure.com). Ensure the token is in the query string, withagent-namebeing identical tomodeland matching the agent’s display name. Getting this right streamlines the entire process. - Always create agents via the Foundry portal to ensure the voice-live metadata is set appropriately.
- Frame agent instructions for auditory clarity rather than visual presentation, with visual formatting layered on top in the browser.
If you build upon this approach, please consider opening an issue or submitting a pull request to the repository. The sample is intentionally kept concise to facilitate easy forking.
Share this content:
Discover more from Qureshi
Subscribe to get the latest posts sent to your email.