Loading Now

Building an Enterprise HR Chatbot with Multi-Strategy RAG and Live Agent Handoff on Azure

HR teams face a multitude of employee inquiries daily, ranging from policy clarifications and leave balances to sensitive issues such as harassment and misconduct. AI chatbots can tackle the routine questions, allowing HR advisors to focus on more complex matters. However, many chatbot initiatives stall at basic FAQ responses, struggling with multi-country policies, employee jargon, and seamless handoffs to human support.

This article outlines how we developed Eva, an operational HR chatbot using the Microsoft Bot Framework and Semantic Kernel on Azure. I’ll highlight three key challenges we faced and how we overcame them:

  1. Finding accurate responses when employee language differs from policy documents
  2. Transferring queries to a live advisor in real-time
  3. Automatically detecting lapses in answer quality

We implemented a method called Retrieval-Augmented Generation (RAG), which involves retrieving relevant documents and inputting them into a large language model (LLM). However, standard RAG approaches often fall short in HR for several reasons:

  • Vocabulary differences. For example, if an employee asks, “How does misconduct impact my ACB?” but the policy states, “Annual Cash Bonus eligibility criteria,” the search could fail to connect the two.
  • Multi-country discrepancies. The same question might yield different answers based on the employee’s country, role, or grade.
  • Delicate subjects. Queries related to harassment, disability, or whistleblowing should be directed to a human and not answered by AI.
  • Irrelevant results. Search outcomes often include documents that are globally relevant but not applicable locally.

Eva addresses these issues through a structured pipeline of query augmentation → multi-index search → LLM reranking → answer generation → citation handling.

LayerTechnology
Bot frameworkMicrosoft Agents SDK (aiohttp)
LLM orchestrationSemantic Kernel
Primary LLMAzure OpenAI Service (GPT-4.1 / GPT-4o)
Knowledge searchAzure AI Search (hybrid + vector)
Live agent chatSalesforce MIAW via server-sent events
EvaluationAzure AI Evaluation SDK + custom LLM judge
ConfigPydantic-settings + Azure App Configuration + Key Vault

Eva employs four distinct search techniques, activated via feature flags, allowing for A/B testing across countries without altering code. These operate in a priority sequence:

  1. HyDE (Hypothetical Document Embeddings)
    Instead of directly searching with the employee’s question, the LLM first creates a hypothetical policy document that would provide an answer. This synthetic document is then embedded and used as a search query, thereby effectively bridging vocabulary gaps.
  2. Step-back prompting
    Here, the LLM broadens the question, transforming “How does misconduct affect my ACB?” into “What is the Annual Cash Bonus policy and what factors influence eligibility?” This approach helps when answers are found within broader sections of policy documents.
  3. Query rewrite
    The LLM expands abbreviations and adds context specific to HR before conducting a hybrid (text + vector) search.
  4. Standard search (fallback)
    A basic intent classification method using hybrid search when no augmentation is employed.

All four approaches return the same Pydantic model, ensuring the rest of the pipeline functions independently of which method was used. The team can implement HyDE globally, apply step-back prompting in selected regions, or revert modifications swiftly if necessary.

After retrieving results from both a country-specific index and a global index, Eva can optionally reorder them using a RankGPT-style method, where the LLM evaluates document relevance with a preference for local content. If reranking faces any issues, it defaults to the initial order to keep the process moving forward.

The response stage separates collected documents into local context (country-specific) and global context (company-wide), each injected as unique sections in the prompt. The LLM generates a structured reply that includes reasoning, the actual answer, citations, and a classification of how well the question was answered (full, partial, or denial).

These prompts are stored in version-controlled .txt files with model variants (e.g., gpt-4o.txt, gpt-4.1.txt), making them easy to review and deploy with no code changes required.

When Eva identifies that a question requires a human touch—be it a complex scenario or a sensitive topic—it smoothly transitions to a Salesforce advisor in real-time.

  • SSE streaming. Eva maintains a stable HTTP connection to Salesforce for instant messaging, typing indicators, and session end notifications.
  • Session resilience. The session state is preserved through three layers: in-memory cache, Azure Cosmos DB, and Bot Framework turn state, ensuring continuity even during restarts.
  • Message delivery workers. Each session benefits from a dedicated asynchronous worker with exponential backoff, ensuring that overflow messages are not discarded but rather saved for later retrieval.
  • Queue position updates. While employees wait for assistance, Eva checks with Salesforce for their position in the queue and sends timely updates.
  • Context handoff. At the start of a new session, Eva sends the full conversation history to the advisor to avoid repetitive questions.

Eva includes a robust evaluation framework that operates independently, testing responses against established Q&A pairs stored in CSV files.

Factual questions are assessed using Azure AI’s SimilarityEvaluator on a scale from 1 to 5, with appropriate checks for relevance and grounding.

Sensitive inquiries (such as those relating to harassment or whistleblowing) are supervised by a custom LLM judge, ensuring that the responses respect the sensitivity of the issue and direct the employee to appropriate resources.

A deviation detector identifies any drops in score between evaluations. Results are stored in SQLite for analysis, with Application Insights providing dashboard capabilities. Long evaluation processes can be resumed, with the framework automatically skipping finished test cases upon restart.

  • Make retrieval strategies flexible. Feature flags enable A/B testing without needing to redeploy.
  • Clearly separate local and global knowledge. Don’t depend on the LLM to decipher which country’s policy is applicable.
  • Prioritise evaluation from the outset. Ground-truth datasets with factual and behavioural assessments can reveal regressions missed during manual testing.
  • Enhance resilience in live agent handoff. Multi-tier session recovery and retry protocols minimize the risk of lost conversations.
  • Treat prompts like code. Using file-based, model-variant-aware prompts simplifies maintenance versus inline strings.
  • Implement Pydantic for structured LLM outputs. Typed models help catch erroneous outputs at the validation stage rather than allowing them to propagate.

Share this content:


Discover more from Qureshi

Subscribe to get the latest posts sent to your email.

Discover more from Qureshi

Subscribe now to keep reading and get access to the full archive.

Continue reading