How Drasi used GitHub Copilot to find documentation bugs

For many early-stage open-source projects, a “Getting Started” guide serves as a developer’s initial touchpoint. If a command doesn’t work, results are unexpected, or instructions appear unclear, most users will simply abandon the process rather than report an issue.

Drasi is a project within the CNCF sandbox that detects changes in your data and triggers instant responses. Backed by a dedicated team of four engineers from Microsoft Azure’s Office of the Chief Technology Officer, we provide thorough tutorials. However, the pace at which we’re rolling out code is faster than our capacity for manual testing.

It wasn’t until late 2025, following an update to GitHub’s Dev Container infrastructure, that the severity of our situation became clear. This update required a newer minimum Docker version and disrupted the Docker daemon connection, rendering all existing tutorials non-functional. Due to our reliance on manual testing, the full extent of the breakdown wasn’t immediately evident. Any developer attempting to use Drasi during that timeframe would have faced significant obstacles.

This incident highlighted a crucial realisation: with advanced AI coding assistants, we can transform documentation testing into a monitoring challenge.

The problem: Why does documentation break?

Documentation typically fails for two primary reasons:

1. The curse of knowledge

Experienced developers often write documentation assuming readers share the same background knowledge. For instance, when we say “wait for the query to bootstrap,” we understand that it entails running `drasi list query` and monitoring for a `Running` status or, even better, executing the `drasi wait` command. However, a newcomer lacks this context, just as an AI does. They interpret the instructions literally and may become confused by the “how,” while we only explain the “what.”

2. Silent drift

Documentation does not fail as audibly as code does. When a configuration file is renamed, the build process fails immediately. But if your documentation continues referencing the old file name, it may go unnoticed until a user expresses confusion. This issue is magnified for tutorials, as they rely on components like Docker, k3d, and sample databases. Whenever any upstream dependency changes—whether it’s a deprecated flag, an updated version, or a new default—our tutorials might falter unnoticed.

The solution: Agents as synthetic users

To address this challenge, we approached tutorial testing as a simulation problem and developed an AI agent that functions as a “synthetic new user.”

This agent exhibits three essential traits:

Naïveté: It lacks prior knowledge of Drasi and solely understands what’s written in the tutorial.
Literal execution: It executes each command exactly as written. If any step is missing, it fails.
Unforgiving: It checks all expected outputs. If the documentation states “You should see ‘Success’” and the command line just returns nothing, the agent flags the issue and fails quickly.

The stack: GitHub Copilot CLI and Dev Containers

We devised a solution leveraging GitHub Actions, Dev Containers, Playwright, and the GitHub Copilot CLI.

The tutorials require substantial infrastructure, including:

A complete Kubernetes cluster (k3d)
Docker-in-Docker
Real databases (PostgreSQL and MySQL)

We needed an environment that precisely mirrors what our users experience. If users operate within a specific Dev Container on GitHub Codespaces, our tests must run in that same environment.

The architecture

Inside the container, we utilise the Copilot CLI with a tailored system prompt (view the full prompt):

This prompt uses the Copilot CLI’s prompt mode (-p) and enables the agent to execute terminal commands, write files, and run browser scripts as a human developer would. To simulate a genuine user experience, the agent needs these capabilities.

We also installed Playwright on the Dev Container to allow agents to open and interact with webpages, mimicking how any user would follow the tutorial steps. The agent also takes screenshots for comparison against those in the documentation.

Security model

Our security approach focuses on one key principle: the container is the boundary.

Rather than limiting individual commands (which is futile as the agent must run arbitrary node scripts for Playwright), we treat the entire Dev Container as an isolated environment, controlling what can pass through its borders. This includes restricting outbound network access to localhost, using a Personal Access Token (PAT) with permissions limited to “Copilot Requests,” ephemeral containers that are discarded after use, and requiring maintainer approval for workflow triggers.

Dealing with non-determinism

One significant challenge of AI testing is non-determinism. Large language models (LLMs) can be unpredictable. Occasionally, the agent retries a command; other times, it may abandon the task.

We managed this by implementing a three-stage retry system with model escalation (starting with Gemini-Pro and upgrading to Claude Opus upon failure), employing semantic comparison for screenshots instead of mere pixel-matching, and verifying core data fields rather than volatile outputs.

We also maintain a list of stringent constraints in our prompts to prevent the agent from embarking on unnecessary debugging journeys, directives that guide the report structure, and skip directives that instruct the agent to bypass optional sections like setting up external services.

Artifacts for debugging

When a test run fails, we need to understand the cause. Since the agent operates within a temporary container, we cannot just SSH in and investigate.

Thus, our agent documents every run, preserving screenshots of web UIs, terminal output of critical commands, and a final markdown report outlining its rationale:

These artifacts are uploaded to the GitHub Actions run summary, enabling us to “time travel” back to the precise point of failure and see the agent’s findings.

Parsing the agent’s report

With LLMs, achieving a clear “Pass/Fail” status that a machine can comprehend is tricky. An agent might produce a lengthy, nuanced conclusion.

To make this actionable within a CI/CD pipeline, we needed to refine our prompts. We specifically guided the agent:

In our GitHub Action, we simply grep for this specific string to determine the exit code of the workflow.

These simple techniques bridge the gap between the AI’s fuzzy outputs and the binary expectations of CI.

Automation

We now operate an automated workflow that runs weekly. This setup evaluates all our tutorials simultaneously—each getting its own sandbox container and a fresh perspective from the agent acting as a synthetic user. If any tutorial fails, the workflow automatically creates an issue in our GitHub repository.

This workflow can also be triggered for pull requests, but to safeguard against potential attacks, we’ve introduced maintainer-approval requirements and a `pull_request_target` trigger to ensure that even on external contributors’ pull requests, the executed workflow is the one from our main branch.

Running the Copilot CLI necessitates a PAT token which is securely stored as an environment secret within our repository. We require maintainer approval for each run to prevent leaks—except for the automated weekly activities that strictly run on our repo’s `main` branch.

What we found: Bugs that matter

Since putting this system in place, we’ve conducted over 200 “synthetic user” sessions. The agent uncovered 18 distinct problems, including serious environmental issues and various documentation discrepancies. Rectifying these errors has enhanced the documentation for all users, not just the AI.

Implicit dependencies: One tutorial directed users to create a tunnel to a service. The agent executed the command, then, following instructions, terminated the process to run the next command.
The solution: We realised we hadn’t informed the user to keep that terminal open and added a warning: “This command blocks. Open a new terminal for subsequent steps.”
Missing verification steps: We said: “Verify the query is running.” The agent got stuck, asking: “How exactly?”
The solution: We clarified instructions with an explicit command: `drasi wait -f query.yaml`.
Format drift: Our CLI output evolved with new columns added and older fields deprecated. The documentation images still reflected a 2024 version of the interface. A human tester might overlook this discrepancy, but the agent noted every mismatch, prompting us to keep our examples updated.

AI as a force multiplier

In discussions about AI, we often encounter assertions that it will replace human jobs. Yet in this instance, AI has provided us with a workforce we never had.

Reproducing what our system accomplishes—validating six tutorials across fresh environments each week—would require a dedicated QA resource or a substantial budget for manual testing. This is simply unfeasible for a four-person team. By implementing these Synthetic Users, we’ve effectively engaged a relentless QA engineer who operates around the clock.

Our tutorials are now verified weekly by synthetic users. Try the Getting Started guide for yourself to see the outcomes firsthand. If you’re grappling with similar documentation drift in your own project, consider utilising GitHub Copilot CLI not merely as a coding aid, but as a fully-fledged agent—just feed it a prompt, a container, and a goal, and allow it to accomplish what manual efforts can’t.

Share this content: