Loading Now

Multi-Tenant Architecture: Real Challenges and an Azure Design Walkthrough

Let’s dive into a commonly used reference design for Azure-based systems.

A typical setup may look like this:

  • Microsoft Entra External ID (Azure AD B2C) for handling authentication
  • Azure API Management as the entry layer
  • App Service or Functions for your computing needs
  • Cosmos DB or SQL for data storage
  • Redis for efficient caching
  • Service Bus for asynchronous processing
  • Application Insights for monitoring performance

If you’ve worked with Azure before, none of this should come as a surprise.
On paper, this architecture is tidy, scalable, and designed for multiple tenants.

However, as soon as traffic starts to increase and tenant behaviours vary, issues can arise unexpectedly.

Here’s what I frequently observe:

  • The tenant ID is present in the API but missing in asynchronous processes
  • Background jobs process data without knowing which tenant it belongs to
  • Logs become ineffective as you struggle to link actions back to a tenant

The solution seems straightforward but is often overlooked during implementation:

Every message should carry the tenant context, without exception.

If you think “it’ll be available somewhere,” chances are it won’t be, especially in distributed systems.

Explicitly include tenant context everywhere:

public class TenantMessage
{
    public string TenantId { get; set; }
    public string Payload { get; set; }
}

Every message, event, and asynchronous task must contain the tenant scope.

Many teams start with a shared database model that features tenant-based partitioning.
This approach works well initially.

However, over time, problems can start to emerge:

    • A tenant filter is forgotten in a query
    • A query unexpectedly scans across multiple partitions
    • A large tenant begins to hinder performance for others

A simple query like this becomes essential:

var query = container.GetItemQueryIterator(
    new QueryDefinition("SELECT * FROM c WHERE c.tenantId = @tenantId")
        .WithParameter("@tenantId", tenantId)
);

The challenge lies not just in writing it once, but in ensuring it’s applied everywhere, every time.

At the start, access control seems easy:

“Users can access data tied to their own tenant.”

But as requirements expand:

  • Admin access becomes necessary
  • Cross-tenant visibility is requested
  • Reporting across various firms or regions is needed

This is when things often become complicated.

Different services may begin to implement their own rules, leading to inconsistent behaviours over time.

A simple check like this:

public bool CanAccess(string userTenant, string resourceTenant, bool isGlobalAdmin)
{
    if (isGlobalAdmin) return true;
    return userTenant == resourceTenant;
}

…becomes much harder to manage when it’s duplicated across several services.

One effective strategy is to centralise your authorization logic from the start.

Caching often gets integrated later to boost performance.

This is when risks can arise.

I’ve noticed situations where:

  • Cached data from one tenant is served to another
  • This occurs because the cache key didn’t include tenant information

Addressing this is simple:

public string BuildCacheKey(string tenantId, string key)
{
    return $"{tenantId}:{key}";
}

Always ensure cache keys include tenant identifiers.

All tenants share various resources:

  • Computational power
  • Database capacity
  • Messaging services

In practice, this leads to:

  • One heavily loaded tenant affecting the performance of others
  • Unpredictable latency
  • Behaviour divergence across tenants

You might start implementing controls like this:

if (RequestsPerTenant[tenantId] > 100)
{
    return StatusCode(429);
}

Gradually, you may develop:

  • Throttling mechanisms
  • Workload isolation strategies
  • Resource prioritisation

This challenge is less about design and more about operational realities.

Logging generally functions well until you scale.

Then, you might find:

  • Logs from all tenants become jumbled
  • Debugging slows down significantly
  • Answering basic questions like “which tenant encountered issues?” becomes difficult

A minor adjustment can make a significant difference:

_logger.LogInformation(
    "Tenant={TenantId} Action=ProcessOrder OrderId={OrderId}",
    tenantId,
    orderId
);

While this approach seems obvious, it’s often inconsistent across services.

Taking backups is straightforward.

However, restoring a single tenant can be challenging.

In many shared database setups:

  • Restores occur at the database level
  • This impacts all tenants

If one tenant experiences a problem, recovery isn’t simple.

This highlights how decisions made early on can have lasting impacts.

Designing a multi-tenant system isn’t solely about selecting Azure services.

The real challenges can be boiled down to:

  • How tenant context is managed
  • How isolation is enforced
  • How systems operate under uneven loads

Most issues won’t surface immediately.
They typically emerge as tenants grow and exhibit different behaviours.

 

If you’re interested in exploring these concepts further, here are some useful official resources:

Share this content:


Discover more from Qureshi

Subscribe to get the latest posts sent to your email.

Discover more from Qureshi

Subscribe now to keep reading and get access to the full archive.

Continue reading