Loading Now

Friends Don’t Let Friends Run Loops Sequentially

Tamara Gaidar, Data Scientist, Defender for Cloud Research

Maor Nissan, Principal Data Scientist, Defender for Cloud Research

This story might resonate with many of you working in data science. We received a request to conduct a quick comparative analysis and share our findings with stakeholders. However, the dataset we were given was lacking crucial information, so we had to enhance it first.

Since the data was mostly textual, we opted for LLM-based classifiers to supplement the missing details. Initially, the dataset seemed relatively small at around 3.5K rows. But when we started using our first classifier, we ran into a hurdle: each inference call (using o3-mini with a prompt for every row) took about 2.5 seconds. That meant processing the whole dataset would take nearly 146 minutes.

This was far too slow for an analysis expected to wrap up within a few working days. Additionally, we needed to run the classification multiple times to extract various fields. Anyone familiar with working with prompts knows that the first attempt is rarely perfect, leading to even more runs required.

It became clear that we needed to rethink our strategy. This is where parallel processing can transform what appears to be a straightforward task. Let’s dive deeper into the problem we encountered.

When managing relatively small datasets, data scientists often overlook the need for scale or parallel inference—simply because it doesn’t seem relevant at first. However, this raises an important question: why are we enduring such significant inference delays?

There are a few factors contributing to the inference latency per row:

  1. Model choice
    We selected o3-mini because it’s effective at reasoning and has proven reliable for classification tasks involving security-related text. We chose it knowing it employs internal chain-of-thought reasoning before generating a response. Switching to a quicker model like gpt-4o or gpt-4.1 could indeed reduce latency, but potentially at the cost of quality, which was vital for our needs.
  2. Network round-trip time
    Each row required a complete request cycle: connect → send request → wait for inference → receive response. This adds a typical latency of 200-500 ms due to network overhead, which we couldn’t optimize significantly.
  3. Sequential token generation
    LLMs generate tokens one at a time. The more tokens generated, the longer the response. However, in our case, outputs were brief (typically 1-5 words, like cloud service names), so this wasn’t a major contributor to delays, and we couldn’t enhance it further.
  4. Server-side load (Azure OpenAI)
    The response time is also influenced by the system load. While this variability exists, it’s outside of our control, so we have to accept it as a limitation.

There are additional considerations, like temperature (higher values introduce more variability and may extend reasoning time) and prompt length (more tokens = more processing time). But the main takeaway is this: even with these constraints accepted, we can still significantly enhance performance.

How? By employing parallel processing.

Let’s break it down. Imagine we’re working with a single-core CPU. The same concepts apply in multi-core environments, but starting simply helps us grasp the basics.

So, what is parallel processing?
In basic terms, it’s executing multiple tasks simultaneously rather than sequentially.

Before we proceed, let’s clarify some essential concepts.

  1. Concurrency vs. Parallelism

Concurrency involves managing multiple tasks at once—think of it as task juggling. Tasks may overlap in their execution timeframe but don’t necessarily run at the exact same moment.

For instance, on a single CPU:

Thread 1: [====work====] ……………………[====work====]
Thread 2: ……………………[====work====] …………………….[====work====]

 

The CPU rapidly alternates between tasks. They appear to execute at the same time, but only one is running at any given moment.

Parallelism, on the other hand, is about doing multiple tasks at the same time. This requires multiple CPU cores.

  1. CPU-bound vs. I/O-bound

A task is CPU-bound when computation is the limiting factor—the CPU is maximally utilized (close to 100%). In such cases, using multiple cores for true parallelism is the best way to enhance performance.

A task is I/O-bound when it mostly waits on external systems (like networks, disks, or APIs). Here, the CPU often sits idle.

[send]……….waiting for network/disk……….[receive] done

Now, let’s circle back to our original issue: LLM-based inference over a dataset.

This is a classic I/O-bound workload. Each row involved:

  • formatting the prompt
  • sending the request
  • waiting for the model response
  • receiving the output
  • written result back

Most of the time was spent waiting—rather than computing—meaning the CPU was underutilised while we waited on the external API to respond.

And this is where concurrency becomes beneficial.

Instead of waiting for each request to finish before sending the next, we can send multiple requests at once—keeping the system engaged and dramatically cutting the overall runtime.

To visualize the differences:

In a basic implementation, requests are made sequentially. Each request waits for a response before the next can be sent. This creates a distinct flow:

Request 1 → wait → Response 1
Request 2 → wait → Response 2
Request 3 → wait → Response 3

However, when we introduce concurrency, multiple requests go out nearly at the same time. From the client side, it looks like a series of outgoing calls rather than a queue.

But here’s the interesting part: while requests originate from a single machine, they’re processed on the Azure side by a distributed system capable of handling multiple requests simultaneously.

This creates a mismatch:

  • Client side: can send many requests at once
  • Server side (Azure OpenAI): processes them in parallel, but within strict limits

A new bottleneck appears—not latency per request, but rate limits.

Azure enforces limits like:

  • Tokens per minute (TPM)
  • Requests per minute (RPM)

This means even though multiple requests can be processed together, there’s a cap on total throughput.

In short, the system shifts from being latency-dependent to being throughput-dependent.

 

Picture title: Parallelism doesn’t remove limits – it helps you hit them faster.

Next, let’s explore how we applied this method in Python.

We utilized ThreadPoolExecutor, which is ideal for I/O-bound tasks like API calls to LLMs:

We started by defining a prompt template for our classification task:

 

We then set up a function to process a single row. This function would run in parallel across several threads:

 

 

 

 

 

 

 

Making the Pipeline Resilient

Before running our parallel jobs, we introduced a crucial improvement: checkpointing.

If the process encounters an error (which is quite common when working with APIs), we want to ensure that we don’t lose our progress.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Running in Parallel

Now, let’s focus on the main part: executing multiple requests at the same time.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Finally, we save the completed results:

 

 

Let’s evaluate the performance improvements we achieved.

For demonstration purposes, we ran the classification task on a sample of 50 rows to highlight time savings.

The overall runtime for 50 samples, averaging 2.53 seconds per API call, was 126.4 seconds. With parallel processing using 100 worker threads, the total time was only 6.2 seconds.

SPEEDUP: 20.4x faster through parallel processing!

Time saved: 120.2 seconds (95% reduction)

EXTRAPOLATION TO FULL DATASET (3412 rows):

Estimated sequential time: 143.7 minutes

Estimated parallel time:   7.1 minutes

Estimated time saved:      136.7 minutes

We suggest experimenting with the number of parallel workers/threads. It can be advantageous to push until you hit the “rate-limit wall” where adding more threads ceases to help, then scale back.

A common question arises: what are the implications for cloud costs?

The reality is: there’s no change in the bill. Whether we execute sequential or parallel processing across DataFrame rows, the same number of tokens are consumed. Thus, the cost remains unchanged. Billing is based on tokens—not on time or connections. While parallel processing saves time, it doesn’t save money. In both scenarios, we make 3,412 API calls and use the same amount of tokens.

Conclusions:

  • For per-row LLM inference, which is generally I/O-bound, concurrency (not additional CPU) is the most effective way to achieve shorter end-to-end runtimes.
  • Utilise a thread pool (or async) along with checkpointing to enhance the efficiency of lengthy enrichment pipelines while making them restart-safe.
  • Adjust parallelism to align with your Azure OpenAI RPM/TPM limits—once you reach the “rate-limit wall,” additional workers won’t assist and may increase errors.
  • Parallel processing affects throughput and time-to-result but not token-based expenses: you pay for tokens, not wall-clock minutes.

Share this content:


Discover more from Qureshi

Subscribe to get the latest posts sent to your email.

Discover more from Qureshi

Subscribe now to keep reading and get access to the full archive.

Continue reading