Artificial Intelligence

The Highs and Lows of Bug-hunting with LLMs

October 22, 2025
Artificial Intelligence
Jared Sutton

Azure Functions Does What With Dates??

Introduction

After ten months of development, my financial data analysis application entered its production pilot. The system processes CSV files of P&L data for every location in the company, rolls up the results, and uses LLMs to generate site‑specific financial insights.

In production, the very first run produced an unexpected problem: the analysis window for the roll‑up data was shifted one month older than intended. The application had explicit logic to keep the last 24 months of data, and only shift the window back one month if the run occurred before the 15th of the current month. This run was on the 15th at 8AM EDT — so the shift shouldn’t have happened.

On the surface, this looked like the kind of classic off‑by‑one bug you’d expect to quickly identify in the roll‑up function. In reality, it turned out to be a far more esoteric problem — one I had never encountered in decades of Python development.

Why This Escaped Testing

Prior to deployment, the majority of testing for this code was done using a CLI‑based test harness. This harness was designed to run individual pieces of the overall data flow with controlled inputs, making it extremely convenient for development.

Production execution, however, is completely different. The system runs as a daemon process, with various parts of the data ingestion and analysis pipeline triggered asynchronously by file uploads into different Azure blob storage paths. That difference, interactive CLI invocations versus a long‑running daemon process, ended up being a critical factor.

The Problem

One particular module in the execution flow calculates the time period boundaries used to filter input data before aggregation. This calculation included determining what month and year was considered “last month” — based on the current date at runtime.

In the pilot-production environment, because the application process runs continually inside Azure Functions, this module was imported once when the function host started, and its calculation logic was executed immediately during import. That meant the “current date” was captured at module import time, not at the moment the blob‑triggered job ran.

If the function host first imported this module on, say, the 14th, the time period boundaries would include a “shift back one month” adjustment, and that stale value would persist for all subsequent runs in that process.

In the CLI harness, the module was imported fresh each time it executed, so the date logic always reflected the actual run date. That’s why this bug never showed up pre‑deployment.

The Role of the LLM in Finding the Solution

I used an LLM tool (Cursor) to help identify the root cause. Initially, it wasn’t much help; it focused heavily on time zone conversions, building an elaborate theory about UTC versus EDT. That path turned out to be completely unrelated to the problem.

After several iterations, however, the LLM examined the import‑time calculation more closely and surfaced the environment‑specific behavior: Azure Functions’ reuse of Python processes between invocations can cause modules to retain state set at import time. This was exactly what was tripping us up.

While the solution was straightforward once identified, explicitly recalculating the time periods at the start of each job run, the path getting there would have been much longer without the LLM pointing me in the right direction.

Lessons Learned

LLMs in debugging: They can be indispensable for locating unusual, environment‑specific bugs in complex software. They can also waste a lot of time chasing tangents in areas they don’t fully understand like handling date/time.
Testing harnesses vs production reality: Non‑production interfaces, like CLI harnesses, are excellent for iterating on individual functions, but they can also mask inconsistencies that only surface under true production conditions. A harness doesn’t fully replicate a daemon’s lifecycle, process reuse, or trigger mechanisms. All of these factors can subtly alter program behavior.

Dénouement

While the hallucinated and sometimes incomprehensible output from an LLM can be frustrating, perseverance can often produce surprisingly cogent and insightful responses. After seeing and finding a way to reproduce this problem in a lab setting, I told a colleague that I doubted I would have been able to locate this bug’s source without the LLM pointing me to it.

Looking back a day later, I probably would have eventually found the source, but how many additional hours would it have taken me? Even with its meandering, illogical early responses, the LLM was able to identify this bug within about an hour (this time included me poking around at things between prompt attempts).

This isn’t an advertisement for Cursor specifically, since you could probably get similar performance out of any of the other popular AI-driven IDEs or console-based AI development tools, but just given the time savings in this one instance, it’s probably worth the cost of a subscription to one of these services.

Bugs happen. The faster we can identify and resolve them, the better we look to our customers.