Skip to main content

Airflow PostgresHook Multi-Statement SQL Silently Drops Results? Split by Semicolon and Execute One by One

· 6 min read

When an Airflow DAG reads a .sql template file as a single string and passes it to PostgresHook.get_pandas_df(), prior SELECT results are silently dropped — the DAG reports "SQL query returned no results", but copying the same SQL into psql returns data normally.

Encountered this while building AI Analytics — an LLM-powered analytics pipeline where an Airflow DAG reads multi-query report templates from .sql files and executes them.

TL;DR

PostgresHook.get_pandas_df(sql) internally calls pandas.read_sql(sql, conn)psycopg2 cursor.execute(sql). When sql is a single string with multiple ;-separated SELECTs, the DBAPI only exposes the cursor of the last result set — prior query results are silently dropped with no error. Fix: split by top-level semicolons into a list[str] and call get_pandas_df per statement, or pass the list directly so DbApiHook runs them sequentially.

Symptom

The DAG task executing shop_monthly_overview.sql reports "SQL query returned no results":

sql_count = 1   ← template clearly contains 4 queries
result = "❌ SQL query returned no results"

But the same SQL pasted into psql against the same database with the same parameters returns data for all 4 SELECTs.

Reproduction

Verify get_pandas_df behavior with multi-statement SQL directly inside the Airflow container:

from airflow.providers.postgres.hooks.postgres import PostgresHook

hook = PostgresHook(postgres_conn_id="postgres_default")

# Three SELECTs concatenated into one string
sql = "SELECT 1 AS a; SELECT 2 AS b; SELECT 99 AS c WHERE 1=0;"

df = hook.get_pandas_df(sql)
print(df.columns.tolist()) # ['c'] ← only got the last statement's columns
print(df) # Empty ← the last statement itself returns 0 rows

Expected three result sets, got only the last one (SELECT 99 ... WHERE 1=0, 0 rows). The first two completely disappear with no error or warning.

Root Cause

The call chain is PostgresHook.get_pandas_dfDbApiHook.get_pandas_dfpandas.io.sql.read_sqlpsycopg2 cursor.execute(sql).

The DBAPI protocol (PEP 249) allows execute to accept a string with multiple ;-separated statements. PostgreSQL executes all of them, but the cursor only exposes the last result set — this is inherent PostgreSQL wire protocol behavior, not an Airflow or pandas bug.

┌────────────────────────────────────────────────────┐
│ SELECT 1; ← executed, result set 1 dropped at once │
│ SELECT 2; ← executed, result set 2 dropped at once │
│ SELECT 99 WHERE 1=0; ← executed, result set 3 exposed │
└────────────────────────────────────────────────────┘

pandas.read_sql only fetches result set 3

The root cause in our code is task_execute_sql reading the entire .sql file as a single string and passing it to get_pandas_df:

# ❌ Problematic code
sql_text = open(sql_path).read() # full text with 4 SELECTs
df = pg_hook.get_pandas_df(sql_text) # only gets the last result

Why does psql return data? Because the psql frontend actively iterates through all result sets and prints them one by one, while a DBAPI cursor does not.

Solution

Suitable for .sql template files — they contain comments, quotes, and multiple queries that need robust splitting.

def split_sql_statements(sql: str) -> list:
"""
Split SQL by top-level semicolons, correctly handling:
- Semicolons inside single-quoted strings ('a;b' is not split)
- SQL-standard '' escape ('it''s' is not split)
- Semicolons inside -- line comments (-- note; not split is not split)
"""
statements = []
buf = []
i, n = 0, len(sql)
in_quote = False

while i < n:
ch = sql[i]

# Inside a single-quoted string
if in_quote:
buf.append(ch)
if ch == "'":
# '' = literal single quote, does not end the string
if i + 1 < n and sql[i + 1] == "'":
buf.append(sql[i + 1])
i += 2
continue
in_quote = False
i += 1
continue

# Top level
if ch == "'":
in_quote = True
buf.append(ch)
elif ch == '-' and i + 1 < n and sql[i + 1] == '-':
# Line comment, swallow to end of line (; inside is not a split point)
while i < n and sql[i] != '\n':
buf.append(sql[i])
i += 1
continue
elif ch == ';':
stmt = ''.join(buf).strip()
if stmt:
statements.append(stmt)
buf = []
i += 1
continue
else:
buf.append(ch)
i += 1

# Trailing block without a final semicolon
stmt = ''.join(buf).strip()
if stmt:
statements.append(stmt)

return statements


# Caller
sql_text = open(sql_path).read()
statements = split_sql_statements(sql_text)

# Execute one by one, collect all results
all_results = []
for idx, stmt in enumerate(statements, start=1):
df = pg_hook.get_pandas_df(stmt)
if not df.empty:
all_results.append({
"sql_index": idx,
"sql": stmt,
"data": df.to_dict("records"),
"columns": df.columns.tolist(),
"row_count": len(df),
})

Option B: Pass a list directly to DbApiHook

Airflow's DbApiHook.run and get_records accept list[str] and execute sequentially — but get_pandas_df return behavior in list mode is inconsistent across providers. For production, Option A gives you full control.

Why not sqlparse.split?

Community answers often recommend sqlparse.split(sqlparse.format(sql, strip_comments=True)), but strip_comments=True discards comments. If your downstream processor depends on metadata in comments (e.g. -- dimension: shop), you lose context. A hand-rolled splitter preserves the original comment text and gives you control.

Caveats

Caveats

  • Do not use sql.split(';') — it will mis-cut semicolons inside quoted strings like WHERE name = 'a;b', and inside -- comment; line comments
  • split_sql_statements only handles single-quoted strings and -- line comments; if your SQL uses /* block comments */ or dollar-quoted strings ($$...$$), extend the splitter
  • After the fix, the semantics of sql_index for downstream processors change (1-based sequential index); audit all df.iloc[sql_index] style usages
  • If your SQL is program-generated rather than file-read, the safer pattern is to build a list at generation time rather than split later
  • A related trap: if you've also hit issues with SQL expressions being silently parameterized in Drizzle ORM, see Drizzle sql template mixing parameterized values with SQL expressions — same family of "the framework did a transformation you didn't expect" bugs

FAQ

How do I execute multiple SQL statements in Airflow PostgresHook?

Pass list[str] instead of a single string. DbApiHook.get_pandas_df and run accept sql as a list and execute sequentially; a single string with semicolon-separated statements causes psycopg2 to return only the last result set. For production, split yourself and call per-statement so you control result aggregation and sql_index.

Why does get_pandas_df only return the last result for multi-statement SQL?

pandas.io.sql.read_sql calls psycopg2 cursor.execute with the full string; the DBAPI protocol only exposes the cursor of the last result set for multi-statement execution, and prior SELECT results are dropped by the server immediately, with no error or warning. psql returns data because its frontend actively iterates all result sets, while a DBAPI cursor does not.

How do I split SQL by semicolon safely with comments and quotes?

Scan character by character and split only at top-level semicolons outside single-quoted strings and -- line comments. Single-quote literals use the SQL-standard '' escape; do not use str.split(';'), it will mis-cut semicolons inside comments and strings. If you use sqlparse.split, note that strip_comments=True discards the original comment text.


CCLEE

Independent developer, 24 years in e-commerce, focused on grounding AI in real business scenarios.

Work with me

dotenv silently truncates values at #? Wrap .env values in double quotes

· 4 min read

Encountered this while building AI Ops — LLM-powered analytics that surfaces market trends, user behavior, and sales data for precise operational strategy.

TL;DR

dotenv treats # in unquoted values as an inline comment. KEY=value#hash is actually loaded as value, with #hash dropped — no warning, no error. Fix: wrap any .env value containing #, spaces, or special characters in double quotesKEY="value#hash".

Symptom

Backend calls to an upstream service keep returning 401 Invalid credentials:

POST /api/v1/dag/trigger → 500
Stack: Airflow JWT auth failed (401): {"detail":"Invalid credentials"}
at getJwtToken (airflow-client.ts)

Investigation shows the password written in .env is 24 chars and contains # and &:

AIRFLOW_PASSWORD=ooGR0^kThVI&ag#RyCpUmbIr

But the value loaded into process.env.AIRFLOW_PASSWORD is only 10 chars long — #RyCpUmbIr is gone. Calling the upstream auth endpoint with the full password from CLAUDE.md returns 201; calling it with the truncated value from .env returns 401. The credentials are fine; the value loaded from .env is truncated.

Root Cause

dotenv follows shell convention: anything after # in an unquoted value is treated as an inline comment.

# .env
AIRFLOW_PASSWORD=ooGR0^kThVI&ag#RyCpUmbIr
# dotenv actually parses:
# AIRFLOW_PASSWORD = "ooGR0^kThVI&ag"
# #RyCpUmbIr ← dropped

This behavior is documented, but there is no warning or log. What the runtime gets is a silently truncated string. Combined with shell-escaping semantics for &, spaces, and $, the bug is even more hidden:

CharacterBehavior when unquoted
#Everything after is treated as inline comment, truncated
(space)Everything after is dropped
$VARTriggers variable expansion (may resolve to empty string)
&Shell background operator; dotenv usually preserves it but it bites again when joined into shell commands

Strong-random strings like JWT_SECRET, API_KEY, and DATABASE_URL frequently contain # — high-risk territory.

Solution

Wrap any value with special characters in double quotes in .env:

# .env
AIRFLOW_PASSWORD="ooGR0^kThVI&ag#RyCpUmbIr"
JWT_SECRET="abc#def$ghi jkl"
DATABASE_URL="postgres://user:p@ss#word@host:5432/db"

Restart the service to apply:

# pm2
pm2 restart analytics-api --update-env

# docker compose
docker compose restart api

# systemd
sudo systemctl restart api

Why it works: when dotenv sees double quotes, it reads the value literally up to the closing quote. #, spaces, and $ are not special-cased (unless you explicitly enable expand). Verify the loaded value immediately after the fix:

// Validate critical env vars at startup to catch truncation early
const required = ['AIRFLOW_PASSWORD', 'JWT_SECRET', 'DATABASE_URL'] as const;
for (const key of required) {
const v = process.env[key];
if (!v || v.length < 16) {
throw new Error(`${key} not loaded correctly (length ${v?.length ?? 0}); check .env quoting`);
}
}

This turns dotenv's silent failure into a startup failure, exposing the bug immediately the next time.

Caveats

  • Single quotes also work, but dotenv does not expand $VAR inside single quotes — it does inside double quotes. For passwords you usually want literal values: prefer double quotes + avoid writing ${...}.
  • dotenv versions: v15+ behaves as described; earlier versions (pre-v8) handle # slightly differently. Check the CHANGELOG before upgrading.
  • Docker / Kubernetes Secrets: variables injected via environment: don't go through dotenv and aren't affected. Only .env files and dotenv.config() paths are.
  • CI environments: GitHub Actions and GitLab CI inject secrets into the env context, also bypassing dotenv.

FAQ

Why does a password with # in .env get shorter?

dotenv treats everything after # as an inline comment by default and drops it. Unquoted KEY=value#hash is loaded as just value, with no error or log. Wrap the value in double quotes — KEY="value#hash" — to preserve the full content.

How do I debug dotenv not working?

Three steps: first confirm dotenv.config() runs before all imports (ES Module imports are hoisted statically — see debugging silent JWT signature failures); then verify .env values have no unescaped # or spaces; finally print process.env.XXX length and characters at startup and diff them char-by-char against the .env source file.

CCLEE

Independent developer, 24 years in e-commerce, focused on grounding AI in real business scenarios.

Work with me

Docker Compose service won't come back after restart? Check the restart policy

· 5 min read

Debugging a Milvus-dependent service that failed to start in a RAG knowledge base project — full writeup below.

TL;DR

After a host reboot (or a container crash), a group of services didn't come back: the app port had no listener and docker ps -a showed everything Exited. The root cause: docker-compose.yml had no restart policy (default no), so once a container died it stayed dead. Fix: set restart: always on every production service so the infrastructure self-heals after a crash or reboot.

Docusaurus scripts Build Fails with inline Script? Only src Is Accepted, Not content

· 5 min read

Dropping a { content: '...' } entry into the scripts array of docusaurus.config.ts to inject an inline script (Baidu Tongji's IIFE, for example) fails npm run build immediately.

Encountered this while building CCLEE Docusaurus Theme — an advanced docs theme built on Docusaurus 3.x, with a purple theme, dark mode, and Tailwind typography enhancements, a production-ready docs site template out of the box.

TL;DR

Docusaurus's scripts config only accepts src, not inline content. Move the inline script into static/js/ and reference it via { src: '/js/xxx.js', async: true }. If the script loads an external domain, also update your CSP.

Symptom

Following Baidu Tongji's official snippet, the instinct is to inline it straight into scripts:

// docusaurus.config.ts
const config: Config = {
scripts: [
{
content: `var _hmt=_hmt||[];(function(){var hm=document.createElement("script");hm.src="https://hm.baidu.com/hm.js?XXXX";var s=document.getElementsByTagName("script")[0];s.parentNode.insertBefore(hm,s);})();`,
},
],
};

The build fails right away:

[ERROR] Error: "scripts[1]" is invalid.
A script must be a plain string (the src), or an object with at least a "src" property.
at validateConfig (.../configValidation.js:397:15)

Root Cause

The scripts config is validated at build time by validateScripts, entry by entry. Only two shapes are allowed:

  1. A plain string: treated as src
  2. An object: must contain a src property; async, defer, data-* are optional

By design, scripts only emits tags of the form <script src="..." /> and reserves no content / innerHTML field for inline scripts. So no matter how short your inline content is, validation throws as soon as src is missing. Local builds and Vercel remote builds fail identically.

To inject an inline script that depends on build-time variables, use the top-level headTags config (tagName: 'script' + innerHTML) instead of scripts.

Solution

1. Put the inline script under static/js/

// static/js/baidu-tongji.js
var _hmt = _hmt || [];
(function () {
var hm = document.createElement('script');
hm.src = 'https://hm.baidu.com/hm.js?XXXX';
var s = document.getElementsByTagName('script')[0];
s.parentNode.insertBefore(hm, s);
})();

Files under static/ are copied as-is to the site root, so the final URL is /js/baidu-tongji.js.

2. Reference it from scripts via src

// docusaurus.config.ts
scripts: [
// Existing Umami setup (see /blog/docusaurus-umami-analytics)
{
src: 'https://tj.ccleeai.com/script.js',
async: true,
'data-website-id': 'xxxx',
},
// Baidu Tongji: via static file
{
src: '/js/baidu-tongji.js',
async: true,
},
],

3. Update CSP accordingly

If your site sets Content-Security-Policy (recommended — see our earlier Umami integration and CSP walkthrough), every newly added external domain must be allowlisted or the script will be blocked by the browser:

themeConfig: {
metadata: [
{
'http-equiv': 'Content-Security-Policy',
// Add https://hm.baidu.com to script-src
// Also allow it in connect-src and img-src (hm.js sends image pixels and fetch reports)
content: "default-src 'self'; " +
"script-src 'self' 'unsafe-inline' 'unsafe-eval' https://tj.ccleeai.com https://hm.baidu.com; " +
"connect-src 'self' https://tj.ccleeai.com https://hm.baidu.com; " +
"img-src 'self' data: https://hm.baidu.com; " +
"style-src 'self' 'unsafe-inline'; " +
"object-src 'none'; base-uri 'self'",
},
],
},

Baidu Tongji's IIFE internally calls document.createElement('script') to inject <script src="hm.baidu.com/...">, so 'unsafe-inline' alone is not enough — hm.baidu.com must be in script-src.

Notes

  • A plain string and src are equivalent in the scripts array: 'https://x/a.js' and { src: 'https://x/a.js' } behave the same
  • A path starting with / is relative to the site root (the copy output of static/), not the filesystem root
  • If the script depends on runtime variables and must be inline, use the top-level headTags, not scripts; headTags supports innerHTML
  • Multiple analytics setups (Umami + Baidu + GA) can coexist, but each external domain must be added to the CSP individually or its reports will be silently blocked

FAQ

How do I configure custom scripts in Docusaurus?

Add entries to the top-level scripts array in docusaurus.config.ts. Each entry is either a string (treated as src) or an object with a src property. Inline content cannot go in a content field — put the script under static/js/ and reference it via src: '/js/xxx.js'. If you also need async, defer, or custom data-* attributes, place them in the same object as src.

Why does Docusaurus scripts not support inline content?

The scripts config emits <script src="..."> tags at build time and has no field reserved for inline scripts. The validator (validateScripts in configValidation.ts) walks each entry and throws A script must be a plain string, or an object with at least a "src" property whenever src is missing, no matter how complete your content is. This is a deliberate API boundary — inline injection is delegated to headTags.

How do I inject inline JavaScript (like Baidu Tongji) in Docusaurus?

The most robust approach is "static file + internal dynamic injection": save the official IIFE snippet to static/js/baidu-tongji.js, reference it from scripts via { src: '/js/baidu-tongji.js', async: true }, and let the script itself document.createElement('script') to load hm.baidu.com/hm.js. Finally, add https://hm.baidu.com to script-src, connect-src, and img-src of your CSP, or reports will be blocked by the browser.

CCLEE

Independent developer, 24 years in e-commerce, focused on grounding AI in real business scenarios.

Work with me

Milvus collection name returns 500? UUID hyphens break the naming rules

· 4 min read

Debugging multi-tenant collection naming in a RAG knowledge base project — full writeup below.

TL;DR

Concatenating f"{tenant_id}_{collection}" to name a Milvus collection breaks when tenant_id is a UUID: the result starts with a digit and contains hyphens -, both of which violate Milvus naming rules and throw code=1100. The rule in one line: first character must be a letter or underscore, only [a-zA-Z0-9_] allowed, no hyphens. Never concatenate a UUID raw — use the original collection name, or convert it to a legal identifier.

Python task marked failed but no error? try/except swallowed the exception

· 5 min read

Debugging a silent failure where a document sync task marked everything failed in a RAG knowledge base project — full writeup below.

TL;DR

A shared method was refactored with a new parameter signature, but one caller was missed. The caller passed arguments under the old contract and threw TypeError — except the call sat inside a try/except that quietly funneled the exception into a failed counter. No crash, no ERROR in the logs, just a number ticking up. These "silent failures" are the hardest bugs to track down. Two fixes: grep all callers after a signature refactor; and make except blocks log or re-raise, never swallow silently.

Node.js AsyncLocalStorage Returns undefined in a Callback? EventEmitter Escapes Its Context

· 5 min read

The request-logging middleware reads AsyncLocalStorage's traceId inside the res.on('finish') callback, and getStore() returns undefined — every response log ends up missing its traceId.

Encountered this while building the ecommerce data collection tool for a client — the server uses ALS to carry each request's traceId through the entire handling chain, but response logs stubbornly refused to correlate. The culprit turned out to be "late callbacks" losing the context.

TL;DR

EventEmitter callbacks like res.on('finish') run outside the async context they were registered in, so als.getStore() can't find the request's store. The most reliable fix is to capture the value into a closure variable during the synchronous segment and use that closure inside the callback; when you need the full store, rebuild the context with als.run(store, fn) inside the callback.

The Problem

An innocent-looking request-logging middleware:

// middleware/requestLog.js
import { als } from '../utils/als.js';

app.use((req, res, next) => {
res.on('finish', () => {
const store = als.getStore();
logger.info({
traceId: store?.traceId, // always undefined in the response log
statusCode: res.statusCode,
}, 'request');
});
next();
});

The middleware order is fine, the traceId is readable everywhere else in the request chain (routes, business logic), but not inside res.on('finish'). The really confusing part: move als.getStore() into the synchronous segment before next(), and it has a value.

Root Cause

AsyncLocalStorage relies on Node's async_hooks to bind the store to the currently active async context and propagate it down the async call chain. The semantics of als.run(store, fn) are: during fn's execution (and any async tasks it spawns), getStore() returns this store.

The problem is EventEmitter. res.on('finish', cb) registers cb as a listener, to be fired by EventEmitter's dispatch loop after the response is sent. The async context that fires cb is the one active where dispatch happens — not the request's context. And since the response is usually sent after the request-handling chain, the als.run scope for that request may already have exited.

So als.getStore() inside cb reads the store of "whatever context is active right now," which doesn't belong to this request — the result is undefined (or worse, a different request's store).

Any callback with "registered in one context, fired in another" has this trap: res.on('finish'), once, some setTimeout/setInterval, chrome.alarms listeners, and so on.

Solution

Two patterns, depending on what you need.

If your callback only needs a couple of values from the store (most often just traceId), the simplest and most robust approach is to capture them during the synchronous segment — when the store is guaranteed alive — into a closure, and use that closure in the callback without ever touching ALS:

app.use((req, res, next) => {
// Synchronous segment: we're inside the als.run scope, getStore() always has a value
const traceId = als.getStore()?.traceId;
const start = Date.now();

res.on('finish', () => {
// Use the closure's traceId, never touch ALS
logger.info({
traceId, // reliably present
statusCode: res.statusCode,
durationMs: Date.now() - start,
}, 'request');
});

next();
});

This swaps the uncertainty of "is the async context still alive" for a deterministic closure reference. When the callback fires no longer matters — the value is already captured.

Pattern B: rebuild the context with als.run

When the callback invokes a blob of code that internally depends on getStore() (a logger mixin, Sentry scope injection), rewriting each call to use a closure is impractical. Instead, rebuild the context at the callback's entry:

res.on('finish', () => {
const traceId = capturedTraceId; // value captured in the synchronous segment
if (traceId) {
// Re-establish the ALS context inside the callback so record()'s internal getStore() works
als.run({ traceId }, () => record(res, start));
} else {
record(res, start);
}
});

als.run(store, fn) creates a new, independent async context, binds the store to it, and makes it visible to fn and every async task it spawns. It's safer than als.enterWith — which mutates the "current shared context" and causes cross-talk under concurrency. That's a separate trap, covered in AsyncLocalStorage reads the wrong value under concurrency? Replace enterWith with run.

Caveats

  • To tell whether a callback will lose context, ask whether registration and firing are separated. res.on('finish'), once, and cross-tick setTimeout are suspect; await and fetch().then() inherit naturally along the async chain and need no handling.
  • Prefer pattern A. It reduces the problem to an ordinary closure — best readability, no implicit "context rebuild" behavior. Only reach for pattern B when the callback wraps a lot of existing code that depends on getStore().
  • Don't patch this with als.enterWith in the callback — it mutates the shared parent context under concurrency and causes cross-talk, a far harder bug to diagnose than a lost context.

FAQ

Why can't I read AsyncLocalStorage inside the res.on('finish') callback?

res.on('finish', cb) registers cb as an EventEmitter listener that only fires after the response is sent. The async context active when it fires is the dispatch context, not the request's, and the request's als.run scope may have already exited — so getStore() returns undefined.

How do I make an EventEmitter callback see the AsyncLocalStorage context again?

The simplest way is to capture the value into a closure variable during the synchronous segment and use the closure in the callback. If the callback wraps a lot of code that depends on getStore(), rebuild the context at the callback's entry with als.run(store, fn). Prefer the former; reserve the latter for retrofitting existing logic.

CCLEE

Independent developer, 24 years in e-commerce, focused on grounding AI in real business scenarios.

Work with me

Node.js AsyncLocalStorage Reads the Wrong Value Under Concurrency? Replace enterWith with run

· 6 min read

A BullMQ worker with concurrency: 3 goes live, and the logs and Sentry reports of concurrent jobs are all crossed — job A's error stack lands under job B's traceId, and you spend ages staring at the wrong trace.

Encountered this while building the AI Analytics platform — intelligent analysis of market trends, user behavior, and sales data for ecommerce operations. The backend runs analysis jobs concurrently on BullMQ, each stamping a traceId via AsyncLocalStorage for log correlation, and the moment concurrency ramped up the traceIds started crossing.

TL;DR

als.enterWith(store) mutates the currently active shared parent context, so concurrent tasks overwrite each other when they interleave at an await — the last write wins, and every interleaved task reads the same wrong value. The fix is to switch to als.run(store, fn) and wrap the entire processor in it: it creates a fresh, independent context per call and restores the previous one on exit, so no amount of concurrency causes cross-talk.

The Problem

Each job stamps its own traceId into ALS on entry, and the processor (which contains awaits) reads that traceId for logging and Sentry reporting:

// worker.js — cross-talk version
new Worker('analytics', (job) => {
als.enterWith({ traceId: job.data.executionId }); // stamp on entry
return processJob(job); // internally: multiple awaits + logger.info({ traceId: als.getStore().traceId })
}, { concurrency: 3 });

It works in isolation, but turn on concurrency: 3 and the weirdness begins:

# job A (executionId: aaa) and job B (executionId: bbb) enter almost simultaneously
[worker] job A start traceId=aaa
[worker] job B start traceId=bbb
# A hits an await and yields; B calls enterWith({bbb}); when A resumes:
[worker] job A step2 traceId=bbb ← crossed into B
[worker] job A error traceId=bbb ← reported under B's trace in Sentry

Not intermittent — it reproduces deterministically whenever there's concurrency, and the traceId always equals "the value of the most recent enterWith."

Root Cause

The key is that enterWith doesn't write to a "this-call-only" context — it writes to the currently active shared parent context.

AsyncLocalStorage contexts form a tree: one async context can be shared by multiple child tasks. The semantics of als.enterWith(store) are "write this store onto the context I'm currently in." When the worker runs with concurrency: 3, the three job processors share the same parent context (the worker loop's context), so:

  • job A calls enterWith({aaa}) → the shared context is written to aaa;
  • job A awaits and yields;
  • job B calls enterWith({bbb}) → the same shared context is overwritten to bbb;
  • job A resumes and reads getStore() → it gets bbb.

That's classic last-write-wins cross-talk. The more await points and the higher the concurrency, the more frequent the overwrites and the worse the corruption. Under concurrency: 1 it looks fine simply because there's no interleaving — which is exactly what makes it so dangerous: single-threaded debugging during development never surfaces it.

The Node docs are explicit about this: enterWith() can have unintended side effects, and recommends run() instead.

Solution

Swap enterWith for run, and wrap the entire processor (not just one segment) with it:

// worker.js — isolated version
new Worker('analytics', (job) => {
return als.run(
{ traceId: job.data.executionId },
() => processJob(job), // the whole processor runs in its own context
);
}, { concurrency: 3 });

The semantics of als.run(store, fn): create a brand-new, independent async context, bind the store to it, and make it visible to fn and every async task it spawns; when fn returns, the context restores to what it was before the call.

Because each run call establishes a fresh context scoped to that invocation, concurrent tasks are isolated by construction — job A's context always holds aaa, job B's always holds bbb, no matter how they interleave at await points.

The payoff is direct:

  • Per-call snapshot: the traceId is bound on job entry, and the entire handling chain (every await, sub-function, Sentry scope) reads this job's own value;
  • Auto-restore on exit: when the job ends the context resets, with no leakage into the next job or the worker's main loop;
  • Concurrency-safe: crank concurrency as high as you like — behavior stays identical to single-threaded.

If the processor is an extracted function (say processWorkflowJob, processAtomicJob), wrap it once at the Worker construction site — no need to touch the processor internals:

new Worker(queue, (job) => als.run({ traceId: job.data.id }, () => processWorkflowJob(job)), { concurrency });

Caveats

  • Whenever there's concurrency (worker concurrency > 1, concurrent HTTP requests, Promise.all batching), don't use enterWith. It's designed for "set once, single-threaded, sequential" use and will always cross-talk under concurrency.
  • run must wrap the entire processor, not just the synchronous entry — otherwise the code after an await inside the processor falls back to the shared context and you've fixed nothing.
  • concurrency: 1 hides this bug. Always load-test with the target concurrency during development, or it only surfaces in production.
  • The other frequent AsyncLocalStorage trap is reading undefined inside a callback (a lost context) — see Node.js AsyncLocalStorage returns undefined in a callback? EventEmitter escapes its context.

FAQ

What's the difference between als.enterWith and als.run?

enterWith writes the store onto the currently active shared parent context, so concurrent async tasks overwrite each other; run creates a fresh, independent context for the callback, binds the store to it, and restores the previous context on exit, so each call is isolated. Node officially recommends run over enterWith.

Why do concurrent tasks read the wrong traceId and cross into another request?

When concurrent tasks interleave at an await, the value written by enterWith is overwritten by the most recent call, so every interleaved task reads the same wrong traceId. Switching to als.run gives each call its own isolated context, so no amount of concurrency causes cross-talk.

CCLEE

Independent developer, 24 years in e-commerce, focused on grounding AI in real business scenarios.

Work with me

Chrome Extension chrome.alarms Fires at the Wrong Interval? MV3 Enforces a ~1 Minute Minimum

· 5 min read

An MV3 extension uses chrome.alarms with a 10-second period to flush logs, but in production it turns out to fire only once a minute — the schedule is silently wrong.

Encountered this while building the ecommerce data collection tool for a client — the extension's background service worker needs to periodically batch-upload accumulated client logs to the server. A 10-second cadence was meant to keep things near-real-time, but in production the worst case was a full minute of latency.

TL;DR

The MV3 service worker sleeps, so scheduled tasks must use chrome.alarms (setInterval is unreliable); and Chrome enforces a minimum period of about 1 minute on chrome.alarms in production, silently clamping periodInMinutes < 1 up to 1. The fix is to treat 1 minute as your floor and add a "flush immediately when the buffer fills" trigger to cover high-throughput periods.

The Problem

The log relay is written to flush every 10 seconds:

// background.js (MV3 service worker)
chrome.alarms.create('log-flush', { periodInMinutes: 0.16 }); // aiming for ~10s

chrome.alarms.onAlarm.addListener((alarm) => {
if (alarm.name === 'log-flush') {
flushLogs();
}
});

It seems fine locally (unpacked), but after packaging and publishing to the store, the listener fires only once a minuteperiodInMinutes: 0.16 is ignored by Chrome. No error, just a stretched schedule.

Root Cause

Two constraints stack.

First: setInterval doesn't work under MV3. The Manifest V3 background is a service worker, which Chrome suspends after roughly 30 seconds of idleness to save power. When it suspends, setInterval stops, and on wake-up it doesn't run the missed ticks. So any task that must run "even when the page or extension is idle" has to use chrome.alarms — Chrome's native scheduler that can wake the service worker.

Second: chrome.alarms has a minimum period. For performance and battery, Chrome has long enforced a ~1-minute minimum on alarms: periodInMinutes < 1 is clamped to 1. Dev mode (unpacked / Dev channel) is more permissive and runs shorter periods, so local tests pass; but once packaged into a release build, Chrome snaps it back to 1 minute. That's the root of "works locally, stretches in production."

Together: you must use chrome.alarms, and you can't rely on it firing faster than 1 minute.

Solution

Since 1 minute is a hard floor, treat it as the worst-case backstop and add an event-driven immediate trigger for real-time needs — belt and suspenders:

// 1. Backstop timer: once a minute, guarantees a flush even if the worker was suspended
const FLUSH_THRESHOLD = 50;
chrome.alarms.create('log-flush', { periodInMinutes: 1 }); // stop fighting < 1

chrome.alarms.onAlarm.addListener((alarm) => {
if (alarm.name === 'log-flush') {
flushLogs().catch(() => {});
}
});

// 2. Immediate trigger: check on every log entering the buffer; flush when the threshold is reached
messageBus.on('log', (entry) => {
pushBuffer([entry]);
if (memBuffer.length >= FLUSH_THRESHOLD) {
flushLogs().catch(() => {}); // high-throughput periods flush within seconds
}
});

This combination absorbs both constraints:

  • The 1-minute floor answers "does the timer still run after the worker suspends" — chrome.alarms wakes the worker on schedule, so worst-case latency is capped at 1 minute and logs never pile up indefinitely while the extension is idle;
  • The buffer-full trigger answers "do we have to wait a full minute during bursts" — once the threshold accumulates within a short window, it flushes right away, bypassing the alarm. Low throughput leans on the alarm, high throughput leans on events, neither end stalls.

The migration cost is tiny: wherever you expected "every 10 seconds," switch to "buffer hits 50 entries OR 1 minute, whichever comes first." Batch-friendly workloads like logs are essentially free; for latency-sensitive single-item tasks, you should redesign them to be event-driven rather than polled.

Caveats

  • Don't use setInterval for critical MV3 service-worker scheduling — it stops when the worker suspends and doesn't catch up on wake, the sneakiest source of "intermittent missed tasks" in production. chrome.alarms is the only reliable persistent scheduler under MV3.
  • Treat periodInMinutes as 1 minute in production. Dev mode's shorter periods will fool you — always re-test the cadence with the packaged build in a real environment, don't trust dev mode alone.
  • If your feature genuinely needs "exactly every N seconds" precision (a precise countdown), alarms can't deliver — they're coarse-grained "no sooner than 1 minute" scheduling that Chrome may delay further. In that case, run the timer with setInterval inside an active page, and let the worker only backstop it.
  • Another frequent service-worker trap is losing the logged-in state — see Chrome Extension Service Worker can't read the login state? A cross-context token sync solution.

FAQ

Why doesn't my chrome.alarms period take effect and gets stretched to 1 minute?

For performance and battery, Chrome enforces a roughly 1-minute minimum on alarms, so a periodInMinutes below 1 is clamped to 1. Dev mode (unpacked) usually allows shorter periods, but the production build published to the store is snapped back to 1 minute — which is why it works locally but stretches online.

Can I use setInterval for scheduled tasks in an MV3 service worker?

Not reliably. An MV3 service worker is suspended by Chrome after about 30 seconds of idleness, and setInterval stops with it, without running the missed ticks on wake. For persistent scheduling you must use chrome.alarms (which can wake the worker), or persist state to chrome.storage and catch up based on elapsed time when the worker wakes.

CCLEE

Independent developer, 24 years in e-commerce, focused on grounding AI in real business scenarios.

Work with me

Node.js require('nanoid') Throws ERR_REQUIRE_ESM? Alternatives After v5 Went ESM-Only

· 5 min read

In a CommonJS project, require('nanoid') to generate a unique ID throws ERR_REQUIRE_ESM the moment the process starts, and it exits immediately.

Encountered this while building the ecommerce data collection tool for a client — a browser-side scraper that captures product images, SKUs, prices, and reviews in real time, then cleans and exports them as structured files. The server needed a stable traceId per request for cross-service log correlation.

TL;DR

From v5 onward, nanoid is an ESM-only package, and CommonJS require() cannot load it — it always throws ERR_REQUIRE_ESM. If your project is still CJS, the simplest replacement is Node's built-in crypto.randomUUID(): zero dependencies, works in both CJS and ESM, and produces a standard UUID.

The Problem

A perfectly ordinary import in a CJS project:

// server.js (CommonJS)
const { nanoid } = require('nanoid');

const traceId = nanoid();

It crashes on startup, the stack pointing at nanoid's entry file:

node server.js

internal/modules/cjs/loader.js:905
Error [ERR_REQUIRE_ESM]: require() of ES Module
/node_modules/nanoid/index.js from server.js not supported.

Instead change the require of index.js in server.js to a CommonJS module,
or use a dynamic import() call.

Note that this isn't an intermittent or environment-specific error — it's a deterministic crash. Once you're on v5, the CJS path simply does not work.

Root Cause

In v5, nanoid completed its ESM-only migration: its package.json no longer ships a CommonJS entry, only ESM. Node's CommonJS loader, require(), is synchronous and cannot load an ESM module, so it throws ERR_REQUIRE_ESM.

This isn't a nanoid bug — it's the ecosystem's module-format evolution. More and more packages ship ESM-only (got v12+, node-fetch v3, uuid v7+ all do the same). As long as your host project is CommonJS, you'll hit the same wall with every one of them.

If you've also hit "module not found" with dynamic import(), that's the same ESM resolution rules at work — see Node.js ESM dynamic import can't find the module? Check the file extension.

Solution

Three options, ordered by how little they cost to adopt.

When you're generating unique IDs, nanoid's core value is "short and unique." But as long as the ID doesn't need to fit in a URL or be aggressively shortened, a standard UUID is more than enough — and it's built into Node 14.17+, with zero dependencies:

// Works identically in CommonJS and ESM
const { randomUUID } = require('node:crypto');

const traceId = randomUUID();
// => '1b9d6bcd-bbfd-4b2d-9b5d-ab8dfbbd4bed'

This single change solves three problems at once:

  • Zero dependencies: no more third-party package, so its module format can never hold you hostage;
  • Format alignment: UUID is a universal cross-language, cross-service format, handy for log correlation and database primary keys;
  • CJS/ESM agnostic: node:crypto is built into Node and behaves the same under both module systems.

The only tradeoff is length — a UUID is 36 characters versus nanoid's default 21. For traceIds and primary keys that cost is negligible; only if you need it in a short link should you keep reading.

Option 2: pin nanoid v3

nanoid's v3.x is the last major version that supports CommonJS, and require works directly:

// package.json — explicitly pin v3
{
"dependencies": {
"nanoid": "^3.3.7"
}
}
const { nanoid } = require('nanoid');
const id = nanoid(); // 21-char short ID

Good for when you genuinely want short IDs but can't migrate the project to ESM yet. The cost is staying on an old version and missing v5's later updates.

Option 3: async dynamic import

If you must use v5, the only way in is ESM's async loader:

// In CommonJS, load the ESM package with dynamic import()
async function makeId() {
const { nanoid } = await import('nanoid');
return nanoid();
}

// The call site itself has to be async
const id = await makeId();

It works, but nanoid is fundamentally a synchronous ID generator — wrapping it in async/await forces async to propagate up the entire call chain, which is rarely worth it.

Caveats

  • This trap isn't unique to nanoid: uuid v7+, node-fetch v3, and got v12+ are all ESM-only, and require-ing them in a CJS project throws the identical ERR_REQUIRE_ESM. The way to tell is to check the target package's package.json for "type": "module" or an "import"-only entry.
  • crypto.randomUUID() requires Node 14.17+; on older runtimes, assemble one yourself with crypto.randomBytes(16).toString('hex').
  • Don't require('nanoid') in a CJS project while also import-ing nanoid in an ESM one — mixing them leaves both old and new copies in the dependency tree, making behavior much harder to predict.

FAQ

Why does require('nanoid') throw ERR_REQUIRE_ESM in Node.js?

Because nanoid has shipped only ESM artifacts since v5, and Node's CommonJS require() loads synchronously and cannot load an ESM module — it throws ERR_REQUIRE_ESM the moment it hits nanoid's entry. This is a hard boundary between the CJS and ESM module systems, not a configuration issue.

Can I still use nanoid v5 in a CommonJS project?

Yes, but either load it asynchronously with await import('nanoid') (which forces the whole call chain async) or pin the version to v3.x, which is still CJS-compatible. If you only need a unique ID, Node's built-in crypto.randomUUID() is the simplest path — zero dependencies and supported under both module systems.

CCLEE

Independent developer, 24 years in e-commerce, focused on grounding AI in real business scenarios.

Work with me