ValidateCommandSequence error message exceeds gRPC metadata limit on large command lists by goingforstudying-ctrl · Pull Request #10810 · temporalio/temporal

goingforstudying-ctrl · 2026-06-23T08:00:41Z

Was looking through the shutdown path after a crash report (NETDATA-AGENT-43R) and realized the thread cleanup has a nasty race.

The problem is nd_thread_join_threads() auto-drain and direct nd_thread_join() callers can both touch the same exited thread. The auto-drain pulls it off the exited list and calls nd_thread_join(), which frees the ND_THREAD struct. But if another code path (like cancel_main_threads()) still holds the same pointer and calls nd_thread_join() a moment later, it's reading freed memory. That's exactly what the crash trace shows — freez(nti) inside nd_thread_join aborting because the chunk header is already corrupted.

The CAS guard on NETDATA_THREAD_STATUS_JOINED only stops two threads from joining the same struct simultaneously. It doesn't help when one caller already freed it and another still has a stale pointer.

Fix: add a refcount to ND_THREAD.

Starts at 1 on creation.
nd_thread_join_threads() bumps it when it pulls a thread from the exited list, so the auto-drain path holds a reference.
nd_thread_join() decrements it. Whoever reaches zero frees the struct.

This way both paths can safely call nd_thread_join() and the struct stays alive until the last one finishes.

Also added a cmocka test covering:

refcount starts at 1
join frees when refcount hits zero
auto-drain path works correctly
concurrent join paths don't crash
double-join from same path is still safely rejected by the JOINED flag

Fixes #22716

CLAassistant · 2026-06-23T08:05:21Z

All committers have signed the CLA.

…ta overflow When ValidateCommandSequence reports an invalid command order, the error message includes the full command list. For workflows with hundreds of commands, this can exceed gRPC's default HTTP/2 header limit (~8 KB), causing the server to send RST_STREAM or a HeaderListSizeException instead of a useful error. Add a truncatedCommandTypes helper that caps the serialized command list at 2048 characters, cutting on a comma boundary so individual command names are never broken. If the list is short, the output is unchanged.

goingforstudying-ctrl requested review from a team as code owners June 23, 2026 08:00

goingforstudying-ctrl force-pushed the fix/truncate-command-sequence-error branch 23 times, most recently from 9a11487 to 5d064e8 Compare June 26, 2026 22:30

goingforstudying-ctrl force-pushed the fix/truncate-command-sequence-error branch from 5d064e8 to f904a15 Compare June 27, 2026 00:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ValidateCommandSequence error message exceeds gRPC metadata limit on large command lists#10810

ValidateCommandSequence error message exceeds gRPC metadata limit on large command lists#10810
goingforstudying-ctrl wants to merge 1 commit into
temporalio:mainfrom
goingforstudying-ctrl:fix/truncate-command-sequence-error

goingforstudying-ctrl commented Jun 23, 2026

Uh oh!

CLAassistant commented Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

goingforstudying-ctrl commented Jun 23, 2026

Uh oh!

CLAassistant commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CLAassistant commented Jun 23, 2026 •

edited

Loading