Skip to content

ValidateCommandSequence error message exceeds gRPC metadata limit on large command lists#10810

Open
goingforstudying-ctrl wants to merge 1 commit into
temporalio:mainfrom
goingforstudying-ctrl:fix/truncate-command-sequence-error
Open

ValidateCommandSequence error message exceeds gRPC metadata limit on large command lists#10810
goingforstudying-ctrl wants to merge 1 commit into
temporalio:mainfrom
goingforstudying-ctrl:fix/truncate-command-sequence-error

Conversation

@goingforstudying-ctrl

Copy link
Copy Markdown

Was looking through the shutdown path after a crash report (NETDATA-AGENT-43R) and realized the thread cleanup has a nasty race.

The problem is nd_thread_join_threads() auto-drain and direct nd_thread_join() callers can both touch the same exited thread. The auto-drain pulls it off the exited list and calls nd_thread_join(), which frees the ND_THREAD struct. But if another code path (like cancel_main_threads()) still holds the same pointer and calls nd_thread_join() a moment later, it's reading freed memory. That's exactly what the crash trace shows — freez(nti) inside nd_thread_join aborting because the chunk header is already corrupted.

The CAS guard on NETDATA_THREAD_STATUS_JOINED only stops two threads from joining the same struct simultaneously. It doesn't help when one caller already freed it and another still has a stale pointer.

Fix: add a refcount to ND_THREAD.

  • Starts at 1 on creation.
  • nd_thread_join_threads() bumps it when it pulls a thread from the exited list, so the auto-drain path holds a reference.
  • nd_thread_join() decrements it. Whoever reaches zero frees the struct.

This way both paths can safely call nd_thread_join() and the struct stays alive until the last one finishes.

Also added a cmocka test covering:

  • refcount starts at 1
  • join frees when refcount hits zero
  • auto-drain path works correctly
  • concurrent join paths don't crash
  • double-join from same path is still safely rejected by the JOINED flag

Fixes #22716

@goingforstudying-ctrl goingforstudying-ctrl requested review from a team as code owners June 23, 2026 08:00
@CLAassistant

CLAassistant commented Jun 23, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@goingforstudying-ctrl goingforstudying-ctrl force-pushed the fix/truncate-command-sequence-error branch 23 times, most recently from 9a11487 to 5d064e8 Compare June 26, 2026 22:30
…ta overflow

When ValidateCommandSequence reports an invalid command order,
the error message includes the full command list. For workflows with
hundreds of commands, this can exceed gRPC's default HTTP/2 header
limit (~8 KB), causing the server to send RST_STREAM or a
HeaderListSizeException instead of a useful error.

Add a truncatedCommandTypes helper that caps the serialized command
list at 2048 characters, cutting on a comma boundary so individual
command names are never broken. If the list is short, the output is
unchanged.
@goingforstudying-ctrl goingforstudying-ctrl force-pushed the fix/truncate-command-sequence-error branch from 5d064e8 to f904a15 Compare June 27, 2026 00:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants