🤖 Add optional disk-spill buffer for large result sets in Custom Search Commands#804
Open
Ickerday wants to merge 2 commits into
Open
🤖 Add optional disk-spill buffer for large result sets in Custom Search Commands#804Ickerday wants to merge 2 commits into
Ickerday wants to merge 2 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Investigates and addresses the OOM issue from #687.
Why PR #800 breaks tests:
list(records)inwrite_records()andrecords = []inGeneratingCommand._execute_chunk_v2are load-bearing for thecustom_fieldsheader mechanism — the CSV header is frozen on the first_write_recordcall and must include all fields upfront. Removing these causes fields from later records to be silently dropped. PR #800's removals are reverted here.Why the OOM is protocol-constrained: The CEXC protocol is strictly request-response — chunk header requires
body_lengthbefore the body, so the full CSV reply must be buffered before writing.RecordWriterV2.flush(partial=True)is a deliberate no-op as the protocol does not support partial chunking.What this PR does instead: Opt-in
RecordWriterV3spills the CSV reply buffer to aSpooledTemporaryFileinstead ofStringIO, bounding peak RAM tospool_size(default 4 MB) regardless of result set size.Can we remove
list()entirely?No general solution exists.
list()is load-bearing becauseadd_field()/gen_record()populatecustom_fieldslazily during iteration — by the time the first record is written, the CSV header must already include all fields. Replacinglist()would require knowing the full field schema before iteration begins.Attempted a fast path (skip
list()whencustom_fieldsempty) — breakstest_all_fieldnames_present_for_generated_recordsandtest_field_preservation_positivefor exactly this reason.This PR instead adds an opt-in
declare_fields()API for users who know their schema upfront:declare_fields()pre-populatescustom_fieldsand setsfields_declared = Trueon the writer, which bypasses thelist()materialisation path. Without it, behaviour is 100% unchanged.Usage
Benchmark
All four combinations measured to isolate each fix's contribution (2 GB payload, 50k-row chunks):
list()+ StringIO)declare_fieldsonly (nolist(), StringIO)list()kept, SpoolFile)declare_fields+ SpoolFile)Spool (V3) dominates — 94.7% heap reduction alone, because the StringIO CSV buffer is the dominant cost, not the dict list.
declare_fieldsadds an extra 3.8% saving and a ~3 s CPU win by eliminating the two-pass iteration.At 10 GB / 10.7M records:
RecordWriterV2(StringIO)RecordWriterV3(SpoolFile, 4 MB spool)V3 heap stays flat at ~spool_size across the entire 10 GB run. RSS delta is zero — the OS reuses spool file pages freely; StringIO leaves 1.4 GB of dirty pages behind.
spool_sizesweet spot is 4 MB:Below 4 MB: frequent small spills increase I/O overhead. Above 4 MB: spool stays in RAM, heap climbs back with no speed gain.
What would fully fix the OOM?
Implementing partial chunk support in Splunk core —
flush(partial=True)would become real, eliminating all buffering. Both V3 anddeclare_fieldswould become unnecessary. Until that ships, ~1× CSV payload buffering is unavoidable regardless of SDK changes.Notes
disk_bufferis SDK-only, never sent to Splunk in the CEXC getinfo response.declare_fields()is incompatible withadd_field()/gen_record()on the same invocation — those APIs populate fields lazily and require the full-materialisation path.AI Tool Assistance Usage Statement