Skip to content

🤖 Add optional disk-spill buffer for large result sets in Custom Search Commands#804

Open
Ickerday wants to merge 2 commits into
developfrom
feature/disk-buffer-spooled-file
Open

🤖 Add optional disk-spill buffer for large result sets in Custom Search Commands#804
Ickerday wants to merge 2 commits into
developfrom
feature/disk-buffer-spooled-file

Conversation

@Ickerday

@Ickerday Ickerday commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Summary

Investigates and addresses the OOM issue from #687.

Why PR #800 breaks tests: list(records) in write_records() and records = [] in GeneratingCommand._execute_chunk_v2 are load-bearing for the custom_fields header mechanism — the CSV header is frozen on the first _write_record call and must include all fields upfront. Removing these causes fields from later records to be silently dropped. PR #800's removals are reverted here.

Why the OOM is protocol-constrained: The CEXC protocol is strictly request-response — chunk header requires body_length before the body, so the full CSV reply must be buffered before writing. RecordWriterV2.flush(partial=True) is a deliberate no-op as the protocol does not support partial chunking.

What this PR does instead: Opt-in RecordWriterV3 spills the CSV reply buffer to a SpooledTemporaryFile instead of StringIO, bounding peak RAM to spool_size (default 4 MB) regardless of result set size.

Can we remove list() entirely?

No general solution exists. list() is load-bearing because add_field()/gen_record() populate custom_fields lazily during iteration — by the time the first record is written, the CSV header must already include all fields. Replacing list() would require knowing the full field schema before iteration begins.

Attempted a fast path (skip list() when custom_fields empty) — breaks test_all_fieldnames_present_for_generated_records and test_field_preservation_positive for exactly this reason.

This PR instead adds an opt-in declare_fields() API for users who know their schema upfront:

@Configuration()
class MyCommand(GeneratingCommand):
    def generate(self):
        self.declare_fields('extra_field')   # pre-declare → list() skipped
        for row in huge_dataset():
            row['extra_field'] = compute(row)
            yield row

declare_fields() pre-populates custom_fields and sets fields_declared = True on the writer, which bypasses the list() materialisation path. Without it, behaviour is 100% unchanged.

Usage

@Configuration(disk_buffer=DiskBufferSettings())
class MyCommand(GeneratingCommand):
    def generate(self):
        for record in very_large_dataset():
            yield record

Benchmark

All four combinations measured to isolate each fix's contribution (2 GB payload, 50k-row chunks):

Variant Wall (s) Heap peak RSS delta Heap saved
A — baseline (list() + StringIO) 32.83 304 MB +1281 MB
B — declare_fields only (no list(), StringIO) 29.90 293 MB 0 MB −3.8%
C — spool only (list() kept, SpoolFile) 32.53 16 MB 0 MB −94.7%
D — both (declare_fields + SpoolFile) 30.18 4.7 MB 0 MB −98.5%

Spool (V3) dominates — 94.7% heap reduction alone, because the StringIO CSV buffer is the dominant cost, not the dict list. declare_fields adds an extra 3.8% saving and a ~3 s CPU win by eliminating the two-pass iteration.

At 10 GB / 10.7M records:

Writer Wall (s) Heap peak RSS delta
RecordWriterV2 (StringIO) 155.6 293 MB +1428 MB
RecordWriterV3 (SpoolFile, 4 MB spool) 152.4 4.7 MB 0 MB

V3 heap stays flat at ~spool_size across the entire 10 GB run. RSS delta is zero — the OS reuses spool file pages freely; StringIO leaves 1.4 GB of dirty pages behind.

spool_size sweet spot is 4 MB:

spool_size       Wall      Heap
──────────────────────────────────
0 (always disk)  12.80s    62 MB   ← TextIOWrapper buffering overhead
64 KB            14.04s    10 MB   ← too many small spills
256 KB           13.08s    10 MB
1 MB             13.51s    10 MB
4 MB  ← default  12.30s    14 MB   ✓ fast as V2, minimum heap
32 MB            12.77s    42 MB
64 MB            12.88s    62 MB
128 MB           12.91s    62 MB   ← no further heap savings, RAM floor rises
256 MB           12.84s    62 MB

Below 4 MB: frequent small spills increase I/O overhead. Above 4 MB: spool stays in RAM, heap climbs back with no speed gain.

What would fully fix the OOM?

Implementing partial chunk support in Splunk core — flush(partial=True) would become real, eliminating all buffering. Both V3 and declare_fields would become unnecessary. Until that ships, ~1× CSV payload buffering is unavoidable regardless of SDK changes.

Notes

  • Opt-in only — existing commands unchanged.
  • disk_buffer is SDK-only, never sent to Splunk in the CEXC getinfo response.
  • declare_fields() is incompatible with add_field()/gen_record() on the same invocation — those APIs populate fields lazily and require the full-materialisation path.

AI Tool Assistance Usage Statement

  • AI assistance was used to draft parts of the implementation, that was subsequently modified and extended.
  • AI assistance was used in generating tests/documentation/comments for this change.
  • This PR has been deslopified.

@Ickerday Ickerday changed the base branch from master to develop June 9, 2026 14:59
@Ickerday Ickerday changed the title feat: opt-in disk-spill buffer for large result sets (RecordWriterV3) feat: 🤖 opt-in disk-spill buffer for large result sets (RecordWriterV3) Jun 9, 2026
@Ickerday Ickerday changed the title feat: 🤖 opt-in disk-spill buffer for large result sets (RecordWriterV3) 🤖 feat: opt-in disk-spill buffer for large result sets (RecordWriterV3) Jun 9, 2026
@Ickerday Ickerday changed the title 🤖 feat: opt-in disk-spill buffer for large result sets (RecordWriterV3) 🤖 feat: opt-in disk-spill buffer for large result sets Jun 9, 2026
@Ickerday Ickerday changed the title 🤖 feat: opt-in disk-spill buffer for large result sets 🤖 Add optional disk-spill buffer for large result sets for Custom Search Commands Jun 17, 2026
@Ickerday Ickerday changed the title 🤖 Add optional disk-spill buffer for large result sets for Custom Search Commands 🤖 Add optional disk-spill buffer for large result sets in Custom Search Commands Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant