:robot: Add optional disk-spill buffer for large result sets in Custom Search Commands by Ickerday · Pull Request #804 · splunk/splunk-sdk-python

Ickerday · 2026-06-09T13:21:04Z

Summary

Investigates and addresses the OOM issue from #687.

Why PR #800 breaks tests: list(records) in write_records() and records = [] in GeneratingCommand._execute_chunk_v2 are load-bearing for the custom_fields header mechanism — the CSV header is frozen on the first _write_record call and must include all fields upfront. Removing these causes fields from later records to be silently dropped. PR #800's removals are reverted here.

Why the OOM is protocol-constrained: The CEXC protocol is strictly request-response — chunk header requires body_length before the body, so the full CSV reply must be buffered before writing. RecordWriterV2.flush(partial=True) is a deliberate no-op as the protocol does not support partial chunking.

What this PR does instead: Opt-in RecordWriterV3 spills the CSV reply buffer to a SpooledTemporaryFile instead of StringIO, bounding peak RAM to spool_size (default 4 MB) regardless of result set size.

Can we remove `list()` entirely?

No general solution exists. list() is load-bearing because add_field()/gen_record() populate custom_fields lazily during iteration — by the time the first record is written, the CSV header must already include all fields. Replacing list() would require knowing the full field schema before iteration begins.

Attempted a fast path (skip list() when custom_fields empty) — breaks test_all_fieldnames_present_for_generated_records and test_field_preservation_positive for exactly this reason.

This PR instead adds an opt-in declare_fields() API for users who know their schema upfront:

@Configuration()
class MyCommand(GeneratingCommand):
    def generate(self):
        self.declare_fields('extra_field')   # pre-declare → list() skipped
        for row in huge_dataset():
            row['extra_field'] = compute(row)
            yield row

declare_fields() pre-populates custom_fields and sets fields_declared = True on the writer, which bypasses the list() materialisation path. Without it, behaviour is 100% unchanged.

Usage

@Configuration(disk_buffer=DiskBufferSettings())
class MyCommand(GeneratingCommand):
    def generate(self):
        for record in very_large_dataset():
            yield record

Benchmark

All four combinations measured to isolate each fix's contribution (2 GB payload, 50k-row chunks):

Variant	Wall (s)	Heap peak	RSS delta	Heap saved
A — baseline (`list()` + StringIO)	32.83	304 MB	+1281 MB	—
B — `declare_fields` only (no `list()`, StringIO)	29.90	293 MB	0 MB	−3.8%
C — spool only (`list()` kept, SpoolFile)	32.53	16 MB	0 MB	−94.7%
D — both (`declare_fields` + SpoolFile)	30.18	4.7 MB	0 MB	−98.5%

Spool (V3) dominates — 94.7% heap reduction alone, because the StringIO CSV buffer is the dominant cost, not the dict list. declare_fields adds an extra 3.8% saving and a ~3 s CPU win by eliminating the two-pass iteration.

At 10 GB / 10.7M records:

Writer	Wall (s)	Heap peak	RSS delta
`RecordWriterV2` (StringIO)	155.6	293 MB	+1428 MB
`RecordWriterV3` (SpoolFile, 4 MB spool)	152.4	4.7 MB	0 MB

V3 heap stays flat at ~spool_size across the entire 10 GB run. RSS delta is zero — the OS reuses spool file pages freely; StringIO leaves 1.4 GB of dirty pages behind.

spool_size sweet spot is 4 MB:

spool_size       Wall      Heap
──────────────────────────────────
0 (always disk)  12.80s    62 MB   ← TextIOWrapper buffering overhead
64 KB            14.04s    10 MB   ← too many small spills
256 KB           13.08s    10 MB
1 MB             13.51s    10 MB
4 MB  ← default  12.30s    14 MB   ✓ fast as V2, minimum heap
32 MB            12.77s    42 MB
64 MB            12.88s    62 MB
128 MB           12.91s    62 MB   ← no further heap savings, RAM floor rises
256 MB           12.84s    62 MB

Below 4 MB: frequent small spills increase I/O overhead. Above 4 MB: spool stays in RAM, heap climbs back with no speed gain.

What would fully fix the OOM?

Implementing partial chunk support in Splunk core — flush(partial=True) would become real, eliminating all buffering. Both V3 and declare_fields would become unnecessary. Until that ships, ~1× CSV payload buffering is unavoidable regardless of SDK changes.

Notes

Opt-in only — existing commands unchanged.
disk_buffer is SDK-only, never sent to Splunk in the CEXC getinfo response.
declare_fields() is incompatible with add_field()/gen_record() on the same invocation — those APIs populate fields lazily and require the full-materialisation path.

AI Tool Assistance Usage Statement

AI assistance was used to draft parts of the implementation, that was subsequently modified and extended.
AI assistance was used in generating tests/documentation/comments for this change.
This PR has been deslopified.

Add RecordWriterV3, perf tests

c64a592

Ickerday mentioned this pull request Jun 9, 2026

fix: stream records lazily in write_records and _execute_chunk_v2 #800

Open

Ickerday changed the base branch from master to develop June 9, 2026 14:59

Ickerday changed the title ~~feat: opt-in disk-spill buffer for large result sets (RecordWriterV3)~~ feat: 🤖 opt-in disk-spill buffer for large result sets (RecordWriterV3) Jun 9, 2026

Ickerday changed the title ~~feat: 🤖 opt-in disk-spill buffer for large result sets (RecordWriterV3)~~ 🤖 feat: opt-in disk-spill buffer for large result sets (RecordWriterV3) Jun 9, 2026

Ickerday changed the title ~~🤖 feat: opt-in disk-spill buffer for large result sets (RecordWriterV3)~~ 🤖 feat: opt-in disk-spill buffer for large result sets Jun 9, 2026

Add more perf tests

908bce7

Ickerday changed the title ~~🤖 feat: opt-in disk-spill buffer for large result sets~~ 🤖 Add optional disk-spill buffer for large result sets for Custom Search Commands Jun 17, 2026

Ickerday changed the title ~~🤖 Add optional disk-spill buffer for large result sets for Custom Search Commands~~ 🤖 Add optional disk-spill buffer for large result sets in Custom Search Commands Jun 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🤖 Add optional disk-spill buffer for large result sets in Custom Search Commands#804

🤖 Add optional disk-spill buffer for large result sets in Custom Search Commands#804
Ickerday wants to merge 2 commits into
developfrom
feature/disk-buffer-spooled-file

Ickerday commented Jun 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Ickerday commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Can we remove list() entirely?

Usage

Benchmark

What would fully fix the OOM?

Notes

AI Tool Assistance Usage Statement

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Ickerday commented Jun 9, 2026 •

edited

Loading

Can we remove `list()` entirely?