Block-2 Joint Top-K Decoding

A training-free decoder that emits two tokens per step by jointly maximizing P(w1) * max_{w2} P(w2 | w1) over the top-K candidates for w1.

On GSM8K with Qwen2.5-1.5B-Instruct, Block-2 with K=3 gives:

Decoder	Accuracy	Sec / problem
Greedy	55.5% (111/200)	11.88
Block-2 K=3	63.0% (126/200)	11.47
Δ	+7.5 pp	0.97× (slightly faster)

McNemar exact test: discordant pairs 36 vs 21, p ≈ 0.063 (borderline at α=0.05). Hardware: single RTX 4080 SUPER, bfloat16. n=200 from GSM8K test split.

The accompanying paper, including method, analysis, and limitations, is at paper/block2_paper.pdf.

Why this might matter

Greedy is myopic. It commits to the highest-probability w1 even when w1 constrains w2 poorly. Block-2 looks one step ahead.
GPU-friendly. The K-way candidate forward batches 3 inputs into a single GEMM. On a 4080 SUPER this is cheaper than two serial single-token forwards, so Block-2 is slightly faster than greedy in wall-clock time.
Training-free. Drop-in replacement for greedy, no model changes.
Reduces to greedy at K=1.

Algorithm

At each step:

Take the top-K candidates c_1, ..., c_K for the next token from P_1.
In one K-batched forward pass, compute P_2^{(i)} for each candidate.
Pick i* = argmax_i P_1(c_i) * max_{w2} P_2^{(i)}(w2).
Emit (c_{i*}, argmax P_2^{(i*)}) — two tokens per step.

Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from src.block2 import block2_generate

tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-1.5B-Instruct", dtype=torch.bfloat16
).to("cuda")

text = tok.apply_chat_template(
    [{"role": "user", "content": "What is 23 multiplied by 17?"}],
    tokenize=False, add_generation_prompt=True,
)
ids = tok(text, return_tensors="pt").input_ids.to(model.device)

out = block2_generate(model, ids, max_new_tokens=128, k=3,
                      eos_token_id=tok.eos_token_id)
print(tok.decode(out, skip_special_tokens=True))

Reproduce GSM8K

pip install -r requirements.txt
python src/eval_gsm8k.py \
    --ckpt Qwen/Qwen2.5-1.5B-Instruct \
    --dataset-path gsm8k_test.jsonl \
    --n 200 \
    --k 3 \
    --out gsm8k_results.md

The full per-problem output for n=200 is in results/gsm8k_results.md.

Repo layout

paper/
  block2_paper.tex       LaTeX source
  block2_paper.pdf       Compiled PDF (7 pages)
src/
  block2.py              Standalone Block-2 decoder (clean module)
  eval_gsm8k.py          GSM8K evaluation with auto-judging + McNemar test
  compare_interactive.py Interactive Greedy vs Block-2 chat
  eval_100.py            100-prompt qualitative benchmark
  eval_40.py             40-prompt math/list/MC benchmark
results/
  gsm8k_results.md       200-problem GSM8K outputs (both decoders)
  eval_100_results.md    100-prompt outputs
  eval_40_results.md     40-prompt outputs

Limitations

Sample size: 200 problems gives borderline significance (p=0.063). The full 1319-problem GSM8K test is the natural confirmatory experiment.
Single model: only Qwen2.5-1.5B-Instruct is tested.
Single benchmark: GSM8K's structured numeric answers favor decoders that commit cleanly to digit strings; the result may not transfer to open-ended generation.
Single K: only K=3 is reported.

See the paper's Limitations section for full discussion.

Citation

@misc{douya2026block2,
  title  = {Block-2 Joint Top-K Decoding: A 7.5\% Accuracy Gain on GSM8K
            with Zero GPU Overhead},
  author = {Douya, M.},
  year   = {2026},
  note   = {\url{https://github.com/douyamv/block2-decoding}},
}

License

MIT. See LICENSE.

Contact

douyamv@gmail.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Block-2 Joint Top-K Decoding

Why this might matter

Algorithm

Quick start

Reproduce GSM8K

Repo layout

Limitations

Citation

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
paper		paper
results		results
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Block-2 Joint Top-K Decoding

Why this might matter

Algorithm

Quick start

Reproduce GSM8K

Repo layout

Limitations

Citation

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages