A training-free decoder that emits two tokens per step by jointly maximizing
P(w1) * max_{w2} P(w2 | w1) over the top-K candidates for w1.
On GSM8K with Qwen2.5-1.5B-Instruct, Block-2 with K=3 gives:
| Decoder | Accuracy | Sec / problem |
|---|---|---|
| Greedy | 55.5% (111/200) | 11.88 |
| Block-2 K=3 | 63.0% (126/200) | 11.47 |
| Δ | +7.5 pp | 0.97× (slightly faster) |
McNemar exact test: discordant pairs 36 vs 21, p ≈ 0.063 (borderline at α=0.05). Hardware: single RTX 4080 SUPER, bfloat16. n=200 from GSM8K test split.
The accompanying paper, including method, analysis, and limitations, is at
paper/block2_paper.pdf.
- Greedy is myopic. It commits to the highest-probability
w1even whenw1constrainsw2poorly. Block-2 looks one step ahead. - GPU-friendly. The K-way candidate forward batches 3 inputs into a single GEMM. On a 4080 SUPER this is cheaper than two serial single-token forwards, so Block-2 is slightly faster than greedy in wall-clock time.
- Training-free. Drop-in replacement for greedy, no model changes.
- Reduces to greedy at K=1.
At each step:
- Take the top-K candidates
c_1, ..., c_Kfor the next token fromP_1. - In one K-batched forward pass, compute
P_2^{(i)}for each candidate. - Pick
i* = argmax_i P_1(c_i) * max_{w2} P_2^{(i)}(w2). - Emit
(c_{i*}, argmax P_2^{(i*)})— two tokens per step.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from src.block2 import block2_generate
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-1.5B-Instruct", dtype=torch.bfloat16
).to("cuda")
text = tok.apply_chat_template(
[{"role": "user", "content": "What is 23 multiplied by 17?"}],
tokenize=False, add_generation_prompt=True,
)
ids = tok(text, return_tensors="pt").input_ids.to(model.device)
out = block2_generate(model, ids, max_new_tokens=128, k=3,
eos_token_id=tok.eos_token_id)
print(tok.decode(out, skip_special_tokens=True))pip install -r requirements.txt
python src/eval_gsm8k.py \
--ckpt Qwen/Qwen2.5-1.5B-Instruct \
--dataset-path gsm8k_test.jsonl \
--n 200 \
--k 3 \
--out gsm8k_results.mdThe full per-problem output for n=200 is in
results/gsm8k_results.md.
paper/
block2_paper.tex LaTeX source
block2_paper.pdf Compiled PDF (7 pages)
src/
block2.py Standalone Block-2 decoder (clean module)
eval_gsm8k.py GSM8K evaluation with auto-judging + McNemar test
compare_interactive.py Interactive Greedy vs Block-2 chat
eval_100.py 100-prompt qualitative benchmark
eval_40.py 40-prompt math/list/MC benchmark
results/
gsm8k_results.md 200-problem GSM8K outputs (both decoders)
eval_100_results.md 100-prompt outputs
eval_40_results.md 40-prompt outputs
- Sample size: 200 problems gives borderline significance (p=0.063). The full 1319-problem GSM8K test is the natural confirmatory experiment.
- Single model: only Qwen2.5-1.5B-Instruct is tested.
- Single benchmark: GSM8K's structured numeric answers favor decoders that commit cleanly to digit strings; the result may not transfer to open-ended generation.
- Single K: only K=3 is reported.
See the paper's Limitations section for full discussion.
@misc{douya2026block2,
title = {Block-2 Joint Top-K Decoding: A 7.5\% Accuracy Gain on GSM8K
with Zero GPU Overhead},
author = {Douya, M.},
year = {2026},
note = {\url{https://github.com/douyamv/block2-decoding}},
}MIT. See LICENSE.