Skip to content

douyamv/block2-decoding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Block-2 Joint Top-K Decoding

Stars License: MIT Paper GSM8K Speed Python 3.10+ PyTorch 🤗 Transformers

A training-free decoder that emits two tokens per step by jointly maximizing P(w1) * max_{w2} P(w2 | w1) over the top-K candidates for w1.

On GSM8K with Qwen2.5-1.5B-Instruct, Block-2 with K=3 gives:

Decoder Accuracy Sec / problem
Greedy 55.5% (111/200) 11.88
Block-2 K=3 63.0% (126/200) 11.47
Δ +7.5 pp 0.97× (slightly faster)

McNemar exact test: discordant pairs 36 vs 21, p ≈ 0.063 (borderline at α=0.05). Hardware: single RTX 4080 SUPER, bfloat16. n=200 from GSM8K test split.

The accompanying paper, including method, analysis, and limitations, is at paper/block2_paper.pdf.

Why this might matter

  • Greedy is myopic. It commits to the highest-probability w1 even when w1 constrains w2 poorly. Block-2 looks one step ahead.
  • GPU-friendly. The K-way candidate forward batches 3 inputs into a single GEMM. On a 4080 SUPER this is cheaper than two serial single-token forwards, so Block-2 is slightly faster than greedy in wall-clock time.
  • Training-free. Drop-in replacement for greedy, no model changes.
  • Reduces to greedy at K=1.

Algorithm

At each step:

  1. Take the top-K candidates c_1, ..., c_K for the next token from P_1.
  2. In one K-batched forward pass, compute P_2^{(i)} for each candidate.
  3. Pick i* = argmax_i P_1(c_i) * max_{w2} P_2^{(i)}(w2).
  4. Emit (c_{i*}, argmax P_2^{(i*)}) — two tokens per step.

Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from src.block2 import block2_generate

tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-1.5B-Instruct", dtype=torch.bfloat16
).to("cuda")

text = tok.apply_chat_template(
    [{"role": "user", "content": "What is 23 multiplied by 17?"}],
    tokenize=False, add_generation_prompt=True,
)
ids = tok(text, return_tensors="pt").input_ids.to(model.device)

out = block2_generate(model, ids, max_new_tokens=128, k=3,
                      eos_token_id=tok.eos_token_id)
print(tok.decode(out, skip_special_tokens=True))

Reproduce GSM8K

pip install -r requirements.txt
python src/eval_gsm8k.py \
    --ckpt Qwen/Qwen2.5-1.5B-Instruct \
    --dataset-path gsm8k_test.jsonl \
    --n 200 \
    --k 3 \
    --out gsm8k_results.md

The full per-problem output for n=200 is in results/gsm8k_results.md.

Repo layout

paper/
  block2_paper.tex       LaTeX source
  block2_paper.pdf       Compiled PDF (7 pages)
src/
  block2.py              Standalone Block-2 decoder (clean module)
  eval_gsm8k.py          GSM8K evaluation with auto-judging + McNemar test
  compare_interactive.py Interactive Greedy vs Block-2 chat
  eval_100.py            100-prompt qualitative benchmark
  eval_40.py             40-prompt math/list/MC benchmark
results/
  gsm8k_results.md       200-problem GSM8K outputs (both decoders)
  eval_100_results.md    100-prompt outputs
  eval_40_results.md     40-prompt outputs

Limitations

  • Sample size: 200 problems gives borderline significance (p=0.063). The full 1319-problem GSM8K test is the natural confirmatory experiment.
  • Single model: only Qwen2.5-1.5B-Instruct is tested.
  • Single benchmark: GSM8K's structured numeric answers favor decoders that commit cleanly to digit strings; the result may not transfer to open-ended generation.
  • Single K: only K=3 is reported.

See the paper's Limitations section for full discussion.

Citation

@misc{douya2026block2,
  title  = {Block-2 Joint Top-K Decoding: A 7.5\% Accuracy Gain on GSM8K
            with Zero GPU Overhead},
  author = {Douya, M.},
  year   = {2026},
  note   = {\url{https://github.com/douyamv/block2-decoding}},
}

License

MIT. See LICENSE.

Contact

douyamv@gmail.com

About

Block-2 Joint Top-K Decoding — +7.5pp on GSM8K vs greedy, 0.97x wall time on GPU. Training-free, drop-in replacement for greedy on Qwen2.5-1.5B.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors