Skip to content

fix: sort dictionary keys bytewise per BEP 3#206

Open
spokodev wants to merge 1 commit into
webtorrent:masterfrom
spokodev:fix/bytewise-dict-key-sort
Open

fix: sort dictionary keys bytewise per BEP 3#206
spokodev wants to merge 1 commit into
webtorrent:masterfrom
spokodev:fix/bytewise-dict-key-sort

Conversation

@spokodev

Copy link
Copy Markdown

encode.dict and encode.dictMap sort keys with Array.prototype.sort(), which orders by UTF-16 code unit. BEP 3 requires dictionary keys to "appear in sorted order (sorted as raw strings, not alphanumerics)" — i.e. bytewise on their UTF-8 encoding (the repo's own test/specifications.md states this).

The two orderings diverge for keys containing an astral (non-BMP) code point: its leading UTF-16 surrogate (0xD8000xDBFF) sorts before a BMP key in U+E000U+FFFF, but its first UTF-8 byte (0xF0+) sorts after that key's 0xEE/0xEF byte — so the order inverts.

const d = {}
d['\u{1F600}'] = 2   // UTF-8: f0 9f 98 80
d['']   = 1   // UTF-8: ee 80 80
bencode.encode(d)
// emits  d 4:<f0 9f 98 80> i2e 3:<ee 80 80> i1e e   (astral first — non-canonical)
// BEP 3  d 3:<ee 80 80> i1e 4:<f0 9f 98 80> i2e e   (ee < f0)

Impact

A torrent info-hash is SHA-1 of the bencoded info dictionary, so a dict whose keys trip this inversion produces a non-canonical encoding and a wrong info-hash (28d6de08… vs the canonical 408ec594… for the example above). Ordinary torrents are unaffected — their keys are ASCII/BMP — but bencode is a general-purpose deterministic serializer (content-addressing, DHT values, arbitrary metadata), and any astral-codepoint key (emoji, CJK Ext-B, rare scripts) yields non-canonical output.

Fix

Sort by the bytes that are actually emitted — the UTF-8 encoding for string/number keys, the raw bytes for Buffer keys (matching how encode.dict/encode.dictMap write the keys). Fuzzing (200k adversarial key sets) shows all non-canonical orderings on the current code involve astral keys; with the fix, 0 remain, and byte-sort is identical to the default sort for the all-BMP case.

Tests

Added a case asserting bytewise order for an astral + BMP key pair. It fails on the current code and passes with the fix. The full suite (421 → 422) is green and the bundled benchmark/test.torrent still re-encodes byte-identically.

encode.dict and encode.dictMap sorted keys with Array.prototype.sort(),
which orders by UTF-16 code unit. BEP 3 requires dictionary keys to be
sorted as raw byte strings (bytewise on their UTF-8 encoding). The two
orders diverge for keys containing an astral (non-BMP) code point: its
leading UTF-16 surrogate (0xD800-0xDBFF) sorts before a BMP key in
U+E000-U+FFFF, but its first UTF-8 byte (0xF0+) sorts after that key's
0xEE/0xEF byte.

Because a torrent info-hash is SHA-1 of the bencoded info dictionary, a
dict with such a key produces a non-canonical encoding and a wrong hash.

Sort by the bytes that are actually emitted: the UTF-8 encoding for
string and number keys, the raw bytes for Buffer keys. ASCII/BMP-only
dictionaries are unaffected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant