fix: sort dictionary keys bytewise per BEP 3#206
Open
spokodev wants to merge 1 commit into
Open
Conversation
encode.dict and encode.dictMap sorted keys with Array.prototype.sort(), which orders by UTF-16 code unit. BEP 3 requires dictionary keys to be sorted as raw byte strings (bytewise on their UTF-8 encoding). The two orders diverge for keys containing an astral (non-BMP) code point: its leading UTF-16 surrogate (0xD800-0xDBFF) sorts before a BMP key in U+E000-U+FFFF, but its first UTF-8 byte (0xF0+) sorts after that key's 0xEE/0xEF byte. Because a torrent info-hash is SHA-1 of the bencoded info dictionary, a dict with such a key produces a non-canonical encoding and a wrong hash. Sort by the bytes that are actually emitted: the UTF-8 encoding for string and number keys, the raw bytes for Buffer keys. ASCII/BMP-only dictionaries are unaffected.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
encode.dictandencode.dictMapsort keys withArray.prototype.sort(), which orders by UTF-16 code unit. BEP 3 requires dictionary keys to "appear in sorted order (sorted as raw strings, not alphanumerics)" — i.e. bytewise on their UTF-8 encoding (the repo's owntest/specifications.mdstates this).The two orderings diverge for keys containing an astral (non-BMP) code point: its leading UTF-16 surrogate (
0xD800–0xDBFF) sorts before a BMP key inU+E000–U+FFFF, but its first UTF-8 byte (0xF0+) sorts after that key's0xEE/0xEFbyte — so the order inverts.Impact
A torrent info-hash is SHA-1 of the bencoded
infodictionary, so a dict whose keys trip this inversion produces a non-canonical encoding and a wrong info-hash (28d6de08…vs the canonical408ec594…for the example above). Ordinary torrents are unaffected — their keys are ASCII/BMP — butbencodeis a general-purpose deterministic serializer (content-addressing, DHT values, arbitrary metadata), and any astral-codepoint key (emoji, CJK Ext-B, rare scripts) yields non-canonical output.Fix
Sort by the bytes that are actually emitted — the UTF-8 encoding for string/number keys, the raw bytes for
Bufferkeys (matching howencode.dict/encode.dictMapwrite the keys). Fuzzing (200k adversarial key sets) shows all non-canonical orderings on the current code involve astral keys; with the fix, 0 remain, and byte-sort is identical to the default sort for the all-BMP case.Tests
Added a case asserting bytewise order for an astral + BMP key pair. It fails on the current code and passes with the fix. The full suite (421 → 422) is green and the bundled
benchmark/test.torrentstill re-encodes byte-identically.