-
Notifications
You must be signed in to change notification settings - Fork 620
Add a variable-length integer encoder/decoder #744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
DHowett
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you feel about non-canonical overlong encodings (which is a problem UTF-8 also suffers from)?
that would be something like encoding 0x01 as 0x17 0x00 0x00 0x00, if I have parsed your description correctly.
|
Yeah, I thought about that. SQLite's varint for instance doesn't support this but has a more efficient encoding. I intentionally decided against that, for one because decoding becomes faster, and also because non-canonical encodings are quite beneficial: When a LSH instruction jumps further down into the instruction stream, the address offset depends on the number of bytes in-between. That number depends on the encoding size of all the varints in-between. And those in turn could be downward jumps which, again, depend on the encoding size of other varints. I'm sure I'll come up with a solution to this recursive problem at some point, if I want to. But I'm fairly certain that having non-canonical encodings will allow for easy "tie breakers" for any such algorithm. |
| // Copyright (c) Microsoft Corporation. | ||
| // Licensed under the MIT License. | ||
|
|
||
| //! Variable-length `u32` encoding and decoding, with efficient storage of `u32::MAX`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I don't understand - it's not a u32 encodig, it's a u28 encoding with a special case for u32::MAX and a pretty significant gap between 268435455 and 4294967295
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's fair. Perhaps I should move this into the lsh project now that I made it a library. 🤔 The reason it's an "u28" is because lsh really doesn't need values >2^28, while an efficient compression for a >2^28 value is still useful (it's used for setting the input offset to max. when matching a .*).
For now, this module has no purpose.
I wrote it as an experiment for encoding VM instructions.