Skip to content

Conversation

@lhecker
Copy link
Member

@lhecker lhecker commented Jan 20, 2026

For now, this module has no purpose.
I wrote it as an experiment for encoding VM instructions.

Copy link
Member

@DHowett DHowett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about non-canonical overlong encodings (which is a problem UTF-8 also suffers from)?

that would be something like encoding 0x01 as 0x17 0x00 0x00 0x00, if I have parsed your description correctly.

@lhecker
Copy link
Member Author

lhecker commented Jan 20, 2026

Yeah, I thought about that. SQLite's varint for instance doesn't support this but has a more efficient encoding. I intentionally decided against that, for one because decoding becomes faster, and also because non-canonical encodings are quite beneficial:

When a LSH instruction jumps further down into the instruction stream, the address offset depends on the number of bytes in-between. That number depends on the encoding size of all the varints in-between. And those in turn could be downward jumps which, again, depend on the encoding size of other varints.

I'm sure I'll come up with a solution to this recursive problem at some point, if I want to. But I'm fairly certain that having non-canonical encodings will allow for easy "tie breakers" for any such algorithm.

@lhecker lhecker enabled auto-merge (squash) January 20, 2026 20:56
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT License.

//! Variable-length `u32` encoding and decoding, with efficient storage of `u32::MAX`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I don't understand - it's not a u32 encodig, it's a u28 encoding with a special case for u32::MAX and a pretty significant gap between 268435455 and 4294967295

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's fair. Perhaps I should move this into the lsh project now that I made it a library. 🤔 The reason it's an "u28" is because lsh really doesn't need values >2^28, while an efficient compression for a >2^28 value is still useful (it's used for setting the input offset to max. when matching a .*).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants