[tor-dev] Drafting a proposal for mnemonic onion URLs

Wed Dec 21 05:16:58 UTC 2011

On Tue, Dec 20, 2011 at 11:20 PM, Sai <tor at saizai.com> wrote:
> Howdy all.
>
> I would like to make a mnemonic translation method for onion URLs. Not
> as in a full petname system with registration and so forth, just a
> "four little words" style way of rendering and entering existing
> 80-bit hashes in a way that is memorable to not just detect
> partial-overlap fuzzing attacks but actually memorize and produce the
> URL directly.
>

So, the last time I was talking about this (with a grad student whose
name I have just asked for permission to cite), I came up with the
following scheme.  Let BW be the number of bits per natural-language
word; let BH be the number of bits per hidden service identity key
hash.  Let H() be some hash function.  Represent the onion key
identity hash as a sequence of BH/BW natural-language words, letting
each word W represent the first BW bits of H(W).

So if H() is SHA256, and BW is 20, and BH is 80, the hostname
  bookbind.electrodynamism.unsurnamed.squibbish
would represent the 80-bit sequence
  Ht("bookbind") Ht("electrodynamism") H("unsurnamed") H("squibbish")
where Ht(x) is the first 20 bits of SHA256(x).

(In practice, 80 bits is feeling way too short these days; but that's
another discussion.)

The main advantages and disadvantages of this idea are:

  + You don't need to ship a shared wordlist with the client software.
 (That's a good thing, since big dictionaries get annoying fast.)
  + You don't need to restrict yourself to any particular language; it
is not English-centric.
  - If bw is large enough, there will be sequences of bits that are
not represented as the hash of any particular English word.  (e.g., if
bw is 32, you are in trouble: no language has 4 billion words!)  You
can work around this by having your search program generate new
english-sounding sequences, by using a bigger wordlist or by
generating keys until you find one whose fingerprint *is*
representable with words you have.
  - If you make BW big, you are going to get some pretty obscure words.
  - Encodings are not unique.
  + You can get a lot of bits per word.  If you're willing to do stuff
like "every word in english" x "the numbers 00-99", you can get even
more.
  + It scales to more than 80 bits pretty easily.
  - It doesn't make pretty sentences by default, though you can keep
generating words till you find something you like the sound of.
  - We'd need to watch out for words that look equivalent to a human,
but aren't.
  / You actually need to write a reasonably good search program.  This
isn't too hard, and the database it generates isn't too big.

There are other issues and other possible solutions too.

yrs,
-- 
Nick