[tor-dev] An ANTLR 4 grammar for Tor bridge network statuses
nickm at torproject.org
Wed Nov 4 16:43:28 UTC 2015
On Wed, Nov 4, 2015 at 4:06 AM, Karsten Loesing <karsten at torproject.org> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> Hello developers,
> in the past few days I have been working on a grammar to parse Tor
> bridge network statuses and hopefully other Tor descriptors in the
> future. It's working, for some definition of working, but some issues
> remain and I need some help.
> I just uploaded my sources, consisting only of the grammar with a fair
> amount of documentation:
Nice work, Karsten! I'm hoping we move towards some kind of
machine-readable grammar/schema for all our data formats, and that we
have our actual parsing/encoding code generated from it.
(When I did a survey of where all our crash/assertion bugs for the
last few years were, they seemed to have a higher-than-usual
concentration in our parsing code.)
One thing about this grammar in particular, though: It is over-strict.
It matches only the formats we use today, and not the formats we are
allowed to use in the future. For one example, a flag on an 's' line
can be any non-space string - but this grammar will fail to parse
On the other hand, while we specify the order of r, s, w, p, a, lines
in a generated consensus, clients are required to parse the s, w, p,
and a lines in any order, but not to allow two s lines in a single 'r'
I think that because of the free-ordering and multiplicity-restriction
rules for our data formats, a context-free grammar simply isn't going
to match our spec very well.
> Quoting from that file to facilitate discussion here:
> There are multiple goals of having a grammar for Tor descriptors
> available on CollecTor:
> 1. Translate descriptors to JSON for statistical analysis: Some tools
> and databases require Tor descriptors in a standard format like JSON.
> This grammar and a parser generated from it can help making that
> translation as easy as possible, also to keep future maintenance as
> low as possible.
> 2. Provide a basis for descriptor-parsing libraries: As of late 2015,
> there are three libraries for parsing Tor descriptors: metrics-lib for
> Java, Stem for Python, and Zoossh for Go. It would be beneficial to
> place as much knowledge about the descriptor format into a grammar
> shared by all those libraries and then generate parsers for different
> languages from that grammar.
> 3. Serve as documentation for the Tor directory protocol
> specification: Tor descriptors are already documented using a
> hand-written grammar, but that may contain slight inaccuracies because
> it's not verified. This grammar could fix that by either detecting
> inaccuracies while trying to rewrite it to an executable grammar form
> or by replacing the grammar in the specification documentation with
> this executable grammar.
> Open issues and questions:
> - Was it smart to explicitly include all those SP tokens in the
> rules, or should those be discarded right away by the lexer? The main
> reason for keeping them was to stay as close to the specification as
> possible, but maybe that has downsides on the other goals.
IMO, once we have a grammar that is truly correct, that grammar should
_be_ the spec, and we should revise the main spec to reference the
> - If a bridge uses a nickname (or other token that's supposed to be a
> STRING) that is also a keyword like "r" or "published", things get
> confusing. Try editing the input bridge network status and observe
> the result. But those are perfectly valid nicknames, so what can we do?
Change the lexing rules so that keywords are only recognized as such
at position 0 on the line, outside of a base64 block?
More information about the tor-dev