Re: [tor-dev] An ANTLR 4 grammar for Tor bridge network statuses

4 Nov 2015

      On Wed, Nov 4, 2015 at 4:06 AM, Karsten Loesing <karsten@torproject.org> wrote:
...
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hello developers,
in the past few days I have been working on a grammar to parse Tor
bridge network statuses and hopefully other Tor descriptors in the
future.  It's working, for some definition of working, but some issues
remain and I need some help.
I just uploaded my sources, consisting only of the grammar with a fair
amount of documentation:
https://people.torproject.org/~karsten/volatile/BridgeNetworkStatus.g4
Nice work, Karsten!  I'm hoping we move towards some kind of
machine-readable grammar/schema for all our data formats, and that we
have our actual parsing/encoding code generated from it.

(When I did a survey of where all our crash/assertion bugs for the
last few years were, they seemed to have a higher-than-usual
concentration in our parsing code.)

One thing about this grammar in particular, though: It is over-strict.
It matches only the formats we use today, and not the formats we are
allowed to use in the future.  For one example, a flag on an 's' line
can be any non-space string - but this grammar will fail to parse
unrecognized flags.

On the other hand, while we specify the order of r, s, w, p, a, lines
in a generated consensus, clients are required to parse the s, w, p,
and a lines in any order, but not to allow two s lines in a single 'r'
entry.

I think that because of the free-ordering and multiplicity-restriction
rules for our data formats, a context-free grammar simply isn't going
to match our spec very well.
...
Quoting from that file to facilitate discussion here:
There are multiple goals of having a grammar for Tor descriptors
available on CollecTor:
1. Translate descriptors to JSON for statistical analysis: Some tools
and databases require Tor descriptors in a standard format like JSON.
 This grammar and a parser generated from it can help making that
translation as easy as possible, also to keep future maintenance as
low as possible.
2. Provide a basis for descriptor-parsing libraries: As of late 2015,
there are three libraries for parsing Tor descriptors: metrics-lib for
Java, Stem for Python, and Zoossh for Go.  It would be beneficial to
place as much knowledge about the descriptor format into a grammar
shared by all those libraries and then generate parsers for different
languages from that grammar.
3. Serve as documentation for the Tor directory protocol
specification: Tor descriptors are already documented using a
hand-written grammar, but that may contain slight inaccuracies because
it's not verified.  This grammar could fix that by either detecting
inaccuracies while trying to rewrite it to an executable grammar form
or by replacing the grammar in the specification documentation with
this executable grammar.
Open issues and questions:
- Was it smart to explicitly include all those SP tokens in the
rules, or should those be discarded right away by the lexer?  The main
reason for keeping them was to stay as close to the specification as
possible, but maybe that has downsides on the other goals.
IMO, once we have a grammar that is truly correct, that grammar should
_be_ the spec, and we should revise the main spec to reference the
grammar.
...
- If a bridge uses a nickname (or other token that's supposed to be a
STRING) that is also a keyword like "r" or "published", things get
confusing.  Try editing the input bridge network status and observe
the result.  But those are perfectly valid nicknames, so what can we do?
Change the lexing rules so that keywords are only recognized as such
at position 0 on the line, outside of a base64 block?

best wishes,
-- 
Nick

Re: [tor-dev] An ANTLR 4 grammar for Tor bridge network statuses

Nick Mathewson