An ANTLR 4 grammar for Tor bridge network statuses

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hello developers, in the past few days I have been working on a grammar to parse Tor bridge network statuses and hopefully other Tor descriptors in the future. It's working, for some definition of working, but some issues remain and I need some help. I just uploaded my sources, consisting only of the grammar with a fair amount of documentation: https://people.torproject.org/~karsten/volatile/BridgeNetworkStatus.g4 Quoting from that file to facilitate discussion here: There are multiple goals of having a grammar for Tor descriptors available on CollecTor: 1. Translate descriptors to JSON for statistical analysis: Some tools and databases require Tor descriptors in a standard format like JSON. This grammar and a parser generated from it can help making that translation as easy as possible, also to keep future maintenance as low as possible. 2. Provide a basis for descriptor-parsing libraries: As of late 2015, there are three libraries for parsing Tor descriptors: metrics-lib for Java, Stem for Python, and Zoossh for Go. It would be beneficial to place as much knowledge about the descriptor format into a grammar shared by all those libraries and then generate parsers for different languages from that grammar. 3. Serve as documentation for the Tor directory protocol specification: Tor descriptors are already documented using a hand-written grammar, but that may contain slight inaccuracies because it's not verified. This grammar could fix that by either detecting inaccuracies while trying to rewrite it to an executable grammar form or by replacing the grammar in the specification documentation with this executable grammar. Open issues and questions: - Was it smart to explicitly include all those SP tokens in the rules, or should those be discarded right away by the lexer? The main reason for keeping them was to stay as close to the specification as possible, but maybe that has downsides on the other goals. - If a bridge uses a nickname (or other token that's supposed to be a STRING) that is also a keyword like "r" or "published", things get confusing. Try editing the input bridge network status and observe the result. But those are perfectly valid nicknames, so what can we do? - It would be really nice to use regular expressions in the grammar to match input more thoroughly than just ~[ \n]+, if only we can fix the lexer troubles. It's a pity that all that verification work would need to be duplicated in each of the language-dependent parsers. That kinda defeats the purpose. - Is it easy to walk the parse tree and output a JSON format *without* having to write code for each of the rules? Ideally, the translator would be 20 lines of code and not grow at all if we add 10 more descriptor types. Do we need to change the grammar for that? - The following may turn out to be a non-issue, but some descriptors require lines to be ordered, e.g., "accept" and "reject" lines in server descriptors, and we'll have to retain that order in the parse tree. This should be similar to how we parse entries, starting with "r" lines, but who knows. Feedback much appreciated! All the best, Karsten -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: GPGTools - http://gpgtools.org iQEcBAEBAgAGBQJWOcp+AAoJEJD5dJfVqbCrnQUH/2dp8ER6ZcEGBtHP+dPb/4p0 0tKb4eZobQhZNx3oQOc08nCJl3AEa+Vedep5Caa9MSNycopf7mBFEGtw2V5J3mKN w6D6cvSbSBoFhuh/+Q8oVj+6h0KkUaCVVMaTHefb63usM0EmjsEXvDjBXr+g5nhn q0RqM1Id3V38rs3pKi1JDwGU4w5X45gzUPOXbiNGig6wJuLN1e2cxfF4RdDmGzST JvjlH/KRV59NjMRvUAeTxZIXlz6fKwjTWWQ2PXUnuXAXNPVxYakzHNhiT7qXGro0 7ZFfIr7gwk9kZlF0oy6ltFC1mGgL4xk0vqlrOjvwrh+oAzIciurMcOddXEHwr3E= =DZ0Z -----END PGP SIGNATURE-----

On Wed, Nov 4, 2015 at 4:06 AM, Karsten Loesing <karsten@torproject.org> wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hello developers,
in the past few days I have been working on a grammar to parse Tor bridge network statuses and hopefully other Tor descriptors in the future. It's working, for some definition of working, but some issues remain and I need some help.
I just uploaded my sources, consisting only of the grammar with a fair amount of documentation:
https://people.torproject.org/~karsten/volatile/BridgeNetworkStatus.g4
Nice work, Karsten! I'm hoping we move towards some kind of machine-readable grammar/schema for all our data formats, and that we have our actual parsing/encoding code generated from it. (When I did a survey of where all our crash/assertion bugs for the last few years were, they seemed to have a higher-than-usual concentration in our parsing code.) One thing about this grammar in particular, though: It is over-strict. It matches only the formats we use today, and not the formats we are allowed to use in the future. For one example, a flag on an 's' line can be any non-space string - but this grammar will fail to parse unrecognized flags. On the other hand, while we specify the order of r, s, w, p, a, lines in a generated consensus, clients are required to parse the s, w, p, and a lines in any order, but not to allow two s lines in a single 'r' entry. I think that because of the free-ordering and multiplicity-restriction rules for our data formats, a context-free grammar simply isn't going to match our spec very well.
Quoting from that file to facilitate discussion here:
There are multiple goals of having a grammar for Tor descriptors available on CollecTor:
1. Translate descriptors to JSON for statistical analysis: Some tools and databases require Tor descriptors in a standard format like JSON. This grammar and a parser generated from it can help making that translation as easy as possible, also to keep future maintenance as low as possible.
2. Provide a basis for descriptor-parsing libraries: As of late 2015, there are three libraries for parsing Tor descriptors: metrics-lib for Java, Stem for Python, and Zoossh for Go. It would be beneficial to place as much knowledge about the descriptor format into a grammar shared by all those libraries and then generate parsers for different languages from that grammar.
3. Serve as documentation for the Tor directory protocol specification: Tor descriptors are already documented using a hand-written grammar, but that may contain slight inaccuracies because it's not verified. This grammar could fix that by either detecting inaccuracies while trying to rewrite it to an executable grammar form or by replacing the grammar in the specification documentation with this executable grammar.
Open issues and questions:
- Was it smart to explicitly include all those SP tokens in the rules, or should those be discarded right away by the lexer? The main reason for keeping them was to stay as close to the specification as possible, but maybe that has downsides on the other goals.
IMO, once we have a grammar that is truly correct, that grammar should _be_ the spec, and we should revise the main spec to reference the grammar.
- If a bridge uses a nickname (or other token that's supposed to be a STRING) that is also a keyword like "r" or "published", things get confusing. Try editing the input bridge network status and observe the result. But those are perfectly valid nicknames, so what can we do?
Change the lexing rules so that keywords are only recognized as such at position 0 on the line, outside of a base64 block? best wishes, -- Nick

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 04/11/15 17:43, Nick Mathewson wrote:
On Wed, Nov 4, 2015 at 4:06 AM, Karsten Loesing <karsten@torproject.org> wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hello developers,
in the past few days I have been working on a grammar to parse Tor bridge network statuses and hopefully other Tor descriptors in the future. It's working, for some definition of working, but some issues remain and I need some help.
I just uploaded my sources, consisting only of the grammar with a fair amount of documentation:
https://people.torproject.org/~karsten/volatile/BridgeNetworkStatus.g4
Nice work, Karsten! I'm hoping we move towards some kind of machine-readable grammar/schema for all our data formats, and that we have our actual parsing/encoding code generated from it.
(When I did a survey of where all our crash/assertion bugs for the last few years were, they seemed to have a higher-than-usual concentration in our parsing code.)
Great, added as goal #4 in the document.
One thing about this grammar in particular, though: It is over-strict. It matches only the formats we use today, and not the formats we are allowed to use in the future. For one example, a flag on an 's' line can be any non-space string - but this grammar will fail to parse unrecognized flags.
Right, that one would be easy to fix by just accepting any string and moving the checks whether a string is a valid flag to the parser.
On the other hand, while we specify the order of r, s, w, p, a, lines in a generated consensus, clients are required to parse the s, w, p, and a lines in any order, but not to allow two s lines in a single 'r' entry.
I think that because of the free-ordering and multiplicity-restriction rules for our data formats, a context-free grammar simply isn't going to match our spec very well.
Indeed, this grammar accepts any number of s, w, and p lines per r entry and leaves it to the parser to make sure there's at most one. It would be good to look for alternatives here. I added this as another item for the "open issues and questions" section.
Quoting from that file to facilitate discussion here:
There are multiple goals of having a grammar for Tor descriptors available on CollecTor:
1. Translate descriptors to JSON for statistical analysis: Some tools and databases require Tor descriptors in a standard format like JSON. This grammar and a parser generated from it can help making that translation as easy as possible, also to keep future maintenance as low as possible.
2. Provide a basis for descriptor-parsing libraries: As of late 2015, there are three libraries for parsing Tor descriptors: metrics-lib for Java, Stem for Python, and Zoossh for Go. It would be beneficial to place as much knowledge about the descriptor format into a grammar shared by all those libraries and then generate parsers for different languages from that grammar.
3. Serve as documentation for the Tor directory protocol specification: Tor descriptors are already documented using a hand-written grammar, but that may contain slight inaccuracies because it's not verified. This grammar could fix that by either detecting inaccuracies while trying to rewrite it to an executable grammar form or by replacing the grammar in the specification documentation with this executable grammar.
Open issues and questions:
- Was it smart to explicitly include all those SP tokens in the rules, or should those be discarded right away by the lexer? The main reason for keeping them was to stay as close to the specification as possible, but maybe that has downsides on the other goals.
IMO, once we have a grammar that is truly correct, that grammar should _be_ the spec, and we should revise the main spec to reference the grammar.
- If a bridge uses a nickname (or other token that's supposed to be a STRING) that is also a keyword like "r" or "published", things get confusing. Try editing the input bridge network status and observe the result. But those are perfectly valid nicknames, so what can we do?
Change the lexing rules so that keywords are only recognized as such at position 0 on the line, outside of a base64 block?
That would work, but I have no clue how to convince ANTLR 4 to generate such a lexer. I added this suggestion to the file in the hope that somebody else who knows more about this stuff can pick this up. Thanks for the great suggestions! New version available here, if somebody else wants to look: https://people.torproject.org/~karsten/volatile/BridgeNetworkStatus.g4 All the best, Karsten -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: GPGTools - http://gpgtools.org iQEcBAEBAgAGBQJWO1faAAoJEJD5dJfVqbCrrDMIAKP3skSwa2QqfJpq7OIGEKKz wuLCYfd+gMyHrwvftekAA1eVbuRlE38F9DmTPPErKD2aPCUbHO90Gv6icpIl9Wew tgExHXFXvE2jZ1l15hEmwOT3F/Fypz4Om0gFvflv4TgKeiuGUvGt4LcKf0iEuFU5 SXvDrWd6nQQ09u7tJq7s214bjT6ixTvmlSTFq89HRgr9+88idaee6LTxejAsDnfd NONmPHaInfXrN1vlrAJF3eOqC3bhY0WGLwBNW6g1htU3EGqDL3yt/XZGxJHf27uv RAuWsgF6r9UTLKu9d7VJSxLaURLb53CqZwnBbwwAXegz5wgVRLiJD7mGz/KtYno= =Rdao -----END PGP SIGNATURE-----
participants (2)
-
Karsten Loesing
-
Nick Mathewson