Hi Damian,
I'm interested in building a lightweight, internal domain-specific language to explore archived Tor data. The goal is to make it easy to answer questions like the one that recently came up on tor-relays, "how many guards shift location significantly across the Internet, and how often?" Combining Stem and zoossh seems like a good solution.
Ideally, zoossh should do the heavy lifting as it's implemented in a compiled language. For data exploration, however, having a Stem-enabled Python shell with a set of analysis methods sounds better. Now the question is how to pass potentially large amounts of readily-parsed consensuses and descriptors from zoossh to Stem? In a perfect world, we would have bindings to use zoossh in Python. The gopy [0] folks are working on that, but it's a young project; interfaces are not yet supported. Two workarounds come to mind until gopy catches up, both requiring some glue code:
1. Let zoossh do the data filtering and then return a list of files that are then parsed again by Stem. That's easy to implement, but can be quite inefficient if the filtering step still returns plenty of data.
2. Have some IPC mechanism that passes objects from zoossh to Stem. Objects could be serialised in some way to minimise unnecessary parsing. While that might be the most efficient option for now, it probably requires too much work.
3. ...something else I didn't consider?
Please let me know if you have any thoughts.
[0] https://github.com/go-python/gopy
Cheers, Philipp
Herrow,
3. Something else you didn't consider.
You're describing something which I've been tinkering with recently so I'll add some thoughts. I've looked at zoossh and stem for parsing. They are inadequate alone. What you need is to properly define this domain-specific language using a context-free grammar. Then it doesn't matter how you parse the data, or what language, and the semantic analysis phase can be mapped to a variety of analysis/viz tools from SciPy to R.
Thinking of this in terms of the parser is too being too limited in your view. The parser is really not that important. The code that is implemented in both parsers you describe is naive because it (alone) doesn't require the semantic analysis to derive useful inferences. Parsing only produces primitives for analysis, and the rate at which these primitives are produced is a completely different problem from what you're trying to solve.
We may cross paths on this subject again in the future. Regards --leeroy
On Tue, Jul 28, 2015 at 07:30:02PM -0400, l.m wrote:
What you need is to properly define this domain-specific language using a context-free grammar. Then it doesn't matter how you parse the data, or what language, and the semantic analysis phase can be mapped to a variety of analysis/viz tools from SciPy to R.
I think I wasn't clear in my previous email, sorry for that. I want the DSL to be internal, with Python as host language, which means that I don't have to design a language from scratch. The DSL part is really just a set of Python methods (and perhaps redefined operators) to intuitively interact with objects. Therefore, I'm looking into ways to get these objects into the Python interpreter as efficiently as possible.
Cheers, Philipp
Hi again,
So it's really not a domain specific language at all then? You can do that without a specific parser and without stem. Just feed the data subset into your favorite analysis tool. Stem, and parsers by themselves are basically useless for analysis. Without an integrated method of performing semantic analysis (specific to the data) you end up with excessive implementation complexity and exponential processing time for even trivial tasks.
You can get the data there fast, sure, but parsing is inherently naive for all data types. The gold standard for efficiency is measured in automata based recognition. To my knowledge none of the parsers do this, so none can be considered efficient. Even if they did though it wouldn't matter. Semantic analysis is hard. So even if the parsers were realized ideally, and you had your pick of the best, the processing would still end up being exponential. Just some food for thought.
In any case, good luck, and I'll probably bring this up at some future measurement team meeting.
Regards --leeroy
Hi Philipp, sorry about the delay! Spread pretty thin right now. Would you mind discussing more about the use cases, and give a mockup for what this new domain specific language would look like in practice?
My first thought is "would such a language be useful enough to be worth investing time to learn?". I've made lots of things that flopped because they didn't serve a true need, and while a domain specific language for descriptors sounds neat I'm not sure if I'm seeing a need for it.
Roger occasionally asks me to write one-off scripts to answer questions about the tor network, such as "how do the votes of dirauth X compare with Y?" or "how many relays are unmeasured by the bandwidth auths?"...
https://stem.torproject.org/tutorials/examples/compare_flags.html https://stem.torproject.org/tutorials/examples/votes_by_bandwidth_authoritie...
These questions generally take me fifteen minutes or so to answer. Yes, yes, I'm the author of Stem so that's skewed. But still, the descriptor APIs are simple enough that anyone should be able to do much the same with only a basic knowledge of Python.
Ideally, zoossh should do the heavy lifting as it's implemented in a compiled language.
This is assuming zoossh is dramatically faster than Stem by virtue of being compiled. I know we've discussed this before but I forget the results - with the latest tip of Stem (ie, with lazy loading) how do they compare? I'd expect time to be mostly bound by disk IO, so little to no difference.
- Let zoossh do the data filtering and then return a list of files that are then parsed again by Stem. That's easy to implement, but can be quite inefficient if the filtering step still returns plenty of data.
Yup, agreed. This plan would essentially be to double parse the results and I'd expect it to be far slower than using either library alone.
- Have some IPC mechanism that passes objects from zoossh to Stem. Objects could be serialized in some way to minimize unnecessary parsing. While that might be the most efficient option for now, it probably requires too much work.
Again agreed. Theoretically possible - you could make a blank Stem descriptor object, then populate its attributes with the zoossh parsed results. However, this would require you to maintain the hand-built conversion function. And again, I'm also doubtful it would yield a performance benefit in practice.
None of this is to say 'don't give it a shot'. If you think this would be a fun project then by all means dig in! Just voicing my two cents that our efforts might be better spent elsewhere. ;)
Cheers! -Damian
Hi Philipp,
I know I've already mentioned some thoughts on this subject. I would be interested in your thoughts on the types of challenging questions such a hypothetical DSL might answer. I've already put some effort into this (forking metrics-lib), but I'm still new to working with tor network data. There's around a terabyte of it and I can't possibly imagine every interesting scenario at this point. Right now I'm trying to be as flexible (and general) as possible in the implementation. Besides what you've already mentioned, what other types of uses do you envision? I'm interested in being able to answer questions which can only be answered by looking at macroscopic details level over time. Things like how to draw interesting facts from performance data, and how to improve collection (signalling, messaging, new metrics, etc) towards making attacks more visible.
Areas I'm fuzzy on include torflow data. Mostly because up to a couple weeks ago I didn't know there *was* a spec (and instead treated it as blackbox).
If there are common, and challenging questions, that are more specific than just 'dive in' and explore, please do be creative.
Thanks --leeroy
On Fri, Jul 31, 2015 at 04:22:19PM -0400, l.m wrote:
I know I've already mentioned some thoughts on this subject. I would be interested in your thoughts on the types of challenging questions such a hypothetical DSL might answer. I've already put some effort into this (forking metrics-lib), but I'm still new to working with tor network data. There's around a terabyte of it and I can't possibly imagine every interesting scenario at this point. Right now I'm trying to be as flexible (and general) as possible in the implementation. Besides what you've already mentioned, what other types of uses do you envision? I'm interested in being able to answer questions which can only be answered by looking at macroscopic details level over time. Things like how to draw interesting facts from performance data, and how to improve collection (signalling, messaging, new metrics, etc) towards making attacks more visible.
My work on finding Sybil groups has brought me to this problem, which is why I usually find myself asking questions such as:
- "Which relays satisfy a given pattern, e.g., ORPort = n, DirPort = n+1?"
- "Are these n relays run by the same operator? What are the similarities between their descriptors?"
A database might be sufficiently good at answering many of these questions. See the discussion I recently had with Karsten on #tor-project: http://meetbot.debian.net/tor-project/2015/tor-project.2015-07-29-14.01.log.html
Ignoring my own interests, I could imagine several other parties to be interested in your language. Relay operators might want to learn things about their relays, Onionoo is unable to provide. Researchers might want to use it to empirically justify design decisions.
I'm probably stating the obvious here, but keep in mind that the usefulness of your language also depends on how easy it is to use. Not everybody might be willing to learn a new language, regardless of its flexibility, if you might as well write your own scripts to achieve the same task in some hours.
Cheers, Philipp
Hi Philipp,
First, thank you for the input. I will certainly review your discussions with other measurement team members. I'm sorry I wasn't able to attend.
On the subject of databases and why they're a kludge. Databases represent relationships between data as joins. Joins are a construct which must be maintained by the database which must persist or be enforced by integrity constraints. A database may be useful to store data in it's final form, and to represent relationships between such entities. It requires computation in an interpreted language and joins are not represented using formal math. (In a matter of speaking database theory does encompass some abstract math objects in the form of sets). Storing data and representing known relationships is what a database is designed for. Analyzing data and finding dynamic relationships is something a database will never do well--it's outside the intended use. Formal (mathematically) methods for representing semantics can always be proved correct using rigorous methods, and will always be faster. Imagine if tor's path selection algorithm were implemented as a database. It would work but the math-derived implementation will also be vastly superior.
Allow me to clarify further. The formal language described here is used to derive subset languages. In a matter of speaking the base language is a representation of tor's network communication. By adding additional grammar to this language a researcher can define formally the semantic relationships that hold particular interest or meaning. One researcher, who is only interested in onionoo-like applications (which is me in this case, not Karsten) would create a grammar describing such content. Another who is interested in a particular class of analysis might have another grammar. Right now my objective in the forks is to make this possible (it's not currently).
The advantage is it's easy to maintain for researchers, easy to maintain for developers, easy to create proofs on the system, easy to implement formal validation methods (which you may really want for some important classes of research).
So there's really not a language to learn per-se. It's a formal method of making all that tor-network gibberish make sense. Once you've described the semantic meaning it's *all* automatic. Want that semantic relationship to build a shiny viz in R--automatic. Want those semantics to trigger an email for censorship--automatic. Would you rather have a report and a graph describing nodes involved in a potential attack--automatic. Would you like to create JSON representations of related entity--automatic.
Strangely, in the history of analysis at tor project, no one has tried, and it is not implemented in any reusable/presentable form. I very much doubt a potential-sponsor would be willing to sponsor work on metrics-lib, because it's basically useless for analysis (same as the others I've mentioned). A researcher has to do too much work to perform analysis to see tor project as having contributed to making it easy.
I hope that clears things about having to learn a language. Although that's also possible, the techniques are not being used here to create a programming language. The techniques are being used to perform linguistics on tor data. It's possible however to extend this work to define a language for programming, but that's not the primary objective. (An implementation, such as I describe, would make that possible in a formal way--which is good of course)
Regards --leeroy
On Fri, Jul 31, 2015 at 10:00:27AM -0700, Damian Johnson wrote:
Hi Philipp, sorry about the delay! Spread pretty thin right now. Would you mind discussing more about the use cases, and give a mockup for what this new domain specific language would look like in practice?
My first thought is "would such a language be useful enough to be worth investing time to learn?". I've made lots of things that flopped because they didn't serve a true need, and while a domain specific language for descriptors sounds neat I'm not sure if I'm seeing a need for it.
I'm not quite sure yet myself. After talking to Karsten, a simple database might be good enough. Or simply reorganising the directory structure of archived data to efficiently find the consensuses, a given relay fingerprint shows up in. Either way, thanks for your thoughts!
Ideally, zoossh should do the heavy lifting as it's implemented in a compiled language.
This is assuming zoossh is dramatically faster than Stem by virtue of being compiled. I know we've discussed this before but I forget the results - with the latest tip of Stem (ie, with lazy loading) how do they compare? I'd expect time to be mostly bound by disk IO, so little to no difference.
zoossh's test framework says that it takes 36364357 nanoseconds to lazily parse a consensus that is cached in memory (to eliminate the I/O bottleneck). That amounts to approximately 27 consensuses a second.
I used the following simple Python script to get a similar number for Stem:
with open(file_name) as consensus_file: for router in stem.descriptor.parse_file(consensus_file, 'network-status-consensus-3 1.0', document_handler = stem.descriptor.DocumentHandler.ENTRIES): pass
This script manages to parse 24 consensus files in ~13 seconds, which amounts to 1.8 consensuses a second. Let me know if there's a more efficient way to do this in Stem.
Cheers, Philipp
Ideally, zoossh should do the heavy lifting as it's implemented in a compiled language.
This is assuming zoossh is dramatically faster than Stem by virtue of being compiled. I know we've discussed this before but I forget the results - with the latest tip of Stem (ie, with lazy loading) how do they compare? I'd expect time to be mostly bound by disk IO, so little to no difference.
zoossh's test framework says that it takes 36364357 nanoseconds to lazily parse a consensus that is cached in memory (to eliminate the I/O bottleneck). That amounts to approximately 27 consensuses a second.
I used the following simple Python script to get a similar number for Stem:
with open(file_name) as consensus_file: for router in stem.descriptor.parse_file(consensus_file, 'network-status-consensus-3 1.0', document_handler = stem.descriptor.DocumentHandler.ENTRIES): pass
This script manages to parse 24 consensus files in ~13 seconds, which amounts to 1.8 consensuses a second. Let me know if there's a more efficient way to do this in Stem.
Interesting! First thought is 'wonder if zoossh is even reading the file content'. Couple quick things to try are...
with open(file_name) as consensus_file: consensus_file.read()
... to see how much time is disk IO verses parsing. Second is to try doing something practical (say, count the number of relays with the exit flag). Stem does some bytes => unicode normalization which might account for some difference but other than that I'm at a loss for what would be taking the time.
Cheers! -Damian
On Sun, Aug 16, 2015 at 02:44:40PM -0700, Damian Johnson wrote:
Ideally, zoossh should do the heavy lifting as it's implemented in a compiled language.
This is assuming zoossh is dramatically faster than Stem by virtue of being compiled. I know we've discussed this before but I forget the results - with the latest tip of Stem (ie, with lazy loading) how do they compare? I'd expect time to be mostly bound by disk IO, so little to no difference.
zoossh's test framework says that it takes 36364357 nanoseconds to lazily parse a consensus that is cached in memory (to eliminate the I/O bottleneck). That amounts to approximately 27 consensuses a second.
I used the following simple Python script to get a similar number for Stem:
with open(file_name) as consensus_file: for router in stem.descriptor.parse_file(consensus_file, 'network-status-consensus-3 1.0', document_handler = stem.descriptor.DocumentHandler.ENTRIES): pass
This script manages to parse 24 consensus files in ~13 seconds, which amounts to 1.8 consensuses a second. Let me know if there's a more efficient way to do this in Stem.
Interesting! First thought is 'wonder if zoossh is even reading the file content'. Couple quick things to try are...
with open(file_name) as consensus_file: consensus_file.read()
Disk IO is negligible for both tests because the file content is cached in memory. As expected, consensus_file.read() terminates almost instantly.
FWIW, zoossh doesn't parse as much as Stem does, so it's not quite an apple-to-apple comparison. For example, exit policies are not parsed and simply stored as strings for now.
Cheers, Philipp