Hi Mridul,
I'm copying tor-dev@, so other folks can chime in.
On Thu, May 12, 2016 at 10:16:45AM +0530, Mridul Malpotra wrote:
a) Can you give me a short description about the program flow on how the EventHandler class enables modules to be executed in exitmap? From my initial pondering over the code, the circuits seem to be created in succession. Then for each circuit creation, the listener catches the event and starts by calling module_closure for invoking probe() function specific to each module. However, I am having trouble understanding the role of the command utility and IPC queue, for which I can see a separate queue having the exit_fingerprint and socket pairs but fail to comprehend how it is being used.
Like you said, circuits are created sequentially in exitmap.py:419. We have configurable inter-circuit delay (exitmap.py:442) to reduce the load on the Tor network.
Before exitmap asks Stem to create circuits, it registers event listeners for all circuit and stream events (exitmap.py:371). So whenever Tor tells Stem that a circuit or stream has changed, Stem notifies exitmap. We only care about a subset of all circuit and stream events, though.
All the event code is in eventhandler.py. Once Tor managed to create a new circuit, it tells Stem, which tells exitmap. We catch this event in the new_circuit() function (eventhandler.py:254).
Now here's where the command utility come in. Exitmap modules can do one of two things. They can either use pure Python to do their scan (e.g., by using httplib to fetch a web site), or use an external tool and parse its output (e.g., by running the openssl command line tool). For the first case, we offer the function run_python_over_tor() (command.py:37). It basically monkey-patches Python's socket.socket and makes it go over Tor.
External programs are a little bit more tricky, and handled by the Command object (command.py:58). We will have multiple instances of our command line tool running at the same time, and they all connect to Tor's SOCKS port. Exitmap then maps a stream (e.g., whatever openssl does) to a circuit. But how does exitmap know which one among the, say, 10 streams belongs to a given circuit? If we don't attach them correctly, we can still detect MitM attacks, but we will get the alert for the wrong exit relay. To correctly attach streams to circuits, exitmap modules remember the source port of the command line tool and hand it back to exitmap (note that a module runs in a different process than exitmap) over the queue. We can get the source port by parsing the output of torsocks. However, the current version of torsocks does not implement this yet.
b) For calculating all possible exits and creating circuits, ServerDescriptors is set to 0 probably to avoid conflict with changed consensus data in the middle of module execution. Also, for each module run in the same execution, the original consensus downloaded during bootstrapping Tor is being used. In my use case however, which involves long duration scanning, we will need to update the cached-consensus after some iterations. One way I think this can be made possible is by having an asynchronous task that updates the consensus after say, an hour or so. Conflicts could be avoided mid-module execution by either stalling execution if near the DA consensus time period or use the old cached-consensus. I would like to ask you whether my conclusion is correct and if so, what other ways can I explore?
Yes, that sounds reasonable to me. In fact, we might as well periodically download server descriptors instead of consensuses, so we can also scan relays that don't have the Exit flag, but have an exit policy. The consensus does not necessarily tell us what a relay's exit policy looks like. For more information, see: https://github.com/NullHypothesis/exitmap/issues/13
c) In exitmap, we are taking in the analysis_dir parameter in the command line and storing it in a global variable in util.py. However, nowhere is the dump_to_file function [1] being called to report on bad exits. What was the working that was thought of behind this function? Was exitmap to dump the log entries in files inside the analysis_dir for every false negative? Can't seem to find code where the function is called.
dump_to_file() is only used in modules, and not by exitmap itself. You certainly have a point here; the way exitmap logs stuff is inconsistent because it's entirely module-driven. Seems worthy of some improvements.
I hope that clears things up a bit. Let me know if you want me to elaborate on something.
Cheers, Philipp