[tor-dev] Exitmap's control flow

Tue May 17 01:53:34 UTC 2016

Hi Mridul,

I'm copying tor-dev@, so other folks can chime in.

On Thu, May 12, 2016 at 10:16:45AM +0530, Mridul Malpotra wrote:
> a) Can you give me a short description about the program flow on how the
> EventHandler class enables modules to be executed in exitmap? From my
> initial pondering over the code, the circuits seem to be created in
> succession. Then for each circuit creation, the listener catches the event
> and starts by calling module_closure for invoking probe() function specific
> to each module. However, I am having trouble understanding the role of the
> command utility and IPC queue, for which I can see a separate queue having
> the exit_fingerprint and socket pairs but fail to comprehend how it is
> being used.

Like you said, circuits are created sequentially in exitmap.py:419.  We
have configurable inter-circuit delay (exitmap.py:442) to reduce the
load on the Tor network.

Before exitmap asks Stem to create circuits, it registers event
listeners for all circuit and stream events (exitmap.py:371).  So
whenever Tor tells Stem that a circuit or stream has changed, Stem
notifies exitmap.  We only care about a subset of all circuit and stream
events, though.

All the event code is in eventhandler.py.  Once Tor managed to create a
new circuit, it tells Stem, which tells exitmap.  We catch this event in
the new_circuit() function (eventhandler.py:254).

Now here's where the command utility come in.  Exitmap modules can do
one of two things.  They can either use pure Python to do their scan
(e.g., by using httplib to fetch a web site), or use an external tool
and parse its output (e.g., by running the openssl command line tool).
For the first case, we offer the function run_python_over_tor()
(command.py:37).  It basically monkey-patches Python's socket.socket and
makes it go over Tor.

External programs are a little bit more tricky, and handled by the
Command object (command.py:58).  We will have multiple instances of our
command line tool running at the same time, and they all connect to
Tor's SOCKS port.  Exitmap then maps a stream (e.g., whatever openssl
does) to a circuit.  But how does exitmap know which one among the, say,
10 streams belongs to a given circuit?  If we don't attach them
correctly, we can still detect MitM attacks, but we will get the alert
for the wrong exit relay.  To correctly attach streams to circuits,
exitmap modules remember the source port of the command line tool and
hand it back to exitmap (note that a module runs in a different process
than exitmap) over the queue.  We can get the source port by parsing the
output of torsocks.  However, the current version of torsocks does not
implement this yet.

> b) For calculating all possible exits and creating circuits,
> ServerDescriptors is set to 0 probably to avoid conflict with changed
> consensus data in the middle of module execution. Also, for each module run
> in the same execution, the original consensus downloaded during
> bootstrapping Tor is being used. In my use case however, which involves
> long duration scanning, we will need to update the cached-consensus after
> some iterations. One way I think this can be made possible is by having an
> asynchronous task that updates the consensus after say, an hour or so.
> Conflicts could be avoided mid-module execution by either stalling
> execution if near the DA consensus time period or use the old
> cached-consensus. I would like to ask you whether my conclusion is correct
> and if so, what other ways can I explore?

Yes, that sounds reasonable to me.  In fact, we might as well
periodically download server descriptors instead of consensuses, so we
can also scan relays that don't have the Exit flag, but have an exit
policy.  The consensus does not necessarily tell us what a relay's exit
policy looks like.  For more information, see:
<https://github.com/NullHypothesis/exitmap/issues/13>

> c) In exitmap, we are taking in the analysis_dir parameter in the command
> line and storing it in a global variable in util.py. However, nowhere is
> the dump_to_file function [1] being called to report on bad exits. What was
> the working that was thought of behind this function? Was exitmap to dump
> the log entries in files inside the analysis_dir for every false negative?
> Can't seem to find code where the function is called.

dump_to_file() is only used in modules, and not by exitmap itself.  You
certainly have a point here; the way exitmap logs stuff is inconsistent
because it's entirely module-driven.  Seems worthy of some improvements.

I hope that clears things up a bit.  Let me know if you want me to
elaborate on something.

Cheers,
Philipp