[tor-dev] Python ExoneraTor

Sun Jun 8 23:26:44 UTC 2014

Oh, and another quick thought - you once mentioned that a descriptor
search service would make ExoneraTor obsolete, and in looking it over
I agree. The search functionality ExoneraTor provides is trivial. The
only reason it requires such a huge database is because it's storing a
copy of every descriptor ever made.

I suspect the actual right solution isn't to rewrite ExoneraTor at
all, but rather develop a new service that can be queried for this
descriptor data. That would make for a *much* more worthwhile project.

ExoneraTor? Nice to have. Descriptor archive service? Damn useful. :)

On Sun, Jun 8, 2014 at 3:03 PM, Damian Johnson <atagar at torproject.org> wrote:
> Hi Karsten. This is diving into enough detail that we might as well move this
> over to tor-dev at . For the list's benefit, Karsten and I are discussing a
> Python rewrite of ExoneraTor...
>
>   https://exonerator.torproject.org/
>   https://gitweb.torproject.org/exonerator.git
>
>
> First I think I need to take a step back to figure out exactly what we're
> after. From a quick peek at ExoneraTor it looks like it behaves as follows...
>
>   a. User enters an address (IPv4 or IPv6) and a date (either for a day or an
>      hour).
>
>   b. ExoneraTor lists router status entries for all relays that match the
>      criteria. These entries link to the consensus they came from and server
>      descriptors they reference.
>
>   c. The user can then enter a destination address and port to search exit
>      policies in TorDNSEL entres.
>
> Step 'a' and 'b' make sense to me. Step 'c' however I'm having a little
> difficulty groking. Ignoring TorDNSEL entries for a moment, we already have
> all the ingredients to provide the user with three fields to start with...
>
>   * Source Address (required)
>   * Timestamp (required)
>   * Destination Address and/or Port (optional)
>
> The source address and timestamp come from the consensus, and an optional
> 'can it exit to destination X' consults the server descriptor's exit policy.
>
> So what is TorDNSEL providing us and why is it a separate search on the page?
> As I understand it the value of TorDNSEL is that we can't trust the address in
> the router status entries. If that's the case then our present search fields
> don't make sense to me...
>
>   * Our initial search consults consensus information for the address and
>     timestamp but not the exit policy. This is weird both because the address
>     this has is faulty, and we have the exit policy so we could trivially
>     include that in our search criteria.
>
>   * Our second search gives the impression that we're using the earlier
>     consensus results to query exit criteria from TorDNSEL. As I understand
>     it though that's not what it's doing. TorDNSEL is completely independent
>     from the consensus information.
>
> I could understand a search that just consults consensus information (ignoring
> address accuracy, it has everything we need). I could also understand a search
> that just consults TorDNSEL information (ignoring its inconsistent poll rate,
> it has everything we need).
>
> However, this hybrid approach and how it's presented really confuses me.
> Unless I'm mistaken with something above what I'd expect from ExoneraTor is...
>
>   * The three search fields mentioned above.
>
>   * It shows results based on the consensus information like we presently do.
>
>   * If we have TorDNSEL entries that either indicate that a relay we're
>     presenting had a different external address or another relay had the
>     address we're searching for then note that.
>
> That is to say, the base search is based on consensus information (using
> server descriptor exit policies if we want to filter by that), and the
> TorDNSEL results are just appended notes since we can't rely on its poll rate.
>
> Thoughts?
>
> Cheers! -Damian
>
> PS. Congratulations on getting me invested. I just spent the last three hours
> in front of a whiteboard trying to puzzle out why ExoneraTor works the way it
> presently does. ;)
>
> PPS. Stem's ExitPolicy class has a can_eixt_to() method that would be really
> handy for this...
>
>   https://stem.torproject.org/api/exit_policy.html#stem.exit_policy.ExitPolicy.can_exit_to
>
> PPPS. I'm still hesitant about actually tackling this project. Arm is midway
> through being rewritten, and considering its sudden uptick in usage probably
> the most important project on my plate right now.
>
> That said, I'm happy to discuss this. Even if we don't implement it right now
> this thread will be useful so we know where we're going with ticket #8260.
>
> Concerning the earlier discussion of 'work with Karsten on a python project'
> I have a personal bias toward collaborating when the project has few unknowns
> for me, but working alone when *I'm* learning something. That is to say, I'd
> love to work with you on a straightforward Stem project and I'd also like
> to discuss ExoneraTor's design. But when it comes to coding, this has enough
> unknowns that if I take it on I'd prefer to experiment alone for a while - at
> least until I know enough about the APIs involved that I can avoid
> embarrassing myself. :)
>
>
> On Sun, Jun 8, 2014 at 2:56 AM, Karsten Loesing <karsten at torproject.org> wrote:
>> On 08/06/14 06:27, Damian Johnson wrote:
>>>>> Here's a quick overview of the codebase to facilitate reading through it:
>>>>
>>>> Ahhh, very useful - thanks.
>>>
>>> Hmmm. Just took a quick peek at the ExoneraTor codebase and, unless
>>> I'm mistaken, it doesn't actually use metrics-lib, does it?
>>
>> You're right, looks like it doesn't.
>>
>>> Honestly
>>> looking over the code is making me a little hesitant to take this on
>>> after all. I was anticipating a small, quick project of DocTor's scope
>>> but I've never touched SQLAlchemy or Posgress before.
>>
>> I don't think we'll even have to touch the Postgres for moving from Java
>> to Python.  The Python code would simply do SQL calls via its SQL
>> library just like Java does.
>>
>> I just copied all SQL statements that the Python part would have to
>> prepare and execute:
>>
>> CALL insert_descriptor(?, ?);
>> CALL insert_statusentry(?, ?, ?, ?, ?, ?, ?);
>> CALL insert_consensus(?, ?);
>> CALL insert_exitlistentry(?, ?, ?, ?, ?);
>> SELECT MIN(validafter) AS first, MAX(validafter) AS last FROM consensus;
>> SELECT validafter FROM consensus WHERE validafter >= ? AND validafter <= ?;
>> CALL search_statusentries_by_address_date(?, ?);
>> CALL search_addresses_in_same_24 (?, ?);
>> CALL search_addresses_in_same_48 (?, ?);
>> SELECT rawdescriptor FROM descriptor WHERE descriptor = ?;
>> SELECT descriptor, rawdescriptor FROM descriptor WHERE descriptor LIKE ?;
>> SELECT rawconsensus FROM consensus WHERE validafter = ?;
>>
>> That's it.  No further knowledge about Postgres required.
>>
>>> Once I wrote this I realized I'm being a damn hypocrite. Here I was
>>> saying "Karsten, learn Python so we can leverage each other's
>>> codebases!" but then I hightail it once the project delves into areas
>>> new to me. New arm users are showing up almost daily on irc and I'm
>>> anxious to give them a new release... but then this is exactly the
>>> issue, isn't it? Deliverables you'd like to focus on crowding out time
>>> to learn new things.
>>>
>>> So TL;DR I'm gonna eat my own words and suggest we focus on our
>>> separate domains for now. I really would like to work on some small
>>> metrics projects with you. Each month I eyeball your status reports
>>> asking myself "Is there anything here I can work with Karsten on to
>>> draw our spaces closer together?" so please let me know if you run
>>> across anything in Metrics we can collaborate on.
>>
>> (Replying below, first replying to the DynamoDB part.)
>>
>>> Your hypocritical friend, ~Damian
>>>
>>> PS. When we next meet I'd like to discuss ExoneraTor's design a bit.
>>> First thought I had when looking at the code was 'huh... I wonder if
>>> this would be a good use case for DynamoDB'.
>>
>> I'm wary about moving to another database, especially NoSQL ones and/or
>> cloud-based ones.  They don't magically make things faster, and Postgres
>> is something I understand quite well by now.  And again, I think that we
>> keep the Postgres part entirely unchanged when moving to Python.  Not
>> saying that DymanoDB can't be the better choice, but switching the
>> database is not a priority for me.
>>
>>
>> So, regarding the rewrite: rather than canceling the project before it
>> starts, how about we find a role for you that you're more comfortable with?
>>
>> For example, I'd want to try rewriting it step by step based on your
>> suggestion of frameworks/libraries and with some code review of yours.
>>
>> If you're interested, which framework would I use for the new Python
>> ExoneraTor?  It's supposed to do the following tasks:
>>
>>  - Provide a simple web site with a web form, backed by the PostgreSQL
>> database.
>>  - Maybe offer a simple RESTful API for lookups that the web form could
>> use to compose responses, but that could also be used by other
>> applications directly.
>>  - Return documents from the database by identifier, so without
>> providing a search functionality.
>>  - Run a scheduled task once per hour that fetches data from CollecTor
>> and puts it in a database.
>>
>> Bonus points if the result is as easy to deploy on Debian Wheezy as
>> possible.  Like, install these few Debian packages, run the setup
>> script, done.
>>
>> Of course, if you'd prefer to focus on other things and not discuss
>> ExoneraTor stuff, that's perfectly fine, too. :)
>>
>> All the best,
>> Karsten
>>