[tor-bugs] #28320 [Metrics/CollecTor]: Rewrite CollecTor relaydescs module using Stem/txtorcon

Mon Nov 5 08:43:47 UTC 2018

#28320: Rewrite CollecTor relaydescs module using Stem/txtorcon
-----------------------------------+--------------------------
     Reporter:  karsten            |      Owner:  metrics-team
         Type:  task               |     Status:  new
     Priority:  Medium             |  Milestone:
    Component:  Metrics/CollecTor  |    Version:
     Severity:  Normal             |   Keywords:
Actual Points:                     |  Parent ID:
       Points:                     |   Reviewer:
      Sponsor:  Sponsor13          |
-----------------------------------+--------------------------
 The CollecTor service collects and archives data from various nodes and
 services in the public Tor network. Internally, it consists of several
 modules that are running in the background following a pre-defined
 schedule. These modules either download data from other hosts or process
 data that has been copied from other hosts to the local file system. The
 processed data is then provided via a locally running static web server.

 CollecTor is written in Java. It uses several APIs either provided in the
 JDK or in third-party libraries. For example, it uses
 `java.util.concurrent` for scheduling. However, it does not use a specific
 framework for batch processing. That is why it has to solve challenges
 like the following on its own:

  - Scheduling: Make sure modules are running, say, once per hour; avoid
 overlapping runs.
  - Dependencies: Make sure that module runs don't interfere with each
 other; one module writes newly obtained files to disk, another tars them
 up, yet another writes an index file and provides that to external
 applications.
  - Shutdowns: Handle externally triggered shutdowns gracefully and make
 sure the service resumes operation after reboot, without missing data.

 These are just a few examples, and CollecTor does not resolve all of them
 in the best way possible. It also feels like somebody must have solved
 these challenges before. We should find out, and the best way is probably
 to try it out in practice.

 In Mexico City we decided to evaluate existing batch processing frameworks
 by rewriting the CollecTor relaydescs module using Python with Stem or
 txtorcon. It should be sufficient to make it work for at least consensuses
 and server descriptors as initial proof of concept. Other descriptor types
 can follow later, if we decide to switch from Java to Python for
 CollecTor.

 The first steps are to write down requirements and possible Python
 libraries for the batch-processing parts.

 We're done with this task when we have a working prototype of CollecTor in
 Python that fetches consensuses and server descriptors from the directory
 authorities.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/28320>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online