[tor-dev] Scaling Tor Metrics
tl at rat.io
Fri Nov 27 14:22:44 UTC 2015
Since the Berlin dev meeting I’ve been working on setting up and feeding an "analytics server" that will provide a Big Data infrastructure filled with raw metrics descriptors ready for consumption by anybody who’s interested in finding what’s in the Tor Metrics data. That obviously affects my take on Scaling Tor Metrics.
You havn’t mentioned the backend very much - the data gathering machinery, the handful of parsers (in as many different languages), the storage. Are you content with all that? You seem to assume that if the frontend is in better shape it will attract more people to the backend. Maybe, but I’m not sure. At least I found your little Antlr excursion recently pretty neat.
So far I haven’t succeeded in getting an overview over all of the parts in the backend that actually make up the Metrics infrastructure today. The number of projects listed in the Roadmap working draft is overwhelming. I see suspiciously many projects with some overlap here and there, a handful of parsers, a lot of attributes gathered. I wonder how much effect on the workload it would have alone to streamline these offerings.
I’d like to understand which parts are the essential core, which provide valuable services to lots of people and which are special interest, nice to have or short of abandoned (usage wise).
I have an idea of how a Big Data infrastructure could help metrics scale better in several directions but I can’t say how it would fit into the whole machinery. Having a clearer overview of the essential parts, the languages they are written in and the amount of maintenance they require would help.
The main asset of Tor Metrics is the data. Making that data easily available should be paramount. That means:
1) downloadable from the website
2) in a format that's ready for consumption in popular tools (eg. JSON)
3) raw (like it comes from CollecTor) as well as pre-aggregated for common usecases.
4) well documented and tastefully arranged.
5) live queryable (with a graphic interface)
Metrics provides 1) but until now neither 2) nor 3). A JSON converter is in the works and a side effect of the analytics server should be the generation of ‘popular’ aggregates. But the analytics server is a project that will require some effort in administration and maintenance if it shall become more than an experiment. 4) is not unreachable. The spec actually goes a long way (but not all of it). 5) may not be achievable in the short term - BUT would be great!
PostgreSQL seems to be an important backbone of the Tor Metrics infrastructure. Like any RDBMS it is optimized for retrieving specific single datasets whereas Big Data analytics solutions are optimized for finding patterns and structures in large amounts of data. Tor Metrics' problem space clearly belongs to the latter category although data volume is on the edge between the realms of RDBMS and Big Data. The tools that we will deploy on the analytics server provide interfaces in R, Java, Python, Scala and even SQL. They are also available as standalone versions for your laptop - no need to set up a Hadoop cluster to work with them locally on some downloaded chunk of JSON.
If the experiment with the analytics server plays out well a switch from RDBMS/PostgreSQL to Big Data/Hadoop should be seriously considered. There is one problem though: all that BigData/Hadoop stuff is _not_ in Debian stable.
I would suggest investing some time in better documentation of the workings of Tor and how the data gathered reflects its workings. I also suspect that some of the data gathered is not used any more or never was or is used so seldomly that it could be moved into a “misc” section, clearing the view on central aspects. This may sound arrogant and I sure was a little lazy in my research on this matter, but I also started thinking about documentation for the analytics server recently - because new users will probably need some advice on where to find what and how to find something else - and the prospect of documenting all those fields and attributes is slightly scary. New users might feel scared (away) too if we don’t manage to prioritize information and move a good part of it in the background (of the documentation, not of the database).
Visualizations are just one way of providing access to the data, both very selective and very succinct (if done right).
Preproduced visualizations are very useful as answers to common questions (“How many users…”) and to illustrate not so obvious procedures (“How many Authorities consent on which server flags…”) but they can only provide an entry point.
I tried to construct a rather generic visualization tool with “Visionion” but so far I failed - hybris probably and I’ll have to re-adjust. There are tools around in the data analytics world that work on the raw data interactively (provided that it’s in a popular format like JSON or csv - so CollecTor data has to be converted first) and generate graphs on the fly, in a slice ’n dice way - ‘Tableau' is such a tool, they are not cheap and there is still no open source offering (but there probably will be some day).
And then there are one-off visualizations, handcrafted in visually fine tuned arrangement, made to illustrate specific aspects or prove a point - possibly static, with no live data or interactivity whatsoever.
Successfull visualizations live mostly on the edges: they are either very generic or very specific or the tools are expensive. There is very little middle ground, or rather: this middle ground is really hard to achieve.
Tor Metrics should produce some of the “most popular” and generally useful graphs and guarantee both correctness and actuality. That is low hanging fruit because it is technically not very hard and brings a lot of benefit in perception.
Above that it should concentrate on providing the data that others can use to drive their visualizations in easy to consume formats (pre-aggregated JSON, again). When working on “Visionion” I was spending very much more time on aggregating the data then I had expected.
I very much like David Goulets approach of curating (and eventually integrating on the metrics website) visualization scripts that work on the data we provide and that are provided to us as patches. I think that is the right way to go. We will have to provide some essential graphs ourselves and can then wait for contributions of more interesting or experimental stuff. We will then “just" have to check that the code does indeed what it claims to do.
I’m not familiar with Munin though and I don’t even know what those .tpo’s he mentions are. As a web developer I would never choose the Java path but go with D3.js which expects JSON which Metrics would have to provide. D3 is the quasi standard for visualizations on the web and a very solid solution. It is very near to the metal of web technologies but requires some work on the data beforhand whereas Munin seems to be very near to the backend but require some Java provess. I don’t know which approach works out better in the long run and for the majority of people here.
To sum this up:
- Technically some entry level visualizations (number of users, bandwidth consumed) are low hanging fruit and should be provided by us. It would just be too bad if they were totally missing from the website. And only we can provide them with the required authority.
- Providing the data in easily consumable formats and pre-aggregated aspects is key to spur contributions. Visualizations can then be provided to us as patches.
- Supporting the right tools to encourage further contributions is a tricky question. Web developers are best served with JSON data and some D3 templates. For backend developers David Goulets proposal might be a good choice. But which grouop is more important? Can they be served both? Is it worth the effort?
- Linking to some work others did is okay as long as we make clear that it's not “tested/approved/guaranteed” by us and as we do not include more than a snmall screenshot/appetizer.
On 25 Nov 2015, at 16:53, Karsten Loesing <karsten at torproject.org> wrote:
> Signed PGP part
> Hello devs,
> the Tor Metrics website  claims to be "the primary place to learn
> interesting facts about the Tor network" and invites its visitors who
> "come across something that is missing" to contact the website authors
> about it. That's a bold statement I put there! :)
> Yet, there's considerable product backlog with possible enhancements
>  that doesn't seem to ever become shorter. Even worse, it can be
> expected that the backlog will refill quickly once the community
> notices that feature requests are suddenly considered. The main
> reason for this unfortunate situation is that Tor Metrics contains
> many moving parts, including some heavy database lifting that takes
> place below the surface, that all want to be maintained. Adding more
> parts just makes the whole thing even more likely to break. At the
> same time, knowing about the situation that Tor Metrics has become
> almost closed to contributions is painful.
> This posting shall discuss possible solutions. The goal is to let Tor
> Metrics grow in a healthy fashion that encourages contributions from
> the community. These solutions are not mutually exclusive, and the
> best solution may use parts of more than one solution sketched out here.
> 1 Make Tor Metrics better and bigger, internally
> The obvious solution is that the maintainers of Tor Metrics could just
> work harder to overcome the problems stated above. Let's think this
> 1.1 Add more development resources
> If only the current Tor Metrics maintainers had more time to devote to
> cleaning up existing parts and to add new parts, that would solve our
> problem. They could refactor parts that are hard to maintain, and
> they could work off the serious backlog that has piled up. Of course,
> this means dropping or handing over responsibilities for other
> products, and it may mean finding (and paying) new developers to help
> maintain Tor Metrics. It's unclear whether anything like this would
> fit into Tor's budget, and whether these changed priorities would make
> users of tools that had to be dropped or handed over unhappy.
> 1.2 Rewrite internal parts of Tor Metrics to encourage external
> Most of Tor Metrics would have run 10 or 15 years ago with only minor
> modifications. It's not necessarily a bad thing to use established
> technologies. But maybe, if we rewrite it using modern
> data-processing, web, and visualization frameworks, it becomes more
> attractive to other developers to contribute code and help maintain
> existing (well, then rewritten) code. The result would be a larger
> Tor Metrics website that is easier to maintain and hopefully
> maintained by more people. It's unclear how realistic this plan is,
> though, and it requires attention by Tor Metrics maintainers to bring
> it enough into shape for external contributors to get involved.
> 2 Add more ways to contribute to Tor Metrics externally
> It may be possible to further grow Tor Metrics without adding more
> code to it, hence not making it any harder to maintain. However, if
> code to generate visualizations is run elsewhere, there's a certain
> risk that results are not perceived as trustworthy as if that code
> were run as part of Metrics. This is primarily a problem of setting
> user expectations right. We could add different ways for contributing
> to Tor Metrics, depending on the level of commitment that contributors
> are willing to make. Possible new ways (in addition to filing a Trac
> ticket, which is already possible, though not very effective) are:
> 2.1 Accept contribution of static data or static graphs
> Somebody might contribute data (in a tarball, download link, etc.) or
> a static graph (static as in "doesn't break, ever", not "static HTML
> Tor Metrics team reviews that and puts it on the Tor Metrics website,
> together with a short description, author information, license, etc.
> There are plenty of visualizations on Trac and on the mailing lists,
> so we'll have to define criteria what we add and what not, and we'll
> need a good process for making that happen.
> 2.2 Link to external websites
> Somebody might write a website that visualizes Tor network data. The
> Tor Metrics team reviews the idea behind it, but not necessarily look
> at its code, and adds an external link to Tor Metrics. It becomes
> obvious that the authors remain responsible for their visualization,
> so there's no risk involved for Tor Metrics, but users may not trust
> it as much, because it doesn't have the Tor Metrics label. Note that
> we're already doing this approach by linking to the visualizations
> showing "Tor users as percentage of larger Internet population" 
> and "Data flow in the Tor network" . Also note that we could as
> well have hosted the former directly on Tor Metrics with appropriate
> attribution, because it's a static image. This is not the case with
> the latter.
> 2.3 Run an externally developed website as if it were part of Tor Metrics
> Let's imagine that somebody produces a visualization of Tor network
> data and would like to make it part of Tor Metrics but without
> limiting themselves to the technology used by Tor Metrics. We could
> let them write their visualization as website and integrate it into
> Tor Metrics after reviewing its code.
> Technically, part of this integration would be to "redress" the
> website by applying the Tor Metrics design (which has lots of room for
> improvement, but let's just say the result will look as seamlessly
> integrated into Tor Metrics as the "Network bubble graphs" ).
> Another part would probably be to rewrite web requests, so that users
> still think they're talking to https://metrics.torproject.org/, but
> really they're talking to another webserver behind that.
> Regarding hosting and maintenance, in theory, the website could be
> hosted by the original creators, but that effectively means that the
> Tor Metrics team gives up part of the control about what's on the Tor
> Metrics website. The creators of the external website could change
> parts or add new parts that wouldn't be reviewed by Tor Metrics
> developers, but they would be perceived as part of Metrics, which
> seems bad. The Tor Metrics team could run the externally developed
> website on a separate host or on the same host as Tor Metrics. We
> could imagine variants where the original creator stays around to fix
> any issues as they come up, or we could imagine that they donate their
> visualization that the Tor Metrics people will then maintain. We
> could even imagine that the Tor Metrics maintainers some day decide to
> integrate the originally external website into Tor Metrics proper, but
> that would not be required for this model to work.
> All these ideas require writing down guidelines, criteria, and
> processes. In particular, they require more thoughts and input from
> other people who are not currently involved in Tor Metrics maintenance
> and who can be expected more objective. And once these ideas are
> implemented, we'll need more Tor Metrics maintainer than just one.
> What are your thoughts?
> All the best,
>  https://metrics.torproject.org/
>  https://metrics.torproject.org/oxford-anonymous-internet.html
>  https://metrics.torproject.org/uncharted-data-flow.html
>  https://metrics.torproject.org/bubbles.html
> tor-dev mailing list
> tor-dev at lists.torproject.org
thomas lörtsch + hospitalstr. 95 + d 22767 hamburg
+49 173 202 71 99 + tl at rat.io + tomlurge at someOtherServices
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
More information about the tor-dev