[tor-dev] Metrics Plans

Wed May 29 02:05:55 UTC 2013

Hello!
(@tor-dev: will also write a separate email, introducing the GSoC project
at hand.)

This GSoc idea started a year back as a searchable descriptor search
> application, totally unrelated to Onionoo.  It was when I read Kostas'
> proposal that I started thinking about an integration with Onionoo.
> That's why the plan is still a bit vague.  We should work together with
> Kostas very soon to clarify the plan.
>

Indeed, as it currently stands, the extent of the proposed backend part of
the searchable descriptor project is unclear. The original plan was not to
aim for a universal backend which could ideally, for example, service
existing web-side Metrics etc. project applications. The idea was to
hopefully be able to replace relay and consensus search/lookup tools with a
single and more powerful "search and browse descriptor archives"
application.

However I completely agree that an integrated, reusable backend sounds more
exciting and could potentially/hopefully make the broader Tor metrics-* &c
ecosystem more uniform if that's the word - reducing the tool/component
counts. I think this is doable if the tasks/steps of this project are
somewhat isolated, so that incremental development can happen, and it's not
an all-or-nothing gamble (obviously that is the way it is intended to be,
but I think this would be an important aspect of this project in particular
as well.)

> Maybe we should focus on a 'grand unified backend' rather than
> > splitting Kostas' summer between both a backend and frontend? If he
> > could replace the backends of the majority of our metrics services
> > then that would greatly simplify the metrics ecosystem.
>
> I'm mostly interested in the back-end, too.  But I think it won't be as
> much fun for Kostas if he can't also work on something that's visible to
> users.  I don't know what he prefers though.
>

Honestly, I would actually be up for focusing, if need be, exclusively on
the backend part. It would also probably (hopefully) prove to be the most
beneficial to the overall ecosystem of tools. However, such a plan would
imply that the final goal (ideally) is to have a replacement for Onionoo,
which means that it would have to be reliably stable and scalable, so that
multiple frontends could all use it at once. (It will have to be stable in
any case, of course.) I think this would be a great goal, but if we can
define and isolate development stages to a great extent, I think having two
goals: (a) Onionoo replacement; (b) descriptor search+browse frontend - at
the same time is OK, and either one of them could be dropped/reduced during
the process - this is what I'd have in mind, generally speaking, in terms
of general, let's say incremental deliverables / sub-projects, which can be
done sequentially:

1. Work out the relay schema for (a) relay descriptors; (b)
consensus-statuses; (c) *bridge summaries; (d) *bridge network statuses;

Here, I think it is realistic to try and use and import all the fields
available from metrics-db-*. My PoC is overly simplistic in this regard:
only relay descriptors, and only a limited subset of data fields is used in
the schema, for the import. I think it is realistic to import bridge data
used and reported by Onionoo. Here is the good, 'incremental' part I think:
the Onionoo protocol/design is useful in itself, as a clean "relay
processing" (what comes in and in what form it comes out) design. I think
it makes sense to do the DB schema having the fields used and reported by
Onionoo in mind. Even if the project ends up not aiming to even be
compatible with Onionoo (in terms of its API endpoints, or perhaps not
reporting everything (e.g. guard probability) - though I would like to aim
for compatibility, as would all of you, I suppose!), I think there should
be little to no duplication of effort when designing the schema and the
descriptor/data import part of the backend. The bridge data can later be
dropped. I will soon try looking closer if the schema can be made such that
it may later be very easily *extended* to include bridges data, but it
might be safer to at least have the whole schema from the beginning for
processing db-R, db-B and db-P, and e.g. simply not work on actual bridge
data import at first (depending on priorities.)

2. Implement data import part: so again, the focus would be on importing
all possible fields available from, most importantly, metrics-db-R. More
fields in relay descriptors, and also consensus statuses. Descriptors (IDs)
in consensuses will refer to relay descriptors; must be possible to
efficiently query the consensus table as well to ask "in which statuses has
this descriptor been present?"

These two parts are crucial whether the project is to aim for Onionoo
replacement, and/or also provide a search&browse frontend.

3. Implement Onionoo-compatible search queries, and (maybe only) a subset
of result fields. Again, I don't see why using the Onionoo protocol/design
shouldn't work here in any case. (Other Onionoo-specific nuanses, like
compressed responses etc shouldn't be hard at all, I think.) Make sure
Onionoo-compatible queries scale well for all archival data. By queries I
mean:

 GET summary
 GET details

bandwidht/weights can wait until further time constraints become more
obvious. All parameters available for filtering Onionoo results [1] make
sense to me: the more powerful search/query system (well, bits of it)
referred to in the original project proposal can be seen as a powerset, of
which the Onionoo query/filter system would be a subset. Again, this is
great, as I think there's nothing wrong with aiming for an
Onionoo-compatible query language which frontends / other applications
could query the new backend with anyway! So that's good.

4. At this point, if we have Onionoo-compatible relay/data search (possibly
excluding bridges, and probably excluding bandwidth weights etc) for all
the archival data available via simple rsync (it works very well indeed for
the (small) subset of archival data available - rsync'ing the 'recent'
archive folder) for feeding the data to the backend, it will be great. From
here on, depending on how long all of this took and what our clarified
goals are, more things can happen, and it becomes less clear goal-wise
indeed:

As per my original proposal, implementing a more powerful query/filter
system (specifying and encapsulating AND/OR - only the actual syntax needs
to be decided on; but also being able to refer to more fields - this
obviously requires one to be more concrete - will be able to work on this)
would be part of the plan. The query/filter syntax can be made
(backwards-)compatible with current Onionoo, either by cheaply adding an
additional optional parameter which specifies an advanced protocol version,
and then being able to change the rest of the query as is needed, or by
more carefully designing the syntax to truly be a superset of the current
Onionoo query ruleset. Not sure about this one, the good news is, all
previous parts can be worked on before such decisions are made. Of course,
it would be very useful to have the ideal extended query design / scope of
querying/results clear from the start, so that we don't end up constraining
ourselves with a limited schema design. Though migrating imported data
between schemas should be possible.

5. Optionally, this (vast) part would include working on a frontend
application which would make use of the new powerful backend capabilities.
See original proposal for details (I'll see to it so that it's reachable by
tor-dev.) My idea was to further isolate parts of the frontend for
incremental development, so that leveraging the more powerful search
capabilities in a simple frontend would be the most important aspect. This
is still very vague though, or I need to refer back to the proposal.

Another, related thing: the PoC acts as a backend and (sorry excuse for a)
frontend all-in-one, as of now. The plan would be to completely separate
the two code-wise and application-wise, with backend providing an API for
the frontend. This is the part that is great about Onionoo: I think
implementing an Onionoo-compatible (or a reduced version of, if we
eventually go in that (latter, centering-around-frontend) direction) API is
feasible and makes sense whatever the final direction of the project is to
be. I might need to focus on providing more details about this later, but
I'd really like to make the two completely separate (interchangeable,
switchable) application-wise. Yes for modularity!

> * The present relay search renders raw router status entries. Does it
> > actually store the text of the router status entries within the
> > database? With the new relay search I suppose we'll be retrieving the
> > attributes rather than raw descriptor text, is that right?
>
> The present relay search and ExoneraTor store raw text of router status
> entries in their databases.  But that doesn't mean that the new relay
> search needs to do that, too.

The idea would be import all data as DB fields (so, indexable), but it
makes sense to also import raw text lines to be able to e.g. supply the
frontend application with raw data if needed, as the current tools do. But
I think this could be made to be a separate table, with descriptor id as
primary key, which means this can be done later on if need be, would not
cause a problem. I guess there's no need to this right now.

I've probably glossed over the most sensitive/convoluted parts of the plan!
:) let me know where I should already be more specific at the very start of
the project.

Does the proposed incremental development plan make sense?

I will hopefully later follow up with my more immediate plans. I thought I
would have an extended schema by now - I have more code, but I still need
to sort it out. And I'm still not sure whether trying to import all data
fields available makes sense. I suspect not much significant progress
*code-wise* may happen until my exams are over, but I am not sure.
Hopefully we can focus on design though. (Also: I'm itching to import *all*
archival data even to a reduced schema and do some more nasty queries on
it.)

Hopefully did not just make things more convoluted!
Regards,
Kostas.

[1] https://onionoo.torproject.org/

On Mon, May 27, 2013 at 10:25 PM, Damian Johnson <atagar at torproject.org>wrote:

> Hi Kostas. Now that we no longer need to worry about accidentally
> leaking GSoC selection we can talk more openly about your project.
> Below is an interchange between me and Karsten - thoughts?
>
> ---------- Forwarded message ----------
> From: Karsten Loesing <karsten at torproject.org>
> Date: Thu, May 23, 2013 at 11:37 AM
> Subject: Re: Metrics Plans
> To: Damian Johnson <atagar at torproject.org>
> Cc: Tor Assistants <tor-assistants at lists.torproject.org>
>
>
> On 5/23/13 7:22 PM, Damian Johnson wrote:
> > Hi Karsten. I just finished reading over Kostas' proposal and while it
> > looks great, I'm not sure if I fully understand the plan. Few
> > clarifying questions...
> >
> > * What descriptor information will his backend contain? Complete
> > descriptor attributes (ie, all the attributes from the documents), or
> > only what we need? His proof of concept importer [1] only contains a
> > subset but that's, of course, not necessarily where we're going.
> >
> > If we're aiming for this to be the 'grand unifying backend' for
> > Onionoo, Exonerator, Relay Search, etc then it seems like we might as
> > well aim for it to be complete. But that naturally means more work
> > with schema updates as descriptors change...
>
> This GSoc idea started a year back as a searchable descriptor search
> application, totally unrelated to Onionoo.  It was when I read Kostas'
> proposal that I started thinking about an integration with Onionoo.
> That's why the plan is still a bit vague.  We should work together with
> Kostas very soon to clarify the plan.
>
> > * The present relay search renders raw router status entries. Does it
> > actually store the text of the router status entries within the
> > database? With the new relay search I suppose we'll be retrieving the
> > attributes rather than raw descriptor text, is that right?
>
> The present relay search and ExoneraTor store raw text of router status
> entries in their databases.  But that doesn't mean that the new relay
> search needs to do that, too.
>
> > * Kostas' proposal includes both the backend importing/datastore and
> > also a Flask frontend for rendering the search results. In terms of
> > the present tools diagram [2] I suppose that would mean replacing
> > metrics-web-R and having a python counterpart of metrics-db-R (with
> > the aim of later deprecating the old metrics-db-R). Is that right?
>
> Not quite.  We cannot replace metrics-db-R yet, because that's the tool
> that downloads relay descriptors for all other services.  It needs to
> work really stable.  Replacing metrics-db-R would be a different
> project.  The good thing though is that metrics-db-R offers its files
> via rsync, so that's a very clean interface for services using its data.
>
> In terms of the tools diagram, Kostas would write a second tool in the
> "Process" column above Onionoo that would feed two replacement tools for
> metrics-web-R and metrics-web-E.  His processing tool would use data
> from metrics-db-R and metrics-db-E.
>
> If his tool is supposed to replace more parts of Onionoo and not only
> replace relay search and ExoneraTor, it would use data from metrics-db-B
> and metrics-db-P, too.
>
> > Maybe we should focus on a 'grand unified backend' rather than
> > splitting Kostas' summer between both a backend and frontend? If he
> > could replace the backends of the majority of our metrics services
> > then that would greatly simplify the metrics ecosystem.
>
> I'm mostly interested in the back-end, too.  But I think it won't be as
> much fun for Kostas if he can't also work on something that's visible to
> users.  I don't know what he prefers though.
>
> In my imagination, here's how the tools diagram looks like by the end of
> summer:
>
> - Kostas has written an Onionoo-like back-end that allows searches for
> relays or bridges in our archives since 2007 and provides details for
> any point in the past.  Maybe his tool will implement the existing
> Onionoo interface, so that Atlas and Compass can switch to using it
> instead of Onionoo.
>
> - We'll still keep using Onionoo for aggregating bandwidth and weights
> statistics per relay or bridge, but Kostas' tool would give out that data.
>
> - Thomas has written Visionion and replacements for metrics-web-N and
> metrics-web-U.  You probably saw the long discussion on this list.  This
> is a totally awesome project on its own, but it's sufficiently separate
> from Kostas' project (Kostas is only interested in single
> relays/bridges, whereas Thomas is only interested in aggregates).
>
> I'm aware that not all of this may happen in one summer.  That's why I'm
> quite flexible about plans.  There are quite a lot of missing puzzle
> pieces in the overall picture, people can start wherever they want and
> contribute something useful.
>
> > I was very, very tempted to start up a thread on tor-dev@ to discuss
> > this but couldn't figure out a way of doing so without letting Kostas
> > know that we're taking him on. If you can think of a graceful way of
> > including him or tor-dev@ then feel free.
>
> Let's wait four more days, if that's okay for you.  Starting a new
> discussion there about this together with Kostas sounds like a fine plan.
>
> This will be an exciting summer! :)
>
> Best,
> Karsten
>
>
> > [1] https://github.com/wfn/torsearch/blob/master/tsweb/importer.py#L16
> > [2] https://metrics.torproject.org/tools.html
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20130529/1c166fd4/attachment-0001.html>