Hi Kostas. Now that we no longer need to worry about accidentally leaking GSoC selection we can talk more openly about your project. Below is an interchange between me and Karsten - thoughts?
---------- Forwarded message ---------- From: Karsten Loesing karsten@torproject.org Date: Thu, May 23, 2013 at 11:37 AM Subject: Re: Metrics Plans To: Damian Johnson atagar@torproject.org Cc: Tor Assistants tor-assistants@lists.torproject.org
On 5/23/13 7:22 PM, Damian Johnson wrote:
Hi Karsten. I just finished reading over Kostas' proposal and while it looks great, I'm not sure if I fully understand the plan. Few clarifying questions...
- What descriptor information will his backend contain? Complete
descriptor attributes (ie, all the attributes from the documents), or only what we need? His proof of concept importer [1] only contains a subset but that's, of course, not necessarily where we're going.
If we're aiming for this to be the 'grand unifying backend' for Onionoo, Exonerator, Relay Search, etc then it seems like we might as well aim for it to be complete. But that naturally means more work with schema updates as descriptors change...
This GSoc idea started a year back as a searchable descriptor search application, totally unrelated to Onionoo. It was when I read Kostas' proposal that I started thinking about an integration with Onionoo. That's why the plan is still a bit vague. We should work together with Kostas very soon to clarify the plan.
- The present relay search renders raw router status entries. Does it
actually store the text of the router status entries within the database? With the new relay search I suppose we'll be retrieving the attributes rather than raw descriptor text, is that right?
The present relay search and ExoneraTor store raw text of router status entries in their databases. But that doesn't mean that the new relay search needs to do that, too.
- Kostas' proposal includes both the backend importing/datastore and
also a Flask frontend for rendering the search results. In terms of the present tools diagram [2] I suppose that would mean replacing metrics-web-R and having a python counterpart of metrics-db-R (with the aim of later deprecating the old metrics-db-R). Is that right?
Not quite. We cannot replace metrics-db-R yet, because that's the tool that downloads relay descriptors for all other services. It needs to work really stable. Replacing metrics-db-R would be a different project. The good thing though is that metrics-db-R offers its files via rsync, so that's a very clean interface for services using its data.
In terms of the tools diagram, Kostas would write a second tool in the "Process" column above Onionoo that would feed two replacement tools for metrics-web-R and metrics-web-E. His processing tool would use data from metrics-db-R and metrics-db-E.
If his tool is supposed to replace more parts of Onionoo and not only replace relay search and ExoneraTor, it would use data from metrics-db-B and metrics-db-P, too.
Maybe we should focus on a 'grand unified backend' rather than splitting Kostas' summer between both a backend and frontend? If he could replace the backends of the majority of our metrics services then that would greatly simplify the metrics ecosystem.
I'm mostly interested in the back-end, too. But I think it won't be as much fun for Kostas if he can't also work on something that's visible to users. I don't know what he prefers though.
In my imagination, here's how the tools diagram looks like by the end of summer:
- Kostas has written an Onionoo-like back-end that allows searches for relays or bridges in our archives since 2007 and provides details for any point in the past. Maybe his tool will implement the existing Onionoo interface, so that Atlas and Compass can switch to using it instead of Onionoo.
- We'll still keep using Onionoo for aggregating bandwidth and weights statistics per relay or bridge, but Kostas' tool would give out that data.
- Thomas has written Visionion and replacements for metrics-web-N and metrics-web-U. You probably saw the long discussion on this list. This is a totally awesome project on its own, but it's sufficiently separate from Kostas' project (Kostas is only interested in single relays/bridges, whereas Thomas is only interested in aggregates).
I'm aware that not all of this may happen in one summer. That's why I'm quite flexible about plans. There are quite a lot of missing puzzle pieces in the overall picture, people can start wherever they want and contribute something useful.
I was very, very tempted to start up a thread on tor-dev@ to discuss this but couldn't figure out a way of doing so without letting Kostas know that we're taking him on. If you can think of a graceful way of including him or tor-dev@ then feel free.
Let's wait four more days, if that's okay for you. Starting a new discussion there about this together with Kostas sounds like a fine plan.
This will be an exciting summer! :)
Best, Karsten
[1] https://github.com/wfn/torsearch/blob/master/tsweb/importer.py#L16 [2] https://metrics.torproject.org/tools.html
Hello! (@tor-dev: will also write a separate email, introducing the GSoC project at hand.)
This GSoc idea started a year back as a searchable descriptor search
application, totally unrelated to Onionoo. It was when I read Kostas' proposal that I started thinking about an integration with Onionoo. That's why the plan is still a bit vague. We should work together with Kostas very soon to clarify the plan.
Indeed, as it currently stands, the extent of the proposed backend part of the searchable descriptor project is unclear. The original plan was not to aim for a universal backend which could ideally, for example, service existing web-side Metrics etc. project applications. The idea was to hopefully be able to replace relay and consensus search/lookup tools with a single and more powerful "search and browse descriptor archives" application.
However I completely agree that an integrated, reusable backend sounds more exciting and could potentially/hopefully make the broader Tor metrics-* &c ecosystem more uniform if that's the word - reducing the tool/component counts. I think this is doable if the tasks/steps of this project are somewhat isolated, so that incremental development can happen, and it's not an all-or-nothing gamble (obviously that is the way it is intended to be, but I think this would be an important aspect of this project in particular as well.)
Maybe we should focus on a 'grand unified backend' rather than
splitting Kostas' summer between both a backend and frontend? If he could replace the backends of the majority of our metrics services then that would greatly simplify the metrics ecosystem.
I'm mostly interested in the back-end, too. But I think it won't be as much fun for Kostas if he can't also work on something that's visible to users. I don't know what he prefers though.
Honestly, I would actually be up for focusing, if need be, exclusively on the backend part. It would also probably (hopefully) prove to be the most beneficial to the overall ecosystem of tools. However, such a plan would imply that the final goal (ideally) is to have a replacement for Onionoo, which means that it would have to be reliably stable and scalable, so that multiple frontends could all use it at once. (It will have to be stable in any case, of course.) I think this would be a great goal, but if we can define and isolate development stages to a great extent, I think having two goals: (a) Onionoo replacement; (b) descriptor search+browse frontend - at the same time is OK, and either one of them could be dropped/reduced during the process - this is what I'd have in mind, generally speaking, in terms of general, let's say incremental deliverables / sub-projects, which can be done sequentially:
1. Work out the relay schema for (a) relay descriptors; (b) consensus-statuses; (c) *bridge summaries; (d) *bridge network statuses;
Here, I think it is realistic to try and use and import all the fields available from metrics-db-*. My PoC is overly simplistic in this regard: only relay descriptors, and only a limited subset of data fields is used in the schema, for the import. I think it is realistic to import bridge data used and reported by Onionoo. Here is the good, 'incremental' part I think: the Onionoo protocol/design is useful in itself, as a clean "relay processing" (what comes in and in what form it comes out) design. I think it makes sense to do the DB schema having the fields used and reported by Onionoo in mind. Even if the project ends up not aiming to even be compatible with Onionoo (in terms of its API endpoints, or perhaps not reporting everything (e.g. guard probability) - though I would like to aim for compatibility, as would all of you, I suppose!), I think there should be little to no duplication of effort when designing the schema and the descriptor/data import part of the backend. The bridge data can later be dropped. I will soon try looking closer if the schema can be made such that it may later be very easily *extended* to include bridges data, but it might be safer to at least have the whole schema from the beginning for processing db-R, db-B and db-P, and e.g. simply not work on actual bridge data import at first (depending on priorities.)
2. Implement data import part: so again, the focus would be on importing all possible fields available from, most importantly, metrics-db-R. More fields in relay descriptors, and also consensus statuses. Descriptors (IDs) in consensuses will refer to relay descriptors; must be possible to efficiently query the consensus table as well to ask "in which statuses has this descriptor been present?"
These two parts are crucial whether the project is to aim for Onionoo replacement, and/or also provide a search&browse frontend.
3. Implement Onionoo-compatible search queries, and (maybe only) a subset of result fields. Again, I don't see why using the Onionoo protocol/design shouldn't work here in any case. (Other Onionoo-specific nuanses, like compressed responses etc shouldn't be hard at all, I think.) Make sure Onionoo-compatible queries scale well for all archival data. By queries I mean:
GET summary GET details
bandwidht/weights can wait until further time constraints become more obvious. All parameters available for filtering Onionoo results [1] make sense to me: the more powerful search/query system (well, bits of it) referred to in the original project proposal can be seen as a powerset, of which the Onionoo query/filter system would be a subset. Again, this is great, as I think there's nothing wrong with aiming for an Onionoo-compatible query language which frontends / other applications could query the new backend with anyway! So that's good.
4. At this point, if we have Onionoo-compatible relay/data search (possibly excluding bridges, and probably excluding bandwidth weights etc) for all the archival data available via simple rsync (it works very well indeed for the (small) subset of archival data available - rsync'ing the 'recent' archive folder) for feeding the data to the backend, it will be great. From here on, depending on how long all of this took and what our clarified goals are, more things can happen, and it becomes less clear goal-wise indeed:
As per my original proposal, implementing a more powerful query/filter system (specifying and encapsulating AND/OR - only the actual syntax needs to be decided on; but also being able to refer to more fields - this obviously requires one to be more concrete - will be able to work on this) would be part of the plan. The query/filter syntax can be made (backwards-)compatible with current Onionoo, either by cheaply adding an additional optional parameter which specifies an advanced protocol version, and then being able to change the rest of the query as is needed, or by more carefully designing the syntax to truly be a superset of the current Onionoo query ruleset. Not sure about this one, the good news is, all previous parts can be worked on before such decisions are made. Of course, it would be very useful to have the ideal extended query design / scope of querying/results clear from the start, so that we don't end up constraining ourselves with a limited schema design. Though migrating imported data between schemas should be possible.
5. Optionally, this (vast) part would include working on a frontend application which would make use of the new powerful backend capabilities. See original proposal for details (I'll see to it so that it's reachable by tor-dev.) My idea was to further isolate parts of the frontend for incremental development, so that leveraging the more powerful search capabilities in a simple frontend would be the most important aspect. This is still very vague though, or I need to refer back to the proposal.
Another, related thing: the PoC acts as a backend and (sorry excuse for a) frontend all-in-one, as of now. The plan would be to completely separate the two code-wise and application-wise, with backend providing an API for the frontend. This is the part that is great about Onionoo: I think implementing an Onionoo-compatible (or a reduced version of, if we eventually go in that (latter, centering-around-frontend) direction) API is feasible and makes sense whatever the final direction of the project is to be. I might need to focus on providing more details about this later, but I'd really like to make the two completely separate (interchangeable, switchable) application-wise. Yes for modularity!
- The present relay search renders raw router status entries. Does it
actually store the text of the router status entries within the database? With the new relay search I suppose we'll be retrieving the attributes rather than raw descriptor text, is that right?
The present relay search and ExoneraTor store raw text of router status entries in their databases. But that doesn't mean that the new relay search needs to do that, too.
The idea would be import all data as DB fields (so, indexable), but it makes sense to also import raw text lines to be able to e.g. supply the frontend application with raw data if needed, as the current tools do. But I think this could be made to be a separate table, with descriptor id as primary key, which means this can be done later on if need be, would not cause a problem. I guess there's no need to this right now.
I've probably glossed over the most sensitive/convoluted parts of the plan! :) let me know where I should already be more specific at the very start of the project.
Does the proposed incremental development plan make sense?
I will hopefully later follow up with my more immediate plans. I thought I would have an extended schema by now - I have more code, but I still need to sort it out. And I'm still not sure whether trying to import all data fields available makes sense. I suspect not much significant progress *code-wise* may happen until my exams are over, but I am not sure. Hopefully we can focus on design though. (Also: I'm itching to import *all* archival data even to a reduced schema and do some more nasty queries on it.)
Hopefully did not just make things more convoluted! Regards, Kostas.
[1] https://onionoo.torproject.org/
On Mon, May 27, 2013 at 10:25 PM, Damian Johnson atagar@torproject.orgwrote:
Hi Kostas. Now that we no longer need to worry about accidentally leaking GSoC selection we can talk more openly about your project. Below is an interchange between me and Karsten - thoughts?
---------- Forwarded message ---------- From: Karsten Loesing karsten@torproject.org Date: Thu, May 23, 2013 at 11:37 AM Subject: Re: Metrics Plans To: Damian Johnson atagar@torproject.org Cc: Tor Assistants tor-assistants@lists.torproject.org
On 5/23/13 7:22 PM, Damian Johnson wrote:
Hi Karsten. I just finished reading over Kostas' proposal and while it looks great, I'm not sure if I fully understand the plan. Few clarifying questions...
- What descriptor information will his backend contain? Complete
descriptor attributes (ie, all the attributes from the documents), or only what we need? His proof of concept importer [1] only contains a subset but that's, of course, not necessarily where we're going.
If we're aiming for this to be the 'grand unifying backend' for Onionoo, Exonerator, Relay Search, etc then it seems like we might as well aim for it to be complete. But that naturally means more work with schema updates as descriptors change...
This GSoc idea started a year back as a searchable descriptor search application, totally unrelated to Onionoo. It was when I read Kostas' proposal that I started thinking about an integration with Onionoo. That's why the plan is still a bit vague. We should work together with Kostas very soon to clarify the plan.
- The present relay search renders raw router status entries. Does it
actually store the text of the router status entries within the database? With the new relay search I suppose we'll be retrieving the attributes rather than raw descriptor text, is that right?
The present relay search and ExoneraTor store raw text of router status entries in their databases. But that doesn't mean that the new relay search needs to do that, too.
- Kostas' proposal includes both the backend importing/datastore and
also a Flask frontend for rendering the search results. In terms of the present tools diagram [2] I suppose that would mean replacing metrics-web-R and having a python counterpart of metrics-db-R (with the aim of later deprecating the old metrics-db-R). Is that right?
Not quite. We cannot replace metrics-db-R yet, because that's the tool that downloads relay descriptors for all other services. It needs to work really stable. Replacing metrics-db-R would be a different project. The good thing though is that metrics-db-R offers its files via rsync, so that's a very clean interface for services using its data.
In terms of the tools diagram, Kostas would write a second tool in the "Process" column above Onionoo that would feed two replacement tools for metrics-web-R and metrics-web-E. His processing tool would use data from metrics-db-R and metrics-db-E.
If his tool is supposed to replace more parts of Onionoo and not only replace relay search and ExoneraTor, it would use data from metrics-db-B and metrics-db-P, too.
Maybe we should focus on a 'grand unified backend' rather than splitting Kostas' summer between both a backend and frontend? If he could replace the backends of the majority of our metrics services then that would greatly simplify the metrics ecosystem.
I'm mostly interested in the back-end, too. But I think it won't be as much fun for Kostas if he can't also work on something that's visible to users. I don't know what he prefers though.
In my imagination, here's how the tools diagram looks like by the end of summer:
- Kostas has written an Onionoo-like back-end that allows searches for
relays or bridges in our archives since 2007 and provides details for any point in the past. Maybe his tool will implement the existing Onionoo interface, so that Atlas and Compass can switch to using it instead of Onionoo.
- We'll still keep using Onionoo for aggregating bandwidth and weights
statistics per relay or bridge, but Kostas' tool would give out that data.
- Thomas has written Visionion and replacements for metrics-web-N and
metrics-web-U. You probably saw the long discussion on this list. This is a totally awesome project on its own, but it's sufficiently separate from Kostas' project (Kostas is only interested in single relays/bridges, whereas Thomas is only interested in aggregates).
I'm aware that not all of this may happen in one summer. That's why I'm quite flexible about plans. There are quite a lot of missing puzzle pieces in the overall picture, people can start wherever they want and contribute something useful.
I was very, very tempted to start up a thread on tor-dev@ to discuss this but couldn't figure out a way of doing so without letting Kostas know that we're taking him on. If you can think of a graceful way of including him or tor-dev@ then feel free.
Let's wait four more days, if that's okay for you. Starting a new discussion there about this together with Kostas sounds like a fine plan.
This will be an exciting summer! :)
Best, Karsten
[1] https://github.com/wfn/torsearch/blob/master/tsweb/importer.py#L16 [2] https://metrics.torproject.org/tools.html
On 5/29/13 4:05 AM, Kostas Jakeliunas wrote:
Hello! (@tor-dev: will also write a separate email, introducing the GSoC project at hand.)
This GSoc idea started a year back as a searchable descriptor search application, totally unrelated to Onionoo. It was when I read Kostas' proposal that I started thinking about an integration with Onionoo. That's why the plan is still a bit vague. We should work together with Kostas very soon to clarify the plan.
Indeed, as it currently stands, the extent of the proposed backend part of the searchable descriptor project is unclear. The original plan was not to aim for a universal backend which could ideally, for example, service existing web-side Metrics etc. project applications. The idea was to hopefully be able to replace relay and consensus search/lookup tools with a single and more powerful "search and browse descriptor archives" application.
However I completely agree that an integrated, reusable backend sounds more exciting and could potentially/hopefully make the broader Tor metrics-* &c ecosystem more uniform if that's the word - reducing the tool/component counts.
Sounds great! Sorry for making things more complicated by suggesting the Onionoo integration, but it just made sense to me when reading your proposal. Just don't feel like you have to do it, so if you rather want to focus on your original proposal, that's fine by me.
I think this is doable if the tasks/steps of this project are somewhat isolated, so that incremental development can happen, and it's not an all-or-nothing gamble (obviously that is the way it is intended to be, but I think this would be an important aspect of this project in particular as well.)
Incremental development sounds great!
Maybe we should focus on a 'grand unified backend' rather than splitting Kostas' summer between both a backend and frontend? If he could replace the backends of the majority of our metrics services then that would greatly simplify the metrics ecosystem.
I'm mostly interested in the back-end, too. But I think it won't be as much fun for Kostas if he can't also work on something that's visible to users. I don't know what he prefers though.
Honestly, I would actually be up for focusing, if need be, exclusively on the backend part. It would also probably (hopefully) prove to be the most beneficial to the overall ecosystem of tools. However, such a plan would imply that the final goal (ideally) is to have a replacement for Onionoo, which means that it would have to be reliably stable and scalable, so that multiple frontends could all use it at once. (It will have to be stable in any case, of course.) I think this would be a great goal, but if we can define and isolate development stages to a great extent, I think having two goals: (a) Onionoo replacement; (b) descriptor search+browse frontend - at the same time is OK, and either one of them could be dropped/reduced during the process -
I think I understand, but I'm not sure. Just to get this right, is either of these states the planned end state of your GSoC project?
1) descriptor database supporting efficient queries, separate API similar to Onionoo's, front-end application using new search parameters;
2) descriptor database supporting efficient queries, full integration with Onionoo API, no special front-end application using new search parameters; or
3) descriptor database supporting efficient queries, full integration with Onionoo API, front-end application using Onionoo's new search parameters.
this is what I'd have in mind, generally speaking, in terms of general, let's say incremental deliverables / sub-projects, which can be done sequentially:
- Work out the relay schema for (a) relay descriptors; (b)
consensus-statuses; (c) *bridge summaries; (d) *bridge network statuses;
Here, I think it is realistic to try and use and import all the fields available from metrics-db-*. My PoC is overly simplistic in this regard: only relay descriptors, and only a limited subset of data fields is used in the schema, for the import. I think it is realistic to import bridge data used and reported by Onionoo. Here is the good, 'incremental' part I think: the Onionoo protocol/design is useful in itself, as a clean "relay processing" (what comes in and in what form it comes out) design. I think it makes sense to do the DB schema having the fields used and reported by Onionoo in mind. Even if the project ends up not aiming to even be compatible with Onionoo (in terms of its API endpoints, or perhaps not reporting everything (e.g. guard probability) - though I would like to aim for compatibility, as would all of you, I suppose!), I think there should be little to no duplication of effort when designing the schema and the descriptor/data import part of the backend. The bridge data can later be dropped. I will soon try looking closer if the schema can be made such that it may later be very easily *extended* to include bridges data, but it might be safer to at least have the whole schema from the beginning for processing db-R, db-B and db-P, and e.g. simply not work on actual bridge data import at first (depending on priorities.)
Note that there's no Onionoo client that uses bridge data, yet. We have been planning to add bridge support to Atlas for a while, but this hasn't happened yet.
But in general, bridge data is quite similar to relay data. There are some specifics because of sanitized descriptor parts, but in general, data structures are similar.
- Implement data import part: so again, the focus would be on importing
all possible fields available from, most importantly, metrics-db-R. More fields in relay descriptors, and also consensus statuses. Descriptors (IDs) in consensuses will refer to relay descriptors; must be possible to efficiently query the consensus table as well to ask "in which statuses has this descriptor been present?"
These two parts are crucial whether the project is to aim for Onionoo replacement, and/or also provide a search&browse frontend.
- Implement Onionoo-compatible search queries, and (maybe only) a subset
of result fields. Again, I don't see why using the Onionoo protocol/design shouldn't work here in any case. (Other Onionoo-specific nuanses, like compressed responses etc shouldn't be hard at all, I think.) Make sure Onionoo-compatible queries scale well for all archival data. By queries I mean:
GET summary GET details
bandwidht/weights can wait until further time constraints become more obvious. All parameters available for filtering Onionoo results [1] make sense to me: the more powerful search/query system (well, bits of it) referred to in the original project proposal can be seen as a powerset, of which the Onionoo query/filter system would be a subset. Again, this is great, as I think there's nothing wrong with aiming for an Onionoo-compatible query language which frontends / other applications could query the new backend with anyway! So that's good.
I think it's an advantage here that Onionoo itself has a front-end and a back-end part. The back-end processes data once per hour and writes it to the file system. The front-end is a single Java servlet that does all the filtering and sorting in memory and reads larger JSON files from disk. What we could do is: keep the back-end running, so that it keeps producing details, bandwidth, and weights files, and only replace the servlet by a Python thing that also knows how to respond to more complex search queries.
- At this point, if we have Onionoo-compatible relay/data search (possibly
excluding bridges, and probably excluding bandwidth weights etc) for all the archival data available via simple rsync (it works very well indeed for the (small) subset of archival data available - rsync'ing the 'recent' archive folder) for feeding the data to the backend, it will be great. From here on, depending on how long all of this took and what our clarified goals are, more things can happen, and it becomes less clear goal-wise indeed:
As per my original proposal, implementing a more powerful query/filter system (specifying and encapsulating AND/OR - only the actual syntax needs to be decided on; but also being able to refer to more fields - this obviously requires one to be more concrete - will be able to work on this) would be part of the plan. The query/filter syntax can be made (backwards-)compatible with current Onionoo, either by cheaply adding an additional optional parameter which specifies an advanced protocol version, and then being able to change the rest of the query as is needed, or by more carefully designing the syntax to truly be a superset of the current Onionoo query ruleset. Not sure about this one, the good news is, all previous parts can be worked on before such decisions are made. Of course, it would be very useful to have the ideal extended query design / scope of querying/results clear from the start, so that we don't end up constraining ourselves with a limited schema design. Though migrating imported data between schemas should be possible.
- Optionally, this (vast) part would include working on a frontend
application which would make use of the new powerful backend capabilities. See original proposal for details (I'll see to it so that it's reachable by tor-dev.) My idea was to further isolate parts of the frontend for incremental development, so that leveraging the more powerful search capabilities in a simple frontend would be the most important aspect. This is still very vague though, or I need to refer back to the proposal.
Another, related thing: the PoC acts as a backend and (sorry excuse for a) frontend all-in-one, as of now. The plan would be to completely separate the two code-wise and application-wise, with backend providing an API for the frontend. This is the part that is great about Onionoo: I think implementing an Onionoo-compatible (or a reduced version of, if we eventually go in that (latter, centering-around-frontend) direction) API is feasible and makes sense whatever the final direction of the project is to be. I might need to focus on providing more details about this later, but I'd really like to make the two completely separate (interchangeable, switchable) application-wise. Yes for modularity!
- The present relay search renders raw router status entries. Does it
actually store the text of the router status entries within the database? With the new relay search I suppose we'll be retrieving the attributes rather than raw descriptor text, is that right?
The present relay search and ExoneraTor store raw text of router status entries in their databases. But that doesn't mean that the new relay search needs to do that, too.
The idea would be import all data as DB fields (so, indexable), but it makes sense to also import raw text lines to be able to e.g. supply the frontend application with raw data if needed, as the current tools do. But I think this could be made to be a separate table, with descriptor id as primary key, which means this can be done later on if need be, would not cause a problem. I guess there's no need to this right now.
I've probably glossed over the most sensitive/convoluted parts of the plan! :) let me know where I should already be more specific at the very start of the project.
Does the proposed incremental development plan make sense?
It does!
I will hopefully later follow up with my more immediate plans. I thought I would have an extended schema by now - I have more code, but I still need to sort it out. And I'm still not sure whether trying to import all data fields available makes sense. I suspect not much significant progress *code-wise* may happen until my exams are over, but I am not sure. Hopefully we can focus on design though. (Also: I'm itching to import *all* archival data even to a reduced schema and do some more nasty queries on it.)
Sounds good. Please focus on exams first and ignore GSoC for the time being.
Thanks, Karsten
Hopefully did not just make things more convoluted! Regards, Kostas.
[1] https://onionoo.torproject.org/
On Mon, May 27, 2013 at 10:25 PM, Damian Johnson atagar@torproject.orgwrote:
Hi Kostas. Now that we no longer need to worry about accidentally leaking GSoC selection we can talk more openly about your project. Below is an interchange between me and Karsten - thoughts?
---------- Forwarded message ---------- From: Karsten Loesing karsten@torproject.org Date: Thu, May 23, 2013 at 11:37 AM Subject: Re: Metrics Plans To: Damian Johnson atagar@torproject.org Cc: Tor Assistants tor-assistants@lists.torproject.org
On 5/23/13 7:22 PM, Damian Johnson wrote:
Hi Karsten. I just finished reading over Kostas' proposal and while it looks great, I'm not sure if I fully understand the plan. Few clarifying questions...
- What descriptor information will his backend contain? Complete
descriptor attributes (ie, all the attributes from the documents), or only what we need? His proof of concept importer [1] only contains a subset but that's, of course, not necessarily where we're going.
If we're aiming for this to be the 'grand unifying backend' for Onionoo, Exonerator, Relay Search, etc then it seems like we might as well aim for it to be complete. But that naturally means more work with schema updates as descriptors change...
This GSoc idea started a year back as a searchable descriptor search application, totally unrelated to Onionoo. It was when I read Kostas' proposal that I started thinking about an integration with Onionoo. That's why the plan is still a bit vague. We should work together with Kostas very soon to clarify the plan.
- The present relay search renders raw router status entries. Does it
actually store the text of the router status entries within the database? With the new relay search I suppose we'll be retrieving the attributes rather than raw descriptor text, is that right?
The present relay search and ExoneraTor store raw text of router status entries in their databases. But that doesn't mean that the new relay search needs to do that, too.
- Kostas' proposal includes both the backend importing/datastore and
also a Flask frontend for rendering the search results. In terms of the present tools diagram [2] I suppose that would mean replacing metrics-web-R and having a python counterpart of metrics-db-R (with the aim of later deprecating the old metrics-db-R). Is that right?
Not quite. We cannot replace metrics-db-R yet, because that's the tool that downloads relay descriptors for all other services. It needs to work really stable. Replacing metrics-db-R would be a different project. The good thing though is that metrics-db-R offers its files via rsync, so that's a very clean interface for services using its data.
In terms of the tools diagram, Kostas would write a second tool in the "Process" column above Onionoo that would feed two replacement tools for metrics-web-R and metrics-web-E. His processing tool would use data from metrics-db-R and metrics-db-E.
If his tool is supposed to replace more parts of Onionoo and not only replace relay search and ExoneraTor, it would use data from metrics-db-B and metrics-db-P, too.
Maybe we should focus on a 'grand unified backend' rather than splitting Kostas' summer between both a backend and frontend? If he could replace the backends of the majority of our metrics services then that would greatly simplify the metrics ecosystem.
I'm mostly interested in the back-end, too. But I think it won't be as much fun for Kostas if he can't also work on something that's visible to users. I don't know what he prefers though.
In my imagination, here's how the tools diagram looks like by the end of summer:
- Kostas has written an Onionoo-like back-end that allows searches for
relays or bridges in our archives since 2007 and provides details for any point in the past. Maybe his tool will implement the existing Onionoo interface, so that Atlas and Compass can switch to using it instead of Onionoo.
- We'll still keep using Onionoo for aggregating bandwidth and weights
statistics per relay or bridge, but Kostas' tool would give out that data.
- Thomas has written Visionion and replacements for metrics-web-N and
metrics-web-U. You probably saw the long discussion on this list. This is a totally awesome project on its own, but it's sufficiently separate from Kostas' project (Kostas is only interested in single relays/bridges, whereas Thomas is only interested in aggregates).
I'm aware that not all of this may happen in one summer. That's why I'm quite flexible about plans. There are quite a lot of missing puzzle pieces in the overall picture, people can start wherever they want and contribute something useful.
I was very, very tempted to start up a thread on tor-dev@ to discuss this but couldn't figure out a way of doing so without letting Kostas know that we're taking him on. If you can think of a graceful way of including him or tor-dev@ then feel free.
Let's wait four more days, if that's okay for you. Starting a new discussion there about this together with Kostas sounds like a fine plan.
This will be an exciting summer! :)
Best, Karsten
[1] https://github.com/wfn/torsearch/blob/master/tsweb/importer.py#L16 [2] https://metrics.torproject.org/tools.html
Hi!
Maybe we should focus on a 'grand unified backend' rather than
splitting Kostas' summer between both a backend and frontend? If he could replace the backends of the majority of our metrics services then that would greatly simplify the metrics ecosystem.
I'm mostly interested in the back-end, too. But I think it won't be as much fun for Kostas if he can't also work on something that's visible to users. I don't know what he prefers though.
Honestly, I would actually be up for focusing, if need be, exclusively on the backend part. It would also probably (hopefully) prove to be the most beneficial to the overall ecosystem of tools. However, such a plan would imply that the final goal (ideally) is to have a replacement for Onionoo, which means that it would have to be reliably stable and scalable, so
that
multiple frontends could all use it at once. (It will have to be stable
in
any case, of course.) I think this would be a great goal, but if we can define and isolate development stages to a great extent, I think having
two
goals: (a) Onionoo replacement; (b) descriptor search+browse frontend -
at
the same time is OK, and either one of them could be dropped/reduced
during
the process -
I think I understand, but I'm not sure. Just to get this right, is either of these states the planned end state of your GSoC project?
- descriptor database supporting efficient queries, separate API
similar to Onionoo's, front-end application using new search parameters;
- descriptor database supporting efficient queries, full integration
with Onionoo API, no special front-end application using new search parameters; or
- descriptor database supporting efficient queries, full integration
with Onionoo API, front-end application using Onionoo's new search parameters.
Yes - thanks for helping to nicely articulate them by the way - in the sense that *any* of these end states would qualify, from my perspective at least, as a success for this project. As I said, I think it is possible to work on things without fear of making redundant effort while also not restricting ourselves to one particular end state of the three, until some significantly later point in time. This is because it is possible to firstly do the efficient database, then implement a subset of the Onionoo-like API (with a possibility for diverging from the Onionoo standard later if a need arises at some point later on), and finally - optionally/hopefully - work on the client-side frontend application. I'd still like to do the frontend if the rest can be done in a subset of the whole timeline; I'd also perhaps like to work/tinker on it after the official GSoC timeline; but if (in mid-summer) it turns out that making an Onionoo replacement is possible (the new backend/database scales well for complex queries and so on, and implementing the whole Onionoo API is realistic/easy), I can simply focus on the backend.
Note that there's no Onionoo client that uses bridge data, yet. We have been planning to add bridge support to Atlas for a while, but this hasn't happened yet.
But in general, bridge data is quite similar to relay data. There are some specifics because of sanitized descriptor parts, but in general, data structures are similar.
Understood. Bridge data / sanitized descriptors seem similar indeed, should fit in nicely.
I think it's an advantage here that Onionoo itself has a front-end and a
back-end part. The back-end processes data once per hour and writes it to the file system. The front-end is a single Java servlet that does all the filtering and sorting in memory and reads larger JSON files from disk. What we could do is: keep the back-end running, so that it keeps producing details, bandwidth, and weights files, and only replace the servlet by a Python thing that also knows how to respond to more complex search queries.
Yes, this sounds great! Basically delegating bandwidth and weights calculation to what we have already, and focusing on queries etc. I will have to look into the actual Onionoo backend implementation, namely, how much of the "produce static JSON files including descriptor data" can be reused.
In any case, I don't think that having Onionoo(-compatibility, etc.) as an additional set of variables / potential deliverables should pose a problem.
This was a vague/generic reply, but I will eventually follow up with more things.
Kostas.
On Wed, May 29, 2013 at 5:34 PM, Karsten Loesing karsten@torproject.orgwrote:
On 5/29/13 4:05 AM, Kostas Jakeliunas wrote:
Hello! (@tor-dev: will also write a separate email, introducing the GSoC project at hand.)
This GSoc idea started a year back as a searchable descriptor search application, totally unrelated to Onionoo. It was when I read Kostas' proposal that I started thinking about an integration with Onionoo. That's why the plan is still a bit vague. We should work together with Kostas very soon to clarify the plan.
Indeed, as it currently stands, the extent of the proposed backend part
of
the searchable descriptor project is unclear. The original plan was not
to
aim for a universal backend which could ideally, for example, service existing web-side Metrics etc. project applications. The idea was to hopefully be able to replace relay and consensus search/lookup tools
with a
single and more powerful "search and browse descriptor archives" application.
However I completely agree that an integrated, reusable backend sounds
more
exciting and could potentially/hopefully make the broader Tor metrics-*
&c
ecosystem more uniform if that's the word - reducing the tool/component counts.
Sounds great! Sorry for making things more complicated by suggesting the Onionoo integration, but it just made sense to me when reading your proposal. Just don't feel like you have to do it, so if you rather want to focus on your original proposal, that's fine by me.
I think this is doable if the tasks/steps of this project are somewhat isolated, so that incremental development can happen, and it's
not
an all-or-nothing gamble (obviously that is the way it is intended to be, but I think this would be an important aspect of this project in
particular
as well.)
Incremental development sounds great!
Maybe we should focus on a 'grand unified backend' rather than splitting Kostas' summer between both a backend and frontend? If he could replace the backends of the majority of our metrics services then that would greatly simplify the metrics ecosystem.
I'm mostly interested in the back-end, too. But I think it won't be as much fun for Kostas if he can't also work on something that's visible to users. I don't know what he prefers though.
Honestly, I would actually be up for focusing, if need be, exclusively on the backend part. It would also probably (hopefully) prove to be the most beneficial to the overall ecosystem of tools. However, such a plan would imply that the final goal (ideally) is to have a replacement for Onionoo, which means that it would have to be reliably stable and scalable, so
that
multiple frontends could all use it at once. (It will have to be stable
in
any case, of course.) I think this would be a great goal, but if we can define and isolate development stages to a great extent, I think having
two
goals: (a) Onionoo replacement; (b) descriptor search+browse frontend -
at
the same time is OK, and either one of them could be dropped/reduced
during
the process -
I think I understand, but I'm not sure. Just to get this right, is either of these states the planned end state of your GSoC project?
- descriptor database supporting efficient queries, separate API
similar to Onionoo's, front-end application using new search parameters;
- descriptor database supporting efficient queries, full integration
with Onionoo API, no special front-end application using new search parameters; or
- descriptor database supporting efficient queries, full integration
with Onionoo API, front-end application using Onionoo's new search parameters.
this is what I'd have in mind, generally speaking, in terms of general, let's say incremental deliverables / sub-projects, which can
be
done sequentially:
- Work out the relay schema for (a) relay descriptors; (b)
consensus-statuses; (c) *bridge summaries; (d) *bridge network statuses;
Here, I think it is realistic to try and use and import all the fields available from metrics-db-*. My PoC is overly simplistic in this regard: only relay descriptors, and only a limited subset of data fields is used
in
the schema, for the import. I think it is realistic to import bridge data used and reported by Onionoo. Here is the good, 'incremental' part I
think:
the Onionoo protocol/design is useful in itself, as a clean "relay processing" (what comes in and in what form it comes out) design. I think it makes sense to do the DB schema having the fields used and reported by Onionoo in mind. Even if the project ends up not aiming to even be compatible with Onionoo (in terms of its API endpoints, or perhaps not reporting everything (e.g. guard probability) - though I would like to
aim
for compatibility, as would all of you, I suppose!), I think there should be little to no duplication of effort when designing the schema and the descriptor/data import part of the backend. The bridge data can later be dropped. I will soon try looking closer if the schema can be made such
that
it may later be very easily *extended* to include bridges data, but it might be safer to at least have the whole schema from the beginning for processing db-R, db-B and db-P, and e.g. simply not work on actual bridge data import at first (depending on priorities.)
Note that there's no Onionoo client that uses bridge data, yet. We have been planning to add bridge support to Atlas for a while, but this hasn't happened yet.
But in general, bridge data is quite similar to relay data. There are some specifics because of sanitized descriptor parts, but in general, data structures are similar.
- Implement data import part: so again, the focus would be on importing
all possible fields available from, most importantly, metrics-db-R. More fields in relay descriptors, and also consensus statuses. Descriptors
(IDs)
in consensuses will refer to relay descriptors; must be possible to efficiently query the consensus table as well to ask "in which statuses
has
this descriptor been present?"
These two parts are crucial whether the project is to aim for Onionoo replacement, and/or also provide a search&browse frontend.
- Implement Onionoo-compatible search queries, and (maybe only) a subset
of result fields. Again, I don't see why using the Onionoo
protocol/design
shouldn't work here in any case. (Other Onionoo-specific nuanses, like compressed responses etc shouldn't be hard at all, I think.) Make sure Onionoo-compatible queries scale well for all archival data. By queries I mean:
GET summary GET details
bandwidht/weights can wait until further time constraints become more obvious. All parameters available for filtering Onionoo results [1] make sense to me: the more powerful search/query system (well, bits of it) referred to in the original project proposal can be seen as a powerset,
of
which the Onionoo query/filter system would be a subset. Again, this is great, as I think there's nothing wrong with aiming for an Onionoo-compatible query language which frontends / other applications could query the new backend with anyway! So that's good.
I think it's an advantage here that Onionoo itself has a front-end and a back-end part. The back-end processes data once per hour and writes it to the file system. The front-end is a single Java servlet that does all the filtering and sorting in memory and reads larger JSON files from disk. What we could do is: keep the back-end running, so that it keeps producing details, bandwidth, and weights files, and only replace the servlet by a Python thing that also knows how to respond to more complex search queries.
- At this point, if we have Onionoo-compatible relay/data search
(possibly
excluding bridges, and probably excluding bandwidth weights etc) for all the archival data available via simple rsync (it works very well indeed
for
the (small) subset of archival data available - rsync'ing the 'recent' archive folder) for feeding the data to the backend, it will be great.
From
here on, depending on how long all of this took and what our clarified goals are, more things can happen, and it becomes less clear goal-wise indeed:
As per my original proposal, implementing a more powerful query/filter system (specifying and encapsulating AND/OR - only the actual syntax
needs
to be decided on; but also being able to refer to more fields - this obviously requires one to be more concrete - will be able to work on
this)
would be part of the plan. The query/filter syntax can be made (backwards-)compatible with current Onionoo, either by cheaply adding an additional optional parameter which specifies an advanced protocol
version,
and then being able to change the rest of the query as is needed, or by more carefully designing the syntax to truly be a superset of the current Onionoo query ruleset. Not sure about this one, the good news is, all previous parts can be worked on before such decisions are made. Of
course,
it would be very useful to have the ideal extended query design / scope
of
querying/results clear from the start, so that we don't end up
constraining
ourselves with a limited schema design. Though migrating imported data between schemas should be possible.
- Optionally, this (vast) part would include working on a frontend
application which would make use of the new powerful backend
capabilities.
See original proposal for details (I'll see to it so that it's reachable
by
tor-dev.) My idea was to further isolate parts of the frontend for incremental development, so that leveraging the more powerful search capabilities in a simple frontend would be the most important aspect.
This
is still very vague though, or I need to refer back to the proposal.
Another, related thing: the PoC acts as a backend and (sorry excuse for
a)
frontend all-in-one, as of now. The plan would be to completely separate the two code-wise and application-wise, with backend providing an API for the frontend. This is the part that is great about Onionoo: I think implementing an Onionoo-compatible (or a reduced version of, if we eventually go in that (latter, centering-around-frontend) direction) API
is
feasible and makes sense whatever the final direction of the project is
to
be. I might need to focus on providing more details about this later, but I'd really like to make the two completely separate (interchangeable, switchable) application-wise. Yes for modularity!
- The present relay search renders raw router status entries. Does it
actually store the text of the router status entries within the database? With the new relay search I suppose we'll be retrieving the attributes rather than raw descriptor text, is that right?
The present relay search and ExoneraTor store raw text of router status entries in their databases. But that doesn't mean that the new relay search needs to do that, too.
The idea would be import all data as DB fields (so, indexable), but it makes sense to also import raw text lines to be able to e.g. supply the frontend application with raw data if needed, as the current tools do.
But
I think this could be made to be a separate table, with descriptor id as primary key, which means this can be done later on if need be, would not cause a problem. I guess there's no need to this right now.
I've probably glossed over the most sensitive/convoluted parts of the
plan!
:) let me know where I should already be more specific at the very start
of
the project.
Does the proposed incremental development plan make sense?
It does!
I will hopefully later follow up with my more immediate plans. I thought
I
would have an extended schema by now - I have more code, but I still need to sort it out. And I'm still not sure whether trying to import all data fields available makes sense. I suspect not much significant progress *code-wise* may happen until my exams are over, but I am not sure. Hopefully we can focus on design though. (Also: I'm itching to import
*all*
archival data even to a reduced schema and do some more nasty queries on it.)
Sounds good. Please focus on exams first and ignore GSoC for the time being.
Thanks, Karsten
Hopefully did not just make things more convoluted! Regards, Kostas.
[1] https://onionoo.torproject.org/
On Mon, May 27, 2013 at 10:25 PM, Damian Johnson <atagar@torproject.org wrote:
Hi Kostas. Now that we no longer need to worry about accidentally leaking GSoC selection we can talk more openly about your project. Below is an interchange between me and Karsten - thoughts?
---------- Forwarded message ---------- From: Karsten Loesing karsten@torproject.org Date: Thu, May 23, 2013 at 11:37 AM Subject: Re: Metrics Plans To: Damian Johnson atagar@torproject.org Cc: Tor Assistants tor-assistants@lists.torproject.org
On 5/23/13 7:22 PM, Damian Johnson wrote:
Hi Karsten. I just finished reading over Kostas' proposal and while it looks great, I'm not sure if I fully understand the plan. Few clarifying questions...
- What descriptor information will his backend contain? Complete
descriptor attributes (ie, all the attributes from the documents), or only what we need? His proof of concept importer [1] only contains a subset but that's, of course, not necessarily where we're going.
If we're aiming for this to be the 'grand unifying backend' for Onionoo, Exonerator, Relay Search, etc then it seems like we might as well aim for it to be complete. But that naturally means more work with schema updates as descriptors change...
This GSoc idea started a year back as a searchable descriptor search application, totally unrelated to Onionoo. It was when I read Kostas' proposal that I started thinking about an integration with Onionoo. That's why the plan is still a bit vague. We should work together with Kostas very soon to clarify the plan.
- The present relay search renders raw router status entries. Does it
actually store the text of the router status entries within the database? With the new relay search I suppose we'll be retrieving the attributes rather than raw descriptor text, is that right?
The present relay search and ExoneraTor store raw text of router status entries in their databases. But that doesn't mean that the new relay search needs to do that, too.
- Kostas' proposal includes both the backend importing/datastore and
also a Flask frontend for rendering the search results. In terms of the present tools diagram [2] I suppose that would mean replacing metrics-web-R and having a python counterpart of metrics-db-R (with the aim of later deprecating the old metrics-db-R). Is that right?
Not quite. We cannot replace metrics-db-R yet, because that's the tool that downloads relay descriptors for all other services. It needs to work really stable. Replacing metrics-db-R would be a different project. The good thing though is that metrics-db-R offers its files via rsync, so that's a very clean interface for services using its data.
In terms of the tools diagram, Kostas would write a second tool in the "Process" column above Onionoo that would feed two replacement tools for metrics-web-R and metrics-web-E. His processing tool would use data from metrics-db-R and metrics-db-E.
If his tool is supposed to replace more parts of Onionoo and not only replace relay search and ExoneraTor, it would use data from metrics-db-B and metrics-db-P, too.
Maybe we should focus on a 'grand unified backend' rather than splitting Kostas' summer between both a backend and frontend? If he could replace the backends of the majority of our metrics services then that would greatly simplify the metrics ecosystem.
I'm mostly interested in the back-end, too. But I think it won't be as much fun for Kostas if he can't also work on something that's visible to users. I don't know what he prefers though.
In my imagination, here's how the tools diagram looks like by the end of summer:
- Kostas has written an Onionoo-like back-end that allows searches for
relays or bridges in our archives since 2007 and provides details for any point in the past. Maybe his tool will implement the existing Onionoo interface, so that Atlas and Compass can switch to using it instead of Onionoo.
- We'll still keep using Onionoo for aggregating bandwidth and weights
statistics per relay or bridge, but Kostas' tool would give out that
data.
- Thomas has written Visionion and replacements for metrics-web-N and
metrics-web-U. You probably saw the long discussion on this list. This is a totally awesome project on its own, but it's sufficiently separate from Kostas' project (Kostas is only interested in single relays/bridges, whereas Thomas is only interested in aggregates).
I'm aware that not all of this may happen in one summer. That's why I'm quite flexible about plans. There are quite a lot of missing puzzle pieces in the overall picture, people can start wherever they want and contribute something useful.
I was very, very tempted to start up a thread on tor-dev@ to discuss this but couldn't figure out a way of doing so without letting Kostas know that we're taking him on. If you can think of a graceful way of including him or tor-dev@ then feel free.
Let's wait four more days, if that's okay for you. Starting a new discussion there about this together with Kostas sounds like a fine
plan.
This will be an exciting summer! :)
Best, Karsten
[1] https://github.com/wfn/torsearch/blob/master/tsweb/importer.py#L16 [2] https://metrics.torproject.org/tools.html
Here, I think it is realistic to try and use and import all the fields available from metrics-db-*. My PoC is overly simplistic in this regard: only relay descriptors, and only a limited subset of data fields is used in the schema, for the import.
I'm not entirely sure what fields that would include. Two options come to mind...
* Include just the fields that we need. This would require us to update the schema and perform another backfill whenever we need something new. I don't consider this 'frequent backfill' requirement to be a bad thing though - this would force us to make it extremely easy to spin up a new instance which is a very nice attribute to have.
* Make the backend a more-or-less complete data store of descriptor data. This would mean schema updates whenever there's a dir-spec addition [1]. An advantage of this is that the ORM could provide us with stem Descriptor instances [2]. For high traffic applications though we'd probably still want to query the backend directly since we usually won't care about most descriptor attributes.
The idea would be import all data as DB fields (so, indexable), but it makes sense to also import raw text lines to be able to e.g. supply the frontend application with raw data if needed, as the current tools do. But I think this could be made to be a separate table, with descriptor id as primary key, which means this can be done later on if need be, would not cause a problem. I guess there's no need to this right now.
I like this idea. A couple advantages that this could provide us are...
* The importer can provide warnings when our present schema is out of sync with stem's Descriptor attributes (ie. there has been a new dir-spec addition).
* After making the schema update the importer could then run over this raw data table, constructing Descriptor instances from it and performing updates for any missing attributes.
Cheers! -Damian
[1] https://gitweb.torproject.org/torspec.git/blob/HEAD:/dir-spec.txt [2] This might be a no-go. Stem Descriptor instances are constructed from the raw descriptor content, and needs it for str(), get_bytes(), and signature validation. If we don't care about those we can subclass Descriptor and overwrite those methods.
Here, I think it is realistic to try and use and import all the fields
available from metrics-db-*.
My PoC is overly simplistic in this regard: only relay descriptors, and
only a limited subset of data fields is used in the schema, for the import.
I'm not entirely sure what fields that would include. Two options come to mind...
- Include just the fields that we need. This would require us to
update the schema and perform another backfill whenever we need something new. I don't consider this 'frequent backfill' requirement to be a bad thing though - this would force us to make it extremely easy to spin up a new instance which is a very nice attribute to have.
- Make the backend a more-or-less complete data store of descriptor
data. This would mean schema updates whenever there's a dir-spec addition [1]. An advantage of this is that the ORM could provide us with stem Descriptor instances [2]. For high traffic applications though we'd probably still want to query the backend directly since we usually won't care about most descriptor attributes.
In truth, I'm not sure here, either. I agree that it basically boils down to either of the two aforementioned options. I'm okay with any of them. I'd like to, however, see how well the db import scales if we were to import all relay descriptor fields. There aren't a lot of them (dirspec [1]), if we don't count extra-info of course and only want to deal with the Router descriptor format (2.1). So I think I should try working with those fields, and see if the import goes well and quickly enough. I plan to do simple python timeit / timing report macroses that may be attached / deattached from functions easily, would be simple and clean that way to measure things and so on.
[...] An advantage of [more-or-less complete data store of descriptor data] is that the ORM could provide us with stem Descriptor instances [2]. For high traffic applications though we'd probably still want to query the backend directly since we usually won't care about most descriptor attributes.
I can try experimenting with this later on (when we have the full / needed importer working, e.g.), but it might be difficult to scale indeed (not sure, of course). Do you have any specific use cases in mind? (actually curious, could be interesting to hear.) [2] fn is noted, I'll think about it.
The idea would be import all data as DB fields (so, indexable), but it makes sense to also import raw text lines to be able to e.g. supply the frontend application with raw data if needed, as the current tools do. But I think this could be made to be a separate table, with descriptor id as primary key, which means this can be done later on if need be, would not cause a problem. I guess there's no need to this right now.
I like this idea. A couple advantages that this could provide us are...
- The importer can provide warnings when our present schema is out of
sync with stem's Descriptor attributes (ie. there has been a new dir-spec addition).
- After making the schema update the importer could then run over this
raw data table, constructing Descriptor instances from it and performing updates for any missing attributes.
The 'schema/format mismatch report' idea sounds like a really good idea! Surely if we are to try for Onionoo compatibility / eventual replacement, but in any case, this seems like a very useful thing for the future. I will keep this in mind for the nearest future / database importer rewrite.
- After making the schema update the importer could then run over this
raw data table, constructing Descriptor instances from it and performing updates for any missing attributes.
I can't say I can easily see the specifics of how all this would work, but if we had an always-up-to-date data model (mediated by Stem Relay Descriptor class, but not necessarily), this might work.. (The ORM <-> Stem Descriptor object mapping itself is trivial, so all is well in that regard.)
On Wed, May 29, 2013 at 5:49 PM, Damian Johnson atagar@torproject.orgwrote:
Here, I think it is realistic to try and use and import all the fields
available from metrics-db-*.
My PoC is overly simplistic in this regard: only relay descriptors, and
only a limited subset of data fields is used in the schema, for the import.
I'm not entirely sure what fields that would include. Two options come to mind...
- Include just the fields that we need. This would require us to
update the schema and perform another backfill whenever we need something new. I don't consider this 'frequent backfill' requirement to be a bad thing though - this would force us to make it extremely easy to spin up a new instance which is a very nice attribute to have.
- Make the backend a more-or-less complete data store of descriptor
data. This would mean schema updates whenever there's a dir-spec addition [1]. An advantage of this is that the ORM could provide us with stem Descriptor instances [2]. For high traffic applications though we'd probably still want to query the backend directly since we usually won't care about most descriptor attributes.
The idea would be import all data as DB fields (so, indexable), but it
makes sense to also import raw text lines to be able to e.g. supply the frontend application with raw data if needed, as the current tools do. But I think this could be made to be a separate table, with descriptor id as primary key, which means this can be done later on if need be, would not cause a problem. I guess there's no need to this right now.
I like this idea. A couple advantages that this could provide us are...
- The importer can provide warnings when our present schema is out of
sync with stem's Descriptor attributes (ie. there has been a new dir-spec addition).
- After making the schema update the importer could then run over this
raw data table, constructing Descriptor instances from it and performing updates for any missing attributes.
Cheers! -Damian
[1] https://gitweb.torproject.org/torspec.git/blob/HEAD:/dir-spec.txt [2] This might be a no-go. Stem Descriptor instances are constructed from the raw descriptor content, and needs it for str(), get_bytes(), and signature validation. If we don't care about those we can subclass Descriptor and overwrite those methods.
Ah, forgot to add my footnote to the dirspec - we all know the link, but in any case:
[1]: https://gitweb.torproject.org/torspec.git/blob/HEAD:/dir-spec.txt
This was in the context of discussing which fields from 2.1 to include.
On Tue, Jun 11, 2013 at 12:34 AM, Kostas Jakeliunas kostas@jakeliunas.comwrote:
Here, I think it is realistic to try and use and import all the fields available from metrics-db-*.
My PoC is overly simplistic in this regard: only relay descriptors, and
only a limited subset of data fields is used in the schema, for the import.
I'm not entirely sure what fields that would include. Two options come to mind...
- Include just the fields that we need. This would require us to
update the schema and perform another backfill whenever we need something new. I don't consider this 'frequent backfill' requirement to be a bad thing though - this would force us to make it extremely easy to spin up a new instance which is a very nice attribute to have.
- Make the backend a more-or-less complete data store of descriptor
data. This would mean schema updates whenever there's a dir-spec addition [1]. An advantage of this is that the ORM could provide us with stem Descriptor instances [2]. For high traffic applications though we'd probably still want to query the backend directly since we usually won't care about most descriptor attributes.
In truth, I'm not sure here, either. I agree that it basically boils down to either of the two aforementioned options. I'm okay with any of them. I'd like to, however, see how well the db import scales if we were to import all relay descriptor fields. There aren't a lot of them (dirspec [1]), if we don't count extra-info of course and only want to deal with the Router descriptor format (2.1). So I think I should try working with those fields, and see if the import goes well and quickly enough. I plan to do simple python timeit / timing report macroses that may be attached / deattached from functions easily, would be simple and clean that way to measure things and so on.
[...] An advantage of [more-or-less complete data store of descriptor data] is that the ORM could provide us
with stem Descriptor instances [2]. For high traffic applications though we'd probably still want to query the backend directly since we usually won't care about most descriptor attributes.
I can try experimenting with this later on (when we have the full / needed importer working, e.g.), but it might be difficult to scale indeed (not sure, of course). Do you have any specific use cases in mind? (actually curious, could be interesting to hear.) [2] fn is noted, I'll think about it.
The idea would be import all data as DB fields (so, indexable), but it makes sense to also import raw text lines to be able to e.g. supply the frontend application with raw data if needed, as the current tools do. But I think this could be made to be a separate table, with descriptor id as primary key, which means this can be done later on if need be, would not cause a problem. I guess there's no need to this right now.
I like this idea. A couple advantages that this could provide us are...
- The importer can provide warnings when our present schema is out of
sync with stem's Descriptor attributes (ie. there has been a new dir-spec addition).
- After making the schema update the importer could then run over this
raw data table, constructing Descriptor instances from it and performing updates for any missing attributes.
The 'schema/format mismatch report' idea sounds like a really good idea! Surely if we are to try for Onionoo compatibility / eventual replacement, but in any case, this seems like a very useful thing for the future. I will keep this in mind for the nearest future / database importer rewrite.
- After making the schema update the importer could then run over this
raw data table, constructing Descriptor instances from it and performing updates for any missing attributes.
I can't say I can easily see the specifics of how all this would work, but if we had an always-up-to-date data model (mediated by Stem Relay Descriptor class, but not necessarily), this might work.. (The ORM <-> Stem Descriptor object mapping itself is trivial, so all is well in that regard.)
On Wed, May 29, 2013 at 5:49 PM, Damian Johnson atagar@torproject.orgwrote:
Here, I think it is realistic to try and use and import all the fields
available from metrics-db-*.
My PoC is overly simplistic in this regard: only relay descriptors, and
only a limited subset of data fields is used in the schema, for the import.
I'm not entirely sure what fields that would include. Two options come to mind...
- Include just the fields that we need. This would require us to
update the schema and perform another backfill whenever we need something new. I don't consider this 'frequent backfill' requirement to be a bad thing though - this would force us to make it extremely easy to spin up a new instance which is a very nice attribute to have.
- Make the backend a more-or-less complete data store of descriptor
data. This would mean schema updates whenever there's a dir-spec addition [1]. An advantage of this is that the ORM could provide us with stem Descriptor instances [2]. For high traffic applications though we'd probably still want to query the backend directly since we usually won't care about most descriptor attributes.
The idea would be import all data as DB fields (so, indexable), but it
makes sense to also import raw text lines to be able to e.g. supply the frontend application with raw data if needed, as the current tools do. But I think this could be made to be a separate table, with descriptor id as primary key, which means this can be done later on if need be, would not cause a problem. I guess there's no need to this right now.
I like this idea. A couple advantages that this could provide us are...
- The importer can provide warnings when our present schema is out of
sync with stem's Descriptor attributes (ie. there has been a new dir-spec addition).
- After making the schema update the importer could then run over this
raw data table, constructing Descriptor instances from it and performing updates for any missing attributes.
Cheers! -Damian
[1] https://gitweb.torproject.org/torspec.git/blob/HEAD:/dir-spec.txt [2] This might be a no-go. Stem Descriptor instances are constructed from the raw descriptor content, and needs it for str(), get_bytes(), and signature validation. If we don't care about those we can subclass Descriptor and overwrite those methods.
I can try experimenting with this later on (when we have the full / needed importer working, e.g.), but it might be difficult to scale indeed (not sure, of course). Do you have any specific use cases in mind? (actually curious, could be interesting to hear.)
The advantages of being able to reconstruct Descriptor instances is simpler usage (and hence more maintainable code). Ie, usage could be as simple as...
========================================
from tor.metrics import descriptor_db
# Fetches all of the server descriptors for a given date. These are provided as # instances of... # # stem.descriptor.server_descriptor.RelayDescriptor
for desc in descriptor_db.get_server_descriptors(2013, 1, 1): # print the addresses of only the exits
if desc.exit_policy.is_exiting_allowed(): print desc.address
========================================
Obviously we'd still want to do raw SQL queries for high traffic applications. However, for applications where maintainability trumps speed this could be a nice feature to have.
- After making the schema update the importer could then run over this
raw data table, constructing Descriptor instances from it and performing updates for any missing attributes.
I can't say I can easily see the specifics of how all this would work, but if we had an always-up-to-date data model (mediated by Stem Relay Descriptor class, but not necessarily), this might work.. (The ORM <-> Stem Descriptor object mapping itself is trivial, so all is well in that regard.)
I'm not sure if I entirely follow. As I understand it the importer...
* Reads raw rsynced descriptor data. * Uses it to construct stem Descriptor instances. * Persists those to the database.
My suggestion is that for the first step it could read the rsynced descriptors *or* the raw descriptor content from the database itself. This means that the importer could be used to not only populate new descriptors, but also back-fill after a schema update.
That is to say, adding a new column would simply be...
* Perform the schema update. * Run the importer, which... * Reads raw descriptor data from the database. * Uses it to construct stem Descriptor instances. * Performs an UPDATE for anything that's out of sync or missing from the database.
Cheers! -Damian
Hi,
forgot to reply to this email earlier on..
On Tue, Jun 11, 2013 at 6:02 PM, Damian Johnson atagar@torproject.orgwrote:
I can try experimenting with this later on (when we have the full /
needed
importer working, e.g.), but it might be difficult to scale indeed (not sure, of course). Do you have any specific use cases in mind? (actually curious, could be interesting to hear.)
The advantages of being able to reconstruct Descriptor instances is simpler usage (and hence more maintainable code).
[...]
Obviously we'd still want to do raw SQL queries for high traffic applications. However, for applications where maintainability trumps speed this could be a nice feature to have.
Oh, very nice, this would indeed be great, and this kind of usage would, I suppose, facilitate the new tool's function as a simplifying 'glue' that reduces multiple tools/applications into one. In any case, since the model for a descriptor can be mapped to/from Stem's Descriptor instance, this should be possible. (More) raw SQL queries for the backend's internal usage would still be used - yes, this makes sense.
- After making the schema update the importer could then run over this
raw data table, constructing Descriptor instances from it and performing updates for any missing attributes.
I can't say I can easily see the specifics of how all this would work,
but
if we had an always-up-to-date data model (mediated by Stem Relay
Descriptor
class, but not necessarily), this might work.. (The ORM <-> Stem
Descriptor
object mapping itself is trivial, so all is well in that regard.)
I'm not sure if I entirely follow. As I understand it the importer...
- Reads raw rsynced descriptor data.
- Uses it to construct stem Descriptor instances.
- Persists those to the database.
My suggestion is that for the first step it could read the rsynced descriptors *or* the raw descriptor content from the database itself. This means that the importer could be used to not only populate new descriptors, but also back-fill after a schema update.
That is to say, adding a new column would simply be...
- Perform the schema update.
- Run the importer, which...
- Reads raw descriptor data from the database.
- Uses it to construct stem Descriptor instances.
- Performs an UPDATE for anything that's out of sync or missing from
the database.
Aha, got it - this would actually probably be a brilliant way to do it. :) that is,
My suggestion is that for the first step it could read the rsynced descriptors *or* the raw descriptor content from the database itself. This means that the importer could be used to not only populate new descriptors, but also back-fill after a schema update.
is definitely possible, and doing UPDATEs could indeed be automated that way. Ok, so since I'm writing the new database importer incarnation now, it's definitely possible to put each descriptor's raw contents/text into a separate, non-indexed field. This would then simply be a matter of satisfying disk space constraints, and no more. There could/should be a way of switching this raw import option off, IMO.
Kostas.
On Tue, Jun 11, 2013 at 6:02 PM, Damian Johnson atagar@torproject.orgwrote:
I can try experimenting with this later on (when we have the full /
needed
importer working, e.g.), but it might be difficult to scale indeed (not sure, of course). Do you have any specific use cases in mind? (actually curious, could be interesting to hear.)
The advantages of being able to reconstruct Descriptor instances is simpler usage (and hence more maintainable code). Ie, usage could be as simple as...
========================================
from tor.metrics import descriptor_db
# Fetches all of the server descriptors for a given date. These are provided as # instances of... # # stem.descriptor.server_descriptor.RelayDescriptor
for desc in descriptor_db.get_server_descriptors(2013, 1, 1): # print the addresses of only the exits
if desc.exit_policy.is_exiting_allowed(): print desc.address
========================================
Obviously we'd still want to do raw SQL queries for high traffic applications. However, for applications where maintainability trumps speed this could be a nice feature to have.
- After making the schema update the importer could then run over this
raw data table, constructing Descriptor instances from it and performing updates for any missing attributes.
I can't say I can easily see the specifics of how all this would work,
but
if we had an always-up-to-date data model (mediated by Stem Relay
Descriptor
class, but not necessarily), this might work.. (The ORM <-> Stem
Descriptor
object mapping itself is trivial, so all is well in that regard.)
I'm not sure if I entirely follow. As I understand it the importer...
- Reads raw rsynced descriptor data.
- Uses it to construct stem Descriptor instances.
- Persists those to the database.
My suggestion is that for the first step it could read the rsynced descriptors *or* the raw descriptor content from the database itself. This means that the importer could be used to not only populate new descriptors, but also back-fill after a schema update.
That is to say, adding a new column would simply be...
- Perform the schema update.
- Run the importer, which...
- Reads raw descriptor data from the database.
- Uses it to construct stem Descriptor instances.
- Performs an UPDATE for anything that's out of sync or missing from
the database.
Cheers! -Damian
As a separate thing/issue, I will try and write up a coherent design of what we currently have in mind, since the discussion took place over multiple places and some timespan. That way we can see what we have in one place, and discuss parts of the system that are still unclear, etc.