Hi,<br><br>forgot to reply to this email earlier on..<br><br>On Tue, Jun 11, 2013 at 6:02 PM, Damian Johnson <span dir="ltr"><<a href="mailto:atagar@torproject.org" target="_blank">atagar@torproject.org</a>></span> wrote:<br>


<blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote"><div class="im">> I can try experimenting with this later on (when we have the full / needed<br>


> importer working, e.g.), but it might be difficult to scale indeed (not<br>

> sure, of course). Do you have any specific use cases in mind? (actually<br>

> curious, could be interesting to hear.)<br>

<br>

</div>The advantages of being able to reconstruct Descriptor instances is<br>

simpler usage (and hence more maintainable code). <br><br></blockquote><div><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote">[...] <br><br>Obviously we'd still want to do raw SQL queries for high traffic<br>


applications. However, for applications where maintainability trumps<br>

speed this could be a nice feature to have.<br></blockquote><div><br>Oh, very nice, this would indeed be great, and this kind of usage would, I suppose, facilitate the new tool's function as a simplifying 'glue' that reduces multiple tools/applications into one. In any case, since the model for a descriptor can be mapped to/from Stem's Descriptor instance, this should be possible. (More) raw SQL queries for the backend's internal usage would still be used - yes, this makes sense.<br>


</div> <br><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote">>> * After making the schema update the importer could then run over this<br>

>> raw data table, constructing Descriptor instances from it and<br>

>> performing updates for any missing attributes.<br>

><br>

> I can't say I can easily see the specifics of how all this would work, but<br>

> if we had an always-up-to-date data model (mediated by Stem Relay Descriptor<br>

> class, but not necessarily), this might work.. (The ORM <-> Stem Descriptor<br>

> object mapping itself is trivial, so all is well in that regard.)<br>

<br>

I'm not sure if I entirely follow. As I understand it the importer...<br>

<br>

* Reads raw rsynced descriptor data.<br>

* Uses it to construct stem Descriptor instances.<br>

* Persists those to the database.<br>

<br>

My suggestion is that for the first step it could read the rsynced<br>

descriptors *or* the raw descriptor content from the database itself.<br>

This means that the importer could be used to not only populate new<br>

descriptors, but also back-fill after a schema update.<br>

<br>

That is to say, adding a new column would simply be...<br>

<br>

* Perform the schema update.<br>

* Run the importer, which...<br>

  * Reads raw descriptor data from the database.<br>

  * Uses it to construct stem Descriptor instances.<br>

  * Performs an UPDATE for anything that's out of sync or missing from<br>

the database.<br></blockquote><div><br>Aha, got it - this would actually probably be a brilliant way to do it. :) that is,<br><br>> My suggestion is that for the first step it could read the rsynced<br>> descriptors *or* the raw descriptor content from the database itself.<br>


> This means that the importer could be used to not only populate new<br>> descriptors, but also back-fill after a schema update. <br><br>is definitely possible, and doing UPDATEs could indeed be automated that way. Ok, so since I'm writing the new database importer incarnation now, it's definitely possible to put each descriptor's raw contents/text into a separate, non-indexed field. This would then simply be a matter of satisfying disk space constraints, and no more. There could/should be a way of switching this raw import option off, IMO.<br>


<br>Kostas.<br></div><br></div><div class="gmail_quote">On Tue, Jun 11, 2013 at 6:02 PM, Damian Johnson <span dir="ltr"><<a href="mailto:atagar@torproject.org" target="_blank">atagar@torproject.org</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">> I can try experimenting with this later on (when we have the full / needed<br>

> importer working, e.g.), but it might be difficult to scale indeed (not<br>

> sure, of course). Do you have any specific use cases in mind? (actually<br>

> curious, could be interesting to hear.)<br>

<br>

</div>The advantages of being able to reconstruct Descriptor instances is<br>

simpler usage (and hence more maintainable code). Ie, usage could be<br>

as simple as...<br>

<br>

========================================<br>

<br>

from tor.metrics import descriptor_db<br>

<br>

# Fetches all of the server descriptors for a given date. These are provided as<br>

# instances of...<br>

#<br>

#   stem.descriptor.server_descriptor.RelayDescriptor<br>

<br>

for desc in descriptor_db.get_server_descriptors(2013, 1, 1):<br>

  # print the addresses of only the exits<br>

<br>

  if desc.exit_policy.is_exiting_allowed():<br>

    print desc.address<br>

<br>

========================================<br>

<br>

Obviously we'd still want to do raw SQL queries for high traffic<br>

applications. However, for applications where maintainability trumps<br>

speed this could be a nice feature to have.<br>

<div class="im"><br>

>> * After making the schema update the importer could then run over this<br>

>> raw data table, constructing Descriptor instances from it and<br>

>> performing updates for any missing attributes.<br>

><br>

> I can't say I can easily see the specifics of how all this would work, but<br>

> if we had an always-up-to-date data model (mediated by Stem Relay Descriptor<br>

> class, but not necessarily), this might work.. (The ORM <-> Stem Descriptor<br>

> object mapping itself is trivial, so all is well in that regard.)<br>

<br>

</div>I'm not sure if I entirely follow. As I understand it the importer...<br>

<br>

* Reads raw rsynced descriptor data.<br>

* Uses it to construct stem Descriptor instances.<br>

* Persists those to the database.<br>

<br>

My suggestion is that for the first step it could read the rsynced<br>

descriptors *or* the raw descriptor content from the database itself.<br>

This means that the importer could be used to not only populate new<br>

descriptors, but also back-fill after a schema update.<br>

<br>

That is to say, adding a new column would simply be...<br>

<br>

* Perform the schema update.<br>

* Run the importer, which...<br>

  * Reads raw descriptor data from the database.<br>

  * Uses it to construct stem Descriptor instances.<br>

  * Performs an UPDATE for anything that's out of sync or missing from<br>

the database.<br>

<br>

Cheers! -Damian<br>

</blockquote></div><br><style type="text/css">            img.imageResizerActiveClass{cursor:nw-resize !important;outline:1px dashed black !important;}            img.imageResizerChangedClass{z-index:300 !important;max-width:none !important;max-height:none !important;}            img.imageResizerBoxClass{margin:auto; z-index:99999 !important; position:fixed; top:0; left:0; right:0; bottom:0; border:1px solid white; outline:1px solid black;}        </style>