[metrics-team] Monthly churn values per relay flag

Karsten Loesing karsten at torproject.org
Mon Feb 15 20:34:58 UTC 2016

Hash: SHA1

On 15/02/16 03:22, Philipp Winter wrote:
> On Thu, Feb 04, 2016 at 05:07:01PM +0100, Karsten Loesing wrote:
>> - Regarding the mentioned outliers and suspected consensus
>> hiccups, I wonder if those are caused by comparing non-adjacent
>> consensuses.  A random example could be the decline between
>> 2007-11-10 to 2007-11-13.
> I already account for that and don't calculate churn values when 
> consensuses are missing.

Okay, great!

>> Or is that an artifact of connecting two valid comparisons of
>> two adjacent consensuses each with numerous missing values in
>> between?  In that case, would it make sense to add NAs to the
>> graph data before plotting it, so that the line would end on
>> 2007-11-10 and another line would start on 2007-11-13?
> What do you mean by "numerous missing values in between?"

Looking at your .csv file, there's one line for 2007-11-08T23:00:00Z
and the next line for 2007-11-12T21:00:00Z and nothing in between,
still you're connecting those two data points with a line in your
graph.  This gives the impression that there are actual data points.
That's why I would cut the line after that first data point and
restart drawing the line at the second data point.  In R/ggplot2 you'd
do that by adding NA values for each hour where you don't have an
actual data point to plot.  Hope this makes sense...!

>> - I'm yet unsure how churn rates are defined exactly, and I know
>> I brought this up in previous discussions, Philipp, I just don't 
>> remember the latest definition.  If I compare two adjacent
>> consensuses C0 and C1, how are the numbers in NewRunning and
>> GoneRunning calculated?  If I were to define them, I think I'd
>> say that NewRunning is the number of relays in C1 that were not
>> listed in C0, divided by the total number of relays in C1, and
>> GoneRunning is the number of relays in C0 that are not listed
>> anymore in C1, divided by the total number of relays in C0.  Note
>> the different denominators.  But I think if we use the same
>> denominator, we can't guarantee that both values are in [0, 1].
>> Is this also the definition you used?
> Yes, that's how I'm doing it now (after changing the definition,
> thanks to your suggestions.)


>> - Would it make sense to not only include one Date in your .csv
>> file (I assume this is C1 in my definition above?), but the
>> valid-after times of C0 and C1 that you're comparing, like
>> DatePrevious and DateCurrent?
> Yes, it's always C1.  I could include it if you think it's helpful,
> but it's redudant because I only compare two adjacent consensuses,
> so C0's valid-after is always C1's valid-after minus one hour.

Well, if you're always comparing consensuses published one hour after
the other, then it's redundant.  But if Tor is ever going to switch to
publishing a new consensus twice per hour (to make use of new relays
faster), or if you want to add different comparisons (see below), then
it would be useful to have that second column.  Again, entirely up to
you to ignore this suggestion.

>> - Your .csv file uses a "wide" table format with lots of columns
>> for the different flags.  My experience is that this table format
>> has disadvantages, because you need to know which columns exist
>> and update any code using the .csv file when you add new columns.
>> I find the "long" table format to be more flexible.  In that
>> format you'd add columns for flags that just contain a boolean or
>> null/NA/empty string and one line per combination of flags.
>> Here's an example of how the first lines could look like in the
>> "long" table format:
>> Date,Authority,BadExit,Exit,Fast,Guard,HSDir,Named,Running,Stable,Unnamed,V2Dir,Valid,New,Gone
>> 2007-10-27T13:00:00Z,NA,NA,T,NA,NA,NA,NA,NA,NA,NA,NA,NA,0.09551,0.02669
> That's new to me, thanks for the tip.  I opened a feature report
> for this: <https://github.com/NullHypothesis/sybilhunter/issues/6>


>> - I agree with Nusenu that absolute relay numbers might be 
>> interesting, too.  Using the "long" table format they are
>> relatively easy to add as new column NewAbs and GoneAbs.  Whether
>> you'd graph them or not is another question.
> Yes, that's already implemented: 
> <https://nymity.ch/sybilhunting/churn-values/churn-all.csv>


>> - Another potentially interesting metric would be fraction of 
>> consensus weight joining or leaving the network, like NewCW and 
>> GoneCW.  Not sure how useful that would be for joining relays,
>> because new relays typically have very small consensus weights,
>> but it might be interesting to see when a large part of the
>> network by consensus weight leaves.
> That's a good idea.  I opened a feature report for it: 
> <https://github.com/NullHypothesis/sybilhunter/issues/4>

Curious to see the result. :)

>> - Did you consider comparing consensuses with more than 1 hours
>> in between them, for example 1 day, 1 week, or 1 month?  That
>> would remove daily/weekly/monthly patterns and might make it
>> easier to observe changes.  It would also reduce the data
>> resolution in the graph, allowing you to plot more than just a
>> month.  I could imagine that a graph from 2007 to 2016 would be
>> much more useful with a data resolution of 1 week or 1 month.
>> Note how a data format with DatePrevious and DateCurrent would
>> allow you to add that data to your .csv file.  Maybe add another
>> column Interval that you set to "1 month" to make plotting
>> easier.
> Also a good idea.  I added another feature report: 
> <https://github.com/NullHypothesis/sybilhunter/issues/5>

Also great.  And sorry for not contributing code but only ideas that
produce more work for you.  But please let me know when I can help
more, for example by merging this code into Metrics.

> Cheers, Philipp


Comment: GPGTools - http://gpgtools.org


More information about the metrics-team mailing list