[metrics-team] Monthly churn values per relay flag

Mon Feb 15 02:22:50 UTC 2016

On Thu, Feb 04, 2016 at 05:07:01PM +0100, Karsten Loesing wrote:
>  - Regarding the mentioned outliers and suspected consensus hiccups, I
> wonder if those are caused by comparing non-adjacent consensuses.  A
> random example could be the decline between 2007-11-10 to 2007-11-13.

I already account for that and don't calculate churn values when
consensuses are missing.

> Or is that an artifact of connecting two valid comparisons of two
> adjacent consensuses each with numerous missing values in between?  In
> that case, would it make sense to add NAs to the graph data before
> plotting it, so that the line would end on 2007-11-10 and another line
> would start on 2007-11-13?

What do you mean by "numerous missing values in between?"

>  - I'm yet unsure how churn rates are defined exactly, and I know I
> brought this up in previous discussions, Philipp, I just don't
> remember the latest definition.  If I compare two adjacent consensuses
> C0 and C1, how are the numbers in NewRunning and GoneRunning
> calculated?  If I were to define them, I think I'd say that NewRunning
> is the number of relays in C1 that were not listed in C0, divided by
> the total number of relays in C1, and GoneRunning is the number of
> relays in C0 that are not listed anymore in C1, divided by the total
> number of relays in C0.  Note the different denominators.  But I think
> if we use the same denominator, we can't guarantee that both values
> are in [0, 1].  Is this also the definition you used?

Yes, that's how I'm doing it now (after changing the definition, thanks
to your suggestions.)

>  - Would it make sense to not only include one Date in your .csv file
> (I assume this is C1 in my definition above?), but the valid-after
> times of C0 and C1 that you're comparing, like DatePrevious and
> DateCurrent?

Yes, it's always C1.  I could include it if you think it's helpful, but
it's redudant because I only compare two adjacent consensuses, so C0's
valid-after is always C1's valid-after minus one hour.

>  - Your .csv file uses a "wide" table format with lots of columns for
> the different flags.  My experience is that this table format has
> disadvantages, because you need to know which columns exist and update
> any code using the .csv file when you add new columns.  I find the
> "long" table format to be more flexible.  In that format you'd add
> columns for flags that just contain a boolean or null/NA/empty string
> and one line per combination of flags.  Here's an example of how the
> first lines could look like in the "long" table format:
> 
> Date,Authority,BadExit,Exit,Fast,Guard,HSDir,Named,Running,Stable,Unnamed,V2Dir,Valid,New,Gone
> 2007-10-27T13:00:00Z,T,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,0.00000,0.00000
> 2007-10-27T13:00:00Z,NA,NA,T,NA,NA,NA,NA,NA,NA,NA,NA,NA,0.09551,0.02669

That's new to me, thanks for the tip.  I opened a feature report for
this:
<https://github.com/NullHypothesis/sybilhunter/issues/6>

>  - I agree with Nusenu that absolute relay numbers might be
> interesting, too.  Using the "long" table format they are relatively
> easy to add as new column NewAbs and GoneAbs.  Whether you'd graph
> them or not is another question.

Yes, that's already implemented:
<https://nymity.ch/sybilhunting/churn-values/churn-all.csv>

>  - Another potentially interesting metric would be fraction of
> consensus weight joining or leaving the network, like NewCW and
> GoneCW.  Not sure how useful that would be for joining relays, because
> new relays typically have very small consensus weights, but it might
> be interesting to see when a large part of the network by consensus
> weight leaves.

That's a good idea.  I opened a feature report for it:
<https://github.com/NullHypothesis/sybilhunter/issues/4>

>  - Did you consider comparing consensuses with more than 1 hours in
> between them, for example 1 day, 1 week, or 1 month?  That would
> remove daily/weekly/monthly patterns and might make it easier to
> observe changes.  It would also reduce the data resolution in the
> graph, allowing you to plot more than just a month.  I could imagine
> that a graph from 2007 to 2016 would be much more useful with a data
> resolution of 1 week or 1 month.  Note how a data format with
> DatePrevious and DateCurrent would allow you to add that data to your
> .csv file.  Maybe add another column Interval that you set to "1
> month" to make plotting easier.

Also a good idea.  I added another feature report:
<https://github.com/NullHypothesis/sybilhunter/issues/5>

Cheers,
Philipp