[metrics-team] Monthly churn values per relay flag

Karsten Loesing karsten at torproject.org
Thu Feb 4 16:07:01 UTC 2016


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 02/02/16 15:27, Philipp Winter wrote:
> On Tue, Feb 02, 2016 at 12:19:07AM +0000, nusenu wrote:
>> Can one download the data somewhere as well?
> 
> Sure, it's available here: 
> <https://nymity.ch/sybilhunting/churn-values/churn-all.csv>

Hello Philipp and Nusenu,

Interesting stuff!  Here are some questions and comments:

 - Regarding the mentioned outliers and suspected consensus hiccups, I
wonder if those are caused by comparing non-adjacent consensuses.  A
random example could be the decline between 2007-11-10 to 2007-11-13.
 Or is that an artifact of connecting two valid comparisons of two
adjacent consensuses each with numerous missing values in between?  In
that case, would it make sense to add NAs to the graph data before
plotting it, so that the line would end on 2007-11-10 and another line
would start on 2007-11-13?

 - I'm yet unsure how churn rates are defined exactly, and I know I
brought this up in previous discussions, Philipp, I just don't
remember the latest definition.  If I compare two adjacent consensuses
C0 and C1, how are the numbers in NewRunning and GoneRunning
calculated?  If I were to define them, I think I'd say that NewRunning
is the number of relays in C1 that were not listed in C0, divided by
the total number of relays in C1, and GoneRunning is the number of
relays in C0 that are not listed anymore in C1, divided by the total
number of relays in C0.  Note the different denominators.  But I think
if we use the same denominator, we can't guarantee that both values
are in [0, 1].  Is this also the definition you used?

 - Would it make sense to not only include one Date in your .csv file
(I assume this is C1 in my definition above?), but the valid-after
times of C0 and C1 that you're comparing, like DatePrevious and
DateCurrent?

 - Your .csv file uses a "wide" table format with lots of columns for
the different flags.  My experience is that this table format has
disadvantages, because you need to know which columns exist and update
any code using the .csv file when you add new columns.  I find the
"long" table format to be more flexible.  In that format you'd add
columns for flags that just contain a boolean or null/NA/empty string
and one line per combination of flags.  Here's an example of how the
first lines could look like in the "long" table format:

Date,Authority,BadExit,Exit,Fast,Guard,HSDir,Named,Running,Stable,Unnamed,V2Dir,Valid,New,Gone
2007-10-27T13:00:00Z,T,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,0.00000,0.00000
2007-10-27T13:00:00Z,NA,NA,T,NA,NA,NA,NA,NA,NA,NA,NA,NA,0.09551,0.02669

 - I agree with Nusenu that absolute relay numbers might be
interesting, too.  Using the "long" table format they are relatively
easy to add as new column NewAbs and GoneAbs.  Whether you'd graph
them or not is another question.

 - Another potentially interesting metric would be fraction of
consensus weight joining or leaving the network, like NewCW and
GoneCW.  Not sure how useful that would be for joining relays, because
new relays typically have very small consensus weights, but it might
be interesting to see when a large part of the network by consensus
weight leaves.

 - Note that the long table format would also allow you to add flag
combinations like the ones mentioned by Nusenu.  For example, 'not
guard and not exits' would get an F in the Guard and Exit column and
NAs in the rest.  Note that you'd probably want to consider only
running relays, so that would be Guard=F, Exit=F, Running=T, and the
rest NA.

 - Did you consider comparing consensuses with more than 1 hours in
between them, for example 1 day, 1 week, or 1 month?  That would
remove daily/weekly/monthly patterns and might make it easier to
observe changes.  It would also reduce the data resolution in the
graph, allowing you to plot more than just a month.  I could imagine
that a graph from 2007 to 2016 would be much more useful with a data
resolution of 1 week or 1 month.  Note how a data format with
DatePrevious and DateCurrent would allow you to add that data to your
.csv file.  Maybe add another column Interval that you set to "1
month" to make plotting easier.

 - I'm thinking how we would provide such a .csv file on Metrics and
have graphs based on it.  What we should not do is read a file that is
potentially multiple times the current 14M for plotting a single graph
on demand.  I think we'd have to build something to store subsets of
that .csv file to plot graphs provided on Metrics.  That should be
easier in the long table format, where we could filter based on the
flag columns.  But let's talk more about this when you're planning to
move this .csv file to Metrics (assuming that is still your plan).

Again, great stuff!

Cheers!
Karsten

-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org

iQEcBAEBAgAGBQJWs3clAAoJEJD5dJfVqbCrdDMH/RzlacHqvVB1EFBdi8HUl/nb
rjCBQOJS8hGEenDIBhRC7cuTxTLh4i8KAgjIwp/eHZw+5e090i1+CoIi/AqYwpuA
zlQ71ShKKN1R511W30wKH0vbTwMN9cDnAnecKqktwZVkUXwKVOQqWAr+dGW1rBGp
PDcrJ3pW9O9VJ3Rsip89vyHxWXLZ+AWcMYr2oTboD1QTkG5SaxBHAQUTpvUGMpoY
v1ZkehuyY38IJ3JrHt313DEERAKVm2ZMDlLNi3m6qKUl/MHHj916Ll+sYFK1e4Ux
FTgMjyViRq6QHThy8w+r2B6rN7cGTta1mIrYaOjQn6elMc26yFIjB7cCakMyfkw=
=AoLj
-----END PGP SIGNATURE-----


More information about the metrics-team mailing list