[metrics-team] metrics-web detect script update and question

Mon Nov 16 19:59:09 UTC 2015

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 14/11/15 18:04, seamus tuohy wrote:
> 
> Hello All,

Hello Seamus,

> I have been updating the detector scripts in metrics-web with a
> goal towards making it easier for others (hopefully with more
> statistical knowledge than I) to work with and build on the code.
> It has been a substantial rewrite that relies heavily on the python
> pandas library. I have just reached the point where I can
> accurately duplicate the functionality of the original code as it
> is called in the 80-run-clients-stats.sh file. This code also
> removes the need for pre-processing the data as done by the
> userstats-detector.R script.
> 
> -- THE ACTUAL QUESTION --
> 
> The original script has data-visualization functions (e.g.
> plot_all()) that don't seem to be called from within metrics-web
> and I want to get some guidance on if I should re-implement them.
> If so, where I should be looking to make sure that I implement them
> to work seamlessly with the existing system?
> 
> -- FLUFF FOR THE INTERESTED --
> 
> Here is the current code: 
> https://github.com/elationfoundation/metrics-web/blob/master/modules/clients/detector.py
>
>  Below is a quick overview of the changes in output that may impact
> other programs, or consumers of this information. I will write up a
> much more in-depth overview of functionality when I submit the
> actual pull request. I am thinking of getting basic PT anomaly
> detection added before this before I submit the pull request. This
> should be much easier with the new code.
> 
> Here is a comparison of the old and new output: 
> https://gist.github.com/elationfoundation/1714e0f1e9f8728eddb1
> 
> NEW_ranges_file_SUBSET.csv OLD_ranges_file_SUBSET.csv
> 
> - The output from write_all function [now called 
> write_censorship_analysis()] has had some fields added to it. The
> old code had some duplicate processing that was built into it. The
> new code identifies the censorship and spike events the first time
> it runs through the time series so that the other functions can
> just read from the ranges output. - I have also changed the names
> of some of the fields.This will impact any code that is currently
> parsing this output. I can either change the field names back,
> write a seperate file that only has the currently formatted data
> and heading in it for further processing, or whatever code process'
> this output can be updated to parse this properly.
> 
> 
> NEW_short_censorship_report.txt OLD_short_censorship_report.txt
> 
> - I have slightly modified the short censorship report produced by 
> write_ml_report() which is now called write_short_report(). The
> changes are merely cosmetic, but I think there is a lot that can be
> done to eventually make the short report a more useful document
> (e.g. putting it in a structured format that will allow others to
> scrape and incorporate it into a threat feed).

Great to see your interest in making the censorship detector better!

So, I have been thinking about your plan to submit a pull request for
the rewrite of the current functionality, and I think I'd want to
suggest a different plan:

How about you deploy your rewritten code on a minimal website that
visualizes the output of your rewritten censorship detection script,
possibly comparing it to other algorithms, and we link that website
from the Metrics website?

Let me explain this plan a bit more: what we really want is a better
censorship detection algorithm that doesn't produce as many false
positives.  Your rewrite can be a great starting point for that.  But
there's no need to merge code directly into Metrics until we're sure
we found an algorithm we like better than the current one, and maybe
that requires making two or three attempts to get it right.  For now,
I'd rather want to add a link to your results.  We can always discuss
replacing the script in Metrics with a new one later, but there are
really no requirements other than that it can read a .csv in the
provided format and write a new .csv in the expected format.

If you're not sure what I mean by link, here are two examples for
external links on Tor Metrics:

https://metrics.torproject.org/oxford-anonymous-internet.html

https://metrics.torproject.org/uncharted-data-flow.html

Does that plan make sense to you?  It's really great that you're
picking up this topic.  Thanks for that!

All the best,
Karsten

> 
> Best, s2e
> 
> -- seamus tuohy | Sr. Technologist - Internet Initiatives 
> stuohy at internews.org Skype/XMPP on request PGP: 36AC 272E B7CF EDD5
> F907 E488 B619 3EC7 3CF0 7AA7 MiniLock:
> 2G3JmRWRYB3B7rthZqkzomcRe8GwJvPtSooA748XMsTBdf
> 
> INTERNEWS | Local Voices. Global Change. www.internews.org |
> @internews _______________________________________________ 
> metrics-team mailing list metrics-team at lists.torproject.org 
> https://lists.torproject.org/cgi-bin/mailman/listinfo/metrics-team
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: GPGTools - http://gpgtools.org

iQEcBAEBAgAGBQJWSjWNAAoJEJD5dJfVqbCrqNQH/3M1sp8ZUB5LSGT4f9zO8Srv
9Oeq5APKh+GpMvUavTNoZebjjegP4GzsUmLbFLUM0n0/v0woNxiEpQJV8yS6aumA
dJchopns2xSBsbcQgPc/+x1QKmAnxeqCDmetQWEWLF8VXRO/VJKGkHL39ULRbKwL
5NF8o3Zd3V5uN2PyXArPeWmR35bbMUTse+8HLqlQwb3bj6uazbzBgUC91YMMhmt3
nZrzVU3rOu+CMtXAMZgHdQMmjEbdy3Qx/wt/sdVaj6102RxdH6QgA2cxnf3s4Ftz
19NVpVGOUZ+goxhyI4I1fDW8PJ4Tpo4OpVCMAcA+K39dQdEagY2c+JpQ5y0lbwg=
=5O6k
-----END PGP SIGNATURE-----