[metrics-team] M.Sc projects at Edinburgh

Karsten Loesing karsten at torproject.org
Fri Jan 8 11:14:46 UTC 2016

Hash: SHA1

Hi William,

On 07/01/16 21:03, William Waites wrote:
> Hi all, I visited the IRC channel for some of the meeting earlier 
> today. I am a researcher at the School of Informatics -- though my 
> background is network engineering and my interest in things like
> Tor comes from there, my actual area of research is completely 
> unrelated. However it occured to me that we have some 300 or so
> taught M.Sc students who will be looking for projects across
> various computer science subdisciplines. It might be nice to have
> some of those work on things together with the Tor project.
> One such proposal that I made for making improvements to Ooni can
> be seen here:
> http://j6vwvo5cbgkdwfoo.onion/file/URI:CHK:hlb455txf3koftnq3qhvshkike:4fzxfggr5c5afluqw5etuvptdwqzl3fww2lwqbsqjo2ohqjgldha:3:10:129920/@@named=P028.pdf
>  (sorry, the entire project list is internal to the school, and
> yes that is tahoe-lafs behind a hidden service.)
> It's not necessarily the best exemplar of such a proposal, but the 
> basic idea is that it should be a relatively small piece of 
> self-contained interesting work. Self contained because of the 
> academic requirement being that the student demonstrate that it is 
> their own work. Small because this is for the project phase of an
> M.Sc degree which means, roughly, that from February the student
> would do background research and write a specific plan for what
> they will do, and from March/April until August they would actually
> do it and write it up -- so a fairly short timeframe. It should be
> interesting enough that there are non-obvious aspects to the work.
> It would be great to have more ideas that students could work on
> to help the Tor project and wider ecosystem. I'm sure there is
> plenty of material! As far as supervision arrangements go that will
> need to be worked out -- having someone from Informatics and
> someone from elsewhere is perfectly feasible. Depending on the
> level of interest I can do some but probably not all, but for
> interesting ideas there are other colleagues in the school who can
> help.
> The deadline is soon! The 15th of January. Any specific ideas
> please let me know as soon as possible.

This sounds like a great opportunity.  I'd even say that "small piece
of work" is an understatement, given that we usually have much less
than 6 months to work on a single researchy project at Tor.  I can see
how these projects can produce something quite useful, even when
students need to first get up to speed on things.

I went through various todo lists and tickets to find a few possible
project ideas.  Note that all of these would have to be fleshed out
more.  Please let me know which of these ideas you find most
interesting, and let's nail them down more.

1. Grammar for Tor descriptors

A while ago I started looking into ANTLR 4 for writing down a grammar
for Tor bridge network statuses.  I described my motivation for doing
that on the tor-dev@ mailing list, and the grammar file I wrote
contains more details in the comments.  I don't know enough about
(more) practical alternatives to ANTLR 4 for this task, so I'd
appreciate feedback from somebody (an advisor at your school) before
writing this down as a project for a student.  I would also appreciate
buy-in from another Tor developer working on the Tor daemon as co-mentor.



(Also read the responses to that posting.)

2. Append-only log for ExoneraTor

ExoneraTor is a service for Tor relay operators and law enforcement
people to find out whether an IP address was used as a Tor relay in
the past.  We have been thinking about increasing integrity of
ExoneraTor data by operating a public append-only untrusted log,
similar to Google's Certificate Transparency project.  With such a
system in place, anyone can verify that this log doesn't change and
that ExoneraTor doesn't lie about relays being part of the network.
If this project is successful, ExoneraTor might only be one
application using this log and other projects might start relying on
that log, too.  I'd appreciate input from Linus Nordberg and from a
possible advisor from your school on this topic before suggesting it
to students.


3. Exposed bad relays

A few weeks I go I started working on a visualization of exposed bad
relays in the Tor network, which includes a) relays that got the
BadExit flag assigned, b) relays that got the Valid flag removed, and
c) relays that got outright rejected.  The goal was to only use
publicly available data for this visualization, which is not as
explicit about cases b) and c) above.  The issue is that the Tor
directory authorities disagree more than one would expect about
considering a relays as valid or even about listing it.  It's possible
that one can derive robust criteria for saying when a relay was
exposed as bad relay, or it might be that machine learning would be
necessary to say this with high enough confidence.  This task requires
processing a huge amount of data.




4. Analytics Project

thms is working on the Analytics Project which he recently described
as follows (hope it's okay to quote him here):

"Setting up an environment to analyze Tor data is no trivial task.
Access to the raw data is only feasable programatically through
special libraries. Also the data is collected on a node per hour basis
whereas most of the time one will be interested in certain kinds of
nodes within a certain period of time. Therefor a lot of aggregation
has to be performed upfront before you can even start to ask the
questions that you came for. Not to mention that the shere amount of
data might block your notebook for days and eat up your harddrive on
the way. The analytics project plans to provide a small set of tools
to ease these problems: a converter from raw CollecTor data to
ubiquitous JSON and more per- formant Avro/Parquet; a Big Data setup
providing popular interfaces like MapReduce, R and SQL and an easy
migration path to scale huge tasks from notebook to cloud; a
collection of pre-aggregated datasets and aggregation scripts. With
these tools it should become much more feasable to work with Tor data
and perform analytics as need arises, without too much fuzz and effort
upfront. This is still work in progress, so please allow a few more

I could imagine that this projects fits into the criteria stated
above.  But of course thms is already working on it, so we'd have to
find a self-contained part of this project to suggest for your
students.  If this sounds interesting to you we should talk more to thms.

5. Confidence intervals for user number estimates

A year ago we added hidden-service statistics to Tor Metrics.  These
are based on reports by dozens or even hundreds of relays, each of
which seeing a small fraction of the network.  The algorithms used for
extrapolating data are much better than earlier algorithms used for
estimating user numbers, and we should look into adapting these
algorithms to user number estimates, also to overcome problems as
described in Tor Trac ticket #16555.  What we should also do is
provide confidence for these estimates.  Obviously, this project
should be done by somebody with a background in statistics, and I'd
appreciate help by an advisor with background in that field.





There, let me know which of these sound most promising as project
ideas for your students, and I'll put more thoughts into those.

Thanks for asking, by the way!

> Cheers, -w

All the best,

> -- William Waites <wwaites at tardis.ed.ac.uk>  |  School of
> Informatics https://tardis.ed.ac.uk/~wwaites/      | University of
> Edinburgh https://hubs.net.uk/             |      HUBS AS60241
> The University of Edinburgh is a charitable body, registered in 
> Scotland, with registration number SC005336. 
> _______________________________________________ metrics-team
> mailing list metrics-team at lists.torproject.org 
> https://lists.torproject.org/cgi-bin/mailman/listinfo/metrics-team

Comment: GPGTools - http://gpgtools.org


More information about the metrics-team mailing list