Hi Georg,
Thanks for the feedback!
On 03/17/2016 10:06 PM, Georg Koppen wrote:
Hi Pierre,
thanks for this proposal. Gunes has already raised some good points and I won't repeat them here. This is part one of my feedback as I need a bit more time to think about the code example section.
Pierre Laperdrix:
Hi Tor Community, .....
Website features The main feature of the website is to collect a set of fingerprintable attributes on the client and calculate the distribution of values for each attribute like Panopticlick or AmIUnique. The set of tests would not only include known fingerprinting techniques but also ones developed specifically for the Tor browser. The second main feature of the website would be for Tor users to check how close their current fingerprint is from the ideal unique fingerprint that most users should share. A list of actions should be added to help users configure their browser to reach this ideal fingerprint.
We might want to think about that ideal fingerprint idea a bit. I think there is no such a thing even for Tor Browser users as we are e.g. rounding the content window size to a multiple of 200x100 for each user. Thus, we have at least one fingerprintable attribute where we say "you are good if you have one out of a bunch of possible values". The same holds for our security slider which basically partitions the Tor Browser users. We could revisit these design decisions and I am especially interested in getting data that is backing/not backing our decisions regarding them. Nevertheless, I assume we won't always be able to put users into just one bucket per attribute due to usability issues. And this in turn makes the idea to help users configure their browser not easier.
You are right on that. I'll switch to the idea of an "ideal" fingerprint to an "acceptable" one in my proposal. The idea to partition the dataset into categories from the security slider can be really interesting and we could try to play with some JS benchmarking since some JS optimizations are removed on higher security levels. Then, we could try to detect the security level either through fingerprinting or following the suggestion of Gunes, the browser could directly add the level as a URL parameter for the ground truth.
The third main feature would be an API for automated tests as detailed by this page : https://people.torproject.org/~boklm/automation/tor-automation-proposals.htm... . This would enable automatic verification of Tor protection features with regard to fingerprinting. When a new version is released, the output of specific tests will be verified to check for any evolution/changes/regressions from previous versions. The fourth main feature I'd like to include is a complete stats page where the user can go through every attribute and filter by OS, browser version and more. The inclusion of additional features that go beyond the core functionnalities of the site should be driven by the needs of the developers and the Tor community. Still, a lot of open questions remain that should be addressed during the bonding period to define precisely how each of these features should ultimately work. Some of these open questions include:
- How closed/private/transparent should the website be about its tests
and the results? Should every tests be clearly indicated on the webpage with their own description? or should some tests stay hidden to prevent spreading usable tests to fingerprint Tor users?
- Should a statistics page exist? Should we give a read access to the
database to every user (like in the form of a REST API or other solutions)?
- Where the data should be stored? How long should the data be kept? If
tests are performed by versions, should the data from an old TBB version be removed? Should the data be kept a week, a month or more?
I am not sure about how long the data should be kept. It probably depends on what kind of data we are talking about (e.g. aggregate or not). I think, though, that data we collected with Tor Browser A should not get deleted just because Tor Browser A+1 got released. I think, in fact, we might want to keep that data especially if we want to give users a guide about how to get a "better" fingerprint. But even if not we might want to have this data to measure e.g. whether a fix for a particular fingerprinting vector had an impact and if so, which one.
It makes sense to keep data from previous versions. Moreover, I don't know how fast the majority of Tor users upgrade their browsers but when a new version is launched, it would be a bad idea to restrict the collection of fingerprints to one or two specific versions.
- How new tests should be added: A pull request? A form where
submissions are reviewed by admins? A link to the Tor tracker?
From a Tor perspective opening a ticket and posting the test there or ideally having a link to a test in the ticket that is fixing the fingerprinting vector seems like the preferred solution. I'd like to avoid the situation where tests get added to the system and we don't know about that dealing with users that are scared because of the new results. So, yes, some review should be involved here.
I always envisioned some form of review to add new tests. I still don't know exactly what the system would be since I don't know the exact structure of the website yet.
- Should the website only be accessible through Tor?
I don't think so. I am fine with Chrome/IE etc. users that try to see how they fare on that test. This not-closing-down right from the start and proper communication about that might be important if we like to create a better test platform not only for Tor Browser but other vendors as you alluded to above. (which is a good idea as it encourages collaboration and a better understanding of the fingerprinting problematic in general)
Yes, for this reason, there is no reason to restrict the access.
Technical choices In my opinion, the website must be accessible and modular. It should have the ability to cope with an important number of connections/data. With this in mind and the experience gained from developing AmIUnique, I plan on using the Play framework with a MongoDB database. Developing the website in Java opens the door to many developers to make the website better and more robust after its initial launch since it is one of most used programming language in the world. On the storage and statistics side, MongoDB is a good fit because it is now a mature technology that can scale well with an important number of data and connections. Moreover, the use of SQL databases for AmIUnique proved to be really powerful but the maintenance after the website was launched became a tedious task, especially when modifying the underlying model of a fingerprint to collect new attributes. The choice of a more flexible and modular database seems a better choice for maintenance and for adding/removing tests.
If we look at the Tor side I guess we have more experience with Python code (which includes me) than Java. Thus, by using Python it might be easier for us to maintain the code in the longer run. That said, I am fine with the decisions as you made them especially if you are already familiar with using all these tools/languages. And, hey, we always encourage students to stay connected to us and get even deeper involved after the GSoC ended. So, this might then actually be an area for you... ;)
I wrote that I would use Java and Play because I'm familiar with it but I'm really open to try something new. For the past year, I have mainly used Java and Python so switching to Python is absolutely not a problem for me. In terms of timeline, this would mean that the website would take a little more time to have a proper first version running but if it means in the long-term that a more broader part of the Tor community can participate, it is better. And for me, learning new technos is part of the fun of development. In terms of framework, the new version of Panopticlick is using Flask but the Django framework seems to be more complete with a stronger community support.
One thing I'd like you to think about, though, is that we have guidelines for developing services that might be running on Tor project infrastructure one day:
https://trac.torproject.org/projects/tor/wiki/org/operations/Guidelines
Not sure if the tools you had in mind above fit the requirements outlined there. If not, we should try to fix that. (Thanks to Karsten for pointing that out)
Thanks for pointing that out! I wasn't aware of it. If I read the Guidelines correctly, to make a service run on the Tor infrastructure, you must use trusted and stable sources. In the case of Tor, it is stable Debian packages. The use of self-provided libraries or third-party package managers is not recommended. If I decide to go the Django way, it is in the stable repository of Debian. However, if I want to use a non-relational database like Mongo, it uses pip and git repositories for the connectors so we end up in the "Not recommended" area (even if it is not forbidden). Django has built-in support for SQL DBs but in my opinion, it would not be easy to add/remove tests with such rigid models. I don't know who says if it is okay or not to be on the Tor infrastructure but it seems to me that it could work.
Pierre
Estimated timeline You will find below a rough estimate of the timeline for the three months of the GSoC.
Community bonding period - Discuss with the mentors and the community the set of features that should be included in the very first version of the website and clarify the open questions raised in one of the previous paragraphs.
23 May - 27 June : Development of the first version of the website with the core features Week 1 - Development of the first version of the fingerprinting script with the core set of attributes. Special attention will be given so that it is fully compatible with the most recent version of the Tor browser (and older ones too). Week 2 - Start developing the front-end and the back-end to store fingerprints with a page containing data on your current fingerprint (try adding a view to see how close/far you are from the ideal fingerprint). Week 3 - Start developing the statistics page with the necessary visualization for the users. Modification of the back-end to improve statistics computation to lessen the server load. Week 4 - Finishing the front-end development and refining the statistics page to get back the most relevant information. Adding and testing an API to support automated tests. Week 5 - Finishing the first version so that it is ready for deployment. Start developing additional features requested by the community (rest API? account management?)
27 June - Mid July : Deployment of the first version online for a beta-test with bug fixing. Finishing development of additional features requested by the mentors/community. Defining the list of new features for the second version.
Mid July - 23th August : Adding a system to make the website as flexible as possible to add/remove tests easily (A pull-request system? A test submission form where admins review tests before they are included in the test suite?) Developing additional features for the website. Making sure that the website can be opened to more browsers (work done at design time to support any browsers will be tested here) Bug fixing
That looks like a good timeline estimation to me.
That's it for the first feedback,
Georg
[snip]
tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev