Surveillance, Privacy, Encryption, Anonymity and Censorship Copyright 2016 Hugo M. Connery License: Creative Commons Attribution ShareALike 4.0 https://creativecommons.org/licenses/by-sa/4.0/ The Internet, which was created by the US Dept. of Defence as a fallback communications network during the cold war, has become a very successful network. It is now used to perform a very large component of all inter-human and business communications, and serves as a "library in your home/work place". The success of the Internet is a clear case of support for standards, by which disparate systems can inter-operate. One reads the "here is how it works" document and can then have your thing work with the other things without any specific understanding of how the other things work only that they adhere to the standard. The huge amount of concentration of communication via a single (well, mulitiple but consistent) mechanism leads those who wish to understand humans by their interaction to wish to have access to those communications. Thus, advertisers, law enforcement and intelligence agencies must have access to either the raw communications or summaries thereof that fit their needs. There must be other actors too: polling agencies, large corporations etc.. Moore's law, the prediction that computing would continuously improve in speed and/or capacity, doubling in such every 2 years has largely been confirmed for three decades. The result of this is that it is easier to capture the communications of an entire population and, post capture, filter that information, than to target specific communications. This results in mass surveillance -- it is easier and cheaper. From the above the internet can be seen as heavily used communications platform which is also heavily surveilled. If one understands the risks of global surveillance one may wish to counter that threat. A little thought can quickly clarify the risk: imagine that someone knew the text, date, time, and place of every single internet search that you have ever made. They would surely know much about you, and that information would be useful to some groups. The end purpose of such groups may be benign and/or dangerous depending on your situation and outlook. But, to counter the threat, one must understand the Internet itself. Not all of it, just how some of it works. Thus, one must read and understand the standards documents. The following is the author's best effort at rendering the contents of those documents understandable by persons with little experise in the engineering fields of computing and network communications. Let's start with the most trivial thing that persons do on the internet -- they start a browser, and that browser loads the default page for a particular internet site. What happens is as follows: 1. The computer operating system that is being used asks its long term storage device (disk) for the program (the browser) and loads that into the short term storage (memory) of the computer, and then starts running that program 2. The program asks the operating system how it is that the program can turn internet names (e.g site.net) into an address, so that the program can then use the internet communication protocols to talk to that address. The system that does this 'name to number' translation for the Internet is called the Domain Name System (DNS). 3. The program then submits the name (site.net) to the facility that the operating system told it to use to turn the name into an address, and waits for an answer. 4. Upon receiving an answer the program then starts talking to the address (which it thinks is "site.net") and asking it for its default "page". This "page" is a document written in HTML which the program (browser) can then display. The communication from the browser runs through the operating system (which controls the network interface -- wired or wireless) and reaches the end point (site.net's address) and the remote site replies and that communication is handed by the operating system to the browser. In all of these communications between the browser (your computer) and the end point (site.net) each piece of communication displays the source of the communication (the address of your computer running the browser) and the end point of the communication (the address that the browser got from the name to number translation). And there is the surveillance. Each piece of communication on the internet describes in clear readable form the addresses of both the source (your computer) and the end point (site.net). Thus, anyone watching the communication knows that one address is talking to the other. This was true from the earliest incarnations of the internet up until the publication of this text. Anyone watching the "wire" (backbone communications channels) knows who is talking to what. There is a second surveillance mechanism in the above, and that is the name to address translation (DNS). If this communication can be read, then people will know what you are attempting to communicate with without having to watch all the traffic. This communication mechanism is currently unencrypted and thus leaks all of this information. Attempts to address this glaring problem are currently (2016) being addressed by the Internet standards bodies. To understand further, we must look at "how" your computer is talking to the site. Internet based communications use a model called the seven layered stack. This is one of the most beautiful things created by computer/network communications engineering/science. We need only understand a portion of it. Note that it is this "OSI model" that allows your browser to talk to web sites via wired networks, wireless networks, mobile phone networks and many others. All of the dirty details are hidden from the browser and handled at other layers. One of those layers is the "encryption" layer (more precisely, the "presentation" layer). The end point to which your application is talking may offer (or demand) some form of encryption. If this is the case, your application (the browser or other program) may be able to meet this offer or requirement. If so, then the people watching the wire will know where/what you are communicating with, but not what you are saying. The most common form of Internet based encryption happens via the https protocol. This is different from the older but, unfortunately, still too commonly used http protocol. It is the 's' in https that makes the difference. In http the watcher knows both who is talking to whom and what they are saying. In https the watcher only knows who is talking to whom but cannot understand the conversation (unless they know how to break the encryption). To clarify this, imagine that you are visiting a site which has recipes. In http the watcher knows both that you are visiting the recipe site, and which recipe you are looking at. In https the watcher only knows that you are visiting the recipe site, but not which recipe you are viewing. In a banking analogy, they know that you are visiting a specific bank, but know nothing about which account or transaction you are accessing/executing. It is worth observing that in the above, the recipe site and bank, both know who is talking to them and what is being said. In the case of the bank, this is essential, as the bank wishes to only allow the owner of an account to access it. For the recipe site, it is not essential that it knows who is asking it about its recipe. It may be interested in knowing how often over various periods of time certain recipes have been read, but that information does not require identifying the reader. For a person who is concerned about surveillance, the use of http should be of great concern as anyone who is watching the traffic can know everything. Https offers far more security (the 's' stands for 'secure') but is still of some concern -- they know who/what you are communicating with. Before continuing on the remaining surveillance problem that watchers know what/who you are communicating with, it behooves me to exand on the 'secure' in https. It does 3 things. Encryption (ensuring that watchers dont understand what is being said) is only one of them. The acronym, fittingly, is CIA: Confidentiality (thats the encryption), Integrity and Authenticity. Integrity guarantees that if the 'watchers' (or anyone in the network path) modifies the data sent then the entire communication will fail. This ensures that you get what the site sent, and vice versa. Or, to put it another way, nobody can modify the communications without people knowing. Authenticity means that you are really talking to who you think you are. It is possible for watchers (or others) to fake the communications and sit in the middle and forward messages back and forth between you and your desired end point. This is called a Man in the Middle (MITM) attack, and is deadly. You think all is fine, but the attacker is reading and/or changing anything and you dont know. Authenticity is still a major problem for the internet, but its discussion is beyond the scope of this article. But, be confident that encryption and integrity are problems that have been solved by the cryptographic community. Problems continue to arise, but they are largely due to outdated solutions continuing to be used. Why is that? Because of Moore's Law; encryption and identity solutions are based partially on the amount of computational work that the attacker must do to decrypt or modify the message. As computational power increased, the solutions to the encryption and identity problemes have had to be re-engineered. There are several solutions to the "we know who you are talking to" problem. The "what you are saying problem" is solved by using encryption and integrity. Below, Proxying, Virtual Private Networks (VPN) and Onion Routing are discussed as solutions to the "watchers know who you are talking to" problem. A proxy is a computer which will shuttle messages on your behalf between you and the end point with which you wish to communicate. You talk to the proxy and say "I want the default page from site.net". The proxy takes this communication and says to site.net "I want your default page" and when it receives that information from site.net the proxy sends that back to you. Now, the watcher sees you talking to the proxy rather than site.net. But, if the protcol between you and the proxy is not encrypted then the watcher can see who you are asking the proxy to talk to on your behalf, and you have achived nothing. A Virtual Private Network (VPN) is just like a proxy, but it automatically involves encryption between you and the VPN. In this case, the watcher knows that you are talking to the VPN, but nothing else. There are two key risks to consider, that the watcher is also watching the proxy/VPN, or that the watcher is the proxy/VPN ! In the first case, by knowing both what is coming from your computer, and when it is coming, and what is passing through the VPN and when it is passing through, the watcher can correlate these network flows based on times of communication and thus identify who you are talking to via the VPN. If the watcher has this power, you have achieved nothing more than making the watcher's job a little harder. But, this additional challenge of doing the timing correlation can largely be automated and thus, one has acheived nothing more that making the watcher watch more things (you and the VPN). If the watcher is the VPN, then obviously they know who you are talking to, and they dont even have to do the correlation analysis. Finally, we move to Onion Routing. Onion Routing involves encryption, like the VPN, but involves more than just one "proxy". The term used in Onion Routing for machines that forward communications amonst the Onion Routing network is a relay. So, a VPN uses just one relay. Onion Routing uses at least three, and the particular three that you are using changes over time. Onion Routing tries to solve both of the problems discussed above; that watchers may be watching lots of things, and that they may actually be some of those things. The largest and most used modern Onion Routing network is called Tor (The Onion Routing network). It gets to use the definite article "The" because it was the first Onion Routing network. The original work was done by the USA Navy Research Lab. Since, the project has moved into the public. Today, Tor is a volunteer community driven network, to which anyone can contribute. Anyone includes both people who wish to help assist in internet anonymity, and their adversaries (i.e idealistic academics, intelligence agencies, law enforcement, criminal networks etc.). This is a problem that the Tor community is aware of and tries to combat (minimise abusive participation whilst continuing to allow broad constructive participation). Tor has a collection of partially trusted computers called the "Directory Authorities". These computers know the collection of relays which make up the network. When you connect to Tor the following happens: 1. you contact a directory authority and ask for a list of all of the relays in the network (or at least, what has changed since last time you asked for information about the network) 2. you select 3 relays from that list. If you have used a certain relay as your first relay before, and it is still available, you will prefer to continue to use that first relay. The next two relays are selected randomly. 3. you contact each of those relays, in order, through the first, and ask for their cryptographic information. With this, you have formed a "circuit" from you to 3 relays, and have the information to be able to encrypt communication from you to each of those relays, and decrypt return communication from that relay to you. 4. You visit a site: a) your request is encrypted to the 3rd (last) relay so that that relay can decrypt that request and send your communication onwards to the site b) you form a request for the second relay to forward the "request" to the third relay. This request is encrypted with the cryptographic details for the second relay. c) as above, you create an encapsulated, encrypted communication for the first relay which asks it is send the above on to the second relay. d) you send the above information to the first relay. e) the first relay decrypts the communication and sees that it should send something that it does not understand to the second relay. It does so, and the second relay decrypts what it gets. It sees a request to forward an encrypted communication that it cannot understand to the third realy. It does so. The third relay decrypts what it receives and sees what you wanted to do -- get the default page from site.net. It contacts site.net and gets that information. f) the third relay receives the response from site.net. It encrypts that data with its cryptographic key (which your client knows) and sends that to the second relay. The second relay does the same, adding another layer of encryption based on its key (which your client also knows) and sends it to the first relay, which does the same and sends the final triply encrypted data to your client. The client then decrypts each layer of encryption with the keys it has obtained from the relays when it formed the circuit and finally what site.net said is visible known and displayed. As you can seen in steps a), b) and c) the client (you) is multiply encrypting communication to different parts of your circuit through the Tor network. The "layer on layer" encryption is what gives Tor its name: The Onion Routing network. The end property is that the only part of the entire exchange that knows who is talking to who is you. You know yourself, the three relays, and the end point (5 things talking). But, each relay only knows itself and the two adjacent components of the communications path. The first relay knows you and the second relay (but not the third relay or end point). The second relay knows the first and the third relays (but not you or the end point), the third relay knows the second relay and the end point (but not you or the first relay). Tor also sends the 'name to address' translation via the circuit, so that anyone watching the 'name to address' translation sees the third relay making this query, rather than you. Two additional points are worth noting: 1) Tor uses ephemeral keys, which means that each relay has a long term key which it uses with each client that connects through it, to generate another key for the use with that client, and those keys are different for all clients. 2) No relay shares those ephemeral keys with any other relay or client. The only thing that knows all of the ephemeral keys for the circuit that your client is using is your client. For a watcher to get useful information out of this set up they need to watch your first relay and your third relay. With that, and using timing, they can with some probability determine that you were talking to a specific end point, and if the end point was not using encryption, what you were saying. Additionally, if the watcher is one of those relays they still cannot know who you were talking to as no relay knows both you and what you are talking to. Finally, if a watcher is watching traffic at any single point withing the Tor network they get nothing. All they see is encrypted traffic moving wth the network and they have no idea what is in that traffic (as they dont have the keys) and they dont know the final end points of that traffic. There are a number of possible attacks against Tor. But, before that is considered, take a moment to think. * If you are not using Tor and you are using http, then any watcher at any point in the communication knows everything. * If you are not using Tor but are using https, then every watcher at any point knows who you are talking with. * With Tor and just http the watchers need to watch two things, the first and third relay, which are statistically likely to be geographically disparate. If they do this, and can do the timing correlation analysis then they can have some confidence that it was you talking to the end point and have the same confidence in what you said. * If you are using Tor and https, the only entities that knows what you said are you and the end point, and watchers who can surveil both the first and third relay and who can do the timing correlation have some probability of confidence that you were saying something unknown to a specific end point. There is one other important point, and that is that the end point (the site you contacted) thinks it is talking to the Tor network and has no idea where or who you are unless you specifically provide that information. Tor is an "anonymity" network. It is this property of the end point not knowing who/where the source of the communication is which provides that anonymity. A collection of attacks against Tor are now presented. A reader may have noticed that the weakness of this network is the timing itself. This seems to be an inherint problem in "low latency anonymity networks". By "low latency" one means networks that do not deliberately delay communications, but immediately forward on communications. One may be able to improve the anonymity of the network by introducing some random delay at each point in the network, but this would make the network almost unusable for things like audio and video. It would, however, increase the security of the network as it would be much harder, and perhaps impossible, to perform the correlation analysis. The other thing that stregthens the network is the number of different people using it. The more diversity there is in the network the harder it is to observe patterns in the of the network. Additionally, pure volume (the number of people using the network) increases the amount of work that must be done to do correlation attacks. To attack the Tor network, there are various options. The biggest hurdle is the design of the network itself. One could deploy sufficient network monitoring equipment to monitor the entire, or at least a very large part of, the internet and then perform correlation analysis. Above it was mentioned that the traffic correlation through a proxy/VPN could be largely automated. The challenge for Tor is even larger, as there is an intermediate (second) relay between the first and third and the speed at which it can relay communications may vary, and the second and third relays change over time. This makes the correlation analysis very challenging. The obvious attack is against the software itself. The software is maintained by a relatively small group of highly competant people who pride themselves on transparency. This makes it difficult to expend the time to gain acceptance in the community and then submit a software change that would not be sufficiently checked. The next most obvious attack has happened several times; create a large number of Tor relays and submit them to the network. This is the "timing" attack, but instead of just watching the network, one participates in it. This is analogous to the "watcher actually runs the proxy" problem noted above, and again highlights the importance of the authenticity problem (which largely remains unsolved). As of early 2016 Tor has about 7000 relays. Lets say that you create and submit 700 relays in disparate network locations and submit those slowly and randomly to the network. This method of joining the network would make it harder for the people who defend the network to think that something strange was happening. So, you now control about 10% of the Tor relays. On pure probability a tenth of the circuits will have a first relay that you control. Of those, another tenth will also have a third relay that you control. Thus you can de-anonymise a 1% of the circuits. This is why "first" relays (called guard relays) are preferentially preserved. People will generally continue to use pre-used first relays. Thus, new elements of the network are less likely to become first nodes and thus the amount of correlation attack is reduced. Additionally, Tor clients routinely change their circuit. Second and third nodes get changed every ten minutes or so. Thus, a polluted, compromised community of relays has less chance to constantly watch (de-anonymise) persons. The last attack is the funding attack. Tor was orginally fully funded by the US Dept of Defence (Naval Research Labs). Consider the problem that they were trying to solve: A government employee needs to submit data to the government from a foreign location. The foreign location may have no direct access to US government secure networks, but the internet is accessible. How can the internet be used to allow a government agent to communicate with the government without allowing the foreign organisation to know that they are doing so? Thus, Tor. Assuming that the end point employs secure cryptography, the foregin organisation will have no idea about what is being said, or where it is going. Could the US Dept.of Defence have built in special access tricks? Yes and no. They may have been there (and may still be there) but it is increasingly difficult over time to maintain these when a group of mixed nationality technically skilled privacy activists participate in the project and its source code. Can the funding donor prioritise the work that is done on the project. Yes ! The US government continues to be a major sponsor of the project and as such, they can direct attention away from areas which they wish to be untouched. But, beware this is a double edged sword. Every vulernability that they maintain is one that can be found by their adversaries. The project is public. Its funding sources are public. Its code and code reviews are public. There are plenty of smart people out there who could find these "problems". There is another 'side effect' of the use of Tor. This discussion has largely been about avoiding mass surveillance or achieving anonymity. Another 'side effect' is censorship avoidance. If it is not known who it is you are communicating with, then nobody can stop you from communicating with certain end points. The only option is to prevent all communication. Thus, Tor is a publicly available, anti-mass surveillance, anonymity providing, censorship avoidance, community provided, communications network.