Surveillance, Privacy, Networks and Encryption The Internet, which was created by the US Dept. of Defence as a fallback communications network during the cold war, has become a very successful network. It is now used to perform a very large component of all inter-human communications (4th geneneration mobile phone networks use the internet as a backbone (specifically packet-switched vs circuit switched), email, internet relay chat, on-line business), and serves as a "library in your home/work place" (search). The success of the Internet is a clear case of support for standards, by which disparate systems can inter-operate. Thus, one reads the "here is how it works" document and can then have your thing work with the other things without any understanding of how the other things work. It is worth considering the expansion and growth of the Internet with other networks, like Google and Facebook, as all are successful but each use different mechanisms for their growth. The core understanding is that the value of a network is the number of "people" using it. This reflects the "competition is a sin" mantra of oligarchs. More on that below. ** Additionally, the huge amount of concentration of communication via a single (well, mulitiple but consistent) mechanism leads those who wish to understand humans by their interaction to wish to have access to those communications. Thus, advertisers and intelligence agencies must have access to either the raw communications of summaries thereof that fit their needs. There must be other actors too: polling agencies, large corporations etc.. Moore's law, the prediction that computing would continuous improve in speed and capacity, doubling in such each 18 months has largely been confirmed for two decades. The result of this is that it is easier to capture the communications of an entire population and post capture filter that information, than to target specific communications. This results in mass surveillance -- it is cheaper. Given the above, the author hopes that it is clear that the Internet is now a global surveillance platform. If one understands the risks of global surveillance one may wish to counter that threat. A little thought can quickly clarify the risk: imagine that someone knew the date, time, and place at which every single internet search that you have ever made. They would surely know much about you, and that information would be useful to some groups (see above for advertisers and intelligence agencies). But, to counter the threat, one must understand the Internet itself. Not all of it, just how it works. Thus, one must read and understand the standards documents. The following is the author's best effort at rendering the contents of those documents understandable by persons with little experise in the engineering fields of computing and network communications. Let's start with the most trivial thing that persons do on the internet -- they start a browser, and that browser loads the default page for a particular internet site. What happens is as follows: 1. The computer operating system that you are using asks its long term storage device (disk) for the program (the browser) and loads that into the short term storage (memory) of the computer, and then starts running that process. 2. The program asks the operating system how it is that the program can turn internet names (e.g site.net) into a number, so that the program can then use the internet communication protocols to talk to that number (address). 3. The program then submits the name (site.net) to the place that the operating system told it to use to turn the name into a number, and waits for an answer. 4. Upon receiving an answer (a number) the program then starts talking to the number (which it thinks is "site.net") and asking it for its default "page". This "page" is a document written in HTML which the program (browser) can then display. The communication from the browser runs through the operating system (which controls the network interface -- wire or wireless) and reaches the end point (site.net's number) and the remote site replies. In all of these communications between the browser and the end point each piece of communication displays the source of the communication (the address of the computer running the browser) and the end point of the communication (the number that the browser got from the name to number translation). And there is the surveillance. Each piece of communication on the internet describes in clear readable form the addresses of both the source (your computer) and the end point (site.net). Thus, anyone watching the communication knows that one address is talking to the other. This was true from the earliest incarnations of the internet up until the publication of this text. Anyone watching the "wire" (backbone communications channel knows who is talking to what). To understand further, we must look at "how" your computer is talking to the site. Internet based communications use a model called the seven layered stack. This is one of the most beautiful things created by computer/network communications science. We need only understand a portion of it. Note that it is this "OSI model" that allows your browser to talk to web sites via wired networks, wireless networks, mobile phone networks and many others. All of the dirty details are hidden from the browser and handled at other layers. One of those layers is the "encryption" layer (more precisely, the "presentation" layer). The end point to which your application is talking may offer (or demand) some form of encryption. If this is the case, your application (the browser or other program) may be able to meet this offer or requirement. If so, then the people watching the wire will know where/what you are communicating with, but not what you are saying. The most common form of Internet based encryption happens via the https protocol. This is different from the older but, unfortunately, still too commonly used http protocol. It is the 's' in https that makes the difference. In http the watcher knows both who is talking to whom and what they are saying. In https the watcher only knows who is talking to whom but cannot understand the conversation (unless they know how to break the encryption). To clarify this, imagine that you are visiting a site which has recipes. In http the watcher knows both that you are visiting the recipe site, and which recipe you are looking at. In https the watcher only knows that you are visiting the recipe site, but not which recipe you are viewing. In a banking analogy, they know that you are visiting a specific bank, but know nothing about which account or transaction you are accessing/executing. So, to a person who is concerned about surveillance, the use of http should be great concern. Https offers far more security (which is what the 's' stands for) but is still of some concern -- they know who/what you are communicating with. Before continuing on the remaining surveillance problem of watchers know what/who you are communicating with, it behooves me to exand on the 'security' in https. It does 3 things. Encryption (ensuring that watchers dont understand what is being said) is only one of them. The acronym, fittingly, is CIA: Confidentiality (thats the encryption), Integrity and Authenticity. Integrity guarantees that if the 'watchers' (or anyone in the network path) modifies the data sent then the entire communication will fail. This ensures that you get what they sent, and vice versa. Authenticity means that you are really talking to who you think you are. It is possible for watchers (or others) to fake the communications and sit in the middle and forward messages back and forth between you and your desired end point. This is called a Man in the Middle (MITM) attack, and is deadly. You think all is fine, but the attacker is reading and/or changing anything and you dont know. Authenticity is still a major problem for the internet, but its discussion is beyond the scope of this article. But, be confident that encryption and integrity are problems that have been solved by the cryptographic community. Problems continue to arise, but they are largely due to outdated solutions continuing to be used. The solution to the "we know who you are talking to" problem are known as proxy, VPN or mix networks. A proxy is a computer which will shuttle messages on your behalf. You talk to the proxy and say "I want to access site.net". The proxy takes this communication and says to site.net "I want your default page" and when it receives that information from site.net sends that back to you. Now, the watcher sees you talking to the proxy rather than site.net. If you are using an unencrypted protocol to talk to the proxy, the watcher knows everything; that you are using a proxy, what you really want to talk to, and what answer you want, and what the site sent back. With encryption, the watcher needs also to be able to watch the proxy to get much of this information, but if they can watch the proxy, there is no value gained. The specific risk case is that the watcher controls the proxy, in which case they can identify the sender and recipient, and if the communications are unencrypted, the communications too. A VPN is just like a proxy, but it automatically involves encryption between you and the VPN. Again, this is of no value if the watcher can also watch the VPN; they know you are talking to the VPN, and with timing analysis can know what you really wanted to communicate with, and what you asked/what the response was if the communication is unencrypted. A mix network is a collection of computers that extend the idea of a proxy to multiple steps. Thus, instead of just shuttling your communication via one intermediary (a single proxy) via the mix network you shuttle it via multiple intermediaries. Thus, to do surveillance the watchers need to watch all the proxies, which make life harder for them, as the mix network's proxies may be located in very geographically distributed location. The collection of "proxies" in a mix network are known as relays (i.e they relay the communication sent by you amongst themselves). All modern mix networks also mandate encryption between you and the start of the mix network, and between each relay of the mix network. The largest and most used modern mix network is called Tor (The Onion Router). Tor is a volunteer community driven network, to which anyone can contribute. Anyone includes both people who wish to help maintain some for of anonymity for internet usage, and their adversaries (i.e intelligence agencies and law enforcement). This is a problem that the Tor community is aware of and tries to combat (abusive participation). Tor has a collection of partially trusted computers called the "Directory Authorities". These computers know the collection of relays which make up the network. When you connect to Tor the following happens: 1. you contact a directory authority and ask for a list of all of the nodes in the network 2. you select 3 nodes from that list, preferring to stick with the first one if you have used it before and it is still there 3. you contact each of those nodes, in order, through the first, and ask for their cryptographic information. With this, you have formed a "circuit" from you via 3 relays, and have the information to be able to encrypt communication from you to each of those relays. 4. You visits a site: a) your request is encrypted to the 3rd (last) relay so that it can decrypt that request and send your communication onwards to the site b) you form form a request for the second relay to forward some communication to the thrid relay. That request to be forwarded is the previously encrypted communication othe third relay. You then encrypt that whole thing with the cryptographic details for the second relay. c) as above, you create an encapsulated, encrypted communication for the first relay which asks it is send the above on to the second relay. d) you send the above information to the first relay. e) the first relay decrpts the communication and sees that it should send something that it does not understand to the second relay. It does so, and the second relay decrypts what it gets. It sees a request to forward and encrypted communication that it cannot understand to the third realy. It does so. The third relay decrypts what it receives and it knows what you wanted to do -- get the default page from site.net. It contacts site.net and gets that information. f) the third relay encrypts the data from site.net with the second relay's cryptographic key and sends it on. The second relay receives that, decrypts it and encrypts it with the first relay's cryptographic key and sends it on to the first relay. The first relay receives that, decrypts it and encrypts it with your key. It sends that to you. You decrypt that and it is displayed. As you can seen in steps a), b) and c) the client (you) is multiply encrypting communication to different parts of your circuit through the Tor network. The "layer on layer" encryption is what gives Tor its name: The Onion Router. The end property is that the only part of the entire exchange that knows who is talking to who is you. You know yourself, the three relays, and the end point (5 things talking). But, each relay only knows two things. The first relay knows you and the second relay. The second relay knows the first and the third (but not you or the end point), the third relay know the second relay and the end point (but not you or the first relay). For a watcher to get useful information out of this set up they need to watch your first relay and your third relay. With that, and using timing, they can with some probability determine that you were talking to a specific end point, and if the end point was not using encryption, what you were saying. There are a number of possible attacks against the Tor (or other mix) network(s). But, before that is considered, take a moment to thinK. * If you are not using Tor and you are using http, then any watcher at any point in the communication knows everything. * If you are not using Tor but are using https, then every watcher at any point knows who you are talking with. * With Tor and just http the watchers need to watch two things, the first and third relay, which are statistically likely to be geographically disparate. If they do this, and can do the timing correlation analysis then they can have some confidence that it was you talking to the end point and have the same confidence in what you said. * If you are using Tor and https, the only entities that knows what you said are you and the end point, and watchers who can surveil both the first and third relay and who can do the timing correlation have some probability of confidence that you were saying something unknown to a specific end point. A reader may have noticed that the weakness of this mix network is the timing itself. The original mix neworks were designed for email. They would wait until a buffer of messages was filled (or some long timout occurred) and then send the messages on in mixed order. This is a high latency (i.e messages might take some time to go through) network. Tor is a low latency anonymity network, and is thus always vulernable to timing attacks by a global adversary (who still has to do considerable work). The above Tor use of encryption and the addition of deliberate delays is the best known form of anonymity network. Assuming that the software is acurately doing what it purports, and that the inter-network encyrption is strong, and that the end point supports strong encryption, then it is almost impossible to identify the sender, the recipient and the message. The ability to decifer the message is depended on the end point's choice of encryption (and security of its key). The sender and recipient's identity are secured by the mix network and its delays and encryption. To attack the Tor network, there are various options. The biggest hurdle is the design of the network itself. One can deploy sufficient network monitoring equipment to monitor the entire internet (difficult) and then perform correlation analysis (expensive). The combination of both to de-anonymise all Tor traffic is prohibitive, and is evidenced by the fact that there is no evidence of a Tor deanonymisation that was not helped by people using poor "operation security" practices (more on that later). The obvious attack is against the software itself. Tor is a very successful network, but is comprised by relatively few people. This makes it more difficult to expend the time to gain acceptance in the community and then submit a software change that would not be checked sufficiently. Possible, but difficult. The next most obvious attack has happened several times; create a large number of Tor relays and submit them to the network. This is the "timing" attack, but instead of just watching the network, on participates in it. This is analygous to the "watcher actually runs the proxy" problem noted above, and again highlights the importance of the authenticity problem (which largely remains unsolved). Say you build and submit enough relays to control a third of the Tor relays. On pure probability you will control a third times a third circuits through the network. Thus your could de-anonymise a ninth of the network. This is why "first" nodes (called guard nodes) are preserved. People will generally continue to use pre-used first nodes. Thus, new elements of the network are less likely to become first nodes and thus and amount of correlation attack is reduced. Additionally, Tor clients routinely change their circuit. Second and third nodes get changed every ten minutes or so. Thus, a polluted, compromised community of relays have less chance to constantly watch (de-anonymise) persons. The last attack I will consider is the funding attack. Tor was orginally fully funded by the US Dept of Defence (Naval Research Labs). Consider the problem that they were trying to solve: A government employee needs to submit data to the government from a foreign location. The foreign location may have no direct access to US government secure networks, but the internet is accessible. How can the internet be used to allow a government agent to communicate with the government without allowing the foreign organisation to know that they are doing so? Hence the above described three hop encrypted proxy setup that is Tor. Assuming that the end point employs secure cryptography, the foregin organisation will have no idea about what is being said, or where it is going. One can assume that if a variant of Tor is still being used by the US Dept. of Defence or other agencies, then it is as outlined above a high latency network. Could the DoD have built in special access tricks? Yes and no. They may have been there (and may still be there) but it is increasingly difficult over time to maintain these when a group of non-US technically skilled privacy activists control the project and its source code. Can the funding donor prioritise the work that is done on the project. Yes ! The US government continues to be a major sponsor of the project and as such, they can direct attention away from areas which they wish to be untouched. But, beware this is a double edged sword. Every vulernability that they maintain is one that can be found by their adversaries. The project is public. Its funding is public. Its code and code reviews are public. There are plenty of smart people out there who can find these "problems". As much as the National Security Agency has shown itself to be more focussed on the "attach" side of its dual mission, the Tor project (from the Naval Research Labs) has placed itself in public hands and allowed all of the disinfection of sunlight.