Surveillance, Privacy, Encryption, Anonymity and Censorship

Copyright 2016 Hugo M. Connery
License: Creative Commons Attribution ShareALike 4.0
https://creativecommons.org/licenses/by-sa/4.0/

The Internet, which was created by the US Dept. of Defence
as a fallback communications network during the cold war,
has become a very successful network.  It is now used to 
perform a very large component of all inter-human and 
business communications, and serves as a "library in 
your home/work place".

The success of the Internet is a clear case of support
for standards, by which disparate systems can inter-operate.
One reads the "here is how it works" document and 
can then have your thing work with the other things
without any specific understanding of how the other things work
only that they adhere to the standard.

The huge amount of concentration of communication via a single
(well, mulitiple but consistent) mechanism leads those who wish
to understand humans by their interaction to wish to have access
to those communications.  Thus, advertisers, law enforcement
and intelligence agencies must have access to either the raw
communications or summaries thereof that fit their needs.
There must be other actors too: polling agencies, large
corporations etc..

Moore's law, the prediction that computing would continuously
improve in speed and/or capacity, doubling in such every 2 years
has largely been confirmed for three decades.  The result of
this is that it is easier to capture the communications of
an entire population and, post capture, filter that information,
than to target specific communications.  This results in
mass surveillance -- it is easier and cheaper.

From the above the internet can be seen as heavily
used communications platform which is also heavily
surveilled.

If one understands the risks of global surveillance one
may wish to counter that threat.  A little thought can
quickly clarify the risk: imagine that someone knew 
the text, date, time, and place of every single internet
search that you have ever made.  They would surely know
much about you, and that information would be useful to
some groups.  The end purpose of such groups may be 
benign and/or dangerous depending on your situation and
outlook.

But, to counter the threat, one must understand the Internet
itself.  Not all of it, just how some of it  works.  Thus, one
must read and understand the standards documents.

The following is the author's best effort at rendering
the contents of those documents understandable by persons
with little experise in the engineering fields of 
computing and network communications.

Let's start with the most trivial thing that persons do
on the internet -- they start a browser, and that browser
loads the default page for a particular internet site.

What happens is as follows:

1. The computer operating system that is being used
asks its long term storage device (disk) for the program
(the browser) and loads that into the short term storage
(memory) of the computer, and then starts running that
program

2. The program asks the operating system how it is that
the program can turn internet names (e.g site.net)
into an address, so that the program can then use the internet
communication protocols to talk to that address.  The
system that does this 'name to number' translation
for the Internet is called the Domain Name System (DNS).

3. The program then submits the name (site.net) to the
facility that the operating system told it to use to turn
the name into an address, and waits for an answer.

4. Upon receiving an answer the program then
starts talking to the address (which it thinks is "site.net")
and asking it for its default "page".  This "page" is a 
document written in HTML which the program (browser) can
then display.  The communication from the browser runs
through the operating system (which controls the network
interface -- wired or wireless) and reaches the end point
(site.net's address) and the remote site replies and that
communication is handed by the operating system to the 
browser.  

In all of these communications between the browser (your computer)
and the end point (site.net) each piece of communication displays
the source of the communication (the address of your computer
running the browser) and the end point of the communication (the
address that the browser got from the name to number translation).

And there is the surveillance.  Each piece of communication
on the internet describes in clear readable form the addresses
of both the source (your computer) and the end point (site.net).

Thus, anyone watching the communication knows that one address
is talking to the other.  This was true from the earliest
incarnations of the internet up until the publication of
this text.  Anyone watching the "wire" (backbone communications 
channels) knows who is talking to what.

There is a second surveillance mechanism in the above,
and that is the name to address translation (DNS).  If this
communication can be read, then people will know what you
are attempting to communicate with without having to
watch all the traffic.  This communication mechanism
is currently unencrypted and thus leaks all of this
information.  Attempts to address this glaring problem are
currently (2016) being addressed by the Internet 
standards bodies.

To understand further, we must look at "how" your computer
is talking to the site.  Internet based communications use a
model called the seven layered stack.  This is one of the most
beautiful things created by computer/network communications
engineering/science.  We need only understand a portion of it.
Note that it is this "OSI model" that allows your browser to talk
to web sites via wired networks, wireless networks, mobile phone
networks and many others.  All of the dirty details are hidden
from the browser and handled at other layers.

One of those layers is the "encryption" layer (more precisely,
the "presentation" layer).  The end point to which your application
is talking may offer (or demand) some form of encryption.  If
this is the case, your application (the browser or other program)
may be able to meet this offer or requirement.  If so, then
the people watching the wire will know where/what you are 
communicating with, but not what you are saying.

The most common form of Internet based encryption happens
via the https protocol.  This is different from the older
but, unfortunately, still too commonly used http protocol.
It is the 's' in https that makes the difference.

In http the watcher knows both who is talking to whom and
what they are saying.  In https the watcher only knows who
is talking to whom but cannot understand the conversation
(unless they know how to break the encryption).

To clarify this, imagine that you are visiting a site which
has recipes.  In http the watcher knows both that you are
visiting the recipe site, and which recipe you are looking
at.  In https the watcher only knows that you are visiting
the recipe site, but not which recipe you are viewing.  In
a banking analogy, they know that you are visiting a specific 
bank, but know nothing about which account or transaction you are
accessing/executing.

It is worth observing that in the above, the recipe site
and bank, both know who is talking to them and what is 
being said.  In the case of the bank, this is essential,
as the bank wishes to only allow the owner of an account
to access it.  For the recipe site, it is not essential
that it knows who is asking it about its recipe.  It may
be interested in knowing how often over various
periods of time certain recipes have been read, but that
information does not require identifying the reader.

For a person who is concerned about surveillance, the 
use of http should be of great concern as anyone who
is watching the traffic can know everything.  Https offers far
more security (the 's' stands for 'secure') but is 
still of some concern -- they know who/what you are 
communicating with.  

Before continuing on the remaining surveillance problem
that watchers know what/who you are communicating with, it
behooves me to exand on the 'secure' in https.  It does
3 things.  Encryption (ensuring that watchers dont understand
what is being said) is only one of them.  The acronym,
fittingly, is CIA: Confidentiality (thats the encryption),
Integrity and Authenticity.  

Integrity guarantees that if the 'watchers' (or anyone in
the network path) modifies the data sent then the entire
communication will fail.  This ensures that you get what the
site sent, and vice versa.  Or, to put it another way, nobody can
modify the communications without people knowing.

Authenticity means that you are really talking to who
you think you are.  It is possible for watchers (or others)
to fake the communications and sit in the middle and 
forward messages back and forth between you and your
desired end point.  This is called a Man in the Middle
(MITM) attack, and is deadly.  You think all is fine,
but the attacker is reading and/or changing anything
and you dont know.  Authenticity is still a major problem
for the internet, but its discussion is beyond the scope
of this article.  But, be confident that encryption and
integrity are problems that have been solved by the
cryptographic community.  Problems continue to arise,
but they are largely due to outdated solutions continuing
to be used.  Why is that?  Because of Moore's Law; 
encryption and identity solutions are based partially
on the amount of computational work that the attacker
must do to decrypt or modify the message.  As computational
power increased, the solutions to the encryption and
identity problemes have had to be re-engineered.

There are several solutions to the "we know who you are talking
to" problem.  The "what you are saying problem" is solved by
using encryption and integrity.  Below, Proxying, Virtual Private
Networks (VPN) and Onion Routing are discussed as solutions to the
"watchers know who you are talking to" problem.  

A proxy is a computer which will shuttle messages on your
behalf between you and the end point with which you wish to
communicate.  You talk to the proxy and say "I want the default
page from site.net".  The proxy takes this communication and says
to site.net "I want your default page" and when it receives
that information from site.net the proxy sends that back to you.

Now, the watcher sees you talking to the proxy rather
than site.net.  But, if the protcol between you
and the proxy is not encrypted then the watcher can
see who you are asking the proxy to talk to on your
behalf, and you have achived nothing.

A Virtual Private Network (VPN) is just like a proxy, but
it automatically involves encryption between you and the VPN.
In this case, the watcher knows that you are talking to the VPN,
but nothing else.

There are two key risks to consider, that the watcher is
also watching the proxy/VPN, or that the watcher is the proxy/VPN !

In the first case, by knowing both what is coming from your
computer, and when it is coming, and what is passing through 
the VPN and when it is passing through, the watcher
can correlate these network flows based on times of communication
and thus identify who you are talking to via the VPN.

If the watcher has this power, you have achieved nothing
more than making the watcher's job a little harder.  But, this
additional challenge of doing the timing correlation can largely 
be automated and thus, one has acheived nothing more that making
the watcher watch more things (you and the VPN).

If the watcher is the VPN, then obviously they
know who you are talking to, and they dont even have
to do the correlation analysis.

Finally, we move to Onion Routing.  Onion Routing involves
encryption, like the VPN, but involves more than just one
"proxy".  The term used in Onion Routing for machines that
forward communications amonst the Onion Routing network is
a relay.  So, a VPN uses just one relay.  Onion Routing 
uses at least three, and the particular three that you are
using changes over time.

Onion Routing tries to solve both of the problems discussed
above; that watchers may be watching lots of things, and
that they may actually be some of those things.

The largest and most used modern Onion Routing network is called
Tor (The Onion Routing network).  It gets to use the definite
article "The" because it was the first Onion Routing network.
The original work was done by the USA Navy Research Lab.  Since,
the project has moved into the public.

Today, Tor is a volunteer community driven network, to which
anyone can contribute.  Anyone includes both people who wish to
help assist in internet anonymity, and their adversaries (i.e
idealistic academics, intelligence agencies, law enforcement,
criminal networks etc.).  This is a problem that the Tor community
is aware of and tries to combat (minimise abusive participation
whilst continuing to allow broad constructive participation).

Tor has a collection of partially trusted computers called 
the "Directory Authorities".  These computers know the 
collection of relays which make up the network.

When you connect to Tor the following happens:

1. you contact a directory authority and ask for a list
 of all of the relays in the network (or at least,
 what has changed since last time you asked for information
 about the network)

2. you select 3 relays from that list.  If you have used
a certain relay as your first relay before, and it is
still available, you will prefer to continue to use
that first relay.  The next two relays are selected
randomly.

3. you contact each of those relays, in order, through 
the first, and ask for their cryptographic information.
With this, you have formed a "circuit" from you to
3 relays, and have the information to be able to encrypt
communication from you to each of those relays, and
decrypt return communication from that relay to you.

4. You visit a site:

a) your request is encrypted to the 3rd (last) relay so
that that relay can decrypt that request and send your communication
onwards to the site

b) you form a request for the second relay to forward
the "request" to the third relay.  This request is encrypted 
with the cryptographic details for the second relay.

c) as above, you create an encapsulated, encrypted communication
for the first relay which asks it is send the above on to
the second relay.

d) you send the above information to the first relay.

e) the first relay decrypts the communication and sees that
it should send something that it does not understand to the
second relay.  It does so, and the second relay decrypts 
what it gets.  It sees a request to forward an encrypted
communication that it cannot understand to the third realy.
It does so.  The third relay decrypts what it receives and 
sees what you wanted to do -- get the default page from
site.net.  It contacts site.net and gets that information.

f) the third relay receives the response from site.net.
It encrypts that data with its cryptographic key (which your client
knows) and sends that to the second relay.  The second relay does
the same, adding another layer of encryption based on its key
(which your client also knows) and sends it to the first relay,
which does the same and sends the final triply encrypted data to
your client.  The client then decrypts each layer of encryption
with the keys it has obtained from the relays when it formed
the circuit and finally what site.net said is visible known
and displayed.

As you can seen in steps a), b) and c) the client (you)
is multiply encrypting communication to different parts
of your circuit through the Tor network.  The "layer on 
layer" encryption is what gives Tor its name: The Onion
Routing network.

The end property is that the only part of the entire exchange
that knows who is talking to who is you.  You know yourself,
the three relays, and the end point (5 things talking).  But,
each relay only knows itself and the two adjacent components
of the communications path.  The first relay knows you
and the second relay (but not the third relay or end point).
The second relay knows the first and the third relays (but not you 
or the end point), the third relay knows the second relay and 
the end point (but not you or the first relay).

Tor also sends the 'name to address' translation via the
circuit, so that anyone watching the 'name to address'
translation sees the third relay making this query, rather
than you.

Two additional points are worth noting:

1) Tor uses ephemeral keys, which means that each relay 
has a long term key which it uses with each client that
connects through it, to generate another key for the use
with that client, and those keys are different for all 
clients.

2) No relay shares those ephemeral keys with any other
relay or client.  The only thing that knows all of the 
ephemeral keys for the circuit that your client is using
is your client.

For a watcher to get useful information out of this set up
they need to watch your first relay and your third relay.
With that, and using timing, they can with some probability
determine that you were talking to a specific end point, and
if the end point was not using encryption, what you were 
saying.

Additionally, if the watcher is one of those relays they
still cannot know who you were talking to as no relay
knows both you and what you are talking to.

Finally, if a watcher is watching traffic at any single
point withing the Tor network they get nothing.  All they
see is encrypted traffic moving wth the network and they
have no idea what is in that traffic (as they dont have
the keys) and they dont know the final end points of that
traffic.

There are a number of possible attacks against Tor.
But, before that is considered, take a moment to think.  

* If you are not using Tor and you are using http, then any
watcher at any point in the communication knows everything.

* If you are not using Tor but are using https, then every watcher
at any point knows who you are talking with.

* With Tor and just http the watchers need to watch two things,
the first and third relay, which are statistically likely to be
geographically disparate.  If they do this, and can do the timing
correlation analysis then they can have some confidence that it
was you talking to the end point and have the same confidence in
what you said.

* If you are using Tor and https, the only entities that knows
what you said are you and the end point, and watchers who can
surveil both the first and third relay and who can do the timing
correlation have some probability of confidence that you were
saying something unknown to a specific end point.

There is one other important point, and that is that the 
end point (the site you contacted) thinks it is talking to
the Tor network and has no idea where or who you are 
unless you specifically provide that information.  Tor is
an "anonymity" network.  It is this property of the end point
not knowing who/where the source of the communication is which
provides that anonymity.

A collection of attacks against Tor are now presented.

A reader may have noticed that the weakness of this network
is the timing itself.  This seems to be an inherint problem
in "low latency anonymity networks".  By "low latency" one
means networks that do not deliberately delay communications,
but immediately forward on communications.  One may be able
to improve the anonymity of the network by introducing some
random delay at each point in the network, but this would
make the network almost unusable for things like audio and video.
It would, however, increase the security of the network
as it would be much harder, and perhaps impossible, to perform 
the correlation analysis.  The other thing that stregthens
the network is the number of different people using it.
The more diversity there is in the network the harder it
is to observe patterns in the of the network.  Additionally,
pure volume (the number of people using the network) increases
the amount of work that must be done to do correlation
attacks.

To attack the Tor network, there are various options.  The
biggest hurdle is the design of the network itself.  One could
deploy sufficient network monitoring equipment to monitor the
entire, or at least a very large part of, the internet and then 
perform correlation analysis.  Above it was mentioned that
the traffic correlation through a proxy/VPN could be largely
automated.  The challenge for Tor is even larger, as there
is an intermediate (second) relay between the first and third and
the speed at which it can relay communications may vary, and
the second and third relays change over time.  This makes
the correlation analysis very challenging.

The obvious attack is against the software itself.  The software
is maintained by a relatively small group of highly
competant people who pride themselves on transparency.
This makes it difficult to expend the time to 
gain acceptance in the community and then submit a software
change that would not be sufficiently checked.

The next most obvious attack has happened several times; 
create a large number of Tor relays and submit them to the 
network.  This is the "timing" attack, but instead of just
watching the network, one participates in it.  This is analogous
to the "watcher actually runs the proxy" problem noted above,
and again highlights the importance of the authenticity 
problem (which largely remains unsolved).

As of early 2016 Tor has about 7000 relays.  Lets say that you create
and submit 700 relays in disparate network locations and submit
those slowly and randomly to the network.  This method of joining
the network would make it harder for the people who defend the
network to think that something strange was happening.

So, you now control about 10% of the Tor relays.  On pure
probability a tenth of the circuits will have a first relay that
you control.  Of those, another tenth will also have a third
relay that you control.  Thus you can de-anonymise a 1% of the
circuits.  This is why "first" relays (called guard relays) are
preferentially preserved.  People will generally continue to use
pre-used first relays.  Thus, new elements of the network are less
likely to become first nodes and thus the amount of correlation
attack is reduced.  Additionally, Tor clients routinely change
their circuit.  Second and third nodes get changed every ten
minutes or so.  Thus, a polluted, compromised community of relays
has less chance to constantly watch (de-anonymise) persons.

The last attack is the funding attack.  Tor was orginally fully
funded by the US Dept of Defence (Naval Research Labs).  Consider
the problem that they were trying to solve: A government employee
needs to submit data to the government from a foreign location.
The foreign location may have no direct access to US government
secure networks, but the internet is accessible.  How can the
internet be used to allow a government agent to communicate with
the government without allowing the foreign organisation to know
that they are doing so?  Thus, Tor.

Assuming that the end point employs secure cryptography, the
foregin organisation will have no idea about what is being said,
or where it is going.

Could the US Dept.of Defence have built in special access tricks?
Yes and no.  They may have been there (and may still be there)
but it is increasingly difficult over time to maintain these
when a group of mixed nationality technically skilled privacy
activists participate in the project and its source code.

Can the funding donor prioritise the work that is done on the
project.  Yes ! The US government continues to be a major sponsor
of the project and as such, they can direct attention away from
areas which they wish to be untouched.  But, beware this is a
double edged sword.  Every vulernability that they maintain is one
that can be found by their adversaries.  The project is public.
Its funding sources are public.  Its code and code reviews are
public.  There are plenty of smart people out there who could
find these "problems".

There is another 'side effect' of the use of Tor.  This discussion
has largely been about avoiding mass surveillance or achieving
anonymity.  Another 'side effect' is censorship avoidance.
If it is not known who it is you are communicating with, then
nobody can stop you from communicating with certain end points.
The only option is to prevent all communication.  Thus, Tor is a
publicly available, anti-mass surveillance, anonymity providing, 
censorship avoidance, community provided, communications network.