Surveillance, Privacy, Networks and Encryption


The Internet, which was created by the US Dept. of Defence
as a fallback communications network during the cold war,
has become a very successful network.  It is now used to 
perform a very large component of all inter-human communications
(4th geneneration mobile phone networks use the internet
as a backbone (specifically packet-switched vs circuit
switched), email, internet relay chat, on-line business), 
and serves as a "library in your home/work place" (search). 

The success of the Internet is a clear case of support
for standards, by which disparate systems can inter-operate.
Thus, one reads the "here is how it works" document and 
can then have your thing work with the other things
without any understanding of how the other things work.

It is worth considering the expansion and growth of 
the Internet with other networks, like Google and Facebook,
as all are successful but each use different mechanisms
for their growth.  The core understanding is that the
value of a network is the number of "people" using it.
This reflects the "competition is a sin" mantra of 
oligarchs.  More on that below. **

Additionally, the huge amount of concentration of communication
via a single (well, mulitiple but consistent) mechanism
leads those who wish to understand humans by their interaction
to wish to have access to those communications.  Thus,
advertisers and intelligence agencies must have access
to either the raw communications of summaries thereof
that fit their needs.  There must be other actors too:
polling agencies, large corporations etc..

Moore's law, the prediction that computing would continuous
improve in speed and capacity, doubling in such each 18 months
has largely been confirmed for two decades.  The result of
this is that it is easier to capture the communications of
an entire population and post capture filter that information,
than to target specific communications.  This results in
mass surveillance -- it is cheaper.

Given the above, the author hopes that it is clear that
the Internet is now a global surveillance platform.

If one understands the risks of global surveillance one
may wish to counter that threat.  A little thought can
quickly clarify the risk: imagine that someone knew 
the date, time, and place at which every single internet
search that you have ever made.  They would surely know
much about you, and that information would be useful to
some groups (see above for advertisers and intelligence
agencies).

But, to counter the threat, one must understand the 
Internet itself.  Not all of it, just how it works.
Thus, one must read and understand the standards documents.

The following is the author's best effort at rendering
the contents of those documents understandable by persons
with little experise in the engineering fields of 
computing and network communications.

Let's start with the most trivial thing that persons do
on the internet -- they start a browser, and that browser
loads the default page for a particular internet site.

What happens is as follows:

1. The computer operating system that you are using
asks its long term storage device (disk) for the program
(the browser) and loads that into the short term storage
(memory) of the computer, and then starts running that
process.

2. The program asks the operating system how it is that
the program can turn internet names (e.g site.net)
into a number, so that the program can then use the internet
communication protocols to talk to that number (address).

3. The program then submits the name (site.net) to the
place that the operating system told it to use to turn
the name into a number, and waits for an answer.

4. Upon receiving an answer (a number) the program then
starts talking to the number (which it thinks is "site.net")
and asking it for its default "page".  This "page" is a 
document written in HTML which the program (browser) can
then display.  The communication from the browser runs
through the operating system (which controls the network
interface -- wire or wireless) and reaches the end point
(site.net's number) and the remote site replies.  In all
of these communications between the browser and the end
point each piece of communication displays the source of 
the communication (the address of the computer running
the browser) and the end point of the communication (the
number that the browser got from the name to number translation).

And there is the surveillance.  Each piece of communication
on the internet describes in clear readable form the addresses
of both the source (your computer) and the end point (site.net).

Thus, anyone watching the communication knows that one address
is talking to the other.  This was true from the earliest
incarnations of the internet up until the publication of
this text.  Anyone watching the "wire" (backbone communications 
channel knows who is talking to what).

To understand further, we must look at "how" your computer
is talking to the site.  Internet based communications use
a model called the seven layered stack.  This is one of 
the most beautiful things created by computer/network
communications science.  We need only understand a portion
of it.  Note that it is this "OSI model" that allows your
browser to talk to web sites via wired networks, wireless
networks, mobile phone networks and many others.  All of
the dirty details are hidden from the browser and handled
at other layers.

One of those layers is the "encryption" layer (more precisely,
the "presentation" layer).  The end point to which your application
is talking may offer (or demand) some form of encryption.  If
this is the case, your application (the browser or other program)
may be able to meet this offer or requirement.  If so, then
the people watching the wire will know where/what you are 
communicating with, but not what you are saying.

The most common form of Internet based encryption happens
via the https protocol.  This is different from the older
but, unfortunately, still too commonly used http protocol.
It is the 's' in https that makes the difference.

In http the watcher knows both who is talking to whom and
what they are saying.  In https the watcher only knows who
is talking to whom but cannot understand the conversation
(unless they know how to break the encryption).

To clarify this, imagine that you are visiting a site which
has recipes.  In http the watcher knows both that you are
visiting the recipe site, and which recipe you are looking
at.  In https the watcher only knows that you are visiting
the recipe site, but not which recipe you are viewing.  In
a banking analogy, they know that you are visiting a specific 
bank, but know nothing about which account or transaction you are
accessing/executing.

So, to a person who is concerned about surveillance, the 
use of http should be great concern.  Https offers far
more security (which is what the 's' stands for) but is 
still of some concern -- they know who/what you are 
communicating with.

Before continuing on the remaining surveillance problem
of watchers know what/who you are communicating with, it
behooves me to exand on the 'security' in https.  It does
3 things.  Encryption (ensuring that watchers dont understand
what is being said) is only one of them.  The acronym,
fittingly, is CIA: Confidentiality (thats the encryption),
Integrity and Authenticity.  Integrity guarantees that 
if the 'watchers' (or anyone in the network path) modifies
the data sent then the entire communication will fail.
This ensures that you get what they sent, and vice versa.
Authenticity means that you are really talking to who
you think you are.  It is possible for watchers (or others)
to fake the communications and sit in the middle and 
forward messages back and forth between you and your
desired end point.  This is called a Man in the Middle
(MITM) attack, and is deadly.  You think all is fine,
but the attacker is reading and/or changing anything
and you dont know.  Authenticity is still a major problem
for the internet, but its discussion is beyond the scope
of this article.  But, be confident that encryption and
integrity are problems that have been solved by the
cryptographic community.  Problems continue to arise,
but they are largely due to outdated solutions continuing
to be used.

The solution to the "we know who you are talking to" problem 
are known as proxy, VPN or mix networks.

A proxy is a computer which will shuttle messages on your
behalf.  You talk to the proxy and say "I want to access
site.net".  The proxy takes this communication and says
to site.net "I want your default page" and when it receives
that information from site.net sends that back to you.

Now, the watcher sees you talking to the proxy rather
than site.net.  If you are using an unencrypted protocol
to talk to the proxy, the watcher knows everything; that
you are using a proxy, what you really want to talk to,
and what answer you want, and what the site sent back.
With encryption, the watcher needs also to be able to 
watch the proxy to get much of this information, but if
they can watch the proxy, there is no value gained.

The specific risk case is that the watcher controls the
proxy, in which case they can identify the sender and 
recipient, and if the communications are unencrypted,
the communications too.

A VPN is just like a proxy, but it automatically involves
encryption between you and the VPN.  Again, this is of
no value if the watcher can also watch the VPN; they
know you are talking to the VPN, and with timing analysis
can know what you really wanted to communicate with, and
what you asked/what the response was if the communication
is unencrypted.

A mix network is a collection of computers that extend 
the idea of a proxy to multiple steps.  Thus, instead of 
just shuttling your communication via one intermediary
(a single proxy) via the mix network you shuttle it via
multiple intermediaries.  Thus, to do surveillance the 
watchers need to watch all the proxies, which make life
harder for them, as the mix network's proxies may be located
in very geographically distributed location.  The collection
of "proxies" in a mix network are known as relays (i.e
they relay the communication sent by you amongst themselves).

All modern mix networks also mandate encryption between
you and the start of the mix network, and between each
relay of the mix network.

The largest and most used modern mix network is called
Tor (The Onion Router).  Tor is a volunteer community
driven network, to which anyone can contribute.  Anyone
includes both people who wish to help maintain some for
of anonymity for internet usage, and their adversaries
(i.e intelligence agencies and law enforcement).  This 
is a problem that the Tor community is aware of and tries
to combat (abusive participation).

Tor has a collection of partially trusted computers called 
the "Directory Authorities".  These computers know the 
collection of relays which make up the network.

When you connect to Tor the following happens:

1. you contact a directory authority and ask for a list
 of all of the nodes in the network

2. you select 3 nodes from that list, preferring to stick
with the first one if you have used it before and it is
still there

3. you contact each of those nodes, in order, through 
the first, and ask for their cryptographic information.
With this, you have formed a "circuit" from you via
3 relays, and have the information to be able to encrypt
communication from you to each of those relays.

4. You visits a site:

a) your request is encrypted to the 3rd (last) relay so
that it can decrypt that request and send your communication
onwards to the site

b) you form form a request for the second relay to forward
some communication to the thrid relay.  That request to be
forwarded is the previously encrypted communication othe 
third relay.  You then encrypt that whole thing with the 
cryptographic details for the second relay.

c) as above, you create an encapsulated, encrypted communication
for the first relay which asks it is send the above on to
the second relay.

d) you send the above information to the first relay.

e) the first relay decrpts the communication and sees that
it should send something that it does not understand to the
second relay.  It does so, and the second relay decrypts 
what it gets.  It sees a request to forward and encrypted
communication that it cannot understand to the third realy.
It does so.  The third relay decrypts what it receives and 
it knows what you wanted to do -- get the default page from
site.net.  It contacts site.net and gets that information.

f) the third relay encrypts the data from site.net with 
the second relay's cryptographic key and sends it on.
The second relay receives that, decrypts it and encrypts it
with the first relay's cryptographic key and sends it on
to the first relay.  The first relay receives that, decrypts
it and encrypts it with your key.  It sends that to you.
You decrypt that and it is displayed.

As you can seen in steps a), b) and c) the client (you)
is multiply encrypting communication to different parts
of your circuit through the Tor network.  The "layer on 
layer" encryption is what gives Tor its name: The Onion
Router.

The end property is that the only part of the entire exchange
that knows who is talking to who is you.  You know yourself,
the three relays, and the end point (5 things talking).  But,
each relay only knows two things.  The first relay knows you
and the second relay.  The second relay knows the first and 
the third (but not you or the end point), the third relay 
know the second relay and the end point (but not you or the 
first relay).

For a watcher to get useful information out of this set up
they need to watch your first relay and your third relay.
With that, and using timing, they can with some probability
determine that you were talking to a specific end point, and
if the end point was not using encryption, what you were 
saying.

There are a number of possible attacks against the Tor 
(or other mix) network(s).  But, before that is considered,
take a moment to thinK.  

* If you are not using Tor and you are using http, then any
watcher at any point in the communication knows everything.

* If you are not using Tor but are using https, then every watcher
at any point knows who you are talking with.

* With Tor and just http the watchers need to watch two things,
the first and third relay, which are statistically likely to be
geographically disparate.  If they do this, and can do the timing
correlation analysis then they can have some confidence that it
was you talking to the end point and have the same confidence in
what you said.

* If you are using Tor and https, the only entities that knows
what you said are you and the end point, and watchers who can
surveil both the first and third relay and who can do the timing
correlation have some probability of confidence that you were
saying something unknown to a specific end point.

A reader may have noticed that the weakness of this mix network
is the timing itself.  The original mix neworks were designed
for email.  They would wait until a buffer of messages was filled
(or some long timout occurred) and then send the messages on in
mixed order.  This is a high latency (i.e messages might take
some time to go through) network.  Tor is a low latency anonymity
network, and is thus always vulernable to timing attacks by
a global adversary (who still has to do considerable work).

The above Tor use of encryption and the addition of deliberate
delays is the best known form of anonymity network.  Assuming that
the software is acurately doing what it purports, and that the
inter-network encyrption is strong, and that the end point
supports strong encryption, then it is almost impossible to
identify the sender, the recipient and the message.  The ability
to decifer the message is depended on the end point's choice 
of encryption (and security of its key).  The sender and recipient's
identity are secured by the mix network and its delays and 
encryption.

To attack the Tor network, there are various options.  The
biggest hurdle is the design of the network itself.  One can
deploy sufficient network monitoring equipment to monitor the
entire internet (difficult) and then perform correlation analysis
(expensive).  The combination of both to de-anonymise all Tor
traffic is prohibitive, and is evidenced by the fact that there
is no evidence of a Tor deanonymisation that was not helped by
people using poor "operation security" practices (more on that
later).

The obvious attack is against the software itself.  Tor is a
very successful network, but is comprised by relatively few
people.  This makes it more difficult to expend the time to 
gain acceptance in the community and then submit a software
change that would not be checked sufficiently.  Possible, but
difficult.

The next most obvious attack has happened several times; 
create a large number of Tor relays and submit them to the 
network.  This is the "timing" attack, but instead of just
watching the network, on participates in it.  This is analygous
to the "watcher actually runs the proxy" problem noted above,
and again highlights the importance of the authenticity 
problem (which largely remains unsolved).

Say you build and submit enough relays to control a third of
the Tor relays.  On pure probability you will control a third
times a third circuits through the network.  Thus your could
de-anonymise a ninth of the network.  This is why "first" 
nodes (called guard nodes) are preserved.  People will generally
continue to use pre-used first nodes.  Thus, new elements of
the network are less likely to become first nodes and thus
and amount of correlation attack is reduced.  Additionally,
Tor clients routinely change their circuit.  Second and third
nodes get changed every ten minutes or so.  Thus, a polluted,
compromised community of relays have less chance to constantly
watch (de-anonymise) persons.

The last attack I will consider is the funding attack.  Tor
was orginally fully funded by the US Dept of Defence (Naval
Research Labs).  Consider the problem that they were trying
to solve: A government employee needs to submit data to the 
government from a foreign location.  The foreign location may
have no direct access to US government secure networks, but
the internet is accessible.  How can the internet be used
to allow a government agent to communicate with the government
without allowing the foreign organisation to know that they
are doing so?

Hence the above described three hop encrypted proxy setup
that is Tor.  Assuming that the end point employs secure
cryptography, the foregin organisation will have no idea
about what is being said, or where it is going.  One can
assume that if a variant of Tor is still being used by the
US Dept. of Defence or other agencies, then it is as outlined
above a high latency network.

Could the DoD have built in special access tricks?  Yes and
no.  They may have been there (and may still be there) but
it is increasingly difficult over time to maintain these 
when a group of non-US technically skilled privacy activists
control the project and its source code.

Can the funding donor prioritise the work that is done on
the project.  Yes ! The US government continues to be a 
major sponsor of the project and as such, they can direct
attention away from areas which they wish to be untouched.
But, beware this is a double edged sword.  Every vulernability
that they maintain is one that can be found by their adversaries.
The project is public.  Its funding is public.  Its code and
code reviews are public.  There are plenty of smart people
out there who can find these "problems".

As much as the National Security Agency has shown itself to
be more focussed on the "attach" side of its dual mission,
the Tor project (from the Naval Research Labs) has placed
itself in public hands and allowed all of the disinfection
of sunlight.