I would like to follow up on the discussion we had in Florence on some design choices behind OONIB.
In particular the most controversy was around using HTTP or rsync.
Before discussion the pro and contra about one choice over the other it would be useful to frame what are exactly the requirements for OONIB.
# What is OONIB
OONIB is the backend of OONI. It will run mainly on one centralized machine and may in a later stage run distributed across multiple ones. Currently we have not though of how to make it scale to being distributed, so we will look at it as if it were running only on one central machine.
It will be responsible for:
* Reporting a) Collecting reports from OONIProbes. Such reports are the results of tests. b) Collecting reports from third party censorship measurement tools (e.x. Bismark, NeuBot, etc.)
* Assistance in test running Certain OONI Tests require to have a backend component (e.x. b0wser). On OONIB we will have the serverside component that will assist us in running the test.
Note: Certain tests require the server to make connections to the client. This means that the client will need to request the server to probe them.
* Control Channel Signaling This is required for making some measurements to verify that the data received by the backend specific to the test matches with the one sent by the client.
# What properties we would like it to have note: these are not ordered.
* Efficient even over high latency networks.
* Ease of integration for third party developers.
* Expandable to support requirements of new tests we develop.
* Anonymous
* Secure
# HTTP and rsych comparison note: I will not deal with the security aspects of OONIB. We will suppose to have an encrypted and authenticated transport (this can be TLS, Tor Hidden Services, etc.)
## Rsync:
Pro:
* It supports good compression algorithms
* It's efficient and supports resume
* It does integrity checking on the uploaded files
Contra:
* It's designed only for copying files, this means we can't implement any more advanced API like logic. [*]
* It's not supported by many languages (for example in python we only have an implementation of the rsync algorithm, not of the protocol [1])
* It's not as commonly used by other application developers that have similar requirements.
* Painful to do sanitization of the data sent by clients.
* Does not allow bidirectional communication (Request-Response pattern)
[*] I would like to be able to create a session ID for a specific test and be able to reference such test ID when interacting with the Test helpers. rsync is one way, I push data to the server, but the server cannot signal me back with some data. This largely impeeds it's usefulness as an API interface.
## HTTP: note: I am not necessarily talking only about HTTP, we could use any other protocol with similar properties (e.x. SPDY). I will discuss HTTP because it is the one that I am most familiar with, but don't
Pro:
* Industry standard for exposing APIs
* Supported natively in most programming languages
* Well understood protocol
* Implementation of sanitization of passed data can be done more easily
* Allows bidirectional communication
* Good support in twisted (what we use as a language for OONIB)
Contra:
* Compression is not enabled by default (we can use gzip compression with HTTP 1.1), and no compression for headers.
* No resume support (this can be implemented on top of HTTP, we could even implement the rsyc algorithm on top of HTTP).
* No support for deltas (we can use rsych protocol over HTTP if we really need this).
I feel like we are a bit comparing apples and oranges here and I don't see why we could not use rsync algorithm on top of HTTP. Anyways I would like to get some feedback as to what we should use for something that should have the above described properties.
Thoughts?
- Art.
On Sun, Jul 15, 2012 at 12:56 PM, Arturo Filastò art@torproject.org wrote:
- No resume support (this can be implemented on top of HTTP, we could
even implement the rsyc algorithm on top of HTTP).
Are you sure HTTP doesn't support resume? What does wget -c do?
Thoughts?
- Art.
[1] https://github.com/isislovecruft/pyrsync _______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Aaron:
On Sun, Jul 15, 2012 at 12:56 PM, Arturo Filastò art@torproject.org wrote:
- No resume support (this can be implemented on top of HTTP, we could
even implement the rsyc algorithm on top of HTTP).
Are you sure HTTP doesn't support resume? What does wget -c do?
I believe this requires the HTTP: range header and it doesn't provide the integrity checking that rsync provides.
All the best, Jake
On 7/15/12 3:58 PM, Jacob Appelbaum wrote:
Are you sure HTTP doesn't support resume? What does wget -c do?
I believe this requires the HTTP: range header and it doesn't provide the integrity checking that rsync provides.
It maybe also an application HTTP parameters that contain the last offset of the specific data-set download, so that the server would seek to that offset and start sending data up to that point?
-naif
Contra:
- No support for deltas (we can use rsych protocol over HTTP if we
really need this).
It's a little hackish, but I believe there is a 'standard' way to do this in HTTP also. A client issues a GET (or PUT) request to a resource, and recieves an Etag that identifies this version of the object. The client then issues a PATCH Request to update the object, sending the Etag, and either structured XMLor JSON with the fields to replace, or binary data with a Range header indicating where in the object to replace.
If the Etag the client sent is the object stored on the server, the PATCH succeeds and overwrites the data. If the Etag does not match, the client is out of date and must issue a GET, resolve differences, and then the PATCH.
-tom
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
On Sun 15 Jul 2012 at 10:13, thus spake Tom Ritter:
Contra:
- No support for deltas (we can use rsych protocol over HTTP if we
really need this).
It's a little hackish, but I believe there is a 'standard' way to do this in HTTP also. A client issues a GET (or PUT) request to a resource, and recieves an Etag that identifies this version of the object. The client then issues a PATCH Request to update the object, sending the Etag, and either structured XMLor JSON with the fields to replace, or binary data with a Range header indicating where in the object to replace.
While this is quite a clever use for Etags, I have to point out that there would be no identity verification[0] in this scheme, in addition to Etags being subject to birthday attack enumeration (even if we use a secure hash). Therefore, Mallory, knowing the location of the OONIB server, can simply compute many random Etags and issue a PATCH of a blank string to each one, erasing all the collected data.
If the Etag the client sent is the object stored on the server, the PATCH succeeds and overwrites the data. If the Etag does not match, the client is out of date and must issue a GET, resolve differences, and then the PATCH.
Mallory can also /GET...
Perhaps I am biased towards opposition to Etags merely because of their nastier uses for tracking. I kind of wish they would be removed from the protocol, and I don't want to create any legitimate use for them that might deter their removal.
<(A)3 isis agora lovecruft
[0] Identity verification in the "yes-this-is-the-same-client-as-before" sense, not the "this-person-is-named-holger-and-they-live-in-osterreich" sense.
On 07/15/2012 02:56 PM, Arturo Filastò wrote:
I would like to follow up on the discussion we had in Florence on some design choices behind OONIB.
In particular the most controversy was around using HTTP or rsync.
[...]
# What properties we would like it to have note: these are not ordered.
Efficient even over high latency networks.
Ease of integration for third party developers.
Expandable to support requirements of new tests we develop.
Anonymous
Secure
Even though you will probably not end up using this, it may be a good idea to know that it exists:
ZeroC Ice - http://www.zeroc.com/ice.html
It's a middleware platform for building exactly this kind of distributed services. Many supported languages (C++/Python/Ruby/Java/...) and OSes (Mac/Lin/Win/Android/...)
It might seem difficult or "enterprisey", but it's simple for simple things (like RPC) and complicated only when you want complicated things (have look at demo programs).
It can optionally use TLS, interface definition for RPC and structures is written only once (each language binding then loads it and maps it to native object of its own as "usual" method calls or attributes).
Advanced features include asynchronous calls, at-most-once semantics (it can retry RPC call for methods that are marked "idempotent", i.e. whose multiple invocation is same as one invocation), persistence via Ice Freeze (might work for the file storage, not sure how big are your files, internally it's implemented on top of BerkeleyDB), forward/backward compatibility among versions of your API (up to a limit)...
Disadvantages:
- you'll have one more library blob to carry around (though Ice is in default Debian/Ubuntu repos and official RPM repos are available; core lib is about 3MB large) - GPL licensed (might conflict with other libraries' licenses) - certainly not as simple as GET/POST request
It's probably the most sane "generic" middleware/RPC platform I've seen and I've worked with a bunch of them - RESTful APIs, variants of XML-RPC, monsters like webservices/SOAP and CORBA (it always starts with "I just need this simple thing" and ends with "how do I hack this onto the existing API so that old clients end existing infrastructure won't break?")
Ondrej
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
On Mon 16 Jul 2012 at 02:15, thus spake Ondrej Mikle:
On 07/15/2012 02:56 PM, Arturo Filastò wrote:
# What properties we would like it to have note: these are not ordered.
- Efficient even over high latency networks.
- Ease of integration for third party developers.
- Expandable to support requirements of new tests we develop.
- Anonymous
- Secure
Even though you will probably not end up using this, it may be a good idea to know that it exists:
ZeroC Ice - http://www.zeroc.com/ice.html
I took a breif look at it and it does actually look quite nice...
I'm about half convinced, but am also rather inexperienced with designing distributed, secure, anonymous, multi-platform, scalable RPC systems. ugh...I think that string of buzzwords just made me puke in my mouth a little bit.
It might seem difficult or "enterprisey", but it's simple for simple things (like RPC) and complicated only when you want complicated things (have look at demo programs).
Oh man. It's not Twisted, that's for sure. :)
Though, it seems that much of Ice is redundant if we are already packaging Twisted. Perhaps we could use their code as reference, and just write out the methods we need in Twisted to avoid the extra dependency?
It can optionally use TLS, interface definition for RPC and structures is written only once (each language binding then loads it and maps it to native object of its own as "usual" method calls or attributes).
Advanced features include asynchronous calls, at-most-once semantics (it can retry RPC call for methods that are marked "idempotent", i.e. whose multiple invocation is same as one invocation), persistence via Ice Freeze (might work for the file storage, not sure how big are your files, internally it's implemented on top of BerkeleyDB), forward/backward compatibility among versions of your API (up to a limit)...
Becoming more convinced. Do you know off the top of your head which protocol it uses? HTTP also, I would assume?
Side note: What are we going to do for countries which block/monitor/MITM SSL connections? If I'm not mistaken, hasn't it been the case that these places have still allowed ssh? Should we have some sort of append-only scp-like fallback? Does Ice have that?
Disadvantages:
- you'll have one more library blob to carry around (though Ice is in default
Debian/Ubuntu repos and official RPM repos are available; core lib is about 3MB large)
Ah, yes, but if you use just the Python libraries, it looks like the size ranges from 600kb for Debian to uh...actually I can't find anywhere which has just the Python libraries for Windows. The largest size for a Linux distro does look to be about 3MB, you're right.
- GPL licensed (might conflict with other libraries' licenses)
- certainly not as simple as GET/POST request
It's probably the most sane "generic" middleware/RPC platform I've seen and I've worked with a bunch of them - RESTful APIs, variants of XML-RPC, monsters like webservices/SOAP and CORBA (it always starts with "I just need this simple thing" and ends with "how do I hack this onto the existing API so that old clients end existing infrastructure won't break?")
<(A)3 isis agora lovecruft
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 07/17/2012 10:08 PM, Isis wrote:
On Mon 16 Jul 2012 at 02:15, thus spake Ondrej Mikle:
On 07/15/2012 02:56 PM, Arturo Filastò wrote:
# What properties we would like it to have note: these are not ordered. * Efficient even over high latency networks. * Ease of integration for third party developers. * Expandable to support requirements of new tests we develop. * Anonymous * Secure
Even though you will probably not end up using this, it may be a good idea to know that it exists:
ZeroC Ice - http://www.zeroc.com/ice.html
[...]
Oh man. It's not Twisted, that's for sure. :)
Though, it seems that much of Ice is redundant if we are already packaging Twisted. Perhaps we could use their code as reference, and just write out the methods we need in Twisted to avoid the extra dependency?
If you are packaging/using Twisted, then yes, Ice is redundant (unless someone planned to differentiate "signaling" from "data" protocol, for example).
It can optionally use TLS, interface definition for RPC and structures is written only once (each language binding then loads it and maps it to native object of its own as "usual" method calls or attributes).
Advanced features include asynchronous calls, at-most-once semantics (it can retry RPC call for methods that are marked "idempotent", i.e. whose multiple invocation is same as one invocation), persistence via Ice Freeze (might work for the file storage, not sure how big are your files, internally it's implemented on top of BerkeleyDB), forward/backward compatibility among versions of your API (up to a limit)...
Becoming more convinced. Do you know off the top of your head which protocol it uses? HTTP also, I would assume?
At low-level, it has its own protocol, it's not HTTP (it actually won't work over HTTP).
Side note: What are we going to do for countries which block/monitor/MITM SSL connections? If I'm not mistaken, hasn't it been the case that these places have still allowed ssh? Should we have some sort of append-only scp-like fallback? Does Ice have that?
Unfortunately, there's no fallback in Ice for that (its firewall-evading also uses SSL/TLS which is not useful here). Maybe I misunderstood Arturo's requirement that said TLS or TorHS was considered for encrypted/authenticated transport.
Ondrej
On 7/16/12 2:15 AM, Ondrej Mikle wrote:
On 07/15/2012 02:56 PM, Arturo Filastò wrote:
I would like to follow up on the discussion we had in Florence on some design choices behind OONIB.
In particular the most controversy was around using HTTP or rsync.
[...]
# What properties we would like it to have note: these are not ordered.
Efficient even over high latency networks.
Ease of integration for third party developers.
Expandable to support requirements of new tests we develop.
Anonymous
Secure
Even though you will probably not end up using this, it may be a good idea to know that it exists:
ZeroC Ice - http://www.zeroc.com/ice.html
There are a bunch of very fancy and nice libraries out there to do RPC like things that support a variety of languages.
Another one I am very fond of is ZeroMQ, but I think we should not start worrying about supporting such advanced and scalable libraries at this time.
This is the reason why I stated at the beginning of the requirements that we should "so we will look at it as if it were running only on one central machine".
The scalability issues will be dealt with once we have properly defined the problem scope. Making the node communication rely on Zero(C|MQ) can be something that we integrate without having to require clients to change their behavior.
I think the overall feeling from the responses is that going for something like an HTTP RESTful API is what we are looking for.
HTTP is a well understood technology and I have quite some experience in designing and building RESTful APIs based on twisted (cyclone) and this is also what is used in other [1] projects [2] belonging [3] to the [4] Tor community [5].
Moreover by looking at how the reporting systems of other network measurements tools worked [5] we found that almost all of them used an HTTP API to collect reports [6][7][8][9]. I see this as an indication that such a strategy is the best practice.
For the time being we should go for something simple like this and once we encounter major scalability/performance bottlenecks we can quantify them and figure out what the best path to a solution may be.
If you were the developer of a censorship detection tool would you like to have to report to anything that is not a RESTful HTTPs API?
- Art.
[1] https://github.com/mmaker/APAF [2] https://github.com/gsathya/pyonionoo [3] https://github.com/globaleaks/Tor2web-3.0 [4] https://github.com/globaleaks/GLBackend [5] https://trac.torproject.org/projects/tor/wiki/doc/OONI/CensorshipDetectionTo... [6] https://trac.torproject.org/projects/tor/wiki/doc/OONI/CensorshipDetectionTo... [7] https://trac.torproject.org/projects/tor/wiki/doc/OONI/CensorshipDetectionTo... [8] https://trac.torproject.org/projects/tor/wiki/doc/OONI/CensorshipDetectionTo... [9] https://trac.torproject.org/projects/tor/wiki/doc/OONI/CensorshipDetectionTo...
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 07/18/2012 04:46 PM, Arturo Filastò wrote:
On 7/16/12 2:15 AM, Ondrej Mikle wrote:
On 07/15/2012 02:56 PM, Arturo Filastò wrote:
I would like to follow up on the discussion we had in Florence on some design choices behind OONIB.
In particular the most controversy was around using HTTP or rsync.
[...]
# What properties we would like it to have note: these are not ordered.
Efficient even over high latency networks.
Ease of integration for third party developers.
Expandable to support requirements of new tests we develop.
Anonymous
Secure
Even though you will probably not end up using this, it may be a good idea to know that it exists:
ZeroC Ice - http://www.zeroc.com/ice.html
There are a bunch of very fancy and nice libraries out there to do RPC like things that support a variety of languages.
[...]
Moreover by looking at how the reporting systems of other network measurements tools worked [5] we found that almost all of them used an HTTP API to collect reports [6][7][8][9]. I see this as an indication that such a strategy is the best practice.
Since it's already implemented, it's reasonable to keep it that way.
For the time being we should go for something simple like this and once we encounter major scalability/performance bottlenecks we can quantify them and figure out what the best path to a solution may be.
Sure (though this transition is always PITA once it is necessary).
If you were the developer of a censorship detection tool would you like to have to report to anything that is not a RESTful HTTPs API?
Hmm, for some reason I remembered there was some debate on stateful requirements on the API but can't seem to find it.
Ondrej