[OONI] Designing the OONI Backend (OONIB). RESTful API vs rsynch

I would like to follow up on the discussion we had in Florence on some design choices behind OONIB. In particular the most controversy was around using HTTP or rsync. Before discussion the pro and contra about one choice over the other it would be useful to frame what are exactly the requirements for OONIB. # What is OONIB OONIB is the backend of OONI. It will run mainly on one centralized machine and may in a later stage run distributed across multiple ones. Currently we have not though of how to make it scale to being distributed, so we will look at it as if it were running only on one central machine. It will be responsible for: * Reporting a) Collecting reports from OONIProbes. Such reports are the results of tests. b) Collecting reports from third party censorship measurement tools (e.x. Bismark, NeuBot, etc.) * Assistance in test running Certain OONI Tests require to have a backend component (e.x. b0wser). On OONIB we will have the serverside component that will assist us in running the test. Note: Certain tests require the server to make connections to the client. This means that the client will need to request the server to probe them. * Control Channel Signaling This is required for making some measurements to verify that the data received by the backend specific to the test matches with the one sent by the client. # What properties we would like it to have note: these are not ordered. * Efficient even over high latency networks. * Ease of integration for third party developers. * Expandable to support requirements of new tests we develop. * Anonymous * Secure # HTTP and rsych comparison note: I will not deal with the security aspects of OONIB. We will suppose to have an encrypted and authenticated transport (this can be TLS, Tor Hidden Services, etc.) ## Rsync: Pro: * It supports good compression algorithms * It's efficient and supports resume * It does integrity checking on the uploaded files Contra: * It's designed only for copying files, this means we can't implement any more advanced API like logic. [*] * It's not supported by many languages (for example in python we only have an implementation of the rsync algorithm, not of the protocol [1]) * It's not as commonly used by other application developers that have similar requirements. * Painful to do sanitization of the data sent by clients. * Does not allow bidirectional communication (Request-Response pattern) [*] I would like to be able to create a session ID for a specific test and be able to reference such test ID when interacting with the Test helpers. rsync is one way, I push data to the server, but the server cannot signal me back with some data. This largely impeeds it's usefulness as an API interface. ## HTTP: note: I am not necessarily talking only about HTTP, we could use any other protocol with similar properties (e.x. SPDY). I will discuss HTTP because it is the one that I am most familiar with, but don't Pro: * Industry standard for exposing APIs * Supported natively in most programming languages * Well understood protocol * Implementation of sanitization of passed data can be done more easily * Allows bidirectional communication * Good support in twisted (what we use as a language for OONIB) Contra: * Compression is not enabled by default (we can use gzip compression with HTTP 1.1), and no compression for headers. * No resume support (this can be implemented on top of HTTP, we could even implement the rsyc algorithm on top of HTTP). * No support for deltas (we can use rsych protocol over HTTP if we really need this). I feel like we are a bit comparing apples and oranges here and I don't see why we could not use rsync algorithm on top of HTTP. Anyways I would like to get some feedback as to what we should use for something that should have the above described properties. Thoughts? - Art. [1] https://github.com/isislovecruft/pyrsync

On Sun, Jul 15, 2012 at 12:56 PM, Arturo Filastò <art@torproject.org> wrote:
* No resume support (this can be implemented on top of HTTP, we could even implement the rsyc algorithm on top of HTTP).
Are you sure HTTP doesn't support resume? What does wget -c do?
Thoughts?
- Art.
[1] https://github.com/isislovecruft/pyrsync _______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

Aaron:
On Sun, Jul 15, 2012 at 12:56 PM, Arturo Filastò <art@torproject.org> wrote:
* No resume support (this can be implemented on top of HTTP, we could even implement the rsyc algorithm on top of HTTP).
Are you sure HTTP doesn't support resume? What does wget -c do?
I believe this requires the HTTP: range header and it doesn't provide the integrity checking that rsync provides. All the best, Jake

On 7/15/12 3:58 PM, Jacob Appelbaum wrote:
Are you sure HTTP doesn't support resume? What does wget -c do?
I believe this requires the HTTP: range header and it doesn't provide the integrity checking that rsync provides.
It maybe also an application HTTP parameters that contain the last offset of the specific data-set download, so that the server would seek to that offset and start sending data up to that point? -naif

Contra: * No support for deltas (we can use rsych protocol over HTTP if we really need this).
It's a little hackish, but I believe there is a 'standard' way to do this in HTTP also. A client issues a GET (or PUT) request to a resource, and recieves an Etag that identifies this version of the object. The client then issues a PATCH Request to update the object, sending the Etag, and either structured XMLor JSON with the fields to replace, or binary data with a Range header indicating where in the object to replace. If the Etag the client sent is the object stored on the server, the PATCH succeeds and overwrites the data. If the Etag does not match, the client is out of date and must issue a GET, resolve differences, and then the PATCH. -tom

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 On Sun 15 Jul 2012 at 10:13, thus spake Tom Ritter:
Contra: * No support for deltas (we can use rsych protocol over HTTP if we really need this).
It's a little hackish, but I believe there is a 'standard' way to do this in HTTP also. A client issues a GET (or PUT) request to a resource, and recieves an Etag that identifies this version of the object. The client then issues a PATCH Request to update the object, sending the Etag, and either structured XMLor JSON with the fields to replace, or binary data with a Range header indicating where in the object to replace.
While this is quite a clever use for Etags, I have to point out that there would be no identity verification[0] in this scheme, in addition to Etags being subject to birthday attack enumeration (even if we use a secure hash). Therefore, Mallory, knowing the location of the OONIB server, can simply compute many random Etags and issue a PATCH of a blank string to each one, erasing all the collected data.
If the Etag the client sent is the object stored on the server, the PATCH succeeds and overwrites the data. If the Etag does not match, the client is out of date and must issue a GET, resolve differences, and then the PATCH.
Mallory can also /GET... Perhaps I am biased towards opposition to Etags merely because of their nastier uses for tracking. I kind of wish they would be removed from the protocol, and I don't want to create any legitimate use for them that might deter their removal. <(A)3 isis agora lovecruft [0] Identity verification in the "yes-this-is-the-same-client-as-before" sense, not the "this-person-is-named-holger-and-they-live-in-osterreich" sense. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAEBCAAGBQJQBcZSAAoJEKOttnos24s11WkQAIQj8lmm10bm4eiGsrcf0syb v/j/+N9vYut4+EmDzgsTvVdpYA53IkVTVfZWs9kKuiUUTNJnjHqbTlF7UfUvzcxJ EfWft+0N0sw8crqAHrENHPoNICLhU1cxxozYYAkGEkx8IOP8/W/WdOqTc49Ybzqz yjUQuUUzLBg0QXY7S+3dPYEjjl+4RVNMzr9awCq97m/H102BXkkR5OC1cwL/gCsl 4FcgHMKP6SkZhIX+zs8MR9AP8ADp9x5uPTd2+nF+u6v0ri0NDdkrHqiQIRmpj42R lvE1I0UZuFjhMZT3HEi2c0XP2KtfcncyBM/CISu4H26AO6KyOA3b6jmUwzkuGHjF HubZKARU82bg+2bRzAiNrq/uEX1ni3NWLm/c/kziEF1G1RsA1Ghy9G5EnHPQ/PQF npHBscHgnpYjiwKJmq4jdSByA8CrcGRdPrcJQQZN8WVa0wfvHn2jsi7a6J3cBGu8 uJ3dpVJrX9UMicV4o/q1iu5cS+piKHkOE5SeTKAySoNyIVMLJQU6zZhoQxLXhQxE c7ZgYAMp4eZeROU8qeQ+A+7mDER83PjzYHr27JhFJ8Zg5+7v6IMcHc7qtbnL9VE2 fgEqgjkzkQnmT3k75daVql2zche9zfX3pEniUxYDCzZCv2T4zb/ysBTMuv0ktYxU 1lPJTfRx3Lyx42Zo9kxf =CwCQ -----END PGP SIGNATURE-----

On 07/15/2012 02:56 PM, Arturo Filastò wrote:
I would like to follow up on the discussion we had in Florence on some design choices behind OONIB.
In particular the most controversy was around using HTTP or rsync.
[...]
# What properties we would like it to have note: these are not ordered.
* Efficient even over high latency networks.
* Ease of integration for third party developers.
* Expandable to support requirements of new tests we develop.
* Anonymous
* Secure
Even though you will probably not end up using this, it may be a good idea to know that it exists: ZeroC Ice - http://www.zeroc.com/ice.html It's a middleware platform for building exactly this kind of distributed services. Many supported languages (C++/Python/Ruby/Java/...) and OSes (Mac/Lin/Win/Android/...) It might seem difficult or "enterprisey", but it's simple for simple things (like RPC) and complicated only when you want complicated things (have look at demo programs). It can optionally use TLS, interface definition for RPC and structures is written only once (each language binding then loads it and maps it to native object of its own as "usual" method calls or attributes). Advanced features include asynchronous calls, at-most-once semantics (it can retry RPC call for methods that are marked "idempotent", i.e. whose multiple invocation is same as one invocation), persistence via Ice Freeze (might work for the file storage, not sure how big are your files, internally it's implemented on top of BerkeleyDB), forward/backward compatibility among versions of your API (up to a limit)... Disadvantages: - you'll have one more library blob to carry around (though Ice is in default Debian/Ubuntu repos and official RPM repos are available; core lib is about 3MB large) - GPL licensed (might conflict with other libraries' licenses) - certainly not as simple as GET/POST request It's probably the most sane "generic" middleware/RPC platform I've seen and I've worked with a bunch of them - RESTful APIs, variants of XML-RPC, monsters like webservices/SOAP and CORBA (it always starts with "I just need this simple thing" and ends with "how do I hack this onto the existing API so that old clients end existing infrastructure won't break?") Ondrej

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 On Mon 16 Jul 2012 at 02:15, thus spake Ondrej Mikle:
On 07/15/2012 02:56 PM, Arturo Filastò wrote:
# What properties we would like it to have note: these are not ordered. * Efficient even over high latency networks. * Ease of integration for third party developers. * Expandable to support requirements of new tests we develop. * Anonymous * Secure
Even though you will probably not end up using this, it may be a good idea to know that it exists:
ZeroC Ice - http://www.zeroc.com/ice.html
I took a breif look at it and it does actually look quite nice... I'm about half convinced, but am also rather inexperienced with designing distributed, secure, anonymous, multi-platform, scalable RPC systems. ugh...I think that string of buzzwords just made me puke in my mouth a little bit.
It might seem difficult or "enterprisey", but it's simple for simple things (like RPC) and complicated only when you want complicated things (have look at demo programs).
Oh man. It's not Twisted, that's for sure. :) Though, it seems that much of Ice is redundant if we are already packaging Twisted. Perhaps we could use their code as reference, and just write out the methods we need in Twisted to avoid the extra dependency?
It can optionally use TLS, interface definition for RPC and structures is written only once (each language binding then loads it and maps it to native object of its own as "usual" method calls or attributes).
Advanced features include asynchronous calls, at-most-once semantics (it can retry RPC call for methods that are marked "idempotent", i.e. whose multiple invocation is same as one invocation), persistence via Ice Freeze (might work for the file storage, not sure how big are your files, internally it's implemented on top of BerkeleyDB), forward/backward compatibility among versions of your API (up to a limit)...
Becoming more convinced. Do you know off the top of your head which protocol it uses? HTTP also, I would assume? Side note: What are we going to do for countries which block/monitor/MITM SSL connections? If I'm not mistaken, hasn't it been the case that these places have still allowed ssh? Should we have some sort of append-only scp-like fallback? Does Ice have that?
Disadvantages:
- you'll have one more library blob to carry around (though Ice is in default Debian/Ubuntu repos and official RPM repos are available; core lib is about 3MB large)
Ah, yes, but if you use just the Python libraries, it looks like the size ranges from 600kb for Debian to uh...actually I can't find anywhere which has just the Python libraries for Windows. The largest size for a Linux distro does look to be about 3MB, you're right.
- GPL licensed (might conflict with other libraries' licenses) - certainly not as simple as GET/POST request
It's probably the most sane "generic" middleware/RPC platform I've seen and I've worked with a bunch of them - RESTful APIs, variants of XML-RPC, monsters like webservices/SOAP and CORBA (it always starts with "I just need this simple thing" and ends with "how do I hack this onto the existing API so that old clients end existing infrastructure won't break?")
<(A)3 isis agora lovecruft -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAEBCAAGBQJQBcZBAAoJEKOttnos24s1clgQALtU3XKp1DnU8eqDsi3xFuo7 FpzR4M9ZEcoMBBbEyP6999TynZTMbFT5ISwR8u0U/9rlTK4fJw5CxuyLwk93/W+K zS9Cv1wMVhXKefpIhl+6LHa1oeSzAnNThzAZtKaA2eWQgaTC21fCn7CSa1RCX3NT v8B9cEiZtnQwXGl83b0zUmUwYy3f9X62Lmag2DZNzOM4QKtyPDeqQQ/SDPvttzl2 k0ZuaIvSFXjR/WhuL18mtzbL+azGeMz1Cs8mE+vI7UuiA353DAjiC9OhZAd6k9ut eHVzU/eaa9v5TMIyuf70eoZRF7wKqY2Z+0L6hAxaT6p4ZRrdOWCw7O95qdmoBQWT IttjTLV3y+msUp+Dsdy1gn6rCDTRPeRX5m0+5nY9eX4lDCyGdYf50mRKJ6DlMK3j waZVJqJDtOf5tIhZBiBkDRWb4N669KCoca9TNtwCSiBaJgTorcTenGaW9Z73L3MA 6QlVfPj3GWqCXGqIoo9jaZkHI5V3zFd7he4+SPenHJyuWIFkqST867M5Zg6/1oJZ A0Wi3KFgSdaAOB5M9KA49X6BlTeTyE3BBV+DTydEL0MsQ2ZWvwAxh/FchFw65Qm0 9tTyzi0ZSzDczXQGvz2hneJ1XQoxWX1NPcrWwZldOKrtb6wHbzoNz3gtY8G0FEF1 l+41r2JrBkQQTHcxlSEd =ygcX -----END PGP SIGNATURE-----

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 07/17/2012 10:08 PM, Isis wrote:
On Mon 16 Jul 2012 at 02:15, thus spake Ondrej Mikle:
On 07/15/2012 02:56 PM, Arturo Filastò wrote:
# What properties we would like it to have note: these are not ordered. * Efficient even over high latency networks. * Ease of integration for third party developers. * Expandable to support requirements of new tests we develop. * Anonymous * Secure
Even though you will probably not end up using this, it may be a good idea to know that it exists:
ZeroC Ice - http://www.zeroc.com/ice.html [...]
Oh man. It's not Twisted, that's for sure. :)
Though, it seems that much of Ice is redundant if we are already packaging Twisted. Perhaps we could use their code as reference, and just write out the methods we need in Twisted to avoid the extra dependency?
If you are packaging/using Twisted, then yes, Ice is redundant (unless someone planned to differentiate "signaling" from "data" protocol, for example).
It can optionally use TLS, interface definition for RPC and structures is written only once (each language binding then loads it and maps it to native object of its own as "usual" method calls or attributes).
Advanced features include asynchronous calls, at-most-once semantics (it can retry RPC call for methods that are marked "idempotent", i.e. whose multiple invocation is same as one invocation), persistence via Ice Freeze (might work for the file storage, not sure how big are your files, internally it's implemented on top of BerkeleyDB), forward/backward compatibility among versions of your API (up to a limit)...
Becoming more convinced. Do you know off the top of your head which protocol it uses? HTTP also, I would assume?
At low-level, it has its own protocol, it's not HTTP (it actually won't work over HTTP).
Side note: What are we going to do for countries which block/monitor/MITM SSL connections? If I'm not mistaken, hasn't it been the case that these places have still allowed ssh? Should we have some sort of append-only scp-like fallback? Does Ice have that?
Unfortunately, there's no fallback in Ice for that (its firewall-evading also uses SSL/TLS which is not useful here). Maybe I misunderstood Arturo's requirement that said TLS or TorHS was considered for encrypted/authenticated transport. Ondrej -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (GNU/Linux) iQEcBAEBAgAGBQJQB0TbAAoJEAy6xNgMZCEgyVUIAKpcZjXVqaxDFmtyYUlyonv8 snXCsW0IX93ywpP63SIpleTPAl3Yp4T7Ng6wZKjpMJ/N2xEo7o5GGHl9Z2YVnUyY Kgp6/FZPkHZv0PmDSVKANleJPTP+CR4LemkcezLiMnpSQ7kv7mIXpVsKbgTJ9B5L AFa/mWj/YCAJT8I108pteCLZDFEaDEdciM5Bl4Kp6hoiiouyDPRjF2/fC/YWVTfL DBmo6m8Wq3ZemlLW3At5dvYOct9gQgYyZgq8DWXVFzKx0JzfQ1rXoO4ovZFoLh7D fnVtjjSaWMOHhscdIS4zx5x9Q4J4QQtwyK0pKBnZwq6DF1J2FPuMxg/jP4v+UrE= =31rD -----END PGP SIGNATURE-----

On 7/16/12 2:15 AM, Ondrej Mikle wrote:
On 07/15/2012 02:56 PM, Arturo Filastò wrote:
I would like to follow up on the discussion we had in Florence on some design choices behind OONIB.
In particular the most controversy was around using HTTP or rsync. [...]
# What properties we would like it to have note: these are not ordered.
* Efficient even over high latency networks.
* Ease of integration for third party developers.
* Expandable to support requirements of new tests we develop.
* Anonymous
* Secure Even though you will probably not end up using this, it may be a good idea to know that it exists:
ZeroC Ice - http://www.zeroc.com/ice.html
There are a bunch of very fancy and nice libraries out there to do RPC like things that support a variety of languages. Another one I am very fond of is ZeroMQ, but I think we should not start worrying about supporting such advanced and scalable libraries at this time. This is the reason why I stated at the beginning of the requirements that we should "so we will look at it as if it were running only on one central machine". The scalability issues will be dealt with once we have properly defined the problem scope. Making the node communication rely on Zero(C|MQ) can be something that we integrate without having to require clients to change their behavior. I think the overall feeling from the responses is that going for something like an HTTP RESTful API is what we are looking for. HTTP is a well understood technology and I have quite some experience in designing and building RESTful APIs based on twisted (cyclone) and this is also what is used in other [1] projects [2] belonging [3] to the [4] Tor community [5]. Moreover by looking at how the reporting systems of other network measurements tools worked [5] we found that almost all of them used an HTTP API to collect reports [6][7][8][9]. I see this as an indication that such a strategy is the best practice. For the time being we should go for something simple like this and once we encounter major scalability/performance bottlenecks we can quantify them and figure out what the best path to a solution may be. If you were the developer of a censorship detection tool would you like to have to report to anything that is not a RESTful HTTPs API? - Art. [1] https://github.com/mmaker/APAF [2] https://github.com/gsathya/pyonionoo [3] https://github.com/globaleaks/Tor2web-3.0 [4] https://github.com/globaleaks/GLBackend [5] https://trac.torproject.org/projects/tor/wiki/doc/OONI/CensorshipDetectionTo... [6] https://trac.torproject.org/projects/tor/wiki/doc/OONI/CensorshipDetectionTo... [7] https://trac.torproject.org/projects/tor/wiki/doc/OONI/CensorshipDetectionTo... [8] https://trac.torproject.org/projects/tor/wiki/doc/OONI/CensorshipDetectionTo... [9] https://trac.torproject.org/projects/tor/wiki/doc/OONI/CensorshipDetectionTo...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 07/18/2012 04:46 PM, Arturo Filastò wrote:
On 7/16/12 2:15 AM, Ondrej Mikle wrote:
On 07/15/2012 02:56 PM, Arturo Filastò wrote:
I would like to follow up on the discussion we had in Florence on some design choices behind OONIB.
In particular the most controversy was around using HTTP or rsync. [...]
# What properties we would like it to have note: these are not ordered.
* Efficient even over high latency networks.
* Ease of integration for third party developers.
* Expandable to support requirements of new tests we develop.
* Anonymous
* Secure Even though you will probably not end up using this, it may be a good idea to know that it exists:
ZeroC Ice - http://www.zeroc.com/ice.html
There are a bunch of very fancy and nice libraries out there to do RPC like things that support a variety of languages. [...]
Moreover by looking at how the reporting systems of other network measurements tools worked [5] we found that almost all of them used an HTTP API to collect reports [6][7][8][9]. I see this as an indication that such a strategy is the best practice.
Since it's already implemented, it's reasonable to keep it that way.
For the time being we should go for something simple like this and once we encounter major scalability/performance bottlenecks we can quantify them and figure out what the best path to a solution may be.
Sure (though this transition is always PITA once it is necessary).
If you were the developer of a censorship detection tool would you like to have to report to anything that is not a RESTful HTTPs API?
Hmm, for some reason I remembered there was some debate on stateful requirements on the API but can't seem to find it. Ondrej -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (GNU/Linux) iQEcBAEBAgAGBQJQB1IXAAoJEAy6xNgMZCEgWHgH/i2i+FPxco4xIsOeDLtbMO13 DZQ7iWTih9QGnSk+qk55BUGuzEzvAv2OAMECgD2KnM+VCDe/mHXNME/+87dZpd4s hwp5B9BrUmanAs75FSELOXUsMH3WGNn1hXls+rdgvcfweTyCs7+BOnkTkT5Ni/rl XHdyn1BQsNPfb3MIUc6ZsEWy45QoHXcJwPoaKdfCHaBsKV1WzD7+NeR9JAWUTJ1u pj0idqgUKJqgsxCnCr2r7DcgHg41wGci5cIENMVSraGHUb+Is94PRICXPunU/3Lz 3uiCmZ/9KPl20fVBQd/vhB4tHNdoXlIZw7NJELtwKgjONmneS0zUqJsMNasNU+Q= =1+b3 -----END PGP SIGNATURE-----
participants (7)
-
Aaron
-
Arturo Filastò
-
Fabio Pietrosanti (naif)
-
Isis
-
Jacob Appelbaum
-
Ondrej Mikle
-
Tom Ritter