[tor-dev] Client simulation

Tue Jun 11 15:50:35 UTC 2013

On 6/10/13 4:40 AM, Karsten Loesing wrote:
>>> On 6/6/13 7:32 PM, Norman Danner wrote:
>>>> I have two questions regarding a possible research project.
>>>>
>>>> First, the research question:  can one use machine-learning techniques
>>>> to construct a model of Tor client behavior?  Or in a more general form:
>>>>    can one use <fill-in-the-blank> to construct a model of Tor client
>>>> behavior?  A student of mine did some work on this over the last year,
>>>> and the results are encouraging, though not strong enough to do anything
>>>> with yet.
>>
>> The intent is that each cluster (represented by a single hidden Markov
>> model) represents a "type" of client, even though we don't know for sure
>> what that client type does.  We can make some guesses about some:  the
>> "type" of steady high-volume cell counts is probably a bulk downloader;
>> the "type" of steady zero cell counts is probably an unused circuit;
>> etc.  But in some sense, I'm thinking that what counts is the behavior
>> of the client, not the reason for that behavior.  We don't have to
>> instrument clients for this.  Of course, then one has to ask whether
>> this kind of modeling is in fact useful.  It is somewhat different than
>> what you are envisioning, I think.
>>
>> There are about a billion variations (at last count) on this theme.  We
>> chose one particular one as a test case to play with the methodology.  I
>> think the methodology is mostly OK, though I'm not completely satisfied
>> with the results of the particular variation Julian worked on.  So now
>> I'm trying to figure out whether to push this forward and in particular
>> what directions and end goals would be useful.
>
> Interesting stuff!  You're indeed taking a different approach than I
> were envisioning by gathering data on a single guard rather than on a
> set of volunteering clients.  Both approaches have their pros and cons,
> but I think your approach leads to some interesting results and can be
> done in a privacy-preserving fashion.
>
> Two thoughts:
>
> - I could imagine that your results are quite valuable for modeling
> better Shadow/ExperimenTor clients or for deriving better client models
> for Tor path simulators.  Maybe Julian's thesis already has some good
> data for that, or maybe we'll have to repeat the experiment in a
> slightly different setting.  I'm cc'ing Rob (the Shadow author) and
> Aaron (working on a path simulator) to make sure they saw this thread.
> I can help by reviewing code changes to Tor to make sure data is
> gathered in a privacy-preserving way, and I'd appreciate if those code
> changes would be made public together with analysis results.

I'm in the process of rewriting the data collection code, and will 
e-mail later with some of the details.  But maybe off-list initially, as 
I think the first few passes will be very special-purpose and hence not 
of general interest (though I'm happy to discuss it more publicly if 
that's more appropriate).

Right now I'm considering focusing on trying to get a reasonable 
(partial) answer to the following question:  how well do various 
timing-analysis attacks actually work?    That is, how well do they work 
when the client model is "accurate?"  I'm not even sure how exactly to 
define "accurate," though I can think of at least a few different ways. 
  But I'm hoping that by focusing on a relatively narrow question, we 
can see manageable chunks of questions related to what kinds of data can 
be reasonably collected, and how can we use that data for other purposes.

> - It might be interesting to observe how Tor usage changes over time.
> Maybe the research experiment leads to a set of classifiers telling us
> when a circuit is most likely used for bulk downloads, used for web
> browsing, used for IRC, unused, or whatever.  We could then extend
> circuit statistics to have all relays report aggregate data of how
> circuits can be classified.  Requires a proposal and code, but I could
> help with those.

Yes, I can see a number of longer-range applications like this.  I'm not 
sure I want to think about proposals and code just yet.

	- Norman

-- 
Norman Danner - ndanner at wesleyan.edu - http://ndanner.web.wesleyan.edu
Department of Mathematics and Computer Science - Wesleyan University