Search based blocking and network clustering

Nosferatu · #1 (**permalink**) April 7th, 2002

Two common requests can be serviced by implementing one mechanism (OK, two mechanisms) (well, three would be nice).

Part 1

Give the user a control so they can drop searches they want to by policy. This will make people happy who object to some content on Gnutella, and who knows, might even have some impact on the more objectionable content.

I don't believe that default policies should be provided - that would be a form of censorship imposed by the developers. But if each user can choose exactly what terms they want to drop if any, then this is democracy (or mob rule

Part 3

Some clients provide statistics on which search terms are being used, ie number of searches seen for each individual word. This would be useful as it allows the user to most efficiently target whatever it is they think they object to.

Part 2

As part of this blocking, there should be an answer packet which says 'I just dropped your search because I don't like it'.

In response, the client searching should drop the connection if it knows of another host who has not blocked the search, and connect to that host instead.

This way, clients who search for avi or mpeg or iso will evolve into a group separate from clients who block these searches.

I am sure this topic will stimulate some debate

Meantime I'm trying to figure out how to program so I can implement it.

Nos

Smilin' Joe Fission · #2 (**permalink**) April 7th, 2002

Quote:

Originally posted by Nosferatu
Part 1

Give the user a control so they can drop searches they want to by policy. This will make people happy who object to some content on Gnutella, and who knows, might even have some impact on the more objectionable content.

I don't believe that default policies should be provided - that would be a form of censorship imposed by the developers. But if each user can choose exactly what terms they want to drop if any, then this is democracy (or mob rule

Most clients already implement search filtering by keyword. What is the difference between that and your proposal?

Quote:

Part 2

As part of this blocking, there should be an answer packet which says 'I just dropped your search because I don't like it'.

In response, the client searching should drop the connection if it knows of another host who has not blocked the search, and connect to that host instead.

This way, clients who search for avi or mpeg or iso will evolve into a group separate from clients who block these searches.

I am sure this topic will stimulate some debate

Meantime I'm trying to figure out how to program so I can implement it.

Adding packet types would mean changing the Gnutella protocol. That cannot be done arbitrarily. Besides that, more packets flying around the network means more data to route. I think the main goal is to cut down on the amount of useless data that needs to be routed. Some of us have bandwidth limits.

And what good would come from seperating the network exactly?

Let's not start trying to add politics to a network that's supposed to be politically neutral, shall we? Thanks.

Nosferatu · #3 (**permalink**) April 7th, 2002

Quote:

Originally posted by Smilin' Joe Fission
Most clients already implement search filtering by keyword. What is the difference between that and your proposal?

Sorry, yes I wasn't explicit enough. I mean, drop the search, instead of routing it.

Quote:

Adding packet types would mean changing the Gnutella protocol. That cannot be done arbitrarily. Besides that, more packets flying around the network means more data to route. I think the main goal is to cut down on the amount of useless data that needs to be routed. Some of us have bandwidth limits.

Hmm ..

Quote:

And what good would come from seperating the network exactly?

I'm not too sure myself, but it has been suggested several times. I guess the reasoning is generally that people interested in mp3s want to be close to people interested in mp3s, people interested in isos close to people interested in isos, to increase the number of useful search results and reduce bandwidth lost to uninterested parties.

Personally I am more interested in reducing the traffic of 'illegal' porn through my PC - I wouldn't advocate the use of gnutella to many people I know because I would be embarassed.

Yes, I believe in free speach, I think this solution would be a good one, it would leave people with the right to publish what they like, but also give people the right not to take part in that publishing if they wish not to.

Nos

Smilin' Joe Fission · #4 (**permalink**) April 8th, 2002

Quote:

Originally posted by Nosferatu
I'm not too sure myself, but it has been suggested several times. I guess the reasoning is generally that people interested in mp3s want to be close to people interested in mp3s, people interested in isos close to people interested in isos, to increase the number of useful search results and reduce bandwidth lost to uninterested parties.

But there are people like me who have both. How would your proposal handle this?

Quote:

Personally I am more interested in reducing the traffic of 'illegal' porn through my PC - I wouldn't advocate the use of gnutella to many people I know because I would be embarassed.

Unless you download that sort of stuff directly, no files like that ever reach your PC. About the only thing that travels to your PC are the searches, which your computer routes to other nodes even if you don't have the files that meet the search criteria. The file transfers themselves are made outside the Gnutella network (using HTTP) via a direct connection between the person requesting the file and the person hosting the file. Your PC has nothing to do with the file transfer between them.

It would be seriously destructive to the Gnutella network for some clients (or users of those clients) to arbitrarily decide that they don't want to forward queries just because they contain the word "porn" or something.

Quote:

Yes, I believe in free speach, I think this solution would be a good one, it would leave people with the right to publish what they like, but also give people the right not to take part in that publishing if they wish not to.

A user blocking another user's queries isn't proving anything. How are you giving others the right to publish what they want if you insist on blocking queries to it? How are you upholding the "rights" of others to search for something by blocking it just because YOU disapprove of it? Remember, all you know about is that they're searching... the transfers happen without your knowledge anyway, so why should you care what they're searching for? Also remember, you nor your PC are taking any part in the actual propogation of that sort of stuff on the network.

I think your idea would be better suited to a proprietary network where there are only 1 or 2 clients that exist to access it. Unless you can convince ALL other Gnutella client developers that your idea just rocks and to implement it (which I doubt would happen... you haven't yet convinced me it's worth implementing into my client), producing a client with features like that would be useless because there would be a dozen other clients out there that would be ignoring your efforts.

Nosferatu · #5 (**permalink**) April 8th, 2002

Quote:

Originally posted by Smilin' Joe Fission
But there are people like me who have both. How would your proposal handle this?

I'm not. Unless you block a search, you are seen by the searching client as neutral.

Quote:

Unless you download that sort of stuff directly, no files like that ever reach your PC. About the only thing that travels to your PC are the searches, which your computer routes to other nodes even if you don't have the files that meet the search criteria.

These are what I'm proposing to drop. Drop the searches, and don't display searches which are dropped in the usual 'search monitor'. Optionally have a screen which will optionally display 'Dropped Searches'.

Quote:

It would be seriously destructive to the Gnutella network for some clients (or users of those clients) to arbitrarily decide that they don't want to forward queries just because they contain the word "porn" or something.

No it wouldn't - as I described, it would simply move the clients which want to search for those terms away from the ones which don't.

For HIGHLY UNPOPULAR terms such as lolita preteen xxx I expect it would have a significant impact on the time taken for searches to come back and for the client to find other clients who do not block those terms, assuming everyone installs a client which supports this system and configures the blocking.

But because of the pluralistic nature of the population of gnutella users, for most search terms it wouldn't be any kind of problem.

Quote:

A user blocking another user's queries isn't proving anything. How are you giving others the right to publish what they want if you insist on blocking queries to it?

They are welcome to serve the files from their PC, and pass the information that the files are served to anyone who cares.

Quote:

How are you upholding the "rights" of others to search for something by blocking it just because YOU disapprove of it?

They can search. As I describe above, if most people think their search is OK, most people will propogate it. Do you have any idea how many gnutella clients there are? It's in the tens to hundreds of millions. If one client blocks something, that still leaves a lot of clients which will pass on the request. I have slowed their search by some milliseconds.

Quote:

Remember, all you know about is that they're searching... the transfers happen without your knowledge anyway, so why should you care what they're searching for?

I am sick of the stuff.

Quote:

Also remember, you nor your PC are taking any part in the actual propogation of that sort of stuff on the network.

Yes, my PC is taking part by propogating the searches and the results. Yuk. I'm not talking about a mild aversion to naked women, although I think it is legitimate to question the amount of bandwidth used on porn in general.
Also, as I said, people would like to move clients closer together in some cases for some searches. Uses proposed have included language specific searches.

Once the grep searching ability is facilitated in gnutella searches, you will be able to block searches which do not contain a particular term.

Quote:

I think your idea would be better suited to a proprietary network where there are only 1 or 2 clients that exist to access it. Unless you can convince ALL other Gnutella client developers that your idea just rocks and to implement it (which I doubt would happen... you haven't yet convinced me it's worth implementing into my client),

You have a major problem understanding statistics and democracy and how the two interact.

If noone else thinks this is a good idea, not many clients will support it - maybe none if I don't get my finger out.

On the other hand, if it is a good idea, we will figure out how to do it and do it and the idea will spread - users will demand it or move clients.

Your opinion as an individual that 'the idea just plain sux' is only the idea of one person, and you speak for no one except yourself. Thanks for your opinion. Anyone else with the same opinion, consider yourselves spoken for already by Smilin' Joe.

Quote:

roducing a client with features like that would be useless because there would be a dozen other clients out there that would be ignoring your efforts.

I think the disconnect only needs to be done on a client who is directly connected as a 'host' and that <A HREF="http://rfc-gnutella.sourceforge.net/Proposals/BYE/_status.txt">when</A> the <A HREF="http://rfc-gnutella.sourceforge.net/Proposals/BYE/Bye.txt">BYE protocol extension</A> is implemented that this can be used.
I believe the appropriate protocol response would be 'relay improper queries (402)'.

So, many clients will understand what is being said to them (at least enough to display an error message to the user). BYE can even be retrofitted to 0.4 clients.

If a client does not understand, that's fine, it just won't necessarily have the same degree of effect in moving clients closer/further away based on searches. But they will be disconnected by my client every time they try an unwanted search .. which is not a great penalty unless it is being done by many many clients.

Nos

<A HREF="http://www.sdf.se/~simon/marvin/songs/save_the_children.html">Who really cares?
Who's willing to try?</A>

Smilin' Joe Fission · #6 (**permalink**) April 8th, 2002

Quote:

Originally posted by Nosferatu
But because of the pluralistic nature of the population of gnutella users, for most search terms it wouldn't be any kind of problem.

Until the system starts becoming abused. Yes, it will be abused.

Quote:

They can search. As I describe above, if most people think their search is OK, most people will propogate it. Do you have any idea how many gnutella clients there are? It's in the tens to hundreds of millions. If one client blocks something, that still leaves a lot of clients which will pass on the request. I have slowed their search by some milliseconds.

Have you personally counted all of the Gnutella clients? Have you counted how many of those clients are active at one time? I can pull a number out of my butt too, but unless I can guarantee its accuracy, it's meaningless. I think you're severely overestimating the number of active Gnutella clients out there.

Quote:

I am sick of the stuff.

So that gives you the right to break the network? Think of it this way... I'm connected to the network, but I just happen to be connected to 10 hosts that are searching for terms that you deem unwanted. These 10 hosts I'm connected to are connected to other clients which support search blocking. Even if I'm searching for something legitimate, under your proposal, the clients searching for inappropriate terms will be disconnected. I and the thousand or more other hosts, being unfortunate enough to connect to these 10 hosts now have an ever shifting, ever changing horizon because they're being disconnected all the time. So, you're not only hurting the ones initiating the improper searches, but you're also affecting the hundreds or even thousands of clients connected to them. You're constantly changing their horizons.

Quote:

You have a major problem understanding statistics and democracy and how the two interact.

I do huh? Et tu? I suppose you're the master of statistics. Do YOU have the "statistics" to support your claims? Didn't think so. So I'm telling you what I think will happen to the network if your proposal is implemented. Unless you have the data to back up your claims, all you're doing is saying what you THINK will happen if your proposal is implemented.

Quote:

On the other hand, if it is a good idea, we will figure out how to do it and do it and the idea will spread - users will demand it or move clients.

Fine. Get back to me when that happens.

Quote:

Your opinion as an individual that 'the idea just plain sux' is only the idea of one person, and you speak for no one except yourself. Thanks for your opinion. Anyone else with the same opinion, consider yourselves spoken for already by Smilin' Joe.

In other words "Don't bother saying anything unless you agree with me." That sure makes for a great debate.

Nosferatu · #7 (**permalink**) April 8th, 2002

Quote:

Originally posted by Smilin' Joe Fission
Until the system starts becoming abused. Yes, it will be abused.

Well, this is an opinion, not a fact. Below you ask me to present hard statistics.
Can you at least tell me what type of abuse you think is inevitable?

Quote:

Have you personally counted all of the Gnutella clients?

Why do you want me to do it personally? You trust me more than other sources?!

Quote:

Have you counted how many of those clients are active at one time? I can pull a number out of my butt too, but unless I can guarantee its accuracy, it's meaningless. I think you're severely overestimating the number of active Gnutella clients out there.

OK, looks like you got me - I was guessing.
Here is <A HREF="http://www.limewire.com/index.jsp/size">the real figure - it's around 300k - 1/2 million at the moment.</A>
I guess their guess is better than mine . It's around 30 times the size of the generally accepted average horizon size.

Quote:

So that gives you the right to break the network?

Break is your opinion, not mine. Can you at least put forward an argument to show how it will be considered 'broken' after this change in a way that it is not 'broken' now, or to a much larger degree?

Quote:

Think of it this way... I'm connected to the network, but I just happen to be connected to 10 hosts that are searching for terms that you deem unwanted. These 10 hosts I'm connected to are connected to other clients which support search blocking. Even if I'm searching for something legitimate, under your proposal, the clients searching for inappropriate terms will be disconnected. I and the thousand or more other hosts, being unfortunate enough to connect to these 10 hosts now have an ever shifting, ever changing horizon because they're being disconnected all the time.

OK, the scenario you have proposed seems unlikely to me and the logic is inconsistent.

You are assuming that the 10 randomly selected hosts you have chosen are ALL searching for something widely considered inappropriate. This is already unlikely, but no doubt will happen very very occaisionally.

By definition, because this term is widely considered inappropriate, and as you say you are searching for only things widely considered legitimate, then the next ten hosts you pick up are pretty much guaranteed to accept your searches and connections.

Quote:

So, you're not only hurting the ones initiating the improper searches, but you're also affecting the hundreds or even thousands of clients connected to them.

Degree of likelihood that all clients connected to one of these 'naughty' guys are ALL connected to ALL ten of them and no one else: 0.00000000..etc..00001 %
Considered 'impossible'.
Even if the impossible happened, all that would be experienced is everyone would for 10 seconds to a minute be searching for 10 new hosts. Since the other 'hundreds or even thousands of clients' who are in your impossible scenario all connected to these 10 'naughty' hosts are all receiving pongs through the 'naughty' hosts up until the time that the 'naughty' guys perform their 'naughty' search, they will already have knowledge of a great number of 'nice' hosts, so they should find a new one without even having to visit a host cache.

Remember, the above is not going to happen.

'Naughty' searchers are going to appear rarely, one at a time.

There is a scenario where what you describe is going to happen, which will be during start up, if a very wide number adopt the strategy of 'specialising' their searches.

Let's look at it this way. I will try to describe a reasonable, but worst-case, scenario, where you the user are not searching for something considered inappropriate by most people.

Say a very high number of people think specialist searches implemented using grep is a good idea, ie search me for iso files only, or search me for mp3s only. What do you think the upper limit would be? 40% of people might think this way? I think that is a very very conservative figure.

OK, for a very back-of-the-envelope kind of figure,

n * p = t
where
n: total number of trials
p: probability that a connection will not reject you when you search
t: target number of host connections

gives
n = t/p

If we say p = 0.6 (ie 60% 'good' connections, 40% 'bad')
and you want to keep up 10 connections,
n = 10/0.6
n = 16.67

On average, a user searching for something which the specialists reject, has to talk to 17-odd hosts at startup in order to establish 10 good hosts.

This does not consider any additional host-rejection scenarios.

We can generalise the answer, by saying that t = 1
n = 1.7
You have to connect to, on average, about 1.7 times as many hosts, if there are 40% of people wanting to specialise and you search for something else. 40% is an astonishingly high proportion, and you have said yourself that you don't think this idea will take off at all.

How about if we assume 20%, still a very high figure, but perhaps a realistic high-point.
p = 0.8
n = 1/0.8
n = 1.25

Only a 25% increase in number of initial host connections required at startup.

And as I said before, this ignores the effect of hosts caching hosts who are similar to themselves. (Perhaps this effect would be insignificant anyway until you have done a few searches).

Anyway, I guess this means it might be a good idea if when rejected by a host, if you have plenty of hosts in your cache, that you delete the rejecting host from your cache, thus increasing the chances that clients cache hosts with similar search/drop criteria.

Quote:

You're constantly changing their horizons.

Please demonstrate how this is a bad thing. Many of the clients I have tried have constantly changing hosts, ie constantly changing horizons. So what? So the clients I search in ten minutes are different clients to the ones I searched ten minutes ago? I think this is a good thing.

I am more likely to find a result, eventually. I re-search every ten minutes, and get a different group of results. I can still download from most of the machines I located ten minutes ago, if I don't find anything. The only ones I might not be able to download from are firewalled IPs.

Quote:

I do huh? Et tu? I suppose you're the master of statistics. Do YOU have the "statistics" to support your claims?

Maybe. Depends how badly you want them, how much time I have, and whether someone else provides an answer first.

<I>Added later: oops - confused between statistics and probability. No I do not have the statistics, but can model guessed probabilities - see later posting</I>

Quote:

Didn't think so.

You should let people answer before you answer them back.

Quote:

So I'm telling you what I think will happen to the network if your proposal is implemented. Unless you have the data to back up your claims, all you're doing is saying what you THINK will happen if your proposal is implemented.

Fine. Get back to me when that happens.
...
In other words "Don't bother saying anything unless you agree with me." That sure makes for a great debate.

Well, you I said what I think, you said what you think, I was simply saying, people don't flood the channel if you can't add anything not already covered. That makes for a good debate too

.

Nos

"We can't train that boy as a Jedi because he is too old and too
full of fear"

#8 (**permalink**) April 8th, 2002

Quote:

Originally posted by Smilin' Joe Fission How are you giving others the right to publish what they want if you insist on blocking queries to it? How are you upholding the "rights" of others to search for something by blocking it just because YOU disapprove of it?

"Let's not start trying to add politics to a network that's supposed to be politically neutral, shall we? Thanks."

Where do people get the idea that Gnutella isn't political? Read this thread and tell me it ain't!

#9 (**permalink**) April 8th, 2002

Quote:

Give the user a control so they can drop searches they want to by policy. This will make people happy who object to some content on Gnutella, and who knows, might even have some impact on the more objectionable content.

I don't believe that default policies should be provided - that would be a form of censorship imposed by the developers. But if each user can choose exactly what terms they want to drop if any, then this is democracy (or mob rule

While GNet is indeed ridden with things that don't belong there (or shouldn't be in anyone's posession in the first place), blocking the actual traffic is not a good idea.

First, and foremost, it'll give anyone "control" over the content seen by others - this is one of the things that's been bugging many developers: if you can control, you can also be asked to do control specific files. One can endlessly discuss that particular issue, but basically, a "hands-off" approach is the most appropriate in this case.

Second, the network isn't meant to block certain traffic - it's an open and free protocol. Blocking certain traffic is vendor specific, and that can lead into the great debates you've seen elsewhere on this forum (commercial vs. non-commercial, et al.)

It is better to let the end-user decide what he/she is willing to see, for example the "Family Filters" seen in some clients. Obviously, some developers should make that a password protected. Thing of GnutellaNet as an Internet atop of the Internet. The Internet is an ungoverned place - things you find on GnutellaNet can also be found on the Internet itself, how distasteful that content may be. But as with the Internet, it is at your sole discretion to block/avoid these things, not the maker of an Internet browser.

-- Mike

PS: Mods!! (Morgwen, Cyclo) - for some reason my account is "disabled", I was unable to post with my actual account and had to go as Unregistered. I couldn't even PM you two - wassup?

Nosferatu · #10 (**permalink**) April 9th, 2002

OK, the statistical math is very hardgoing and I'd probably get it wrong and you probably wouldn't understand it (even if you do understand statistical maths!)

Using the Binomial calculator at
http://www.anu.edu.au/nceph/surfstat...me/tables.html

I can quickly plug in n=17 and p=0.6 as determined previously and find out the standard deviation: 2
So we can say, for the horror 40% of people disallow searches that aren't for some specific resource and you aren't searching for that specific resouce, that
5% of the time you get 10 hosts in under 13 tries
33% of the time you get 10 hosts in under 15 tries
67 % of the time you get 10 hosts in under 19 tries
95% of the time you get 10 hosts in under 21 tries

For a still fairly bad situation where 20% of people disallow .. blah blah blah ..
by plugging in n=12, n=13 (the mean is 12.5) and p=0.8 as determined earlier, and finding that the standard deviation is 1.4, so
5% of the time 10 hosts in under 10 tries
33% of the time 10 hosts in under 11 tries
67% of the time 10 hosts in under 14 tries
95% of the time 10 hosts in under 15-16 tries

I couldn't find any online application which will graph these outcomes in a useful way.

I wonder whether any of the big commercial vendors have their own gnutella network modellers. If so, they could figure out better what would happen. I guess they wouldn't tell us though

I wonder if there is a project yet to write a gnutella network model? It would be useful for exploring proposed protocol modificiations, and I guess not much different from writing a client.

The hard part would be writing analysis routines to make the data meaningful.

Nos

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Search based on Directory/Folder	pheare	Feature Request	7	August 12th, 2005 07:51 AM
Does Search Result based on Uploads?	kaymatrix	General Windows Support	2	June 11th, 2005 10:15 AM
Network based ID3 cleanup, playlists, song classification, discovery & more...	jim7	General Gnutella / Gnutella Network Discussion	2	October 22nd, 2001 08:30 PM
University Network Blocking Gnutella	Unregistered	Connection Problems	2	August 31st, 2001 12:30 AM