Gnutella Forums - Proposal for development of Gnutella (hashs)

Page 1 of 2

Show 50 post(s) from this thread on one page

Gnutella Forums (https://www.gnutellaforums.com/)

- General Gnutella Development Discussion (https://www.gnutellaforums.com/general-gnutella-development-discussion/)

- - Proposal for development of Gnutella (hashs) (https://www.gnutellaforums.com/general-gnutella-development-discussion/6969-proposal-development-gnutella-hashs.html)

Unregistered

January 5th, 2002 03:59 PM

Proposal for development of Gnutella

When a servant process a query and return the filename and filesize available. It should also return a hash code.

This would download from multiples sources even when the users rename the original file.

A good hash (20bytes) with a checking of the filesize should avoid "false duplicates"

Marc.
marc@szs.ca

Moak	January 5th, 2002 04:40 PM

yep, I suggest/vote for it too. :)

There is allready a well documented proposal, named 'HUGE' [1]. From their "Motivation & Goals":

o Folding together the display of query results which represent the exact same file -- even if those identical files have different filenames.

o Parallel downloading from multiple sources ("swarming") with final assurance that the complete file assembled matches the remote source files.

o Safe "resume from alternate location" functionality, again with final assurance of file integrity.

o Cross-indexing GnutellaNet content against external catalogs (e.g. Bitzi) or foreign P2P systems (e.g. FastTrack, OpenCola, MojoNation, Freenet, etc.)

[1] "HUGE" - http://groups.yahoo.com/group/the_gd...roposals/HUGE/ (Yahoo account required)

Unregistered

January 6th, 2002 06:20 AM

Could be simple.

The HUGE thing look complicate for no reason. The risk of making an error on dups files is close to impossible (1 / Billions * Billions) with an 160bits hash and filesize check. You simply do the download and check that the receive file match the hash...

And it's quite simple to code a component that can download from mutliple sources (swarm). You simply test the servers for resume capability, split the files to download in blocks, and create threads that request theses blocks on differents server, you can even create mutliple threads (connections) per server.

In order to improve download/connection speed each client should have a list of other client that have the same file and reply not only with there IP but with the IP of others that can provide the same file. This could be done if Hub (supernodes) are inserted in the network. They could scan for dups files!

I already have program a swarm component in Delphi and it's working well. I will now work on it to add on the fly add/remove of new download sources.

If any one want to work on it to let me know i will send you the sources. It's use Indy for TCP access.

Marc.
marc@szs.ca

Moak	January 6th, 2002 12:27 PM

I thought HUGE is simple and flexible?

It does explain a lot of basics and also details how to include hashs into binary Gnutella messages (did you recognize that you have to encode 0x00, also compatibility with other/older clients is guaranteed)? If you think it can be done easier, write a paper... I prefer easy solutions. :-)

PS: What you said about swarming and a list of alternatives downloads. Yes, this is another advantage when we have hashs and Queryhit-caches. I'm a big fan of superpeers, hashs, swarming and multisegmented downloading. :-)

veniamin

January 6th, 2002 01:04 PM

I am not sure but i think a CRC could do the job. For each file in a query hit we can put its CRC between the two nulls like Gnotella does for the MP3 files.

Moak	January 6th, 2002 01:31 PM

You can do that with HUGE. It also describes the encoding between the two nulls, then the new GET request. I think it preferes SHA1 for the hash, but which you use is flexible, CRC, MD5...

The question I have, which is the best algorithm? Can someone give a summary/overview? Hmm, it should be unique enough within a typical horizon (high security is not the topic), small in size (to keep broadcast traffic low), fast to calculate.

Unregistered

January 6th, 2002 05:00 PM

Follow up

About Huge, when i said Huge look complicate i mean from what you tell me it's more about verification of data integrity.

I prefer less verification and better speed (smaller protocol)
as long as verification is good enough.

About CRC it's really not a good idea to use CRC16 or CRC32 the last one give 4 Billions values, that not enough you could get wrong duplicates files. SHA1 use a 20 bytes (160bits) this give a lot more possibility. To give an idea it would be arround,
4Billions * 4Billions * 4Billions, ... You get the point, with this amount of possibility you reduce the change of making false duplicates.

SHA1 speed is fast enough +1MB/Sec but it's not that important, a client program could generate all the sha1 at startup and cache
this information in memory 1000 files would required only 20KB
of memory. Doing a hash for each query would not be a good idea...

Marc

Moak	January 6th, 2002 08:04 PM

hmm

What do you mean with 'verification'?
The HUGE goals describe exactly what we really need today: a) efficient multisegmented downloading/caching (grouping same files together from Queryhits for parallel downloads or Query caches) - b) efficient automatic requerying (finding alternative download loactions).

I agree, the protocol should be as small as possible.
While you agree with SHA1 (I still have no clue about advantages/disadvantages in CRCxx, SHA1, MD5, TigerTree etc), what could be done better as described in the HUGE paper? I think HUGE is pretty simple. It describes hash positioning in Query/Queryhits and necesarry HTTP headers. Then it encodes the hash to make it fit into Gnutella traffic (null inside the hash must encoded!) and also into HTTP traffic. Example, the well known 'GnutellaProtocol04.pdf' becomes 'urn:sha1:PLSTHIPQGSSZTS5FJUPAKUZWUGYQYPFB'.
Perhaps you don't agree with BAS32 encoding of the hash, what could be done better?

CU, Moak

Unregistered

January 7th, 2002 05:55 AM

Follow up

I know nothing about Huge i simply did get this from your previous post:

> Safe "resume from alternate location" functionality, again with final assurance of file integrity.

For me "final assurance" mean once download is complete you must do some kind of block check on the original source, with multiple CRC to verify that all the block receive match the original file.

This is what i call "final assurance". Like i said i don't know Huge, what i'm proposing is "not final assurance". Only perform a Sha1 and download the file from all matching sources. Without performing checking at the end of transfer. If HUGE is doing this then they can't tell "final assurance of file integrity" but it's exactly what i want to to.

To have "final assurance" would use to much bandwidth, performance would be better with a small risk for corrupted file, if it's in the range or 1/10000000000000000 sound ok to me.

I will try to take time and check the HUGE principle.

CRC vs SHA1: CRC is as good as SHA1 for randomly select a number corresponding to a data. But SHA1 add security, SHA1 was built so that it's impossible to recreate any original data from the hash key (good for password storing). And of course it generate larger number since it's a 20bytes key vs 4 bytes for CRC32

Marc.

Tamama

January 7th, 2002 05:57 AM

Base32?

The only thing i find somewhat weird at HUGE is that the sha1 is Base32 encoded. This means only 5 bits of a 8 bits byte are used. Just doesnt make sence... oh well

The GET request is somewhat strange as well... a simple:

GET urn:sha1:452626526(more of this ****)SDGERT GNUTELLA/0.4

would work just as well...

Some thoughts..

Moak	January 7th, 2002 11:19 AM

Just a question again: You don't agree with BAS32 encoding of the hash, what could be done better?

Unregistered

January 7th, 2002 05:55 PM

Tamaha,

Hi, Base32 does not reduce the number of bits, it change the way it is display/send.

For example:

2 is "01" in Base 2. Both "2" and "01" contain the exact same information it's just display in another way. Base32 and Base64 convertion are use to assure that datas can travel to different OS/Computer and still be the same value.

Moak	January 7th, 2002 06:21 PM

wrong

BASE32 codes 5 bits into one char (8 bits).

Unregistered

January 9th, 2002 06:30 AM

Wrong? I guest you don't understand.

Yes Base32 use 5 bits, but when it encode
11111111 (255) it does not scrapt the remaining 3 bits,
like i previously explain it only change how they are send.

In this case 255 would be send with 00011111 then 00000111.

So sending Byte 32 and Byte 8 in BASE32 is the same as sending Byte 255 in Base255 But the result is the same the value is 255.

Changing the Base DOES NOT CHANGE THE VALUE, it change how it's display. Otherwise it's not a base change it's a value change.

Marc.

Moak	January 9th, 2002 08:01 AM

plz look into existing code

Moak	January 11th, 2002 01:14 AM

Hashs in Queries (SMALL != HUGE)

Hi,
I thought more about hashs in Gnutella Queries. I personally think they should be as small as possible, because I expect a increased Query/Queryhit traffic from new clients with features like automatic resume and multisegmented downloads. While automatic requeries are a key technology for those features, the Query traffic especially for hash wil increase. Perhaps it will be also necesarry to group multiple searches together into a single message (multiple searches in one Gnutella Query to avoid repeated sending of Gnutella descriptor header, 23 bytes + more repeated payload). A small query/hash will be necessary in my eyes, as small as possible.

Different people have different ideas of a small hash. It should be still unique enough to fit our needs, common are AFAIK those suggestions:

* CRC-32, size 32 bit (256 hashs/KB *) [1]
* MD5, size 128 bit (64 hashs/KB) [2]
* SHA1, size 160 bit (51.2 hashs/KB) [3]
* Tiger, size 192 bit or truncated to 128 or 160 (42.6 hashs/KB) [4]
* Snefru, 256 bit (32 hashs/KB) [5]

I'm not sure about which hash to use (prefered). There seems to be nothing between 32 bit (CRC-32) and 128 bit (MD5) length. CRC32 will be not unique enough within a typical Gnutella horizon, better start to use MD5 or higher. Is it possible to truncate a big hash to e.g. 64 bits, does this make sense? I'm not familiar with cryptography, this is only a short summary... perhaps someone else wants to add some more qualified comments? :-)

At least the hash should be IMHO pure binary in inside the query (not BASE32 encoded which blows up the size again), in HTTP headers it might be BaseWhatever encoded to gain highest HTTP/1.x compatibilty. I think indexing speed is secondary [6]. Indexing local shared files can be performed in background on first startup (meanwhile the client does not answer with own hashs, but can allready search for).

* = pure binary hash, not included are descriptor headers or other protocol overhead

[1] CRC-32 - http://www.ietf.org/rfc/rfc1510.txt (ISO 3309)
[2] MD5 - http://www.faqs.org/rfcs/rfc1321.html
[3] SHA1 - http://www.faqs.org/rfcs/rfc3174.html
[4] Tiger - http://www.cs.technion.ac.il/~biham/Reports/Tiger/
[5] Snefru - http://www.faqs.org/faqs/cryptography-faq/part07/
[6] Hash Indexing Speed - http://groups.yahoo.com/group/the_gdf/message/1970

Moak	January 12th, 2002 12:42 AM

PS: Some ppl suggest to combine a small CRC hash with filesize (which is allready in Queryhits, not in Queries or other Gnutella descriptors).... let's play around with this idea. This would be a 32 bit hash + 32 bit filesize (taken from Gnutella protocol v0.4) = 64 bit key to use. Perhaps it would be more unique to use a real hash of 64 bit instead of the 64 bit CRC+filesize combo, e.g. an truncated MD5?

Here an overview of minimum possibilities:
* CRC-32, size 32 bit
* CRC-32 filesize combo, size 64 bit
* MD5 truncated, size 64 bit
* MD5, size 128 bit

Notes: I have choosen MD5 in this case, because it is the smallest and fastest compared to other hashs (SHA1, Tiger, Snefru). The CRC-32 alone is too small. The CRC-filesize combo might be enough, the truncated 64 bit MD5 might be mathematically more unique while it wastes 32 bit information in Queryhits (not in Query, GET, PUSHS). The next higher alternative is a 128 Bit MD5, e.g FastTrack uses an MD5 hash AFAIK.

I'm not sure if a minimum alternative is the best solution for Gnutella's future. Perhaps a 64 bit key does make us happy now, in future with more superpeers and bigger horizons we might want to have a bigger hash (MD5 or SHA1)?

An possibility could be an encoding alla HUGE. The hash has an prefix telling the hash type. For binary Gnutella messages (Query/Queryhits) this could be a payload like: byte 0 = hash type, more bytes = binary hash. The protocoll defines a list of known hash, while clients need a common solution, this list will be short, e.g start with CRC-filesize combo today and use SHA1 in future. In HTTP-alike Gnutella messages we can work with encoded hashs (not binary), similar to the HUGE proposal [1].

Conclusion: I have none. :) I suggest to implement and test a minimum solution (CRC32-filesize combo) and a bigger hash (MD5 or SHA1) for a while. With more experience in a real world environment we can hopefully find a suitable solution. Feedback, tests and mathematical analysis are welcome!

[1] "HUGE" - http://groups.yahoo.com/group/the_gd...roposals/HUGE/ (Yahoo account required)

Moak	January 12th, 2002 03:06 PM

PPS: Here is a summary why using Base32 or Base64 encoding in HTTP-style requests/headers.

------ snip ------
From http://groups.yahoo.com/group/the_gdf/message/2442:
- We could choose any encoding, but...
- Base32 is useful for compatibility with URLs and domain names, so...
- We might as well use it in protocol-fields, saving extra conversions and developer inconvenience.
------ snap ------

Which sounds logical... BUT.... at least we could also use Base64 after having a '?' in the location/URN. A HTTP GET could look like this:

Definition: GET /get/hash?[URN] HTTP/1.0
Base32: GET /get/hash?sha1:BCMD5DIPKJJTG2GHI2AZ9HG7HZUN5ZPH HTTP/1.0
Base64: GET /get/hash?sha1:/9n6YmKqNRmcLIiKC+2xRccm68 HTTP/1.0

Right now I prefer Base64 for HTTP encoding (it's smaller then Base32), binary encoding inside binary Gnutella binary messages and a smaller hash than SHA1.

veniamin

January 27th, 2002 01:54 PM

If a client respond to a query with query hits that each one has a URN then a lot of bandwidth is beeing wasted.

What i want to say is that for some files a client should not respond with a URN/Query Hit.
For example text files usually are small files which can be downloaded again in case of an incomplete download. There is no need to apply a URN for a text file. Not just .TXT files but also .HTM, .XML, .DOC and other text files. Another reason is that , if you just alter a text file (ex: put a comma) then the HASH changes. So many users can have (almost) the same (text) file but not the exactly identical and sending diferrent URNs.
URNs should be used only for binary files. But not just for every binary file. For small binary files (800~1000KB) a client should not reply with a URN.

Also when a client searches for the same file in other servers should use and the size of the file, not just the HASH. This way we can use an algorithm smaller than SHA1. If you have two binary files with different file size then there is no chance to be the same (or one of them is corrupted). :D

Unregistered

January 27th, 2002 02:51 PM

confused, do you really mean "Uniform Resource Names" (urn) or simply "hashs"?

gnutellafan

January 28th, 2002 04:29 AM

There is a standard!

Developers can not use what ever hash algorith they feel like. All developers must use the same one or it defeats the point. SHA1 is the one agreed on by the GDF and should be used by all clients.

SHA1 was used because it would be difficult to create a fake file with the hash the matches something else.

TigerTree can be used in combination with SHA1 and would be an excellent way to provide support for the sharing of partial files.

Moak	January 28th, 2002 05:27 AM

Re: There is a standard!

I don't agree.
Gnutella is a protocol in development and while I do not agree with some GDF ideas I prefer collecting/envolving ideas and suggesting new ones. In this particular case, the GDF-HUGE proposal itself is still an proposal in change and it also does not prescribe SHA1. Don't claim something a standard which isn't.

> SHA1 was used because it would be difficult to create a fake file with the hash the matches something else.

No argument. You give me SHA1 hash, I send back something junk data. It was often discussed that hashs are NOT used for security reasons just for simple file identifcation, a smaller hash is fine also. If you need a ensurance to get no junk data, use overlapped file resume.

> TigerTree [...] would be an excellent way for the sharing of partial files.

Please give an explanation or URL for our readers.

gnutellafan

January 28th, 2002 08:53 AM

I believe that HUGE was voted on and approved was it not? So for the time being developers should not choose any hash they want but the one that was agreed on or else it makes hashing worthless.

While a malicious client could sent you junk data this will not be propagated using SHA1, if a client confirms (rehashes) after downloading. It will see that the data is crap and get rid of it. With a less robust hash fake data can be made that would have the same hash value and thus be allowed to propagate.

I should have said that the use of tiger tree for partial file sharing is my idea but it should work. For more on what tiger tree is go to:

http://groups.yahoo.com/group/the_gdf/message/4871

gnutellafan

January 28th, 2002 08:59 AM

From Hash/URN Gnutella Extensions (HUGE) v0.93 :

Quote:

To be in compliance with this specification, you should support at least the SHA1 hash algorithm and format reflected here, and be able to downconvert "bitprint" requests/reports to SHA1. Other URN namespaces are optional and should be gracefully ignored when not understood. Please refer to the rest of this document for other important details.

So yes, SHA1 is "prescribed" and should be supported. Developers may opt to add additional hashes if they wish.

TorK	January 28th, 2002 09:22 AM

SHA1 should be used since MD5 is not strong enough. However, I was thinking of an idea of hot reduce the size of queries:
What if you could specify only the beggining of the hash and a * to indicate that any bytes may follow. Some files with the wrong hash would be returned, but since the whole hash is in the replies, those would be filterd out. Extra hits are also much cheaper than extra bytes in queries.

This, of course requires the hashes to be in base32, since the '*' char could not be recognized in raw binary data.

/Tor

Moak	January 28th, 2002 09:46 AM

Hi Tork,

please define "strong", a 128 bit hash isn't enough to identify a file within a typical horizon? I like the idea of shortened keys. I allready suggested to use an truncated MD5 hash, since 64 Bit should be enough to identify a file in a typical horizon (less bandwith wasted). The hash in binary messages does not need to be Base32 (or whatever) encoded.

/Moak

Moak	January 28th, 2002 09:52 AM

Gnutellafan, rehashing doesn't work if you do not have the full content. So any bad/broken client can send junk data, you won't recognize until you downloaded all partials. A SHA1 hash does really give no security on resume time, a simple overlapped check does this allready.
So while HUGE favours SHA1, I think it's a unnessecary waste of bandwith and therfore I personally don't support this idea, I prefer other alternatives. The GDF can vote a lot of things... I think discussing ideas, improving design, understanding background details, find better alternatives for the next protocoll is still allowed in this place.

gnutellafan

January 28th, 2002 10:10 AM

nothing is set in stone

nothing is set in stone, but for the time being SHA1 is it!

Different clients CANNOT use different hashes if we want to be able to use the hashes across the network. If at some later time developers agree that SHA1 is not working and choose something else than the protocol changes. But one developer should not use md5 or whatever they feel like, they should use SHA1.

Moak	January 28th, 2002 10:24 AM

thx

So you're not interested in my ideas, knowledge and analysis... because what GDF says is the only word? Okay, I'm not interested in the CGDF (commercial gnutella developer forum) under LW/BS pushing force, where things get implemented in _current_ clients before they are well tested and improved in beta clients, motto: I implemented it now, all others eat it or die. Improving an open protocol this way is inefficient in my eyes, like forcing ppl using uneven Linux kernel and improving weak spots is not allowed anymore.

gnutellafan

January 28th, 2002 11:43 AM

I never said that

Hmm, I dont think I ever said, or even implied that I am not interested in your ideas. Any ideas that improve gnutella are valuable. I only said that developers should comply with the accepted standard decided on by the GDF.

Im not sure but how many votes does limewire/bearshare get? 1 each I would imagine. So how many developers are there? I dont know how you can call it a cGDF?

Moak	January 28th, 2002 12:28 PM

I have a very different sight of the_GDF and their role.

Basically it's one valuable source of knowledge beyond others. I take the their proposals as proposals, which means I try to understand and improve what could be done better fom my programming knowledge and point of view. If ppl don't like it, fine, but discussing should be still allowed here. Some months back (before Mike Green cleaned up the chaos), the GDF was very chaotic and sometimes discussing things dead until they didn't move anymore. We gained no improvement in network topology or for the endusers for months, other P2P systems where much more innovative and gained more users every week. Currently the GDF is more attractive, I wonder positive, great for Gnutella! But I still see LW and BS implementing things, make it a "defacto standard" and come with a "eat or die" mentalaty. I still see the GDF hiding with a lack of documentation for new coders who would like to contribute or to program a client that follows the best available Gnutella technology. With this high society behaviour, it is easy to say... oh you are bad for "the network" b/c you don't follow our rules. About votes from important developers -which developers- we have only 4-5 clients in active development. So I don't see a large active balanced developers community, compared to other open source places. The invisble pushing force is LW and BS = commercial interests (Note: For exmaple I see the name "ultrapeer" as marketing, but well some say it's because FT's supernode sounds so close... do we name hashs ultrahashs or smarthashs soon?). Finding an end, the GDF was thought for a forum of coordination between developers: but I think promotion of new concepts and new coders is behind implementation question inside the GDF. Implementation isn't everything, the GDF will not take you serious until you coded an(other) client?

I feel uncomfortable with your description of wish-they-were developers forum here. I and others spend a lot free time in writing articles and helping new Gnutella programmers + bringing new ideas and concepts to Gnutella. I don't want to turn Gnutella into FastTrack, eDonkey... but I get inspired from all, not only from the GDF.

gnutellafan

January 28th, 2002 01:12 PM

Sorry

Please dont take it personally. I really knew that I shouldnt have put the comment in there but it was really more of a swing at myself than anyone. I dont know any programing and can only contribute to possible theories on how to improve gnutella.

I was not aware that anyone here actaully had a client (besides cultiv8r, which I have yet to see).

If you are a programer/developer great. What languages do you know and what projects are you working on? Gnucleus, Phex, LW...?

Moak	January 28th, 2002 03:11 PM

Sorry too

I spoke in defense for this forum, most important is we all find and work together. :)

About coders lurking here at the moment, I know there are Cultiv8or, Tama, GodX to name someone. I have a C++ background, there's no Gnutella client I contribute with code.

Greets, Moak

Moak	January 29th, 2002 09:32 AM

Yeah, offtopic

Let me just add this, the current mails on the GDF are perfectly explaining the "high society" and "commercial" attitude I mentioned:

Raphael wrote: "Let's vote on this issue. I'll manually conclude the vote when I see all the active servent developpers have voted. Please, only Gnutella developpers should vote, and only ONE vote per servent. I.e. LW gets ONE vote."

Vinnie wrote: "Besides, regardless of the poll results, BearShare will follow whatever scheme LimeWire uses in the interest of not creating divergent implementations."

Vinnie wrote: "Contrary to popular belief, the GDF is not a democracy, and servent developers are under no obligation to "adhere" to GDF proceedings. This having been said, cooperation is *recommended* but not required."

veniamin

January 29th, 2002 01:08 PM

Sorry i was confused to.. i mean HASHes...:D

I was only wondered why using SHA1 because from my point of view i find it "too much" for the Gnutella Protocol and for the purpose that is used. I read some articles for reducing network traffic for Gnutella like (Pong/Query Hashes or Ultrapeers). These "extensions" are used to save bandwidth and speed up downloads. But implementing SHA1 is like going to the opposite direction.

I am not a Gnutella developer, although i know how to program. So i am not trying to "push" my ideas and make other developers to adopt them. I just like to discussing about an "on development" protocol like Gnutella.

gnutellafan you said that this forum is for gnutella developers wannabies well IMHO (always :) ) this forum is more oganized than the GDF, and i have found more information for Gnutella protocol by following the links posted in here than downloading some files from the GDF. I also dont like registering to Yahoo.

Both of you didnt answered if not using HASHes for text files is a good idea. I know that they are not too many out there downloading text files but its a possibility that it could be best to exclude.

When someone sends a query then all the other clients respond with query hits thath have HASHes inside them or you have to send a query with HASH inside to filnd alternatives?

gnutellafan

January 29th, 2002 02:26 PM

lend a helping hand

Moak and veniamin, since you both know programing languages why dont you help out on an existing open source project. Any help would be very valuable to the development of the network.

As for the GDF they of course have no power to enforce anything. However if developers dont work together the network will fall apart and become fragmented.

Moak	January 29th, 2002 03:33 PM

Re: lend a helping hand

Speaking about myself: I do only things that make fun, especially in my free time. Haven't found a free gnutella client yet I feel comfortable with. (Perhaps when Mutella becomes better documented, Max *g*, or Xolox goes OpenSource)

PS: Veniamin, currently I prefer sending a hash always when the other side requests it (0.7 proposal appendix B). Small files are not that common in Gnutella these days, so it's not a big traffic issue IMHO, also hashs help finding alternative locations when a host is "firewalled" or drops connection.

Morgwen

January 30th, 2002 03:02 AM

about wannebies!

Gnutellafan!

Why you think that these developers here must be wannebies?

I tell something... I know "one" of these wannebies - his name is godXblue! He developed a client within three months... all people who tested this client so far (they used bearshare, Limewire and Xolox before!) said this prog is great!

So lets compare this with a GDF developer - for example Vinnie... he is working on bearshare for about 15 months...

So our wannebe have developed in 20% of the time a much better client! Our wannebe don´t use open source...

Hmm...

now I wonder who is the wannebe...?

Perhaps many of the GDF developers are thinking about more important things - how to make money!

This is no offense Gnutellafan, only a little story about "wannebies"!

Morgwen

gnutellafan

January 30th, 2002 04:29 AM

Hmm, I believe I said that I was the wannabie!! I was not aware of any developers here besides cultiv8r (which I still have yet to see, same goes for client by godXblue). I apoligised for any offense, removed it, let it drop!

Moak, if you dont like the direction of a client take it in a new direction.

I would really like to see someone with the know-how add data encryption and encrypted file caches (as well as partial file sharing).

Morgwen

January 30th, 2002 07:34 AM

Quote:

Originally posted by gnutellafan
I apoligised for any offense, removed it, let it drop!

OK! :)

Morgwen

maksik

January 30th, 2002 09:21 AM

Now I feel like I have to make a comment. I have not been to the GDF for a while, but when I was, I was really unimpressed by the amount of mess overthere. There were nearly impossible to find something you are looking for as well as understand what those people finally agreed on. I have no time to participate in the forums. Really I develop Mutella part-time and thats ALL the time I can actually devote to the topic. It's a shame there's no place were I can go and check what were the latest updates to the protocol etc. Clip2 did the major contribution by releasing Gnutella Protocol Specification v0.4.x, and it's a shame nobody actually repeated that for v0.6 and later.

Btw, first functional version of Mutella was developed in ~ 1 month. Well, I have not done it from scratch, which I regret of :-)

Maksik

veniamin

January 30th, 2002 12:29 PM

...since you both know programing languages why dont you help out on an existing open source project....

....and have to argue all the time about protocol extensions for gnutella, with other developers? Nope... thanx....

gnutellafan

January 31st, 2002 04:45 AM

New direction

Well, to the programmers here who are not working on a current client let me suggest that you get involved and take gnutella in a new direction. Use the ideas, optimize the protocol and code, and add some great new features.

I think the most important thing that gnutella is lacking is security in the for of anonimity (sp?). I think the best way to do this, and I have expressed it here, the GDF, ect, is to add encryption to file transfers and to cache a percent of the file transfers. In addition, the cached files would be encrypted. Users would be required to provide at least 100mb or more HD space for encrypted files (or partial files) and may choose to share as much as they want of there own files. They would not be able to unencrypt those files and therefor would not know what they had. The network should not know if the files being transfered are encrypted files or regular shared files. Therefor noone could say who is sharing what. In addition this provides a huge benifit because now all users are sharing something, even if it is only the 100mb of encrypted files. If there was encryption I would of course also have no problem with the program to require that the download and partial folders be shared providing even more resources to the net.

It turns out that we have some very talented people here and it would be wonderful to see them greatly advance gnutella to gnutella2 :)

I guess I am the only one here that doesnt know anything ;)

Unregistered

January 31st, 2002 10:57 AM

Gnutellafan, how about starting new threads and explain your ideas more detailed there?

Pferdo

April 5th, 2002 12:23 AM

@ maksik

is this what you're looking for?
http://rfc-gnutella.sourceforge.net/

Nosferatu

April 5th, 2002 02:53 AM

Back to the TOPIC

SHA1 is the "agreed" method and is already implemented in some clients (Bearshare .. any others?)

The limit for conflicting files is not determined by the horizon ie 10,000 odd PCs - because you do not choose the 10,000 PCs to connect to and the files are not predetermined. This is covered in a statistics course, for anyone who would like to argue.

The limit is the number of available files in the world, which is in fact infinite, because they are created all the time and can be any size, storage is growing all the time.

But to be reasonable, for now, say no one is going to use gnutella to download a file bigger than 1G, so the number is the number of possible permutations of binary digits in a 1G file .. which is an astonishingly large number.

It's about 2561000000000 if I am not getting too confused (I was, but I editted this message to try to fix it. I think I have the number approximately correct this time). It's actually more, since I didn't count the files smaller than 1G, but ... well, the number is too big for debian's arbitrary precision calculator to display, let's just leave it at that.
Most people in the systems administration field have been happily using MD5 for years, but files are growing so maybe MD5 is no longer considered enough.

I would like to see sources for why not.

The file hash should not be sent back with queries, since in most cases it is unecessary. The file size is plenty to make a rough estimate of duplicate files in queries received.

Second step should be when the user decides to download a file, then you request the file hash from the client serving the file.

Then you send out that file hash to find duplicate sources where the name differs greatly - believe me there are plenty.

Once the file is retrieved completely, the hash is taken and if it doesn't match what you were 'quoted', then you start again (maybe using a different source!;-)

OK? Comprende?

For small files, then you can use smaller hashes to determine duplicates, since the number of permutations of bits in a file of 10k is very small (comparitively!).

Perhaps when sending out the hash-based query for alternative sources, you send out the file size plus hash.

Here is another use for hashes:

Hashes would be great for eliminating downloading of files you already have.

The way it should be implemented though is not too filter the query results out qhen they are received, since this would require hashes for every single search result, but instead the hash should be retrieved when the user clicks on the file (or the automatic-download algorithm determines that a file matches its criteria) - then the hash should be retrieved, and if the file already exists on the users PC, the gnutella client just says 'skipping download - you already have the file in location x'.

Nos
[Editted 5 Apr 2002 to fix guess at number of permutations in a 1 G file]

Unregistered

April 12th, 2002 06:12 AM

Re: Back to the TOPIC

Quote:

Originally posted by Nosferatu
But to be reasonable, for now, say no one is going to use gnutella to download a file bigger than 1G, so the number is the number of possible permutations of binary digits in a 1G file .. which is an astonishingly large number.

It's about 2561000000000 if I am not getting too confused
[/B]

There's not even that amount of atoms in the whole Universe! Which I think is about 1090, more or less.

Smilin' Joe Fission

April 12th, 2002 10:10 AM

Re: Re: Back to the TOPIC

Quote:

Originally posted by Unregistered
There's not even that amount of atoms in the whole Universe! Which I think is about 1090, more or less.

However, if you do the math, a 1GB file has 8589934592 bits. the number Nosferatu came up with is a total of all permutations of that 1GB file where 1 OR MORE of those bits has changed. When you change even 1 bit, the resulting file is a completely new file because its hash value will be different. I believe the number Nosferatu came up with may be pretty close.

As for the number of atoms in the universe.... I don't think that number is even close. Whatever scientist came up with that number is on drugs.

Taliban

April 12th, 2002 10:46 AM

The number of atoms in the universe is about 10^78. You can estimate this number by counting galaxies, measuring how bright they are and estimating how big their mass is.

You don't need any drugs for that.

Nosferatu

April 13th, 2002 08:12 PM

Re: Re: Re: Back to the TOPIC

I just had a conversation on irc .. someone had a good idea, maybe some of you have heard it before.

Anyway, the idea is this: hash the first meg of the file as well as the whole file.

So that way you can tell that 'buffy vampire.divx' 20M is the same file as 'buffy vampyre.divx' 80M, and get at least the first 20M.

Then you repeat search later for files with first meg hash = x.

To implement this most reliably and sensibly would require instead of the HUGE proposal's technicque of always and only hashing the whole file, the best implementation would be to have a query 'please hash the file from range x-y'.

This shouldn't be totally automated .. because someone might have a text file which includes a smaller text file that should be considered complete .. eg they may have tacked some personall notes onto the end of some classic document. You probably don't want the extended version, so a user control button is needed 'Find bigger files whaich start off the same' or not.

In fact a really good implementation (but not necessary for each client to implement for it to work, as long as clients suppor the 'hash this part of the file please' extension, would be the one suggested below:

<Justin_> bigger or smaller:)
<Justin_> or have a slider hehe
<Justin_> the way md5sum works having 100 sums is not that intensive to make right?
<Justin_> cause its incremental no?
<Justin_> so if you had a slider in the program, that starts at 100%, that you can lower by 10% incremnts to find more files
<Justin_> as in, the default is files that match 100%, or files that match at the 90% mark, well not % it would have to be 10M intervals

Having the ability to reuest hashes for arbitrary portions of files would additionally make their use for verifying contents reliable - if someone could generate two files with the same hashes (or when this happens randomly) simply checking the hash for a given subportion would detect the difference.

Nos

----------

Quote:

Originally posted by Smilin' Joe Fission

However, if you do the math, a 1GB file has 8589934592 bits. the number Nosferatu came up with is a total of all permutations of that 1GB file where 1 OR MORE of those bits has changed. When you change even 1 bit, the resulting file is a completely new file because its hash value will be different.

Well, this is the question. Is the HASH indeed large enough to have a unique value for each individual permutation of a 1G file, and if not, does it really matter?

Certainly we are not going to generate each version of a 1G file that is possible .. ever (well, unless some pr!ck sits down in the far future and does it on purpose as a programming exercise using some newfangled superdupercomputer we can't even imagine yet .. but I stray from the topic). We do need a hash that has enough values that most probably each individual file we generate will have a unique value .. but it can't be known for sure unless you actually generate the hash for each file (ie generate each file).

Hashes are funny things. (I'm still searching for a good reference to back that statement up .. but don't have time to find right now .. see later posting.)

I think if you look at the file size and the hash, you have enough certainty to call it a definite match in searching for alternate download sources. Better techinuqe described above in first portion of post.

Quote:

I believe the number Nosferatu came up with may be pretty close.

As for the number of atoms in the universe.... I don't think that number is even close. Whatever scientist came up with that number is on drugs.

I did a quick one on my calculator based on figure for 'mass of observable universe' from O'Hanian 'Physics' text book .. and 1e70 would seem to be what "they" think (the scientists). But I agree about the drugs ;)

<A HREF="http://groups.google.com/groups?q=number+atoms+universe&hl=en&scoring=r&sel m=4kc1fu%24gej%40agate.berkeley.edu&rnum=1">This</A> will do as a reference - at least the guy has the word 'physics' in his email, as well as the word 'berkely'. I couldn't be bothered checking any more thoroughly than that.

Nos
[Editted 14-04-2000 to add URL reference for atom count left out of initial post]

All times are GMT -7. The time now is 02:10 AM.

Page 1 of 2

Show 50 post(s) from this thread on one page