Proposal for development of Gnutella (hashs)

Moak · #1 (**permalink**) January 11th, 2002

Hi,
I thought more about hashs in Gnutella Queries. I personally think they should be as small as possible, because I expect a increased Query/Queryhit traffic from new clients with features like automatic resume and multisegmented downloads. While automatic requeries are a key technology for those features, the Query traffic especially for hash wil increase. Perhaps it will be also necesarry to group multiple searches together into a single message (multiple searches in one Gnutella Query to avoid repeated sending of Gnutella descriptor header, 23 bytes + more repeated payload). A small query/hash will be necessary in my eyes, as small as possible.

Different people have different ideas of a small hash. It should be still unique enough to fit our needs, common are AFAIK those suggestions:

* CRC-32, size 32 bit (256 hashs/KB *) [1]
* MD5, size 128 bit (64 hashs/KB) [2]
* SHA1, size 160 bit (51.2 hashs/KB) [3]
* Tiger, size 192 bit or truncated to 128 or 160 (42.6 hashs/KB) [4]
* Snefru, 256 bit (32 hashs/KB) [5]

I'm not sure about which hash to use (prefered). There seems to be nothing between 32 bit (CRC-32) and 128 bit (MD5) length. CRC32 will be not unique enough within a typical Gnutella horizon, better start to use MD5 or higher. Is it possible to truncate a big hash to e.g. 64 bits, does this make sense? I'm not familiar with cryptography, this is only a short summary... perhaps someone else wants to add some more qualified comments? :-)

At least the hash should be IMHO pure binary in inside the query (not BASE32 encoded which blows up the size again), in HTTP headers it might be BaseWhatever encoded to gain highest HTTP/1.x compatibilty. I think indexing speed is secondary [6]. Indexing local shared files can be performed in background on first startup (meanwhile the client does not answer with own hashs, but can allready search for).

* = pure binary hash, not included are descriptor headers or other protocol overhead

[1] CRC-32 - http://www.ietf.org/rfc/rfc1510.txt (ISO 3309)
[2] MD5 - http://www.faqs.org/rfcs/rfc1321.html
[3] SHA1 - http://www.faqs.org/rfcs/rfc3174.html
[4] Tiger - http://www.cs.technion.ac.il/~biham/Reports/Tiger/
[5] Snefru - http://www.faqs.org/faqs/cryptography-faq/part07/
[6] Hash Indexing Speed - http://groups.yahoo.com/group/the_gdf/message/1970

Moak · #2 (**permalink**) January 12th, 2002

PS: Some ppl suggest to combine a small CRC hash with filesize (which is allready in Queryhits, not in Queries or other Gnutella descriptors).... let's play around with this idea. This would be a 32 bit hash + 32 bit filesize (taken from Gnutella protocol v0.4) = 64 bit key to use. Perhaps it would be more unique to use a real hash of 64 bit instead of the 64 bit CRC+filesize combo, e.g. an truncated MD5?

Here an overview of minimum possibilities:
* CRC-32, size 32 bit
* CRC-32 filesize combo, size 64 bit
* MD5 truncated, size 64 bit
* MD5, size 128 bit

Notes: I have choosen MD5 in this case, because it is the smallest and fastest compared to other hashs (SHA1, Tiger, Snefru). The CRC-32 alone is too small. The CRC-filesize combo might be enough, the truncated 64 bit MD5 might be mathematically more unique while it wastes 32 bit information in Queryhits (not in Query, GET, PUSHS). The next higher alternative is a 128 Bit MD5, e.g FastTrack uses an MD5 hash AFAIK.

I'm not sure if a minimum alternative is the best solution for Gnutella's future. Perhaps a 64 bit key does make us happy now, in future with more superpeers and bigger horizons we might want to have a bigger hash (MD5 or SHA1)?

An possibility could be an encoding alla HUGE. The hash has an prefix telling the hash type. For binary Gnutella messages (Query/Queryhits) this could be a payload like: byte 0 = hash type, more bytes = binary hash. The protocoll defines a list of known hash, while clients need a common solution, this list will be short, e.g start with CRC-filesize combo today and use SHA1 in future. In HTTP-alike Gnutella messages we can work with encoded hashs (not binary), similar to the HUGE proposal [1].

Conclusion: I have none.

I suggest to implement and test a minimum solution (CRC32-filesize combo) and a bigger hash (MD5 or SHA1) for a while. With more experience in a real world environment we can hopefully find a suitable solution. Feedback, tests and mathematical analysis are welcome!

[1] "HUGE" - http://groups.yahoo.com/group/the_gd...roposals/HUGE/ (Yahoo account required)

Moak · #3 (**permalink**) January 12th, 2002

PPS: Here is a summary why using Base32 or Base64 encoding in HTTP-style requests/headers.

------ snip ------
From http://groups.yahoo.com/group/the_gdf/message/2442:
- We could choose any encoding, but...
- Base32 is useful for compatibility with URLs and domain names, so...
- We might as well use it in protocol-fields, saving extra conversions and developer inconvenience.
------ snap ------

Which sounds logical... BUT.... at least we could also use Base64 after having a '?' in the location/URN. A HTTP GET could look like this:

Definition: GET /get/hash?[URN] HTTP/1.0
Base32: GET /get/hash?sha1:BCMD5DIPKJJTG2GHI2AZ9HG7HZUN5ZPH HTTP/1.0
Base64: GET /get/hash?sha1:/9n6YmKqNRmcLIiKC+2xRccm68 HTTP/1.0

Right now I prefer Base64 for HTTP encoding (it's smaller then Base32), binary encoding inside binary Gnutella binary messages and a smaller hash than SHA1.

veniamin · #4 (**permalink**) January 27th, 2002

If a client respond to a query with query hits that each one has a URN then a lot of bandwidth is beeing wasted.

What i want to say is that for some files a client should not respond with a URN/Query Hit.
For example text files usually are small files which can be downloaded again in case of an incomplete download. There is no need to apply a URN for a text file. Not just .TXT files but also .HTM, .XML, .DOC and other text files. Another reason is that , if you just alter a text file (ex: put a comma) then the HASH changes. So many users can have (almost) the same (text) file but not the exactly identical and sending diferrent URNs.
URNs should be used only for binary files. But not just for every binary file. For small binary files (800~1000KB) a client should not reply with a URN.

Also when a client searches for the same file in other servers should use and the size of the file, not just the HASH. This way we can use an algorithm smaller than SHA1. If you have two binary files with different file size then there is no chance to be the same (or one of them is corrupted).

#5 (**permalink**) January 27th, 2002

confused, do you really mean "Uniform Resource Names" (urn) or simply "hashs"?

gnutellafan · #6 (**permalink**) January 28th, 2002

Developers can not use what ever hash algorith they feel like. All developers must use the same one or it defeats the point. SHA1 is the one agreed on by the GDF and should be used by all clients.

SHA1 was used because it would be difficult to create a fake file with the hash the matches something else.

TigerTree can be used in combination with SHA1 and would be an excellent way to provide support for the sharing of partial files.

Moak · #7 (**permalink**) January 28th, 2002

I don't agree.
Gnutella is a protocol in development and while I do not agree with some GDF ideas I prefer collecting/envolving ideas and suggesting new ones. In this particular case, the GDF-HUGE proposal itself is still an proposal in change and it also does not prescribe SHA1. Don't claim something a standard which isn't.

> SHA1 was used because it would be difficult to create a fake file with the hash the matches something else.

No argument. You give me SHA1 hash, I send back something junk data. It was often discussed that hashs are NOT used for security reasons just for simple file identifcation, a smaller hash is fine also. If you need a ensurance to get no junk data, use overlapped file resume.

> TigerTree [...] would be an excellent way for the sharing of partial files.

Please give an explanation or URL for our readers.

gnutellafan · #8 (**permalink**) January 28th, 2002

I believe that HUGE was voted on and approved was it not? So for the time being developers should not choose any hash they want but the one that was agreed on or else it makes hashing worthless.

While a malicious client could sent you junk data this will not be propagated using SHA1, if a client confirms (rehashes) after downloading. It will see that the data is crap and get rid of it. With a less robust hash fake data can be made that would have the same hash value and thus be allowed to propagate.

I should have said that the use of tiger tree for partial file sharing is my idea but it should work. For more on what tiger tree is go to:

http://groups.yahoo.com/group/the_gdf/message/4871

gnutellafan · #9 (**permalink**) January 28th, 2002

From Hash/URN Gnutella Extensions (HUGE) v0.93 :

Quote:

To be in compliance with this specification, you should support at least the SHA1 hash algorithm and format reflected here, and be able to downconvert "bitprint" requests/reports to SHA1. Other URN namespaces are optional and should be gracefully ignored when not understood. Please refer to the rest of this document for other important details.

So yes, SHA1 is "prescribed" and should be supported. Developers may opt to add additional hashes if they wish.

TorK · #10 (**permalink**) January 28th, 2002

SHA1 should be used since MD5 is not strong enough. However, I was thinking of an idea of hot reduce the size of queries:
What if you could specify only the beggining of the hash and a * to indicate that any bytes may follow. Some files with the wrong hash would be returned, but since the whole hash is in the replies, those would be filterd out. Extra hits are also much cheaper than extra bytes in queries.

This, of course requires the hashes to be in base32, since the '*' char could not be recognized in raw binary data.

/Tor

Moak · #11 (**permalink**) January 28th, 2002

Hi Tork,

please define "strong", a 128 bit hash isn't enough to identify a file within a typical horizon? I like the idea of shortened keys. I allready suggested to use an truncated MD5 hash, since 64 Bit should be enough to identify a file in a typical horizon (less bandwith wasted). The hash in binary messages does not need to be Base32 (or whatever) encoded.

/Moak

Moak · #12 (**permalink**) January 28th, 2002

Gnutellafan, rehashing doesn't work if you do not have the full content. So any bad/broken client can send junk data, you won't recognize until you downloaded all partials. A SHA1 hash does really give no security on resume time, a simple overlapped check does this allready.
So while HUGE favours SHA1, I think it's a unnessecary waste of bandwith and therfore I personally don't support this idea, I prefer other alternatives. The GDF can vote a lot of things... I think discussing ideas, improving design, understanding background details, find better alternatives for the next protocoll is still allowed in this place.

gnutellafan · #13 (**permalink**) January 28th, 2002

nothing is set in stone, but for the time being SHA1 is it!

Different clients CANNOT use different hashes if we want to be able to use the hashes across the network. If at some later time developers agree that SHA1 is not working and choose something else than the protocol changes. But one developer should not use md5 or whatever they feel like, they should use SHA1.

Moak · #14 (**permalink**) January 28th, 2002

So you're not interested in my ideas, knowledge and analysis... because what GDF says is the only word? Okay, I'm not interested in the CGDF (commercial gnutella developer forum) under LW/BS pushing force, where things get implemented in _current_ clients before they are well tested and improved in beta clients, motto: I implemented it now, all others eat it or die. Improving an open protocol this way is inefficient in my eyes, like forcing ppl using uneven Linux kernel and improving weak spots is not allowed anymore.

gnutellafan · #15 (**permalink**) January 28th, 2002

Hmm, I dont think I ever said, or even implied that I am not interested in your ideas. Any ideas that improve gnutella are valuable. I only said that developers should comply with the accepted standard decided on by the GDF.

Im not sure but how many votes does limewire/bearshare get? 1 each I would imagine. So how many developers are there? I dont know how you can call it a cGDF?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Gnutella Protocoll v0.7 Proposal	Moak	General Gnutella Development Discussion	41	August 17th, 2002 11:55 AM
gnutella development plans	Iamnacho	General Gnutella Development Discussion	11	March 9th, 2002 07:21 PM
My Proposal for XoloX!!!	Unregistered	User Experience	1	February 6th, 2002 09:11 AM
Xolox and Gnutella development	Moak	Rants	6	November 25th, 2001 07:05 AM
---a Radical Proposal---	Unregistered	General Gnutella / Gnutella Network Discussion	0	September 21st, 2001 01:08 PM