Gnutella Forums - Proposal for development of Gnutella (hashs)

Gnutella Forums (https://www.gnutellaforums.com/)

- General Gnutella Development Discussion (https://www.gnutellaforums.com/general-gnutella-development-discussion/)

- - Proposal for development of Gnutella (hashs) (https://www.gnutellaforums.com/general-gnutella-development-discussion/6969-proposal-development-gnutella-hashs.html)

Re: Re: Re: Re: Back to the TOPIC

Quote:

Originally posted by Nosferatu
I just had a conversation on irc .. someone had a good idea, maybe some of you have heard it before.

Anyway, the idea is this: hash the first meg of the file as well as the whole file.

So that way you can tell that 'buffy vampire.divx' 20M is the same file as 'buffy vampyre.divx' 80M, and get at least the first 20M.

Then you repeat search later for files with first meg hash = x.

To implement this most reliably and sensibly would require instead of the HUGE proposal's technicque of always and only hashing the whole file, the best implementation would be to have a query 'please hash the file from range x-y'.

I believe this was part of the HUGE proposal as well... The part about using a Tiger tree to hash sections of a file. Is it not?

Quote:

Well, this is the question. Is the HASH indeed large enough to have a unique value for each individual permutation of a 1G file, and if not, does it really matter?

I believe it may be although I haven't verified it. However, what makes me think this is that the SHA1 hash is good for files up to 264 bits long, for which I would think it would generate a unique hash for each unique file.

Quote:

Certainly we are not going to generate each version of a 1G file that is possible .. ever (well, unless some pr!ck sits down in the far future and does it on purpose as a programming exercise using some newfangled superdupercomputer we can't even imagine yet .. but I stray from the topic). We do need a hash that has enough values that most probably each individual file we generate will have a unique value .. but it can't be known for sure unless you actually generate the hash for each file (ie generate each file).

Agreed.

Quote:

I think if you look at the file size and the hash, you have enough certainty to call it a definite match in searching for alternate download sources. Better techinuqe described above in first portion of post.

Personally, I would trust just the hash because a file of a different size should theoretically generate a different hash. But that's just my opinion.

Quote:

I did a quick one on my calculator based on figure for 'mass of observable universe' from O'Hanian 'Physics' text book .. and 1e70 would seem to be what "they" think (the scientists). But I agree about the drugs ;)

Well, hopefully they'll do a count someday to find out an exact number. Heh. :)

Quote:

To implement this most reliably and sensibly would require instead of the HUGE proposal's technicque of always and only hashing the whole file, the best implementation would be to have a query 'please hash the file from range x-y'.

I believe this was part of the HUGE proposal as well... The part about using a Tiger tree to hash sections of a file. Is it not?

Can't find the word tiger or anything that looks like hashing of parts of the file at

http://rfc-gnutella.sourceforge.net/...-huge-0_92.txt or http://rfc-gnutella.sourceforge.net/...-huge-0_93.txt .
Perhaps it was discussed and then dropped .. ? Got a reference?I found http://bitzi.com/ propose/use <A HREF="http://bitzi.com/developer/bitprint">tiger-tree </A>as an attempt to index as many files as they can .. looks like a good project to incorporate into gnutella clients - have a bitzi index lookup.

Also found the <A HREF="http://www.cs.technion.ac.il/~biham/Reports/Tiger/">Tiger Hash algorithm homepage</A> and the <A HREF="http://sourceforge.net/projects/tigertree/">tiger-tree homepage</A>.

Unfortunately between these three sources I can't find a description of the tiger-tree process in words I can understand. <A HREF="http://bitzi.com/developer/bitprint">"TigerTree is based on the Tiger hash algorithm, applied to each 1024-byte block of a file, then combined up through a binary hash tree to a final summary value"</A> really doesn't cut it for me.

Anyone know what it means? They imply that it can be used for incremental portions of the file .. but I don't understand the process.

These bitzi guys are JUST doing hashing of files, and are covering any files you care to name .. so they probably have thrashed this issue out pretty well.

Also, if there aren't competing schemes to index all of filespace, then it really makes a lot of sense to use their hashing scheme so that you can link in and allow users to search bitzis index to see what it has to say about what the user receives in their search results. I think this is a really exciting idea. Could save a lot of bandwidth downloading broken mp3s etc, for example.

Quote:

Personally, I would trust just the hash because a file of a different size should theoretically generate a different hash. But that's just my opinion.

The file size could be used to eliminate the case where two files have the same hash. It's possible to happen, but I would expect (and I don't know enough to say this is definitely the case) I would expect that the chances of two files of the same size having the same hash is much smaller than the chance of two files of differing sizes having the same hash. It's just a way to rule out 99% of files which could be duplicate hash but different file.

Nos

Quote:

Originally posted by Nosferatu
Can't find the word tiger or anything that looks like hashing of parts of the file at

http://rfc-gnutella.sourceforge.net/...-huge-0_92.txt or http://rfc-gnutella.sourceforge.net/...-huge-0_93.txt .
Perhaps it was discussed and then dropped .. ? Got a reference?I found http://bitzi.com/ propose/use <A HREF="http://bitzi.com/developer/bitprint">tiger-tree </A>as an attempt to index as many files as they can .. looks like a good project to incorporate into gnutella clients - have a bitzi index lookup.

You're right... I remember reading about the value of the Tiger Tree hash and, without actually looking at the HUGE proposal again to verify it, I assumed it was the proposal where I originally saw it. However, the HUGE proposal does include a provision for a 39 character tiger tree value, but they don't explain how it is used or how it is generated.

Quote:

Also found the <A HREF="http://www.cs.technion.ac.il/~biham/Reports/Tiger/">Tiger Hash algorithm homepage</A> and the <A HREF="http://sourceforge.net/projects/tigertree/">tiger-tree homepage</A>.

Unfortunately between these three sources I can't find a description of the tiger-tree process in words I can understand. <A HREF="http://bitzi.com/developer/bitprint">"TigerTree is based on the Tiger hash algorithm, applied to each 1024-byte block of a file, then combined up through a binary hash tree to a final summary value"</A> really doesn't cut it for me.

Anyone know what it means? They imply that it can be used for incremental portions of the file .. but I don't understand the process.

I'll see if I can reiterate correctly how it works. Basically the Tiger algorithm hashes a file in 1024 byte blocks. Then it sets up a tree system for finding the final hash which looks similar to:

Code:

A   B C   D   

 \ /   \ /

  E     F

   \   /

    \ / 

     G

From what I remember, working with an "expected" hash value, one can then determine if any of the component hashes are bad. For instance, given the value of E, one could verify that blocks A and B are correct. Given the value of G, one could verify the hash totals for E and F. If, for instance, block C was corrupted, the hash total for F would be wrong so E and F would not equal G. Knowing that the hash total for F was wrong, a person could then trace back the fault to either block C or D. And, if given the true hash values for C and D, individual comparisons could be made to determine that C is incorrect.

I hope this makes a shred of sense, it's in the early morning as I'm writing this and my brain is falling asleep. Besides that, I can't seem to find the reference material I got this from.

Quote:

I think this is a really exciting idea. Could save a lot of bandwidth downloading broken mp3s etc, for example.

Or, in my case, broken video files. I really hate having to download a 500+ MB file again because a few frames in the middle somehow got garbled during transfer.

First, tiger is a hash algorythm just like md5 or sha1. "tree" describes a way of using that algorythm where segments of the file are hashed individually. The tigertree implementation used by Bitzi uses 1024b blocks (though they could use any size). I have no evidence but I think that around 1mb would be the best.

The tree hash is the best way to share partial files. A tree hash can use any hash algorithm (ie, md5, sha1, tiger, ect). Small chunks of the file are individually hashed and all of these hashes make up the tree hash. Because of this you could set it so that there is a hash for every 1mb of the file. Then you could securely and confidently download partial files of 1mb size from multiple hosts with partial files.

An added bonus of the tree hash method is the ability to resume from nearly identical files.

For example:

I want to download songx, so I search for it and find it. There are quite a few versions with the same size, bitrate, ect but they have different metadata so the hash is different.

Well, with the tree hash you could use all of those sources to swarm from for the parts of the file that are the same!!! This would greatly increase swarming speeds while providing the same security and confidence we currently have with hashed files!

Tree is stored where?

Hmm .. hash trees ...

What I have understood from searching the web:

Tree-hashes are also known as Merkle Trees. The idea was <A HREF="http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/srchnum.htm&r=1&f=G&l=50&s1=%274,309,569%27.WKU.&O S=PN/4,309,569&RS=PN/4,309,569">patented in 1979</A>, but I read somewhere the patent ran out in 1999. The tree works like this:

Hash tree

<PRE>
(Whole File)
|
/\
/ \
/ \
(m) (n)
/ \ / \
/ | | \
(i) (j) (k) (l)
/ | / \ / \ / \
(a)(b)(c)(d)(e)(f)(g)(h)
</PRE>

You get parts a and b of a file. You get the hash of the entire file, and the hash values for only j and n, and you can verify that a and b are part of the file by generating x, then n, then with p, the whole file hash.

But in gnutella you wouldn't do this - it's absolutely pointless. For it to work for all parts of the file, all values in the tree hash need to be stored centrally where you can get them to check. If you have available an index (<A HREF="http://bitzi.com/">bitzi</A> only stores the whole file hashes) you would just download the correct hash for section x and check it.

I can't see a feasible way to make that aspect work without a central server storing all the intermediate hash values, otherwise you might just as well do this:

If you didn't use a tree, you might store all the values a-h and download and check each one individually. For a big file this is a lot of downloading and checking.

So you might make a hash of all the hashes together and store that value - but that is useless unless you have downloaded all the sub parts.

So the tree covers these in-between cases.

BUT you need to have all these values available somewhere for verification for it to be useful, ie if you find only parts a, c and n, you still need someone to tell you the correct hashes for b and d and the whole hash to verify the whole hash. The idea is that you can download part of a file from someone who doesn't know what the whole file is, so the person you're downloading the file portion from might not know what the whole file looks like, so you have to find the info from someone else.

Now, to set up a service like Bitzi storing all the subtree values, I guess the storage would blow out. That's obviously why they don't do it.

And I can't see a sane way to track all these sub-portions within the gnutella protocol. I guess you could automatically make lists and store them in config files .. then automatically share them out as virtual files and automatically query for the virtual files .. but it sounds like a big headache to me.

Another option is calculating them on request, but this seems .. bad too.

So the hash tree idea doesn't seem helpful to me (except in maybe an all-out swarming effort .. which seems like too much work for not enough benefit at this stage).

Can anyone point out something I've missed? Is anyone implementing this as a way to do swarmed downloads?

I'm back to thinking that the easy and reliable solution which works is just query for a hash of a given byte-range.

This has an additional benefit I didn't mention before, you could ask for hash of mp3 files being offset by 64 bytes or 64k or whatever size the id3 tag is. Then files which are the same except for meta data could be easily found:

Search for files matching words blah foo bananas.

Query hash of file "foo bananas go blah.mp3" 2.35M offset by 64 bytes or whatever it is. Same with file "foo bananass goes blah.mp3" 2.37M. They match! Queue both as alternative downloads! Unfortunately "bananas go blah foo.mp3" 2.34M turned out to be a different file (must be 160bit or something ;P )

Nos
[Editted 15 Apr 2002 to clean up drawing - sorry pic looks OK in (gak!) IE but maybe not in your browser]

No need for a central server. Who ever is hosting (sharing) the file keeps the whole file hash as well as as 1mb incremental hashes. These are stored just like the sha1 for HUGE. Then if I start a download fromyou I get that hash info. Now I can use it to search the gnet for the other parts, even if they are partial files, to swarm from.

What would be better is if a method were used where one could determine the component hashes by disassembling the full file hash. Then, only the full file hash would need to be sent when requesting a file.

I suppose that may be asking a bit much though.

just a bit ;-)

Yes, this would be great. But a downfall might be that AN actual set of data might be able to be calculated that would match such a hash and then it would be possible to create fake data with the same hash and screw up the dls on gnet.

It's another way of doing it, but I didn't mention it because basically it's not.

You just make the hash for the whole file the concatenation of the hashes for the parts.

It means that either you select parts as being pretty big compared with the size of the whole file, or you end up with a long hash.

Nos

improvement of my idea

After talking with gordon from Bitzi I think tree hashes are overkill. Instead you could simply hash ranges of the file with sha1. This could be done in 1mb chunks. So basically all files would be hashed 2x. Once for a full file hash, and once where a hash is generated for each 1mb portion of the file starting from the begining. Since the the file will not be an exact multiple of 1mb the last hash may be of a segment shorter than 1mb.

I dont have any basis for choosing 1 mb of course. A bit of trial and error would be needed to optimize the system. Anything larger than 1mb, say 5mb or 10mb would be good for large files but would not provide the benefit, esp meta data benefits, for small files such as mp3s.

Does anyone know more about meta data, is it always stored at the end of files, even for videos ect?