Re: Re: Re: Re: Back to the TOPIC Quote:
Quote:
Quote:
Quote:
Quote:
|
Quote:
http://rfc-gnutella.sourceforge.net/...-huge-0_92.txt or http://rfc-gnutella.sourceforge.net/...-huge-0_93.txt . Perhaps it was discussed and then dropped .. ? Got a reference?I found http://bitzi.com/ propose/use <A HREF="http://bitzi.com/developer/bitprint">tiger-tree </A>as an attempt to index as many files as they can .. looks like a good project to incorporate into gnutella clients - have a bitzi index lookup. Also found the <A HREF="http://www.cs.technion.ac.il/~biham/Reports/Tiger/">Tiger Hash algorithm homepage</A> and the <A HREF="http://sourceforge.net/projects/tigertree/">tiger-tree homepage</A>. Unfortunately between these three sources I can't find a description of the tiger-tree process in words I can understand. <A HREF="http://bitzi.com/developer/bitprint">"TigerTree is based on the Tiger hash algorithm, applied to each 1024-byte block of a file, then combined up through a binary hash tree to a final summary value"</A> really doesn't cut it for me. Anyone know what it means? They imply that it can be used for incremental portions of the file .. but I don't understand the process. These bitzi guys are JUST doing hashing of files, and are covering any files you care to name .. so they probably have thrashed this issue out pretty well. Also, if there aren't competing schemes to index all of filespace, then it really makes a lot of sense to use their hashing scheme so that you can link in and allow users to search bitzis index to see what it has to say about what the user receives in their search results. I think this is a <B><I>really exciting idea</I></B>. Could save a lot of bandwidth downloading broken mp3s etc, for example. Quote:
Nos |
Quote:
Quote:
Code: A B C D I hope this makes a shred of sense, it's in the early morning as I'm writing this and my brain is falling asleep. Besides that, I can't seem to find the reference material I got this from. Quote:
|
tiger hash First, tiger is a hash algorythm just like md5 or sha1. "tree" describes a way of using that algorythm where segments of the file are hashed individually. The tigertree implementation used by Bitzi uses 1024b blocks (though they could use any size). I have no evidence but I think that around 1mb would be the best. The tree hash is the best way to share partial files. A tree hash can use any hash algorithm (ie, md5, sha1, tiger, ect). Small chunks of the file are individually hashed and all of these hashes make up the tree hash. Because of this you could set it so that there is a hash for every 1mb of the file. Then you could securely and confidently download partial files of 1mb size from multiple hosts with partial files. An added bonus of the tree hash method is the ability to resume from nearly identical files. For example: I want to download songx, so I search for it and find it. There are quite a few versions with the same size, bitrate, ect but they have different metadata so the hash is different. Well, with the tree hash you could use all of those sources to swarm from for the parts of the file that are the same!!! This would greatly increase swarming speeds while providing the same security and confidence we currently have with hashed files! |
Tree is stored where? Hmm .. hash trees ... What I have understood from searching the web: Tree-hashes are also known as Merkle Trees. The idea was <A HREF="http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/srchnum.htm&r=1&f=G&l=50&s1=%274,309,569%27.WKU.&O S=PN/4,309,569&RS=PN/4,309,569">patented in 1979</A>, but I read somewhere the patent ran out in 1999. The tree works like this: Hash tree <PRE> (Whole File) | /\ / \ / \ (m) (n) / \ / \ / | | \ (i) (j) (k) (l) / | / \ / \ / \ (a)(b)(c)(d)(e)(f)(g)(h) </PRE> You get parts a and b of a file. You get the hash of the entire file, and the hash values for only j and n, and you can verify that a and b are part of the file by generating x, then n, then with p, the whole file hash. But in gnutella you wouldn't do this - it's absolutely pointless. For it to work for all parts of the file, all values in the tree hash need to be stored centrally where you can get them to check. If you have available an index (<A HREF="http://bitzi.com/">bitzi</A> only stores the whole file hashes) you would just download the correct hash for section x and check it. I can't see a feasible way to make that aspect work without a central server storing all the intermediate hash values, otherwise you might just as well do this: If you didn't use a tree, you might store all the values a-h and download and check each one individually. For a big file this is a lot of downloading and checking. So you might make a hash of all the hashes together and store that value - but that is useless unless you have downloaded <I>all</I> the sub parts. So the tree covers these in-between cases. BUT you need to have all these values available somewhere for verification for it to be useful, ie if you find only parts a, c and n, you still need someone to tell you the correct hashes for b and d and the whole hash to verify the whole hash. The idea is that you can download part of a file from someone who doesn't know what the whole file is, so the person you're downloading the file portion from might not know what the whole file looks like, so you have to find the info from someone else. Now, to set up a service like Bitzi storing all the subtree values, I guess the storage would blow out. That's obviously why they don't do it. And I can't see a sane way to track all these sub-portions within the gnutella protocol. I guess you could automatically make lists and store them in config files .. then automatically share them out as virtual files and automatically query for the virtual files .. but it sounds like a big headache to me. Another option is calculating them on request, but this seems .. bad too. So the hash tree idea doesn't seem helpful to me (except in maybe an all-out swarming effort .. which seems like too much work for not enough benefit at this stage). Can anyone point out something I've missed? Is anyone implementing this as a way to do swarmed downloads? I'm back to thinking that the easy and reliable solution which works is just query for a hash of a given byte-range. This has an additional benefit I didn't mention before, you could ask for hash of mp3 files being offset by 64 bytes or 64k or whatever size the id3 tag is. Then files which are the same except for meta data could be easily found: Search for files matching words blah foo bananas. Query hash of file "foo bananas go blah.mp3" 2.35M offset by 64 bytes or whatever it is. Same with file "foo bananass goes blah.mp3" 2.37M. They match! Queue both as alternative downloads! Unfortunately "bananas go blah foo.mp3" 2.34M turned out to be a different file (must be 160bit or something ;P ) Nos <I>[Editted 15 Apr 2002 to clean up drawing - sorry pic looks OK in (gak!) IE but maybe not in <B>your</B> browser]</I> |
No need for a central server. Who ever is hosting (sharing) the file keeps the whole file hash as well as as 1mb incremental hashes. These are stored just like the sha1 for HUGE. Then if I start a download fromyou I get that hash info. Now I can use it to search the gnet for the other parts, even if they are partial files, to swarm from. |
What would be better is if a method were used where one could determine the component hashes by disassembling the full file hash. Then, only the full file hash would need to be sent when requesting a file. I suppose that may be asking a bit much though. |
just a bit ;-) Yes, this would be great. But a downfall might be that AN actual set of data might be able to be calculated that would match such a hash and then it would be possible to create fake data with the same hash and screw up the dls on gnet. |
Can be done It's another way of doing it, but I didn't mention it because basically it's not. You just make the hash for the whole file the concatenation of the hashes for the parts. It means that either you select parts as being pretty big compared with the size of the whole file, or you end up with a long hash. Nos |
improvement of my idea After talking with gordon from Bitzi I think tree hashes are overkill. Instead you could simply hash ranges of the file with sha1. This could be done in 1mb chunks. So basically all files would be hashed 2x. Once for a full file hash, and once where a hash is generated for each 1mb portion of the file starting from the begining. Since the the file will not be an exact multiple of 1mb the last hash may be of a segment shorter than 1mb. I dont have any basis for choosing 1 mb of course. A bit of trial and error would be needed to optimize the system. Anything larger than 1mb, say 5mb or 10mb would be good for large files but would not provide the benefit, esp meta data benefits, for small files such as mp3s. Does anyone know more about meta data, is it always stored at the end of files, even for videos ect? |
All times are GMT -7. The time now is 10:56 PM. |
Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2024, vBulletin Solutions, Inc.
SEO by vBSEO 3.6.0 ©2011, Crawlability, Inc.
Copyright © 2020 Gnutella Forums.
All Rights Reserved.