View Single Post
  #1 (permalink)  
Old March 10th, 2002
Unregistered
Guest
 
Posts: n/a
Default Signature tables for better searches...

Something for the inventors of ultara-peer to look at:

I've written up an idea that might be of use in gnutella like networks - I apologize in advance for any and all ignorance on my part of the state of the art:


Pretty Darn Good Peer to Peer Search Prediction

russjj@mail.com, posted March 7, 2002

The idea here is to prevent most peer to peer search-related traffic in applications like gnutella by providing a pretty good means of predicting whether sending a particular search on down the network might be successful. Traffic can be prevented if, most of the time, a small reference table can tell that what you seek isn't available at a given other machine or the whole network attached to it. The table doesn't tell you where a file is - it tells you, nearly all the time, where the file isn't, preventing the vast majority of search-related traffic.

This is derived from a program (later abandoned when I fell ill) that I created to form deltas/diffs on any kind of file effeciently, particularly for creating patches automatically to .exes, and which I hoped might aid what is now called bioinformatics. Unfortunately no-one was interested in that subject area then.

The reference table is built up from checksum-like "signatures" one of which is created for each word in every.description of every file. The reference table starts as all zeros. A one is placed at each location in the table corresponding to each signature/checksum. Lets say each signature of each word is 16 bits, and the table 65,536 bits (it could certainly be larger) or 8 k. Then if "Cohen" yields a checksum/signature of 17,354, the 17,354th bit of the 8k table array becomes a one. If another word in the description of another file also yeilds a signature of 17,354, the bit stays as a one, of course - it means that at least one word descriptive of a file had that signature. Once every word in every file description has been used to form a signature, and all the corresponding bits of our 8k table have been turned on, the table (which is intended to be quite sparse at this point) can be RLEed and sent "up the line" to the next machine as it were.

There, the tables that machine receives from the machines connected under it can be combined (logically added), and the table from this machine added in too. That's the beauty of these reference tables in such an application. The resulting table is still very sparse, and will yeild an accurate "don't bother with further enquiries" (zero) to the vast majority of searches. The machines further down the line are never directly queried - we know the search would be unsuccessful. (Remember this system doesn't predict whether a search will be successful necessarily, if sent on, but usually does predict when a search will not be successful. If the table does predict no success, that is accurate. If it says a successful search is possible, that's all it predicts - that the search might or might not be successful.) The tables received from the directly machines under this one are kept as well - so that if a search looks like it might be successful (there was a one at the right spot in the table) these tables can be checked and the search sent on only in the appropriate direction(s).

If you imagine a tree structure imposed on the peer-to-peer network simply for search purposes, which is easily formed and repaired dynamically since tables are easy to add, then as you go up to the top of the tree, at each level the tables are added together, becoming less sparse. Searches are sent off only in potentially useful directions and most branches of the tree won't even be asked about most searches because it's known that the search would be unsuccessful. The bulk of traffic that searches would otherwise generated simply doesn't happen, but the results are still just as accurate.

For large networks, in order to keep the tables reasonably sparse even after thousands of tables have been summed together, it would be more reasonable to use, say, 24 bit signatures for each word and create 2 Megabyte tables. Because these are very sparse tables at the edges of the network particularly (where they will contain no more ones than 16 bit based tables did, namely one per signature very likley), they can be RLEed to quite small files for transmission purposes.


Additional Notes:
you're searching for "Bob Cohen" of course, you'll only send the search on if there's a one in the place in the reference table corresponding to the signature number for Bob, and a one in the place in the reference table corresponding to the signature number for Cohen, as well. Etc.

Note that this method still allows most of the work to be done at the edges of the network, with the calculations of signatures being created on each individual machine, and most table additions being done towards the edges of the network as well.

Since changes in tables have to be passed up the tree (resulting in new logical additions of tables to form result tables) it may be desirable to create one table per session per computer and not update each table as a new file is downloaded from another part of the network.

It may be desirable to throw out very common words and punctuation, that is, not to register their signatures in the reference tables.

Combined model networks are possible - one could have a peer to peer association of more orderly trees for example. This might be useful, for example, if the tables became quite dense at the "center" (from the perspective of our search trees) of the network.

For very sparse tables a two (or more) stage RLE might be useful. This would first RLE which bytes were zero/non-zero, then RLE the string of non-zero bytes to specifiy them fully. A number-glue approach to the RLE might be helpful as well, which would allow one to use numbers as small as nibble-size and as large as needed.

No copyright is claimed on this document, it may be freely reproduced so long as the email address is included.
Reply With Quote