Gnutella Forums

Gnutella Forums (https://www.gnutellaforums.com/)
-   New Feature Requests (https://www.gnutellaforums.com/new-feature-requests/)
-   -   File size filter (https://www.gnutellaforums.com/new-feature-requests/53331-file-size-filter.html)

TorpGuy February 20th, 2006 02:43 AM

File size filter
 
Hello,

LimeWire is a great application!

I need the file size filter..
My examples are like this:

Maximum file size: ______
Minimum file size : ______

I need it because there are 'broken' files with 851.7
Kbytes. These files are not usable.

I hope you guys can make it available!

Only A Hobo February 20th, 2006 03:49 AM

I can't help feeling you are not the first to ask this;)

I think for now you will just have to work around the problem and not go for these small sizes which are no doubt much smaller than what you are looking for.

Limewire has a junk filter, which I do not have, but those who have mastered it find it very useful.

I hope this helps. (for now)

kiwichick888 March 15th, 2006 04:29 PM

I absolutely agree LW need a file size filter to get rid of all those annoying virus results!!!

mayotruck March 16th, 2006 10:32 AM

Get k++
 
Agreed, this is an OK P2P server but without the size filter, this program is very weak. Is there any add ons or anything that can be found, that will add a size filter to LW?? It's funny when i originally got LW everyone was telling me how much better LW was than Kazaa. I got rid of kazaa and jumped into my * new p2p sharing application and was excited to see what new innovations this program had. My first search was "dvd movies" it pulled up 5 actual dvd movies and the rest (of 500+ found results)were 1-35min movie clips. Unless i am doing something wrong(correct me if I am), I am not impressed at all with this program.I seached for sze filters eveywhere, but unfortunatly i have found nothing.

Lord of the Rings March 27th, 2006 08:17 AM

Well I find the Junk filter works quite well. It's a dynamic filter so you need to train it. It learns bit by bit. But yes, the size filter was my 1st request on the forum roughly 18 months ago. It would be handy for multiple reasons. But remember, if the intention is just to filter out virus, then they will increase the size of the virus. :rolleyes:

The network wasn't as full of virus as it is now. There's been a concentrated effort to spread virus & spam by dubious sources over the past 12 months (by autogenerated spamming results.) Particularly the past 6 months.

Grandpa March 30th, 2006 09:43 PM

TorpGuy

Why do I get the felling you opened some of those broken files. I hope I am wrong about that but if you did you are now infected with a virus. The files you mentioned 851.7kb are a known virus you might want to check out the link below.

http://www.limewire.com/english/content/firewalls.shtml

lwmcd1 May 29th, 2006 05:04 PM

Quote:

Originally Posted by Lord of the Rings
Well I find the Junk filter works quite well. It's a dynamic filter so you need to train it. It learns bit by bit. But yes, the size filter was my 1st request on the forum roughly 18 months ago. It would be handy for multiple reasons. But remember, if the intention is just to filter out virus, then they will increase the size of the virus. :rolleyes:

The network wasn't as full of virus as it is now. There's been a concentrated effort to spread virus & spam by dubious sources over the past 12 months (by autogenerated spamming results.) Particularly the past 6 months.

Yep, the file filter helped me with 851.7 ,197.1 ,228.0 so on and so on, but unforunately I've now started getting more and more search results containing nothing but those files. I can't help thinking the largest cause of this problem is the result of LW users that continue to carry these files in their PCs.Go figure.

yothisbling June 22nd, 2006 08:09 PM

Vote For This!
 
Hey everyone that wants this feature go to the file size filter JIRA issue and vote on it! The more votes the faster this feature will come!!!:D

ukbobboy01 June 23rd, 2006 05:09 AM

Yo...Bling

Good idea getting us to vote on this much needed feature, goodness only knows why the LW programmers have turned their collectives backs on this request which has been constantly asked for over the past two or so years.



UK Bob

lwmcd1 July 14th, 2006 07:34 PM

to bad
 
Well what I feared is coming to pass. I've done a number of searches and came up with nothing but viriuses. At the rate the 197.9 files and it's like are spreading I'm afraid that LimeWire will soon be rendered useless. That's to bad.

AaronWalkhouse July 15th, 2006 04:16 AM

Your salvation is coming. I have already gathered the hashes of all those worms
and spam and upcoming versions of LimeWire and other gnutella servents will be
using the lists to block them from your searches.

lwmcd1 July 15th, 2006 04:59 AM

Yea!
 
:) That's good news. Good for you. Seeing as how it has been going on so long I was starting to wonder if it were even possible to do anything about it.:)

lwmcd1 July 22nd, 2006 08:34 PM

As I've feared, LimeWire has become useless. I now rarely get ANY files that are not corrupted.

Lord of the Rings July 22nd, 2006 09:30 PM

Patience & some skill helps! Learn from your previous errors.

BruceLeeFan July 23rd, 2006 03:25 AM

Quote:

Originally Posted by AaronWalkhouse
Your salvation is coming. I have already gathered the hashes of all those worms
and spam and upcoming versions of LimeWire and other gnutella servents will be
using the lists to block them from your searches.

Salvation indeed:) Can't wait! It is getting to the stage these files are seriously messing up limewire.

A file size filter would be the finest though!

Question for those who have voted no --- why?

Ian

AaronWalkhouse July 23rd, 2006 03:50 AM

It's too simplistic to work by itself. Add it as a factor in the Junk Filter and
maybe it would be useful, otherwise people will be shooting themselves in the
foot all over the place. Hashes, on the other hand, are an absolute
identification of specific files and therefore more reliable.

Ed91 July 24th, 2006 05:42 AM

Maybe not for stopping junk, worms etc, but when I'm looking for a (very) large file, it would be useful to be able to filter the size to remove relivant, but unwanted results. As part an advanced search it would be more than useful.

musicaddict July 24th, 2006 05:43 AM

I just been marking all the obviously sus files as junk and setting the filter to not show junk. Also the the keywords option in filters is useful for cutting out things/sources you don't want to see. Supplementary question... if I mark a file as junk on my system, does that get rolled out over the network?

AaronWalkhouse July 24th, 2006 06:15 AM

No. It stays on your machine to filter your own searches.
If it was propagated over the network it would be open to abuse.

Ed91 July 24th, 2006 06:38 AM

It's probably unfeasible, but it would brilliant to have a "report file" system, when the client reports the file as junk, it's checked out, and if found to be junk, the hash is automaticly downloaded to all limewire clients to be used as part of their junk filter.. you could even (server side) store the file as it's hash which would easily allow you to check/disguard multiple reports of the same file..

may take a bit of time at first, but I'm sure after a while it would also act as a deterent to potential spammers.

But like I said, a lot of work.

Ed

AaronWalkhouse July 24th, 2006 08:27 AM

Not only that, but unscrupulous people like the MAFIAA would seize on the
capability and get courts to force developers to pervert it to their own ends.

This is why control must be left to the individual and collective operations must
be kept voluntary, like Bitzi.

Upcoming anti-spam and anti-malware capabilities will be user operated, and it
will be up to users to decide whether or not to trust the various sources of
the blocklists they can find.

lwmcd1 July 24th, 2006 09:14 AM

Quote:

Originally Posted by AaronWalkhouse
Not only that, but unscrupulous people like the MAFIAA would seize on the
capability and get courts to force developers to pervert it to their own ends.

This is why control must be left to the individual and collective operations must
be kept voluntary, like Bitzi.

Upcoming anti-spam and anti-malware capabilities will be user operated, and it
will be up to users to decide whether or not to trust the various sources of
the blocklists they can find.

My feelings exactly.......It's called self-reliance and or individual responsibility.
People now days just seem to want way to much done FOR them in lieu of taking care of it themselves. Just give me the tools..I'll do it myself. I don't intent any offense by this post so please don't take any.

Ed91 July 24th, 2006 09:42 AM

None taken. We all want the same thing here, and I'm perfectly happy to use the tools myself. I'm just thinking about the software and possible ways it could be improved, but yeah, I see where you're coming from.
I suppose I'm just trying to find ways around that, as it would make Limewire more "professional" and modern security wise. Could such a service be made optional, thus still volentry? That way, the legal route would seem less appealing as it could just be turned on/off, and even if the legal route was taken over the automatic downloading, it could be used to market limewire as a safe/secure way to distrubute music legally for users that would be worried by the prospect of ilegal downloads. Those less worried could just turn the service off. Or if it even got to that stage it would be well within your rights to make it a subscription service for the users of the professional software only, thus killing it off as no one would want it.

I have a feeling it's just one of those arguments I'm not going to win though, so I'll go back to the drawing board!

Ed

lwmcd1 July 24th, 2006 09:49 AM

Me I LOVE tools so I have no problem with any LimeWire cares to offer. I don't think I've ever won any arguments in any forum I've ever been in. Nor do I think has anyboby else. So don't woory about it. lol

AaronWalkhouse July 24th, 2006 10:50 AM

I win every time! I wonder why…
http://www3.telus.net/public/walkhoub/Axeman.JPG

Sleepless July 24th, 2006 04:24 PM

Junkrating ???
 
1 Attachment(s)
Quote:

Originally Posted by AaronWalkhouse
No. It stays on your machine to filter your own searches.
If it was propagated over the network it would be open to abuse.

This makes me have to ask why the junkrating on many GOOD files in my searches have a junkrating of 15-35 % ??

In the whole time that the junkfilter has been in the program I have marked 3 134.4 KB files as junk and maybe 10 of the 851.7 KB ones. Then why would the junkrating show on 700 MB files that I know are good ?? Shouldn't that be 0% then

All these files are good:

Attachment 2600

Better make that 51%

musicaddict July 24th, 2006 06:53 PM

Been thinking about this size filter and I don't think it would work. I am seeing search results that include my exact search string as the file name or embedded in it. If the spammers can write code that takes my search and pastes it into the filename of their trash, what is to stop them writing code that would return random sizes or from padding that trash to random sizes?

Sleepless July 24th, 2006 07:05 PM

It's in Phex and is working quite good there I might add. Not that I use Phex much but I've tried it

musicaddict July 25th, 2006 10:28 PM

Sounds good then Sleepless. They will probably work around it in the long term tho.
I am now wondering if the host blocking might work better, if there was a list of bad sites that could be centrally maintained so it wasn't abused?

Sleepless July 26th, 2006 09:01 AM

There is.

AaronWalkhouse is working on getting the programmers to have Limewire support the importing of lists.

Sounds like it may be possible allready in the next major version. (pretty soon)

lwmcd1 July 27th, 2006 11:23 AM

Does it do any good to block the hosts that are carrying the 197.7 files and others like it or is it a waste of time. I've been doing that but it is time consuming since there are so many of them.

Sleepless July 27th, 2006 11:42 AM

Thats why the list would be good ;)

I don't bother blocking them. Just don't download them

lwmcd1 July 27th, 2006 02:28 PM

Nor do I.
It's just gotten that MOST of the time that's ALL I get on a search.
I was just wondering if it would help cut down on it some.

Sleepless July 27th, 2006 03:23 PM

A combination of the Block lists and a file size filter would help a great deal I'm sure.

Here are some tips on how to bypass some of the spam

How to find music :)

et voilą August 27th, 2006 05:52 AM

Well maybe the issue here is that the files in the result are near same file size and probably have many sources. LW filter seems to be very picking when there are lots of sources, thinking it could be spammers. Of course, there is room for improvements here, as the spam filter helps, but not too much.

Ciao

mrdudemanx2 August 28th, 2006 09:53 AM

Why isn't there a "Hell yeah! This is what everyone's been waiting for!" option in the poll?

By the way, has a virus/spam file size thread been created yet? If not, I'd like to make one containing the file sizes of guaranteed junk files and their file type, such as the 197.7 KB Zip File, or the 134.4 KB Mp3 file. I think this may help a lot, especially if we do come out with a file size filter.

macho6868 September 3rd, 2006 09:17 PM

you people are pushing the size filter the worng way....
 
you people are pushing the size filter request the worng way.... iv'e explained already in other threads! (lordoftherings has warned also).... my explaination: yes i agree we need a file size filter. but not just any, it has to be a SPECIFIC size(s) filter....just like the way the "keryord" junk filter works... for example if you filter out the word "cat" and you'll never get any search results with the word cat..... MY POINT IS if we create a file size filter it must be a filter so we can add "specific file sizes... like lordofthering said in previous threads... "if we have a file size filter the spammer will adjust (workaround) to it and just make then bigger from 851.7 or 197.7 kb to "megabytes... or hundreds of megabytes..... also its not being fair to the network filtering out like let's say: size from____ to___ because then were missing out on a whole range of files we might want to look up in the near future.... like photos, pdf files.. or docments... my conclusion is we must be sensitive with this area of type of filter.... and if in the near future limewire does get creative and make a size filter... PLEAZE I only beg that it is not a size filter in range of from__ from to to__ any it be a "list" type like a keyword filter... EXAMPLE: filter out these specific file sizes....
851.7
197.7
....etc

and when you do a search result, there shouldnt be any of those files showing up.... (until riaa or sony hires some low life hacker to do thier dirty work and make a "workaround")
PS: there is always a workaround to a workaround.... LOL

PSS: there should also be an intelligent filter to auto-ignore results like B_L_U_E__C_A_T__R_I_S_I_N_G.wma from the real file "blue cat rising.wma"
:bangh:

Sleepless September 3rd, 2006 09:28 PM

There is add _ to yout keyword filter

And BTW if you had read the thread properly you would see that this is about three seperate filters

1. A filesize filter with ranges
2. An IP filter built in to Limewire
and 3. A HASH filter

They would find themselves in heaps of trouble coming up with many bad files if ALL of them were implemented into Limewire

And BTW the ability to search by exact phrase would also help as the bad results would not be triggered nearly as often

macho6868 September 3rd, 2006 09:42 PM

hmmmm???
 
.....and we ALL know that those "pretty" autogenerated spam search results servers or somputers are being "financed/maintained" by riaa/sony/mpaa .... but were not snitching... LOL:D :D

off topic; (have you also noticed that if you yahoo or google you get "spammed" results, harmless but annoying website "advertising" similar things you type in... and maybe now&then you get a lil pop up window (small rectangle) lol.... oh and try mis-typring a website lol:D you really get a surprise.... im just surprised there is no action there since we all know the gov/nt cant control what is in someone's computer (like p2p networks) but they can control whats on the net (like servers/ip)...hmm..?? new trend..... i guess the internet will soon get less users & be paralyzed in like 3 yrs..? who knows.... (the effect is happening to p2p networks (like discouraging users by auto-results spamming/viruses) now were getting auto results in searches in google or yahoo or any search engine))... i remember like last year if i typed in something that is a unique word.. in yahoo or google it woud say "search results 0" (empty) LOL now you can type in anything and get "thousands of websites" claiming results ranging from sales, information, to plain just anything...lol spammers these days...gotaa love em (hate em) (for screwing up the internet & undustry....) there pushing the internet to a point that if it doesnt come to a halt it will be controlled by the gov/nt just like in china.... lol spammers beware, you are only pushing you "hobbies" off a cliff...then you have nothing else better to do.... LOL.... id love to see your face right now....
let the war begin; real hackers VS riaa/mpaa/sony & thier auto-gen spammer servers & beautifil ip addresses that flip like pages... (heck that'd be a good project for you guys when you're bored)(getting rid of those auto-gens & tearing up thier systems heck you're NOT doing anything illegal! since your stopping them from thier illegal actions.... tell me if im wrong!!! because if im wrongg then people that spread viruses are doing something legal..LOL ) do it so limewire doesnt have to work sooooo hard to create a size filter... lol

opps im talking too much.... this post might get edited or deleted by the moderator...lol or i might get traced....lol (if i do then I know im right...lol):bangh: :bangh: :bangh: :tease: :tease: :tease:

macho6868 September 3rd, 2006 09:45 PM

did you not read all that i typed in why range filters (size) will NOT work.. because they will create a workaround... and hash (also soo off topic) also ip filter is soo off topic.. and will never WORK!!! ip addresses always change!!!!!!!!
and explain how "hash" filter will work... (it never will)

macho6868 September 3rd, 2006 09:47 PM

and yes i read the thread and thier 3 topics... but i do have the right to focus on just one for a sec.....
also (im that that bright.... whats does BTW mean)..... it took me 6 years to figure out whats lol means.... where can i get an "internet dictionary?)

Sleepless September 3rd, 2006 09:56 PM

By The Way ;)

And you actually think that blocking those specific files with the sizes you mentioned will work better than a HASH filter :xirokrotima: :rofl: :cheesy:

I maybe should tell you that what you were talking about was a HASH filter ;)

macho6868 September 3rd, 2006 10:08 PM

???
 
"real programmers know whats a REAL hash filter does" and how it works.... it just that internet revolution changes :tease: :tease: :dance: :dance: :dance: :dance:
opps you DID do a funny:dance: :dance: :dance:

...and what the word hash really means.....
maybe you shoud pm me to you "brain war" so we wont take up precious space in this forum.... and there wont be any public embarasments on both sides.... oh but maybe...... WERE ON THE SAME SIDE!!!!!!! HELLO??

Sleepless September 3rd, 2006 10:15 PM

Limewire uses hashes to identify files. Haven't got time for a brainwar, and I am not saying that I know more about this than you do but that I know

How would you filter out the files of specific sizes if not using HASH

Sleepless September 3rd, 2006 10:19 PM

And just one more thing before I go

If you know so much about programming then why not make one and ask them to put it in the code for the next version

Or even take the code and offer it as an alternative with the filter. It's open source you know

;)

macho6868 September 3rd, 2006 10:21 PM

Hashing is the transformation of a string of characters into a usually shorter fixed-length value or key that represents the original string. Hashing is used to index and retrieve items in a database because it is faster to find the item using the shorter hashed key than to find it using the original value. It is also used in many encryption algorithms.
As a simple example of the using of hashing in databases, a group of people could be arranged in a database like this:

Abernathy, Sara
Epperdingle, Roscoe
Moore, Wilfred
Smith, David
(and many more sorted into alphabetical order)

Each of these names would be the key in the database for that person's data. A database search mechanism would first have to start looking character-by-character across the name for matches until it found the match (or ruled the other entries out). But if each of the names were hashed, it might be possible (depending on the number of names in the database) to generate a unique four-digit key for each name. For example:
7864 Abernathy, Sara
9802 Epperdingle, Roscoe
1990 Moore, Wilfred
8822 Smith, David
(and so forth)

A search for any name would first consist of computing the hash value (using the same hash function used to store the item) and then comparing for a match using that value. It would, in general, be much faster to find a match across four digits, each having only 10 possibilities, than across an unpredictable value length where each character had 26 possibilities.
The hashing algorithm is called the hash function (and probably the term is derived from the idea that the resulting hash value can be thought of as a "mixed up" version of the represented value). In addition to faster data retrieval, hashing is also used to encrypt and decrypt digital signatures (used to authenticate message senders and receivers). The digital signature is transformed with the hash function and then both the hashed value (known as a message-digest) and the signature are sent in separate transmissions to the receiver. Using the same hash function as the sender, the receiver derives a message-digest from the signature and compares it with the message-digest it also received. They should be the same.

The hash function is used to index the original value or key and then used later each time the data associated with the value or key is to be retrieved. Thus, hashing is always a one-way operation. There's no need to "reverse engineer" the hash function by analyzing the hashed values. In fact, the ideal hash function can't be derived by such analysis. A good hash function also should not produce the same hash value from two different inputs. If it does, this is known as a collision. A hash function that offers an extremely low risk of collision may be considered acceptable.

Here are some relatively simple hash functions that have been used:

The division-remainder method: The size of the number of items in the table is estimated. That number is then used as a divisor into each original value or key to extract a quotient and a remainder. The remainder is the hashed value. (Since this method is liable to produce a number of collisions, any search mechanism would have to be able to recognize a collision and offer an alternate search mechanism.)
Folding: This method divides the original value (digits in this case) into several parts, adds the parts together, and then uses the last four digits (or some other arbitrary number of digits that will work ) as the hashed value or key.
Radix transformation: Where the value or key is digital, the number base (or radix) can be changed resulting in a different sequence of digits. (For example, a decimal numbered key could be transformed into a hexadecimal numbered key.) High-order digits could be discarded to fit a hash value of uniform length.
Digit rearrangement: This is simply taking part of the original value or key such as digits in positions 3 through 6, reversing their order, and then using that sequence of digits as the hash value or key.
A hash function that works well for database storage and retrieval might not work as for cryptographic or error-checking purposes. There are several well-known hash functions used in cryptography. These include the message-digest hash functions MD2, MD4, and MD5, used for hashing digital signatures into a shorter value called a message-digest, and the Secure Hash Algorithm (SHA), a standard algorithm, that makes a larger (60-bit) message digest and is similar to MD4

macho6868 September 3rd, 2006 10:23 PM

yo sleepness i cant belive i typed this much today.... im soo lazy.... LOL

macho6868 September 3rd, 2006 10:25 PM

The purpose of this one-way-hash filter is to have large scale hashing in a small subset of data.

It is similar to a bloom filter except that instead of using a binary system the filter uses an a-z system where each point can hold more that one value and is not limited to binary values.


View source code v.01 going to add in bitwise shifting soon.
View example of a spellchecker done entirely in php with a 300,000 word dictionary (2.5M), the hash table is a 50x50 (1 in 3,250,000 probabilty) at that size the hash table is 419k serialized and uncompressed, it could be done in a 25x25 hash table which would be around 200k however the probabilty of false positives is much much higher (Compare 25x25 matrix example against the above example).


--------------------------------------------------------------------------------

Why use these?

A 250x250 matrix (large) has a unique factor of over 40 million and is around 16 megabytes blank. The structure is designed for large hashes. It would be easily possible to hold the entire domain name system hashed in around 20 megs. What this means is you could have very fast lookups on if a name is in the system.


--------------------------------------------------------------------------------

How does this filter work?


Take an item and hash it four times, modulus each value to the following values.
A marker for the X axis of the matrix
A marker for the Y axis of the matrix
A marker for how deep into the hash at the matrix point to go into.
A marker for the letter to use,
Insert it twice by flipping the X and Y values.

Each item hashed in the database will have a unique value set to each of the above markers. For lookups, the letter must be in two exact places withing the entire filter. In the above example of the spelling checker, there is a 1/1.625M probabily of a duplicate at any given point (due to doubling data inserting).

These filters can take a while to generate, they are very processor intensive to create. These filters use a matrix format for speed in lookups, each matrix point holds a small filter.


--------------------------------------------------------------------------------

How is that this filter can hold so much information?

Example small filter

0000000000

Insert "d" at position '2' (note:Counting starts at zero)

00d0000000

Insert "z" at position '5' (note:Counting starts at zero)

00d00z0000

Now it becomes interesting. insert "f" at position '2' (on top of another letter in place)

00Df00z0000

If another needs to be positioned at the same point. insert "g" at position '2' (on top of another two letters in place)

00DFg00z0000

The array maintains all positions and if it encounters an uppercase letter it concats until a non-uppercase letter is found.

Array
(
[0] => 0
[1] => 0
[2] => DFg
[3] => 0
[4] => 0
[5] => z
[6] => 0
[7] => 0
[8] => 0
[9] => 0
)

Multiple hashing algorithms are used for placements.

TO DO:


Non-XY-flip version that uses a unique secondary value or a combination of other values, however I felt that at this time the number of hashes is approaching maximum.
More documentation and list of papers cited.
List of links to projects/experiments using bloom filters for large datasets.

macho6868 September 3rd, 2006 10:26 PM

just in case i dinnt type all of this......

A hash function (or hash algorithm) is a way of creating a small digital "fingerprint" from any kind of data. The function chops and mixes (i.e., substitutes or transposes) the data to create the fingerprint, often called a hash value. The hash value is commonly represented as a short string of random-looking letters and numbers (Binary data written in hexadecimal notation). A good hash function is one that yields few hash collisions in expected input domains. In hash tables and data processing, collisions inhibit the distinguishing of data, making records more costly to find.


A typical hash function at workContents [hide]
1 Properties of hash function
2 Applications of hash function
2.1 Cryptography
2.2 Hash tables
2.3 Error correction
2.4 Audio identification
2.5 Rabin-Karp string search algorithm
3 Origins of the term
4 See also
5 References
6 External links



[edit]
Properties of hash function
A fundamental property of all hash functions is that if two hashes (according to the same function) are different, then the two inputs are different in some way. This property is a consequence of hash functions being deterministic. On the other hand, a hash function is not injective, i.e. the equality of two hash values ideally strongly suggests, but does not guarantee, the equality of the two inputs. If a hash value is calculated for a piece of data, and then one bit of that data is changed, a hash function with strong mixing property usually produces a completely different hash value.

Typical hash functions have an infinite domain, such as byte strings of arbitrary length, and a finite range, such as bit sequences of some fixed length. In certain cases, hash functions can be designed with one-to-one mapping between identically sized domain and range. Hash functions that are one-to-one are also called permutations. Reversibility is achieved by using a series of reversible "mixing" operations on the function input.

[edit]
Applications of hash function
Because of the variety of applications for hash functions (details below), they are often tailored to the application. For example, cryptographic hash functions assume the existence of an adversary who can deliberately try to find inputs with the same hash value. A well designed cryptographic hash function is a "one-way" operation: there is no practical way to calculate a particular data input that will result in a desired hash value, so it is also very difficult to forge. Functions intended for cryptographic hashing, such as MD5, are commonly used as stock hash functions.

Functions for error detection and correction focus on distinguishing cases in which data has been disturbed by random processes. When hash functions are used for checksums, the relatively small hash value can be used to verify that a data file of any size has not been altered.

[edit]
Cryptography
Main article: cryptographic hash function
A typical cryptographic one-way function is not one-to-one and makes an effective hash function; a typical cryptographic trapdoor function is one-to-one and makes an effective randomization function.

[edit]
Hash tables
Main article: Hash table
Hash tables, a major application for hash functions, enable fast lookup of a data record given its key. (Note: Keys are not usually secret as in cryptography, but both are used to "unlock" or access information.) For example, keys in an English dictionary would be English words, and their associated records would contain definitions. In this case, the hash function must map alphabetic strings to indexes for the hash table's internal array.

The generally impossible/impractical ideal for a hash table's hash function is to map each key to a unique index (see perfect hashing), because this guarantees access to each data record in the first probe into the table.

Hash functions that are truly random with uniform output (including most cryptographic hash functions) are good in that, on average, only one or two probes will be needed (depending on the load factor). Perhaps as important is that excessive collision rates with random hash functions are highly improbable—if not computationally infeasible for an adversary. However, a small, predictable number of collisions is virtually inevitable (see birthday paradox).

In many cases, a heuristic hash function can yield many fewer collisions than a random hash function. Heuristic functions take advantage of regularities in likely sets of keys. For example, one could design a heuristic hash function such that file names such as FILE0000.CHK, FILE0001.CHK, FILE0002.CHK, etc. map to successive indices of the table, meaning that such sequences will not collide. Beating a random hash function on "good" sets of keys usually means performing much worse on "bad" sets of keys, which can arise naturally—not just through attacks. Bad performance of a hash table's hash function means that lookup can degrade to a costly linear search.

Aside from minimizing collisions, the hash function for a hash table should also be fast relative to the cost of retrieving a record in the table, as the goal of minimizing collisions is minimizing the time needed to retrieve a desired record. Consequently, the optimal balance of performance characteristics depends on the application.

One of the most respected hash functions for use in typical hash tables is Bob Jenkins' LOOKUP2 hash function, published in an article in Dr. Dobb's Journal. The hash function performs well as long as there is no adversary, for it is trivially reversible and useless as a cryptographic hash function.

[edit]
Error correction
Main article: Error correction and detection
Using a hash function to detect errors in transmission is straightforward. The hash function is computed for the data at the sender, and the value of this hash is sent with the data. The hash function is performed again at the receiving end, and if the hash values do not match, an error has occurred at some point during the transmission. This is called a redundancy check.

For error correction, a distribution of likely perturbations is assumed at least approximately. Perturbations to a string are then classified into large (improbable) and small (probable) errors. The second criterion is then restated so that if we are given H(x) and x+s, then we can compute x efficiently if s is small. Such hash functions are known as error correction codes. Important sub-class of these correction codes are cyclic redundancy checks and Reed-Solomon codes.

[edit]
Audio identification
For audio identification such as finding out whether an MP3 file matches one of a list of known items, one could use a conventional hash function such as MD5, but this would be very sensitive to highly likely perturbations such as time-shifting, CD read errors, different compression algorithms or implementations or changes in volume. Using something like MD5 is useful as a first pass to find exactly identical files, but another more advanced algorithm is required to find all items that would nonetheless be interpreted as identical to a human listener. Though they are not common[citation needed], hashing algorithms do exist that are robust to these minor differences. Most of the algorithms available are not extremely robust, but some are so robust that they can identify music played on loud-speakers in a noisy room.[citation needed] One example of this in practical use is the service Shazam. Customers call a number and place their telephone near a speaker. The service then analyses the music, and compares it to known hash values in its database. The name of the music is sent to the user (for a charge!). An open source alternative to this service is MusicBrainz which creates a fingerprint for an audio file and matches it to its online community driven database.

[edit]
Rabin-Karp string search algorithm
Rabin-Karp string search algorithm is a relatively fast string searching algorithm that works in O(n) time on average. It is based on the use of hashing to compare strings.

[edit]
Origins of the term
The term "hash" apparently comes by way of analogy with its standard meaning in the physical world, to "chop and mix." Knuth notes that Hans Peter Luhn of IBM appears to have been the first to use the concept, in a memo dated January 1953; the term hash came into use some ten years later.

In the SHA-1 algorithm, for example, the domain is "flattened" and "chopped" into "words" which are then "mixed" with one another using carefully chosen mathematical functions. The range ("hash value") is made to be a definite size, 160 bits (which may be either smaller or larger than the domain), through the use of modular division.

macho6868 September 3rd, 2006 10:27 PM

this is why hash doesnt work.....(im figuring out a workaround)......


All times are GMT -7. The time now is 02:48 PM.

Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2024, vBulletin Solutions, Inc.
SEO by vBSEO 3.6.0 ©2011, Crawlability, Inc.

Copyright © 2020 Gnutella Forums.
All Rights Reserved.