Gnutella Forums - View Single Post

kmag · #13 (**permalink**) December 19th, 2005

Thanks for the feedback.

In 4.9.40, Roger fixed some display confusion to make the threshold for showing a trash can in the "quality" column equal to the threshold for hiding junk (if the hiding junk option is enabled).

These problems with too many files being marked junk over time are likely a problem with the filter being set up to learn more from bad hints than from good hints. (In the code, these hints are called "tokens".) Hopefully LW 4.9.40 is much better about this; the learning should be much less biased toward the bad unless you set the sensitivity above 50%.

The spam filter is actually a set of filters, where the file starts out being 100% good, and each filter multiplies the goodness by some value between 1.0 (inclusive) and 0.0 (exclusive). It's probably a host of different filters that are whittling the files down to a "junk" rating.

Basically, if you search for some terms and end up getting a result that you mark spam, LW will internally create a bunch of tokens for different things LW knows about the file. There's a token for the size of the file, a token for each word in the title, etc, etc. Tokens that keep showing up in the search results for "very bad" spam rated files gradually get marked more and more "bad". Tokens that keep showing up in the search results for "very good" spam rated files gradually get marked more and more "good". Part of the problem is probaby that the standard for "very good" was more tough than the standard for "very bad", (hard-coded to below 15% junk vs. above 70% junk) so with each search, the effects of the "bad" tokens relative to the "good" tokens was multiplied. Basically, lots of very "bad" tokens mean lots of search results get very bad spam ratings, which means lots of tokens slowly get marked more "bad"... it's a snowball effect, and we need an opposing "good" snowball effect to cancel it out. This is an over-simplification, but hopefully it helps you get a general idea of what goes on inside the spam filter.

Give 4.9.40 a try and let us know how it works for you.

Don't be shy about going into the options and changing the sensitivity of the junk filter. In 4.9.40 (unlike 4.9.39 and 38), to some extent the sensitivity of the junk filter affects the balance of influence between bad junk ratings and good junk ratings. Below a sensitivity of 50%, it's hard to say which way the learnig is biased. Above 50% sensitivity, the learnig becomes more and more biased toward increasing the "bad-ness" of tokens. Hopefully with feedback from real-world useage, we can tweak the filter to have very little bias in the sensitivity ranges that people actually use.