Pages

On Spam

Spam is a tricky problem. Or as Matt Haughey says "spam bloggers sure are resourceful little bastards."

For a while now, the Blogger team has been contending with spam on Blog*Spot through mechanisms like Flag as Objectionable and comment/blog creation CAPTCHAs. The spam classifier that Pal described has also dramatically reduced the amount of spam that folks experience when browsing NextBlog.

However, spam is still being created and, as was widely noted, Blogger was especially targeted this weekend.

One group of folks who are particularly affected by blog spam are those who use blog search services and those who subscribe to feeds of results from those services. When spam goes up, it directly affects the quality of those results. I'm exceedingly sympathetic with these folks because, well, we run one of those services ourselves.

So given that the problems is hard, what more are we doing? One thing we can do is improve the quality of the Recently Updated information we publish.

Recently Updated lists like the one Blogger publishes are used by search services to determine what to crawl and index. A big goal in deploying the filtered NextBlog and Flag as Objectionable was to improve our spam classifiers. As we improve these algorithms, we plan to pass the filtered information along automatically. Just as a first step, we're publishing a list of deleted subdomains that were created this weekend during the spamalanche.

Greg from Blogdigger (one of the folks who consumes blog data) points out that "ultimately the responsibility for providing a quality service rests on the shoulders of the individual services themselves, not Google and/or Blogger." However, we think by sharing what we've learned about spam on Blogger we can hopefully improve the situation for everyone.

We can also make it more difficult for suspected spammers to create content. This includes placing challenges in front of would-be spammers to deter automation.

Of course, false positives are an unavoidable risk with automatic classifiers. And it's important to remember that the majority of content being posted on Blog*Spot is not spam (we know this from the ongoing manual reviews used to train the spam classifier).

Some have suggested that we go a step farther and place CAPTCHA challenges in front of all users before posting. I don't believe this is an acceptable solution.

First off, CAPTCHAs represent a burden for all users (the majority of whom are legit), an impossible barrier for some, and are incompatible with API access to Blogger.

But, most importantly, wrong-doers are already breaking CAPTCHAs on a daily basis. And not through clever algorithmic means but via the old-fashioned human-powered way. We've actually been able to observe when human-powered CAPTCHA solvers come on-line by analyzing our logs. You can even use the timestamps to determine from whence this CAPTCHA-solving originates.

One thing we've learned from Blog Search, is that even if spam were completely solved on Blog*Spot, there would still be a problem. As others have concluded, we've realized that this is going to be an on-going challenge for Blogger, Google and all of us who are interested in making it easier for people to create and share content online.

0 comments:

Post a Comment