On the last ten days four very important Sampa sites where created. They were created by very close friends. And why they are important? Because these people used the Sampa Alpha version more than a year ago and never came back. I didn't bother them recently to create sites, but they did without even talking to me.
The sites created are one about a couple the moved to Washington DC and thought that a blog would be a good way to keep in touch with friends in Seattle. The other is another couple just creating a personal site. The third site is a guy that wants to create easy pages if he wants to sell something on Craigslist and the third is a personal site of a female friend.
Most of them are Microsoft people -- so they shall remain anonymous -- and they could have created their site on MSN Spaces (Live Spaces) or Office Live. I'm sure they have sites on those services as well. But the fact that they are using Sampa speaks a lot to our feature list and overall user experience. Maybe they can't do what they want on MSN Spaces (Office Live is mostly for business anyway) or maybe Sampa had just the extra thing that they needed.
IMHO, this is a sign that we are moving our user experience on the right direction.
Amazingly enough, Sampa gets more page requests from crawlers and bots than from real users. For example, yesterday 59% of the page requests were from crawlers and bots, only 41% were from real users.
Of course, on our stats we always discard the crawlers because those are not real users requesting pages and they can actually really inflate your number of Unique Users and Visits because crawlers (mostly) don't support cookies.
So, in prol of helping my fellow Web 2.0 entrepreneurs, I'm listing some of the strings matches that we use to detect if a user-agent is a crawler:
Crawlers:
bot
crawler
spider
spyder
fetch
perl
search
feedseek
screenshot
scout
thumbnail
reader
mediapartners
jeeves
ia_archiver
slurp
yahoofeed
yahoo-blogs
del.icio.us
nutch
netnewswire
moreover
stackrambler
boitho
blogpulse
snap.com
everest
filangy
stumble
zyborg
baldric
hanzoweb
yacy
wazzup
python
feedcheck
dragonfly
netcraft
grabber
linkwalker
egothor
irlbot
psbot
heritrix
tmcrawler
libwww
jakarta
httpclient
java/1
wget/
Besides those, any user-agent that has less than 15 characters is considered a crawler. I have to update this list every month, because every month there is at least a couple of new crawlers where the user-agent string doesn't contain the word "crawler" or "bot" and most of the time it is from some CS university.
Now, we also have a list of feed readers, which are crawlers but they are working on behalf of a real person (mostly). On those cases, we treat them a bit differently because we want to grab the number of subscribers from that feed.
We don't use the subscribers of feeds to our UU count, but we are still interested in knowing how many people subscribe to each Sampa site feed. The reason we don't use it is because if somebody subscribe to a feed it doesn't mean the saw it. Bloglines might say that I have 25 subscribers, but maybe only a handful really read what I write (in Bloglines), and the only way to detect that is by adding a tracking-gif on each blog post, which is something we are not planning on doing for now.
The list of strings to identify a feed reader is:
bloglines
yahoofeedseeker
newsgator
feedster
feedfetcher-google
netvibes
pubsub
sharpreader
rssbandit
feedbite
zhuaxia
One of the biggest problems with feed readers detection is the new IE 7 and Outlook 2007 that use the IE 7 regular user-agent, making it impossible to distinguish between a user that subscribe to a feed versus a user that just click on the feed link.
I hope this helps your startup and if you have other crawlers, bots or feed readers that I'm missing, please, let me know.