The blog of Tobin

Tobins nerd blog on .NET, Software, Tech and Nice Shiny Gadgets.

Wednesday, October 12, 2005

The Geek Bachelor Fridge


Crikey, I just looked at the fridge and realised I'm living in Bachelor World. Shouldn't two grown men have more than this? Is it the coding that keeps us stuck to the keyboard and away from supermarket? By the way, all 8 items in this fridge are mine except for the beer, which I'm about to steal from my lodger - The Bear. Sorry John, but I think it's time you got your lazy ass down to Morrissons anyway... Posted by Picasa

Sunday, October 09, 2005

Open Source Web Indexing with .NET

For the last 12 months I've been working on a web indexing solution. It's for a client who is building a database of news articles and offering news alert and clipping services. We're leaning on open source software wherever possible in our application, and it occurred to me just how fantastic it is to have such an active and varied collection of tools at our disposal.

For anyone who's interested in indexing, here's the lowdown of what technologies we're using.

1. MySql Database. This is great, we've got it tuned to cope with millions of articles on a single modest server. The .NET db connector library is quite mature too.

2. Lucene Search Engine. This gives our keyword searching capabilities. Again, we have an index of over a million articles and searches complete within .5 seconds. DotLucene certainly rocks!

3. The SGML Library from GotDotNet. This allows us to convert HTML pages into an XML tree, which are then used for information extraction. We have some nifty algorithms for extracting only the body of a news article without the surrounding menus and link bars, and our application doesn't need to have site-specific rules for finding the article.

We're also using other open source components for things such as FTP, emailing, PDF reports etc. And, we've written our web portal in Ruby On Rails.

All this lot runs on a single modestly specified box, who's CPU humming along at about 50%. Once we start to hit capacity, we'll be looking at the open source Grid Computing offerings to help us scale out.

Anyone else doing web indexing in .NET? Got any tips or tools to share?