I started working with Swish-E again recently. This is an open source search engine that, for my money, is one of the best deals in the open source world.
A few years ago, I spent some time working with Inktomi Enterprise Search (now Verity Ultraseek) , but after a few days getting re-acclimated to Swish-E, I’m convinced the latter does everything the former does and more.
When spidering, you can use a regular expression to extract part of the URL and store that as a META tag in the index so you can query on it and return it. So you can grab, say, the top-level folder name and store that as a “virtual” META tag called “Section.”
Using special HTML comments, you can exclude parts of a Web page from being indexed, so you can make sure that common text in headers, footers, and sidebars doesn’t get included (what would be the point? — the text is on every page).
This is such a powerful feature, and something that my $14,000 Inktomi install was completely incapable of. Using this, I can only index the unique content of each page, which is really all I want people to search anyway.
You can set thresholds for common terms. If a term appears on, say, more than 80% of pages, then you can toss it out of the index since it’s effectively useless as a search term.
Swish-E (it stands for “Simple Web Indexing System for Humans – Enhanced) is very close to the metal — it runs off of Apache-ish configuration files, and all output is from the command line. But there are Perl scripts, PHP classes, Java classes, and even a COM object for it to abstract you away from the guts.
Swish-E is an example of everything good about open-source software. Understanding the power this gives the everyday guy for no expense makes me want to smack people who say the GPL is un-American.