Our Own Google – David Sterry

The other day I began to think about search engines and their innately commercial nature. I thought, why is it that the Internet, which exists as a large network of networks, relies on commercial organizations to index the web and serve search results? Shouldn’t the Internet index itself and provide search as a basic utility or protocol? With the success of the open source development model there must be great potential to create a community-owned-and-operated search engine.

What would be the benefits of such a system? It would ideally collect only aggregate and anonymous(hash IP addresses with random session keys) logs that would be used for the sole purpose of improving search. The servers providing search would be donated via a distributed effort similar to Seti@Home except limited more to local ISPs with the server resources to provide the search results and maintain replicated indices.

After a bit of thought on this concept, I did some research. Turns out there are a couple of projects over at the Apache Software Foundation by the names of Lucene and Nutch. These are free software projects with the goal of developing world class indexing and search application software. I’m not sure if they’re trying to build a distributed web search infrastructure but their project is certainly important to the idea I’ve been thinking about. One thing I learned from their FAQ that I did not know is that indexing the web is not the most bandwidth intensive task a company like Google or Yahoo has to deal with…it’s actually serving the search results. Makes sense when you think of how many times people look at a specific site versus how often it would change.

One other thought before I call an end to this post. Free software is generally good at recreating or making small improvements to commodity proprietary software where features have been stable for some time. Since search engines are being rapidly developed in the commercial sector, it is difficult for free software to supplant leaders like Google and Yahoo at their own game. It may be necessary to start the search engine I’m talking about by looking to index sites that would provide little or no ad-revenue. Think scientific research and academia.