Links - PHP Search Engines

Updated Oct 2 2008: The Mission

There is a decided dearth of good PHP open source search engine solutions, but Sphider seems to be the leading candidate these days - "a lightweight web spider and search engine written in PHP, using MySQL as its back end database".

Its bigger brother is GPL derivative Sphider-Plus . I worked with it a bit on a localhost and it's quite good in the role of a big multi-site search engine - a worthy replacement for the also very good Perl-based Fluid Dynamic Search Engine.

A good general description of search engine features at searchengineshowdown.com.

 



PHP and the Text Parsing Mission

Oct 2 2008

Search engines crawl over targeted web sites and extract all the keywords to make an index of what words appear on which page. The extraction process involves parsing text, trying to make sense of it in a limited way. It is a very complex process, among the most complex and computationally intensive tasks known to computer science.

Most search engines are implemented in fast, efficient languages such as C, C++ or Perl. These languages are speedy, but it comes at price - complexity of developing or customizing the application and/or complexity of installing and configuring the application on a server. C/C++ generally need to be recompiled on the server and the security implications of a Perl script are beyond most people, even those who think of themselves as "computer literate".

For a corporation with a skilled staff and a dedicated server, these issues are not a problem, they have the time and money to make the application work. For a small personal site running in a shared hosting environment, these issues are usually a knock-out. Generally, they must accept whatever search capabilities their CMS provides, sophisticated multi-site indexing using a high-level search engine is beyond their budget in terms of time and effort.

Enter PHP. It is easy to install and configure in a shared environment and fairly easy to customize if necessary. However, indexing is still a very resource-intensive activity - it will run very slowly on the server, suck down huge chunks of network capacity and possibly provoke "CPU exceeded" messages from the server operations staff. There is a solution however.

Enter the Sphider-Plus. See the short article about Sphider and Sphider-Plus.

More on this subject later ...