GeodSoft logo   GeodSoft

Implementing Site Searching on GeodSoft.com - 6/1/00

Despite the fact that Index Server was working months before I got serious about the GeodSoft.com web site, I could never get it configured to work with the new virtual site. Swish-e was very easy to set up on Linux and provides fully functional, though somewhat slow, searching.

Update 6/2/00

Previously I've made references to problems with Microsoft's Index Server. On the server selection page problems getting Index Server to work on a virtual site were described and subsequently Index Server's failure to continue indexing what it had been doing for months was described on More NT Quirks. Almost a month later the situation hasn't changed. Index Server now only returns pages in the virtual site but the search form is set up in my default site so all the relative URL's are wrong. They are for the virtual site and don't exist in the default site. The result is that no searching works on the NT server.

Since I'd already spent more time than I thought was reasonable on Index Server, I decided to stop banging my head against the wall and see what new things I could learn. The Apache FAQ page quickly lead me to two "Open source search engines that are often used with Apache." These were Swish-e and DIG. It also had a link to a page of Web Site Search Tools which had dozens of links to a wide range of tools from simple free open source products to incredibly expensive commercial products. The tools were categorized by development environment, Perl, Java, etc.

I spent a while looking at the descriptions of a variety of tools focusing mostly on Perl and Java products. The descriptions of Swish-e and DIG were complete enough that it was not clear from reading product descriptions how any of the other tools would be superior to either Swish-e of DIG so I decided to focus on the two products specifically mentioned in the Apache FAQ. While it's quite possible that there may be superior products, even free ones available, it's not clear how I can find these without a lot of investigative work which probably means installing and testing the products. The most widely used products are usually solid products.

Between the two products, the feature that caused me to select Swish-e over DIG was Swish-e's ability to index a local file system. Both products include spidering capabilities, the ability to index multiple sites via HTTP. While I plan to have multiple sites, since these sites will have duplicate content, I don't want the sites indexed together as that would yield duplicate search results. It seems to me that if you don't want multiple sites that file system based indexing is likely to be much more efficient than HTTP based indexing.

I began reading and printing Swish-e documentation and after going through the installation and readme docs downloaded the product as a tar.gz file. Uncompress on the Red Hat Linux system would not create a renamed uncompressed file but zcat piped to tar successfully expanded the file. I had to find the location of gcc and change one line in the Makefile. Swish-e compiled with a few warning messages the first time and successfully completed the test indexing job that was described in the install documentation.

It was very quickly clear that I needed to get a better understanding of the Swish-e config file and command line options before I would be able to make practical use of it. I skimmed through the documentation printing it as I went. I also reviewed the front end tools that were listed on the Swish-e site for making Swish-e available via the web. I downloaded three different products but the HTTP downloads of tar.gz files that first went to my NT workstation were corrupted, probably because NT saved them as text files. (The NT workstation is still the only machine I have with an Internet connection. Getting a DSL line is starting to like a story in itself.)

After about an hour of reading documentation and changing options in the configuration file, I was ready to try swish-e from the command line again. The first attempt resulted in a number of config file syntax errors which were identified by line number. Most were spider options that the documentation clearly said I had to comment out but that I had missed. After commenting out these problem lines, Swish-e successfully indexed my site and the Apache documentation on the second attempt.

On the screen, printed to standard out was the list of words that had been excluded from the index because they were too common. I'd set the configuration to exclude any word that appeared in more than 50% of the files and more than 50 files. I picked the low absolute number because my site was small and the 1000 file default would mean there were no stop words. Looking at the excluded words, which had all the words in my standard page headers and footers, I knew there were words that I would want to search on, even if they were in every page on the site.

I changed the config file so there would be no stop words (excluded words). Thinking about searches I'd done in the past, I knew how frustrated I'd been at not being able to search for a phrase because it included stop words. I could not see any drawbacks except index size and possibly performance due to including common words in the indexes. Obviously if you put common words in your search criteria you're going to get large results. The specific words that pushed me towards indexing everything were "privacy" and "policy" which were common due their being part of the standard page navigation.

After building the indexes again with no stop words, I tried some simple searches starting with "privacy policy". As expected, the results included every GeodSoft web page but the "Privacy Policy" page was first with a relevance ranking of 1000 with other pages having much lower rankings going as low as 33. This and subsequent searches made it clear that swish-e was using the number of search words relative to the total number of words as part of the relevance ranking. This is what I expect and want from a search engine.

With it clear that I had functional indexes, I turned my attention to search front ends for the web. That was when I discovered the corrupt tar.gz file leaving me only with search.pl by Steve van der Burg. I put it in my cgi-bin directory and tried it. I got a not authorized error which went away as soon as I used chmod to make the file executable. Then I got a server error message which suggested looking in the error.log. That revealed incomplete http headers. I've mentioned this before but this is one thing a really do like about IIS and that is that it displays the script output as text if it's not valid HTTP output. This may be a security weakness and should have a configuration option to control but it's really handy when you're debugging CGI scripts.

I ran search from the command line and got a program not found error which I correctly surmised meant that the first line was pointing to the wrong location for the Perl executable. After fixing this, I ran the script from a browser again and got a search entry form. I tried a search that I knew would have results and got none. It only took about five minutes of looking at the script to find three configuration variables pointing to the locations of the swish-e executable, the configuration file and the index file. As soon as these were fixed, the next search gave a results page with 20 hits and links to four more results pages with a "next" link as well.

In less than three hours total time expended, I had functioning full text searching using products I'd never used before on a platform that I'd not had full text searching on. The three hours includes all the time that I was looking at documentation for competing products, the download and install, reading the Swish-e documentation, and getting a working front end. Among the options I know how to control are what directory trees to search and within those what file extentions to index and which ones to index file name but not contents. This last is for graphics files if I want them indexed. I know where Swish-e keeps its default list of stop words in swish.h and how to control automatic stop words if I decide I want stop word based on actual indexed content. The biggest searching capability that I have not found is searching for exact multi word phrases. I don't know if Swish-e can't do this or I just haven't found it.

I also know that from this point forward, I only need to gain a better understanding of Swish-e's capabilities to extend what I do with it. On the front end, there is only search.pl which is less than 400 lines in a standard language that I understand well. By modifying a Perl script I'll be able to make the results pages look like the rest of my site and control how many hits per page. Options on the existing form suggest that controlling the scope of the search will be straight forward. I'll add updates to this page as I go.

It was about two years ago that ATLA set up Index Server on its web site. I delegated this to an assistant who took a few days to get it to work. The version of Index Server that works with IIS 3 does not provide control over the files that are indexed. It indexes everything under the directory trees that you have it index. The scripting language that comes with Index Server does allow displayed results file types to be controlled via the forms and scripts. It took me at least another two days after my assistant gave up to gain control of output file types and integrate the output with our standard page appearance.

The scripting for Index Server is not particularly difficult but it's totally proprietary. In fact it's specific to Index Server and contained in two different file types. The scripts that control the execution of Index Server are .idq files. The results from .idq files are output to .htx files. Htx. files are conceptually similar to ASP and ColdFusion file. They're HTML pages with embedded, Index Server specific tags. So to control Index Server you need to learn two sets of syntax and a list of proprietary variables that are passed between .idq and .htx file. None of this is well documented so there is lot of trial and error getting these things to work. The output from .htx is standard HTML so any HTML form can be used to initiate or re invoke a search.

I've seen enough to be reasonably sure that Swish-e is not as powerful as Index Server. There's surely nothing like Index Server's tight integration with the OS so that users automatically only see results that they have rights to see and retrieve. On the other hand Swish-e is immeasurably easier to set up and gain meaningful control over, at least for an IT professional with an extensive development background. Perhaps a non technical user could get Index Server "to work" but no one without a solid programming background will ever tightly integrate it with an existing web site and give users meaningful control options specific to the site. I can't imagine Swish-e exhibiting Index Server's totally bizarre and unpredictable behavior.

Overall, at this point I'd give a modest lead to Swish-e over Index Server for public web sites but recognize that some sites will need capabilities they can find in Index Server but not Swish-e. I've concluded that all really sophisticated web sites that need granular security will have to build application level security to control access to resources within individual scripts. If there is a practical way to build centralized security functions that can be called from Apache's authorization modules, from standalone CGI scripts and from search scripts to control results lists, then open source systems have a better way of doing something that has been one of NT's strengths.

Update 6/2/00

As I expected, tying the CGI front end into GeodSoft's site design was simply a matter of Perl programming. By late yesterday (6/01) I had search.pl integrated with the site. Essentially all that was required was to add several standard lines to determine the absolute path to the site root directory and require my function libraries. Then I could call the standard page top and bottom functions. To produce valid HTML output I had to perform a couple of minor substitutions on the standard content because CGI.pm does things in a slightly incompatible manner.

Search.pl was already set up to be able to limit areas of the site searched. All I had to do was replace the sample data structures with real relative paths and meaningful descriptions. Somewhat more difficult was suppressing the output I didn't want. Since search.pl was designed as a fully functional standalone script, it duplicated options that I have in the standard search form in the left column. I wanted to suppress all the options except refine search so that users would return to the standard form to start a new search, limit the area of the site to be searched or change the number of hits per page.

This last, changing the number of hits per page was the trickiest part. Search.pl had a variable that set the number of hits to 20 and this could easily be changed in source code but I wanted the user to be able to change this. While I was able to get the first page to display correctly pretty quickly, subsequently search.pl reverted to the hard coded value. Later I got all but the last page to work. I had to come up with logic that would determine if this was the first invocation and calculate the page size. The calculated size had to be passed to all subsequent invocations which required finding every place that the script generated a URL embedded in an output form, i.e. all the "Prev 1 2 3 . . . Next" links, and change them. In the end this was still much less time than it's taken me in the past to tie Index Server into a site.

The biggest disappointments with Swish-e and search.pl are the time delay overhead and no phrase searching. Every search imposes about a 6 second delay which is a huge CGI overhead. Static pages and even search.pl without search terms return in significantly less than a second on my 100Mbps LAN. Every search with any search words take about 6 seconds to return. This is on a tiny site with just over 100 pages. This is pushing the limit of acceptability. I doubt the delay would be acceptable on a much larger site. I think this is specific to search.pl and not Swish-e because I did some of the sample Swish-e searches with huge results sets over my modem. While they took longer, it wasn't proportional and most of the time was clearly the page size / download time via a modem.

I still can't find anything to suggest that Swish-e provides any phrase or proximity search capabilities. Both are very important for real text searching of large sites or text databases. These and the time limitations will probably push me look at alternative tools as time permits. For now I have fully functional site searching that's OK for my little site.

I decided to go ahead and set up searching on the Open BSD system to see how long it took. It only took a few minutes to ftp the files over, edit the Swish-e config file to account for the different directory locations on BSD versus Linux, and generate the index for the site. Then I FTP'd search.pl, the newer Perl library files, and the search form definition file.

I got a server error when I tried running search.pl. I tried one of the simple CGI test scripts that I use and got the same error. I'd forgotten that I didn't yet have CGI working on the BSD system. I changed file permissions but that did not fix the problem. Then I looked at the error_log which immediately identified the problem. As soon as I changed the Apache Options configuration directive for the cgi-bin directory from None to ExecCGI and restarted Apache, the test script worked. Search.pl got another error but again the answer was in the error_log. GDBM_File, which search.pl uses, was not installed with the version of Perl that I have.

I'm going to continue my discussion of what I encountered trying to add GDBM_File to Perl on OpenBSD on another page. While the problems I encountered are specific to the GDBM_File module of Perl on OpenBSD they are symtomatic of the type of problems that users of open source systems encounter with some regularity.

transparent spacer

Top of Page - Site Map

Copyright © 2000 - 2014 by George Shaffer. This material may be distributed only subject to the terms and conditions set forth in http://GeodSoft.com/terms.htm (or http://GeodSoft.com/cgi-bin/terms.pl). These terms are subject to change. Distribution is subject to the current terms, or at the choice of the distributor, those in an earlier, digitally signed electronic copy of http://GeodSoft.com/terms.htm (or cgi-bin/terms.pl) from the time of the distribution. Distribution of substantively modified versions of GeodSoft content is prohibited without the explicit written permission of George Shaffer. Distribution of the work or derivatives of the work, in whole or in part, for commercial purposes is prohibited unless prior written permission is obtained from George Shaffer. Distribution in accordance with these terms, for unrestricted and uncompensated public access, non profit, or internal company use is allowed.

 
Home >
About >
Building GeodSoft.com >
search.htm


What's New
How-To
Opinion
Book
                                       
Email address

Copyright © 2000-2014, George Shaffer. Terms and Conditions of Use.