Installing and Configuring the ht://Dig Search Engine

ht://Dig is an excellent search engine to install on your web server. Try it out! See the Features and Requirements page for more information. Check the ht://Dig home page for the latest news and updates. I'm going to cover some additional installation and configuration hints.

Tips and Techniques

Customizing the search results
Making the date display all four digits of the year in search results
An alternate rundig script
Indexing PDF files
Indexing Microsoft Word files
Logging search requests

Please report any errors or ommissions to me. Suggestions are welcome too. Thank you.

Getting it going

Quick Start (for the intrepid)

If you are using Red Hat or Mandrake Linux and you are reasonably familiar with using Apache, you might get by by following these Quick Start instructions. Otherwise, use the complete instructions.

(As root) get and install the RPM. (Full information here. Note the vixie-cron issue for Red Hat 5.0-5.1.)
Edit /etc/htdig/htdig.conf and check to see that start_url: correctly points what you want to index on your server. Watch out because the RPM installer adds a second start_url: definition at the end of the file.
Type rundig -v to create the search index database. You should see indications that it is indexing each file. If not and it appears to be "hanging," abort with Ctrl-C and check your configuration.
You should now be able to search by accessing search.html, which is installed in /home/httpd/html. How searching works.
It worked? Good. Now look through the rest of this document to learn more about configuring ht://Dig. If it didn't work, then well, the same advice applies: look through the rest of this document.

Note that the RPM installer created a cron job in /etc/cron.daily that will run /usr/sbin/rundig once a day so that the search index will automatically be updated once a day.

But you still should look over the rest of this documentation.

Installation (Long form)

Before you start, you should look over the Features and Requirements page. Ht://Dig is available in source "tarball" and Red Hat style RPM distributions. The RPM distribution is much easier to install, but the tarball gives you more flexibility in specifying the locations where everything will be installed. Your choice. This document is going to cover installing both the htdig 3.1.5.tar.gz "tarball" and the RPM file. The Where to get it page is the best place to get the most recent version of ht://Dig.

Installing the RPM

Mandrake 7.2 has ht://Dig on the install CD and might already be installed on your system. Red Hat 7.0 has it on the "Power Tools" CD. You can get other RPM distributions from here. (Or from here.) Download one of these:

htdig-3.1.5-0.i386.rpm (Red Hat 4.2)
htdig-3.1.5-0glibc.i386.rpm (Red Hat 5.x) *
htdig-3.1.5-0glibc21.i386.rpm (for glibc-2.1, Red Hat 6.0, 7.0**)

Put it somewhere on your Linux machine and (as root) type rpm -Uvh htdig*.rpm. Bang, it's installed. Now skip to Where everything is.

* There is a bug with vixie-cron for Red Hat 5.0 and 5.1. The ht://Dig team reccomends upgrading to a newer version of vixie-cron. Look for vixie-cron-3.0.1-37.5.2.i386.rpm. This affects you, because the RPM installer installs rundig as an /etc/cron.daily job. Get the updated vixie-cron from here.
** If you are using Red Hat 7.0 and don't have the Power Tools CD, then you can use htdig-3.1.5-0glibc21.i386.rpm, but it needs some additional work to get it going. You must first install compat-libstdc++-6.2-2.9.0.9.i386.rpm from the first Red Hat 7.0 install CD. The default HTML directory in previous version of Red Hat was /home/httpd/html. It is now /var/www/html. htdig-3.1.5-0glibc21.i386.rpm installs several things in /home/httpd/html. These need to be moved to /var/www/html.
Move search.html and the htdig directory to /var/www/html. You must also move /home/httpd/cgi-bin/htsearch to /var/www/cgi-bin/htsearch. The 'local_urls' variable in /etc/htdig/htdig.conf needs to be modified because it refers to /home/httpd/html.

Installing the tarball

For the tarball, you should decide where you want ht://Dig to install its programs. You must decide this before you install it, because you can't move it after you have it installed. (Except by deleting the entire installation and re-installing from scratch.) The default is to install in the /opt/www directory. The assorted ht://Dig binaries and configuration files will be located in this directory tree. You must configure your Web server to execute the ht://Dig CGI programs from here. If this is not acceptable, then change these locations during the installation procedure.

OK, now follow the ht://Dig installation instructions. (You probably should open them in a new window so that you can refer to this page.) When you get to the Configure step, you have the opportunity to edit the CONFIGURE script that defines where everything will get installed. If you want to go with the default location, then just continue on through the procedure.

Configuring Apache (tarball only)

The RPM installation should need no Apache configuration changes, because everything goes in "standard" locations. Assuming that your installation uses the standard locations....

Assuming that you installed ht:/Dig in the default /opt/www directory, here are the configuration changes that you should add to your Apache configuration file(s).

Alias /htdig/ /opt/www/htdocs/htdig/ So that you can "point" to assorted graphic files. e.g.,
<img src="/htdig/htdig.gif"> Also, the default search.html file is located here.
It is a real good idea to keep the /htdig/ definition, because the template files that are used to display the search results all refer to htdig/ to locate files.

ScriptAlias /htdig-cgi/ /opt/www/cgi-bin/ Is how you access the htsearch program for searching. e.g.,
<form method="post" action="/htdig-cgi/htsearch">

<Directory /opt/www/cgi-bin/> AllowOverride None Options ExecCGI </Directory>
So that Apache will allow access to the ht://Dig cgi-bin directory.

After editing your Apache configuration files, type /etc/rc.d/rc.init/httpd restart to restart Apache.

Where everything is

Name	RPM locations	Tarball (Default locations)	Used for
${CONFIG_DIR}	/etc/htdig	/opt/www/htdig/conf	htdig.conf configuration file
${COMMON_DIR}	/var/lib/htdig/common	/opt/www/htdig/common	Template files used for search results
${BIN_DIR}	/usr/sbin	/opt/www/htdig/bin	rundig and other "digging" binaries
${DATABASE_DIR}	/var/lib/htdig/db	/opt/www/htdig/db	The search index database files.
${CGIBIN_DIR}	/home/httpd/cgi-bin	/opt/www/cgi-bin	htsearch
${IMAGE_DIR}	/home/httpd/html/htdig	/opt/www/htdocs/htdig	htdig.gif, and other graphic files
${SEARCH_DIR}	/home/httpd/html	/opt/www/htdocs/htdig	search.html sample search form
Name	RPM locations	Tarball (Default locations)	Used for

Configuring the htdig.conf file

Important note for RPM users: The RPM installation program attempts to configure ht://Dig so that it will work "out of the box." They installed the various files in "standard" Red Hat locations. One thing that is never standard, however, is the name of your machine. The ht://Dig RPM installer attempts to glean this information from your existing configuration files and appends new definitions at the end of the htdig.conf file, in addition to the "stock" definitions that are scattered throughout the htdig.conf file. This includes the all important start_url: variable. Variable definitions at the end of the file override earlier definitions. Bear this in mind as you are scrolling through htdig.conf.

Edit ${CONFIG_DIR}/htdig.conf. Scroll down and find the start_url: line. This line defines what ht://Dig will index for searching. The default is to index the http://www.htdig.org/ site. This is not a good site to test with, because it takes a long time to index. Change this to point to a "site" on your own machine. For speed, change the URL to use your machine's IP address, rather than the full domain name. For example, if your machine is addressed as 192.168.1.1, then set start_url: to be http://192.168.1.1/

Start_url: must be specified to be accessed the same way as your web server accesses it.

Because ht://Dig works like a web crawler and accesses your HTML pages the same way as a web browser does. So use a browser to access the site on your own machine. Use the same URL that your browser uses in start_url:.

Using the IP address to refer to the site is a shortcut for testing. This IP address will be returned in the search results, so 192.168.1.1, for example, isn't what you would use when you release the search form to the public. In this case, you either have to set start_url: to the actual domain that the site uses, or (preferably) use two configuration files (one for digging and another for searching) and use the url_part_aliases directive to translate from a local IP address to the real domain. This is more complicated than what you should be doing until you have it working and are familiar with the basic operations.
For an additional speed boost, check out the local_urls: directive that lets ht://Dig access the files through the local filesystem, rather than having to go through the web server. But, again, wait until you have ht:/Dig working and are reasonably familiar with how everything works before you try using this.

You should create a robots.txt file in the server's root directory to specify what you do not want ht://Dig (or any other search engine!) to index. Here is a sample robots.txt file

# robots.txt for http://www.example.com/

User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space
Disallow: /tmp/ # these will soon disappear
Disallow: /foo.html

From A Standard for Robot Exclusion.

Reference for all configuration file directives

Generating the search index

Before you can search you must generate the search index database. Change to ${BIN_DIR}. Use the rundig script to run the ht://Dig programs to index your site. Type ./rundig -v Rundig will run the htdig "digging" (indexing) and htmerge (second step of creating the search index) programs. The -v option tells them to be verbose. Meaning that you should see each file as it is indexed, followed by indications of the merging activity.

This should complete in a reasonable length of time (depending on the size of your site.) If you see prolonged periods of inactivity, then press Ctrl-C to abort the programs and check start_url: in the ${CONFIG_DIR}/htdig.conf configuration file. If indexing is taking too long for testing, consider changing start_url: to only index a subset of your site until you are done wrestling with the configuration file.

Note that you must update the index whenever the site is updated. If your site is large and indexing is time consuming, then you might want to do the indexing in a cron job that is run in the middle of the night.

RPM users should know that the RPM installer creates an /etc/cron.daily job that will automatically run rundig once a day. This may be all that you need.
When you get the configuration file squared away, then use ./rundig -s for a considerably shorter display. Alternatively, if something is giving you problems then try using ./rundig -vvv for an extremely detailed and verbose display. In this case, you would probably want to redirect the output to a file. ./rundig -vvv > debug.txt Then load debug.txt in an editor.
Right now the only way you have to generate the index is by running the rundig (or rundig2) script, which possibly is limiting because generates the whole index from scratch each time that it is run. This has two undesireable side effects: 1., it takes times and machine resources, and, 2., searching returns no results while the rundig script is running.
There are other ways to do the search index database updating to sidestep these issues. You should examine the command line options for the indexing programs so that you can develop an indexing procedure that best suits your site's needs.

More information on the htdig, htmerge, htnotify, and htfuzzy programs that are used to generate the search index database.

Doing a search. Finally.

Look at ${SEARCH_DIR}/search.html This is your sample search form.

For the tarball installation, you probably have to change one line, because we defined the CGI directory to be htdig-cgi in the Apache configuration file. So change
<form method="post" action="/cgi-bin/htsearch">
to
<form method="post" action="/htdig-cgi/htsearch">
and save the file.

Now use a browser to access this search form. If the IP address of your server is 192.168.1.1, then enter either 192.168.1.1/htdig/search.html (tarball) or 192.168.1.1/search.html (RPM) as the URL for your browser. You should see the search form. Enter a word that you know is somewhere on your site. Click the search button.

(Fingers are crossed.)

You should see the search results displayed, almost instantly.

More information on the htsearch CGI program that does the actual searching.

Troubleshooting

If something isn't working right, the first thing to do is to go back and check your configuration and try repeating the above procedures. If this doesn't help, then the ht://Dig site has a lot of valuable reference material. Check the configuration page, check the FAQ. Check the on-line reference section. Most important, make sure to visit the ht://Dig Mailing List Archive. The ht://Dig community provides excellent support. Most (if not all) common "why doesn't this work" type questions have already been asked and answered on the mailing list, or in the FAQ.

Use the search box at the bottom of the main ht://Dig page to search the archives (and the rest of the ht://Dig site.)

Tips and Techniques

Customizing the search results

Examine ${SEARCH_DIR}/search.html. You use this as a basis for how you want the search forms to look. The search results are defined by the template files that are located in ${COMMON_DIR}. You edit these to change how the search results are displayed.

Search form
Template files
Configuration documentation has more information on these files.
Htsearch
How searching works

One tricky part is that ht://Dig totally ignores the template files unless you add a template_map directive to htdig.conf. Like this:

this_base:  myweb

search_results_header: ${common_dir}/${this_base}/header.html
search_results_footer: ${common_dir}/${this_base}/footer.html
nothing_found_file: ${common_dir}/${this_base}/nomatch.html
syntax_error_file: ${common_dir}/${this_base}/syntax.html

template_map:   Long builtin-long ${common_dir}/${this_base}/long.html \
                Short builtin-short ${common_dir}/${this_base}/short.html \
                Default default ${common_dir}/${this_base}/long.html
template_name: Default

In this case I defined a new variable, this_base: with a value of myweb. The way I use this is to first create a myweb directory on top of ${COMMON_DIR} and copy all the template files into it before I started editing them. This leaves an untouched set of the template files.

Once this has been done I went through and edited all the template files so that they displayed the way I wanted. e.g., editing ${COMMON_DIR}/myweb/header.html, ${COMMON_DIR}/myweb/footer.html, etc. This method is also valuable if you are indexing (and searching) multiple sites and are using multiple configuration files. You keep each different set of template files in a different directory (defined by the value that is assigned to this_base.)

Optional. You could also separate the database files by defining them like

database_base:    ${database_dir}/${this_base}

The database files default to be named like db.docdb, db.word.db, etc. Making the above change would result in the database files being named like myweb.docdb, myweb.word.db, etc. Again, this is important if you are using multiple configuration files to manage multiple search databases on the same machine. If you are only using one search database, then you can ignore defining database_base:.

Making the date display all four digits of the year in search results

Add a date_format: command to htdig.conf.

Example: date_format: %m/%d/%Y will display like 01/23/2000.

See man strftime for full reference.

An alternate rundig script

ht://Dig supplies the rundig script that is sufficient to manage some ht://Dig indexing operations. But rundig doesn't support all the possible htdig, htmerge, and htfuzzy command line options. It is also difficult to use when you are specifying a different configuration file, because you have to type in the complete path to the configuration file.

I have modified rundig to address this. The modified script is named rundig2. It now supports all the command line options. It also supplies the path and file extension when you use the -c config file option.

Download either

rundig2tar.txt

rundig2rpm.txt

Download whichever of these is most appropriate. Rename it to be rundig2, check to see that the variables that define locations (DBDIR, etc.) are correct, move it to ${BIN_DIR}, and chmod it to be executable. (chmod 755 rundig2)

Now you can use rundig2 instead of rundig when you are creating the database files. If rundig2 doesn't work for you, for some reason, then go back to using rundig and please let me know about it.

Indexing PDF files

Ht://Dig will index Adobe Acrobat PDF files quite nicely, but it needs some additional configuration. You must download and install a PDF-to-text converter and do some additional configuration. Here's how.

Download the Xpdf package from the Xpdf Download page. Linux Intel users can download the pre-compiled binaries (x86, Linux 2.0 (libc6):) Once you have the binaries, then copy pdftotext and pdfinfo to a suitable location (${BIN_DIR} or /usr/bin, for example)

Alternatively, you can also use one of these Xpdf RPM files. Download one of these files:

Install the RPM (rpm -Uvh xpdf*.rpm) and pdftotext and pdfinfo will be installed in usr/bin (Double check the location with rpm -ql xpdf)

Download conv_doc.pl from here and copy it to your ${BIN_DIR} directory. Chmod it to to be executable. (chmod 755 conv_doc.pl) Then load it in your editor and change the $CATPDF variable to point to where pdftotext is and change $PDFINFO to where pdfinfo is.

Finally, edit ${CONFIG_DIR}/htdig.conf and add

external_parsers:  application/pdf->text/html /usr/local/bin/conv_doc.pl

Replace /usr/local/bin/ with the location of where you copied conv_doc.pl More about the external_parsers: directive.

Important note. ht:/Dig must read each PDF file in its entirety in order to index it. This is affected by the max_doc_size: directive in htdig.conf. Make sure that max_doc_size: is set to be larger than your largest PDF file.

pdftotext is pretty nifty. It can also be interfaced to lynx Check /etc/lynx.cfg and ~.mailcap.

Indexing Microsoft Word files

Installing a Microsoft Word to text converter is similar to Indexing PDF Files. Follow the procedures there to install and configure conv_doc.pl. The only difference is that you install a Word-to-Text converter, such as catdoc. These go together, so it is almost as easy to install both the Word and PDF converters at the same time. conv_doc.pl is already partially configured to use catdoc. Add

external_parsers:  application/msword->text/html /usr/local/bin/conv_doc.pl

to ${CONFIG_DIR}/htdig.conf. If you were installing both the PDF and Word converters, then you'd add

external_parsers:  application/msword->text/html /usr/local/bin/conv_doc.pl \
                   application/pdf->text/html /usr/local/bin/conv_doc.pl

Again, replace /usr/local/bin/ with the location where you have actually installed the conv_doc.pl script.

Logging search requests

It is valuable to have a record of what prople are searching for so that you know what they are interested in. This can give you hints on additional content that you need to add to your site.

To log search requests, add logging: true to your configuration file. This will direct the system logging facility to log search requests.

However, you might want to change the default logfile where syslog sends these messages to. (By default it goes to /var/log/messages.) To do this, edit your /etc/syslog.conf file and add this to it:

# Log ht://Dig search requests
local5.*                            /var/log/htdig

Remember to use tabs and NOT spaces in your syslog.conf file. Otherwise it won't work.

The system will now log search requests to both /var/log/messages as well as to /var/log/htdig, so now you have to tell it not to log search requests to /var/log/messages. To do this, add ;local5.none to your /var/log/messages line. It should look something like this:

# Log anything (except mail) of level info or higher.
# Don't log private authentication messages!
*.info;mail.none;authpriv.none;local5.none              /var/log/messages

For the changes to take effect, you'll need to restart your syslog daemon. To do so, just do a

killall -HUP syslogd

That will force syslogd to re-read its config file for the changes to take effect.

See man syslog.conf -S 5 for more information.

Syslog information courtesy of Bruce A. Buhler

Back to the scrounge.org home page.

`Alias /htdig/ /opt/www/htdocs/htdig/`	So that you can "point" to assorted graphic files. e.g., `<img src="/htdig/htdig.gif">` Also, the default `search.html` file is located here. It is a real good idea to keep the `/htdig/` definition, because the template files that are used to display the search results all refer to `htdig/` to locate files.
`ScriptAlias /htdig-cgi/ /opt/www/cgi-bin/`	Is how you access the htsearch program for searching. e.g., `<form method="post" action="/htdig-cgi/htsearch">`
<Directory /opt/www/cgi-bin/> AllowOverride None Options ExecCGI </Directory>	So that Apache will allow access to the ht://Dig cgi-bin directory.