FindinSite-CD: Search engine for CD/DVD   .
 
Powered by FindinSite-MS
. Home | Examples | Starting | Set up | Advanced | Languages | Purchasing | Email .
. .
  Getting started | FAQ | FindinSite-CD-Wizard | Findex | File Types | PDF | RDF | Parser API | HTML CDs/DVDs

 

findinsite-cd-wizard


Scanning
Making and viewing a search page
Editing the search database
Command-line interface
Technical details

Introduction

FindinSite-CD-Wizard indexes or scans your existing CD to build a database of all the words. You can then edit the search database.

FindinSite-CD-Wizard can build a basic search page to run FindinSite-CD with your search database. It can copy the necessary FindinSite-CD program files and launch your default browser to view the search page.

FindinSite-CD-Wizard is a Windows application. There is an alternative indexing tool, Findex, which is a platform-independent Java application; this indexes HTML, PDF, DOC, PPT/PPS, TXT and JPEG files, but will not build a search page and has no editor; Findex also indexes meta-data for field searches, including RDF/XML files.

Search databases

A search database contains a list of the words on CD, from web pages, DOC, RTF, XLS, PPT and PDF files and JPEG/TIFF image meta-data. You have various scan options that will determine how many pages are scanned and how large the database will be.

The database contains details of each web page or file in your site. It stores the page title and an abstract, as well as the target frame. You can change these details in the editor once the pages have been scanned.

Note that each search database is stored in 14-17 files, each having an extension starting with ".hi", eg ".his", ".hi1", ".hi2", etc. However, when you open or save a search database in FindinSite-CD-Wizard, it only asks you for the name of one of the files, ie the one with extension ".his". Note that you are still dealing with all the 14-17 search database files.

Make sure that you copy all the 14-17 files with extensions starting ".hi" to your CD.

FindinSite-CD-Wizard cannot open search databases created by Findex that contain field search information.

Abstracts

An abstract is a short description of a file, that is shown by FindinSite-CD in the results by default to help your user choose a suitable page to view. (You can change what is displayed in the results list - see here.)

See the file types page for a summary of how the abstract is obtained for each supported file type.

For web pages, the abstract is normally taken from the META DESCRIPTION of a page. If this tag is not present, the first words of the page are used as abstract. Therefore it best to add META DESCRIPTION tags to all your pages.

Alternatively you can use a new META ABSTRACT tag to give the abstract, eg:
<META NAME="abstract" CONTENT="This page is the introduction<BR><BR>Start here">
Note that the abstract text may contain the characters <BR> to force a line feed in the abstract. Put the META ABSTRACT after the META DESCRIPTION.

If there is no META DESCRIPTION or ABSTRACT (or if you have deselected these scan options) then the abstract is built from the first words of the page body. You can specify how many words to include in the abstract.


Scanning

Select File+New to build a new search database. See the File Types page for details of the types of file that FindinSite-CD-Wizard can scan.

A wizard takes you through several steps before the scan starts. First, enter a name for your project.

Then select how FindinSite-CD-Wizard will find the files on your CD. It can either find all files in a directory (or directories) on a local disk or CD. Alternatively it can follow the hypertext links from an initial file. This initial file can either be a file on a local disk/CD or a URL on a web site. Scans of local disk files will be quicker.

If following links, the scanner follows relative links to other web pages. It will not follow links to absolute URLs, eg "http://...". If a FRAME tag or an A HREF tag (or similar), has an attribute SPY=ignore or REL=nofollow then the link is not followed, eg <A HREF="newpage.htm" REL="nofollow"> is not followed.

The third wizard page asks what type of files you want scanned. FindinSite-CD-Wizard can find words in HTML web pages, PDF files, various Microsoft® Office files, TXT text files and JPEG/TIFF images. You can also change the file specification to indicate which files are recognised as belonging to a file type. For example, you could set the "HTML files" file specification to *.htm, *.html, *.asp if you want ASP script files scanned as well as basic HTML files. A "File mapping" button lets you index one type of file but show a different filetype if the user gets a hit in this file.

The next wizard page asks you to select a local file to store the search database. Just enter a pathname without a file extension. The wizard adds extension ".his" automatically. As stated above, the scanner generates 14-17 files in the complete search database, each with the same filename prefix but different extensions.

This wizard page gives you the option of making a sub-directory for all FindinSite-CD files. Most people find that this is useful because it keeps all the FindinSite-CD files separate from the rest of your CD.

Finally, you can set various scan options. First time round, just use the default options.

If you want to rebuild your search database automatically, then use FindinSite-CD-Wizard's Command-line interface.


Stop words
You can opt not to store stop words, ie common words - such as 'the', 'and' and 'in'. An English language stop word list is provided in StopWordsEn.txt. A prototype French stop word list is in StopWordsFr.txt

You can edit these stop word lists or create your own stop word lists easily - have the words in a plain text file, one per line. Note that non-alphanumeric characters should not be used, so you should not include "e.g." as a stop word.

Store word positions
FindinSite-CD-Wizard can also optionally store the positions of each word on each page. This lets your customers find adjacent words. This option is desirable, but will increase the database size.

Stop lone word positions
If you do not store stop word positions then some words may be surrounded by stop words, eg time in The time of day.
A search for "The time of" (ie in exactly that order) will not get any hits and it may seem sensible not to store the position of time in this case.

Use META description as abstract
Select this option if you want the META DESCRIPTION to be used as the page abstract.

Use META abstract as abstract
Select this option if you want the META ABSTRACT to be used as the page abstract.

Words in abstract
Enter the number of words to include in the abstract, if it is built from the page body.

No title: ignore page
Some simple pages may not have titles. If this option is selected then words on pages without a title will not be included in the database, and links will not be followed from this page.

Case is significant
This specifies whether the case of web page filenames is important, ie whether page "haggis.html" is different from "Haggis.html". In Windows systems, both these filenames will refer to the same page. (Note that on non-Windows servers, the case of link page names is important.)

Parse up directories
If this box is checked, the scanner will follow links that go up a directory, from the directory of the initial page.

Report PDF character problems
If this box is checked, the PDF Scanner will report any character code and glyph problems (these can normally be ignored).

PDF Passwords
If any of your PDF files require passwords, type in the passwords here, comma separated. Open (user) or master (security or owner) passwords are supported.
This option may not be available. If so, a message Sorry, password-protected PDFs not supported is shown. See the main PDF page for more details.


Having pressed Finish, the scan now starts. It displays its progress in a Scan Report window. When complete, a list of any problems encountered is shown. Click on OK to complete the scan and begin editing the search database.

FindinSite-CD-Wizard often finds some genuine errors in web sites - it is very easy to make mistakes. Press the "Save Report As..." button if you want to save the list of problems to a text file.

If you cancel a scan part way through, the database for the pages already scanned will be written correctly.

Ignoring words in web pages

Words and links in between APPLET../APPLET, SCRIPT../SCRIPT and STYLE../STYLE tags are ignored, along with ASP script code in between <% and %>.
Words and links in between a <DIV class=nospy> tag and the next </DIV> tag are ignored.
Words in between a <DIV class=nospytext> tag and the next </DIV> tag are ignored, but links are followed.
Words and links in between a <DIV class=nospyabstract> tag and the next </DIV> tag are put in the search database but not put in the abstract.
Words and links after a NOFRAMES tag are ignored, apart from storing the abstract.


Making and viewing a search page

FindinSite-CD-Wizard can build a basic search page for you, either when you first scan your existing pages to build a search database, or later using the Test+Create search page.. menu.

FindinSite-CD-Wizard can also copy all the necessary FindinSite-CD program files into the same directory as the search page. Note that the FindinSite-CD runtime files include a com sub-directory. Make sure that com and all its sub-directories are put onto your CD.

You can tailor the search page as you wish, eg in your favourite HTML editor. FindinSite-CD-Wizard can run the Windows Notepad text editor to view or change the source HTML of the search page.

FindinSite-CD-Wizard can then display the search page in your default browser, using the Test+View search page.. menu or its toolbar short-cut.

In the scan wizard you can ask that all the FindinSite-CD files be grouped in a separate subdirectory.


Editing the search database

FindinSite-CD-Wizard screenshot

The FindinSite-CD-Wizard main window lets you edit a search database, ie the words themselves, complete pages of words and the Base URLs.

Most of the time you will just be trying to slim down the search database, ie to remove pages that should not be found, or words that people are unlikely to search for.

It is also a good idea to check the abstract for each page.

The top of the edit window displays the description and various statistics about the search database. Only the description can be edited.

If your files contain any words with characters that cannot be displayed in the current system locale then the characters will not be displayed correctly. However they can be edited - with care. See the character sets page for full details.


Word Editing

Click on the "Words" tab to edit the words.

Initially, no words are displayed. You have to select one or more of the check boxes to make some or all of the words to appear.

Note that deleting words does not make them disappear straight away. Indeed, you can still see them listed if you have the "Show deleted words" checked. This is useful as it allows you, say, to delete all words with non-alphabetic characters and then go through the list undeleting the ones that are of interest.

Check box Shows
All All words
Just numbers Words with just numbers in
All capitals Words with only capital letters
Any non-latin chars Words with any non-latin characters
Starting with Words starting with the given letters
Shorter than Words with less the given number of characters
Longer than Words with more the given number of characters
Select one of the check boxes to indicate which words you want to see. Note that you can select more than one option at once. For example, the above screen shows words with All Capitals and Starting With C.

The word list shows each word on a separate line, sorted alphabetically. In the "Cases" column, any instances of a word with different capital letters are shown. Finally, the "Pages" column shows all the pages that a word appears on.

To left of each word is a little state icon. This is a blue tick if the word is in the database, or a thin red X cross if the word is deleted.

Click on the state icon to delete or undelete a word. Alternatively press the Del key, right-click and select "Delete", or select menu "Edit+Delete".

Select the "Delete all shown words" button to delete all the shown words.

You can click on the "Words" and "Cases" column header to sort the word list in different ways.

Word properties screen shot You can change the list of pages that a word refers to by right-clicking and selecting "Word properties..." or select menu "Edit+Word properties...".

The Word properties box shows the different letter cases of the selected word and the pages in which the word appears. Click on the "Remove page" button to remove the highlighted page from list of pages that the word finds.


Page Editing

FindinSite-CD-Wizard Pages screenshot

Click on the "Pages" tab to edit the pages that the search database refers to.

First select a page to edit from the list.

Pressing "Delete" deletes the page and all the words it contains from the search database.

For each page, the Title, URL, Base URL, Target frame, Priority and Abstract are shown, and can be edited. The list of anchors on the page is shown but cannot be edited. (Base URLs are explained below.)

The Priority field can be used to re-order the FindinSite-CD results list. You can set the Priority using the custom META phd-spy-priority tag.


Base URLs

FindinSite-CD-Wizard Base URLs screenshot

Click on the "Base URLs" tab to edit the Base URLs of the pages.

A "Base URL" is the characters to put in front of the characters of the page URL. So if the page URL is "index.html" and its Base URL is "http://www.phdcc.com/" then the page displayed is "http://www.phdcc.com/index.html".

If you asked FindinSite-CD-Wizard scan wizard to make the search database in a subdirectory, then there will usually be a single Base URL ../". Otherwise there will be no Base URLs, and the Base URL for each page will display <None>.

Normally you should leave the Base URLs alone. However you can add, edit and delete the Base URLs in this Base URLs tab. Simply press "Add" to add a new Base URL. Or select a Base URL and edit it below. Note that you can only delete a Base URL if it is not used by any page. (You may have to exit FindinSite-CD-Wizard and re-enter to get Delete enabled.)

Use the "Set all URLs to this Base URL" to make all the pages use the currently selected Base URL. Alternatively "Set all URLs to <None>" to reset all pages' Base URLs to <None>.

Back in the "Pages" tab, you can choose the Base URL for an individual tab in the "Base URL" combo box.



Command-line interface

You can run FindinSite-CD-Wizard from a command-line to rebuild a search database without any user interaction. This lets you update a search database easily from an MS-DOS batch file or equivalent.

Running FindinSite-CD-Wizard from a command-line is equivalent to selecting menu File+Rebuild this search database and therefore must refer to an existing search database. (You cannot currently make a new search database from the command-line. You cannot edit the search database from the command line.)

Run FindinSite-CD-Wizard from the command-line as follows (change the path to fisCDWiz.exe if necessary):

"C:\Program Files\PHD\fisCDv5\fisCDWiz.exe" /rebuild search_database [-c] [optional_log_file]
You must specify /rebuild as the first parameter, and a search database filename for the second parameter, including the .his extension. Optionally specify -c if you want a list of files output to a new console (note that this is not the standard output so it cannot be redirected). Optionally add the name of a file for the scan report (in text format). It is safest to use the full pathname for each file. Put the filename(s) in double-quotes if they contain space characters, eg:
"C:\Program Files\PHD\fisCD\fisCDWiz.exe" /rebuild "C:\My CD\fiscd\mysite.his" C:\scanlog.txt
FindinSite-CD-Wizard does not output any information to the command window. However it will return a non-zero error code for serious errors. For successful runs (even if there are scan errors), FindinSite-CD-Wizard returns zero.

If FindinSite-CD-Wizard has been used for more than 30 days under the Free licence, it will show a nag message box suggesting a purchase, stopping the command-line interface from running uninterrupted. Similarly scans of DOC, XLS or PPT files can result in message boxes appearing if Word, Excel or PowerPoint, say, a file is corrupted.


Technical details

The Character sets page contains information on how FindinSite-CD-Wizard scans HTML files. The PDF Scanning Support page has details of the PDF Scanning module. See the file types page for details of how FindinSite-CD-Wizard indexes other file types.

If you use framesets, FindinSite-CD will usually not show a result page in the desired frameset. Please consult the framesets page for several solutions to this problem.

Directories and files that have a hash (#) character in their pathnames cannot be scanned because FindinSite-CD-Wizard cannot distinguish them from anchor names. An appropriate error is sent to the Scan Report.

META phd-spy-rebase

A custom META tag can be used to change the Base URL, filename and target for a page in the search database. A META tag usually appears in the page header. The phd-spy-rebase META tag must have an attribute called name with a value of phd-spy-rebase. The content attribute specifies a replacement Base URL, filename and target - comma separated with no spaces.

Each element is optional. If an element is not present then it is not changed.

  • If the replacement Base URL is present but empty, the page Base URL is not changed.
  • If the replacement filename is present but empty, then the page's filename is set to empty.
  • If the replacement target is present but empty, then the page's target is set to empty.

This example changes the Base URL for its page to http://www.xxx.com/, the filename to newfilename.htm and the target to Main.

<META name="phd-spy-rebase" content="http://www.xxx.com/,newfilename.htm,Main">
If the replacement filename is -, any directories in the page's filename are removed. In this example, if the page filename is yy/z/apage.htm, then the Base URL is changed to http://www.xxx.com/ and the filename to apage.htm.
<META name="phd-spy-rebase" content="http://www.xxx.com/,-">
If the replacement filename is +, then the page filename is not changed. This is useful if you only want to change the target, eg:
<META name="phd-spy-rebase" content=",+,Main">

META phd-spy-priority

A custom META tag can be used to set a "Priority number" for each web page. Priority numbers can be used to change the order that FindinSite-CD displays the results pages. See the Screen layout - Results Ordering section for more details (you will need to set the ReorderResults FindinSite-CD search page parameter).

A META tag usually appears in the page header. The phd-spy-priority META tag must have an attribute called name with a value of phd-spy-priority. (The old value of phd-spy-revision is also supported.) The content attribute specifies an integer in the range 0 to 255 to be used as the page Priority.

By default, each page has a Priority of zero. This example changes the Priority to 10.

<META name="phd-spy-priority" content="10">
The source of this page contains an example of META phd-spy-priority tag.

META robots

The standard robots META tag can be used to tell FindinSite-CD-Wizard whether to index a page and whether to follow the links in page (when following links). See http://www.robotstxt.org/wc/meta-user.html for full details.

A META tag usually appears in the page header. The robots META tag must have an attribute called name with a value of robots. The content attribute specifies one or more comma-separated directives:

Directive Description
index Index the words on this page
noindex Don't index the words on this page
follow Follow the links on this page
nofollow Don't follow the links on this page
none same as noindex,nofollow
all same as index,follow
spyignore Ignore the other directives, ie index and follow regardless

By default, FindinSite-CD-Wizard indexes each page and follows links. This example indexes the page but does not follow links.

<META name="robots" content="index,nofollow">
  All site Copyright © 1996-2011 PHD Computer Consultants Ltd, PHDCC   Privacy  

Last modified: 8 February 2006.

Valid HTML 4.01 Transitional Valid CSS!