findinsite-ms indexing advanced options

findinsite-ms has an indexing facility, controlled in the Control Panel.

The Create new indexing run wizard makes it easy to index a web site to build a search database. However, you can also specify many advanced options to control the indexing process.

Enter any advanced options at stage 4 of the indexing wizard, when prompted to Enter any advanced options. In the box below, type in any settings, one per line, with each line having a name=value. For example, to enable indexing of text files with file extensions .txt and .bat, enter this:

ParseTXT=true
TXT_Files=*.txt,*.bat

If you want to remove an option, then simply delete the relevant line from the Advanced options box.

Advanced option list

Name Description Default

Description The search database description Taken from the first page title found

ScanType

Indicates how findinsite-ms finds files to index:

`dir`	Scan all files in ScanDirectory to a depth of ScanDirLevels
`file`	Scan by following links from ScanPathname
`url`	Scan by following links from ScanURL

url

ScanDirectory The directory used to find files if ScanType is dir

ScanDirLevels The number of directory levels to scan if ScanType is dir. Use a number in the range 0 to 255, or all. all

ScanPathname The initial file scanned if ScanType is file

ScanURL The initial URL scanned if ScanType is url Set in wizard

ParseHTML Specify true if you want to scan HTML web pages, or false if not. true

HTML_Files The file specification for HTML files, using * and ? wildcards as needed. Separate individual specifiers with a comma. *.htm,*.html,*.asp,*.aspx

ParseTXT Specify true if you want to scan TXT text files, or false if not. false

TXT_Files The file specification for TXT files, using * and ? wildcards as needed. Separate individual specifiers with a comma. *.txt

ParsePDF Specify true if you want to scan PDF text files, or false if not. false

PDF_Files The file specification for PDF files, using * and ? wildcards as needed. Separate individual specifiers with a comma. *.pdf

PDF_Passwords Specify a comma-separated list of passwords to open PDF files.

PDF_ReportCharacterDecodeProblems Specify true if you want to have any PDF character decode problems listed, or false if not. false

ParseDOC Specify true if you want to scan DOC Word document files, or false if not. false

DOC_Files The file specification for DOC files, using * and ? wildcards as needed. Separate individual specifiers with a comma. *.doc, *.docx, *.docm

ParseXLS Specify true if you want to scan XLS Excel spreadsheet files, or false if not. false

XLS_Files The file specification for XLS files, using * and ? wildcards as needed. Separate individual specifiers with a comma. *.xls, *.xlsx, *.xlsm

ParsePPT Specify true if you want to scan PPT PowerPoint presentation files, or false if not. false

PPT_Files The file specification for PPT files, using * and ? wildcards as needed. Separate individual specifiers with a comma. *.ppt, *.pptx, *.pptm

ParsePUB Specify true if you want to scan PUB Publisher files, or false if not. false

PUB_Files The file specification for PUB files, using * and ? wildcards as needed. Separate individual specifiers with a comma. *.pub

ParseImage Specify true if you want to scan JPEG images for meta-data, or false if not. false

Image_Files The file specification for JPEG files, using * and ? wildcards as needed. Separate individual specifiers with a comma. *.jpg,*.jpeg,*.tif,*.tiff

CaseSignificant If finding files by following links, then the case of filenames is ignored if false. If true then findinsite-ms views test.htm and Test.htm as separate files.
Windows always seems to ignore filename letter cases. In Unix, filename case must be correct. Windows: false

StoreStopWords If false, findinsite-ms does not include words specified in StopWordFile. true

StopWordFile The pathname of the file containing stop words, with one word per line in UTF-8 format.

NoTitleIgnorePageLinks If finding files by following links and this property is set to true, then links are not followed if a page has no title. false

ParseUpHierarchy If finding files by following links and this property is set to true, then links are followed to directories above the initial file. false

StorePositions If true then findinsite-ms stores word positions so that "adjacent word" searches will work. true

StoreLoneWords If true then findinsite-ms stores a word's position even if the two surrounding words are stop words. true

UseNoBaseURLs Determines whether to include a Base URL prefix for each page in the search database false

UseMetaDescriptionAsAbstract If true then the page abstract will be taken from the page META description tag. true

UseMetaAbstractAsAbstract If true then the page abstract will be taken from the (new) page META abstract tag. true

AbstractWords If building the abstract from the words in a file, this property indicates the number of words to use. 0

Include A list of file specifications to read and include in the search database. See below All files will be included

Exclude A list of file specifications to read but exclude from the search database. See below No files will be excluded

HardExclude A list of file specifications to exclude from the search database. See below No files will be hard excluded

UserAgent The name of the UserAgent to use when indexing FindInSiteBot/version

ObeyRobots Whether to read indexing instructions from ROBOTS.TXT true

Credentials A list of username/password credentials. See below No usernames/passwords

MaxURLLength The maximum URL length. Set to 0 for no limit. 1024

FieldsToExclude A comma-separated list of fields to ignore (case-insensitive) No fields ignored

rule1, rule2, etc Optional indexing rules, see below No indexing rules

Include, Exclude and HardExclude files

The Include, Exclude and HardExclude properties provide an optional list of file-specs to determine the files to include or exclude in the search database. Those that match HardExclude are not read at all - this is equivalent to being listed in the site ROBOTS.TXT file. Otherwise, note that all files are still read - however only those matching Include and not matching Exclude are included in the search database.

The initial list of acceptable files is determined by the HTML_Files, TXT_Files, etc. Then:

If a HardExclude file-spec set is given, then any files meeting one of the given file-specs will not be read or indexed.

If an Include file-spec set is given, then only files meeting one of the given file-specs will be indexed.

If an Exclude file-spec set is given, then any files meeting one of the given file-specs will not be indexed.

Note that the Includes are processed first and the Excludes afterwards, so an Exclude file-spec takes precedence.

An individual file-spec can include zero or more * or ? wildcard characters, where ? matches exactly one character, and * matches zero or more characters. For example file???.ht* would match:
file001.htm, file101.html and file111.ht
but not
file1001.htm

A list of file-specs can be given directly in the property, or indirectly in a file.

Direct file-specs

Direct file-specs are semi-colon separated, eg:
Include=iso*;*12* Exclude=file???.ht*
This specifies two Include file-specs and one Exclude file-spec.

Indirect file-specs in a file

An indirect value consists of @ followed by a file name, where file-specs are specified one per line in plain text. The above direct example may be expressed indirectly as follows:
[email protected] [email protected]
where includes.txt contains:
iso* *12*
and excludes.txt contains:
file???.ht*
If an indirect file cannot be opened, an error message is reported.

Username/password credentials

If the web site being indexed requires one or more usernames/passwords, then pass this information in the Credentials property. findinsite-ms indexing supports "basic", "digest" and "NTLM" (Integrated Windows Authentication) authentication.

The Credentials property must consist of a semi-colon separated list of credentials. Each credential contains comma-separated fields: a username, a password and an optional path. Spaces are trimmed at the ends of all fields. To use a blank password, specify a period (.) in that field.

For example, for a single username (uname) and password (pwd), use this:

Credentials=uname,pwd

Only one credential can be supplied for each path on the web site. Therefore, if you are using more than one credential, then the paths must be different. Suppose you are indexing www.example.org. If username/password uname1/pwd1 is required for directory www.example.org/manager/ and uname2/pwd2 is required for all other directories, then use this:

Credentials=uname1,pwd1,manager/ ; uname2,pwd2

Indexing rules

Indexing rules provide a limited means of altering aspects of the indexing process. You can specify several rules, each named rule1, rule2, etc.

Each rule must have one or more conditions, and must have one action. For example, this rule has two conditions (that the file being indexed is a PDF, and its URL starts with Default) and one action (store its referer as the file URL):

rule1=C:type==PDF;C:url==Default;A:url=referer;

Each rule consists of several elements, each separated by a semi-colon (;)
Condition elements start with C:
The Action element starts with A:
If all the conditions are true, then the action is performed

Conditions that start with C:type== check that the file is a certain type, from this list: html pdf txt doc xls ppt image pub.

Conditions that start with C:url== check that the file starts with the subsequent characters. Note that the "base URL" should not be included here, ie if the indexing run started at http://www.example.com/subdir/ and you want to check for files that start http://www.example.com/subdir/another/ then use condition C:url==another/

The Action A:url=referer sets the URL for this page to its referer page, if it exists. In practice this restricts this rule to standard URL indexing runs. Note that as a consequence, the referer URL will appear twice in the search database.

Example

	Description=My web site
	ScanType=url
	ScanURL=http://www.mycompany.com/
	ParseHTML=true
	HTML_Files=*.htm,*.html,*.asp
	ParseTXT=false
	ParsePDF=true
	PDF_Files=*.pdf
	CaseSignificant=false
	StoreStopWords=true
	StopWordFile=
	NoTitleIgnorePageLinks=true
	ParseUpHierarchy=false
	StorePositions=true
	StoreLoneWords=true
	UseMetaDescriptionAsAbstract=true
	UseMetaAbstractAsAbstract=true
	AbstractWords=0
	Include=
	Exclude=