findinsite-ms indexing advanced options
findinsite-ms has an indexing
facility, controlled in the Control Panel.
The Create new indexing run wizard makes it easy to index a web site to build a
search database. However, you can also specify many advanced options to control
the indexing process.
Enter any advanced options at stage 4 of the indexing wizard, when prompted to Enter
any advanced options. In the box below, type in any settings, one per line, with
each line having a name=value . For example, to enable indexing of text files with
file extensions .txt and .bat , enter this:
ParseTXT=true
TXT_Files=*.txt,*.bat
If you want to remove an option, then simply delete the relevant line from the Advanced options box.
Advanced option list
Name |
Description |
Default |
Description |
The search database description |
Taken from the first page title found |
ScanType |
Indicates how findinsite-ms finds files to index:
|
url |
ScanDirectory |
The directory used to find files if ScanType is dir
|
|
ScanDirLevels |
The number of directory levels to scan if ScanType is dir .
Use a number in the range 0 to
255 , or
all .
|
all |
ScanPathname |
The initial file scanned if ScanType is file
|
|
ScanURL |
The initial URL scanned if ScanType is url
|
Set in wizard |
ParseHTML |
Specify true if you want to scan HTML web pages,
or false if not.
|
true |
HTML_Files |
The file specification for HTML files, using * and ? wildcards as needed.
Separate individual specifiers with a comma.
|
*.htm,*.html,*.asp,*.aspx |
ParseTXT |
Specify true if you want to scan TXT text files,
or false if not.
|
false |
TXT_Files |
The file specification for TXT files, using * and ? wildcards as needed.
Separate individual specifiers with a comma.
|
*.txt |
ParsePDF |
Specify true if you want to scan PDF text files,
or false if not.
|
false |
PDF_Files |
The file specification for PDF files, using * and ? wildcards as needed.
Separate individual specifiers with a comma.
|
*.pdf |
PDF_Passwords |
Specify a comma-separated list of passwords to open PDF files.
|
|
PDF_ReportCharacterDecodeProblems |
Specify true if you want to have any PDF character decode problems listed,
or false if not.
|
false |
ParseDOC |
Specify true if you want to scan DOC Word document files,
or false if not.
|
false |
DOC_Files |
The file specification for DOC files, using * and ? wildcards as needed.
Separate individual specifiers with a comma.
|
*.doc, *.docx, *.docm |
ParseXLS |
Specify true if you want to scan XLS Excel spreadsheet files,
or false if not.
|
false |
XLS_Files |
The file specification for XLS files, using * and ? wildcards as needed.
Separate individual specifiers with a comma.
|
*.xls, *.xlsx, *.xlsm |
ParsePPT |
Specify true if you want to scan PPT PowerPoint presentation files,
or false if not.
|
false |
PPT_Files |
The file specification for PPT files, using * and ? wildcards as needed.
Separate individual specifiers with a comma.
|
*.ppt, *.pptx, *.pptm |
ParsePUB |
Specify true if you want to scan PUB Publisher files,
or false if not.
|
false |
PUB_Files |
The file specification for PUB files, using * and ? wildcards as needed.
Separate individual specifiers with a comma.
|
*.pub |
ParseImage |
Specify true if you want to scan JPEG images for meta-data,
or false if not.
|
false |
Image_Files |
The file specification for JPEG files, using * and ? wildcards as needed.
Separate individual specifiers with a comma.
|
*.jpg,*.jpeg,*.tif,*.tiff |
CaseSignificant |
If finding files by following links, then the case of filenames is ignored
if false .
If true then findinsite-ms
views test.htm and Test.htm as separate files.
Windows always seems to ignore filename letter cases.
In Unix, filename case must be correct.
|
Windows: false
|
StoreStopWords |
If false ,
findinsite-ms does not include words specified in
StopWordFile.
|
true |
StopWordFile |
The pathname of the file containing stop words, with one word per line in UTF-8 format. |
|
NoTitleIgnorePageLinks |
If finding files by following links and this property is set to
true ,
then links are not followed if a page has no title.
|
false |
ParseUpHierarchy |
If finding files by following links and this property is set to
true ,
then links are followed to directories above the initial file.
|
false |
StorePositions |
If true then findinsite-ms stores
word positions so that "adjacent word" searches will work.
|
true |
StoreLoneWords |
If true then findinsite-ms stores
a word's position even if the two surrounding words are stop words.
|
true |
UseNoBaseURLs |
Determines whether to include a Base URL prefix for each page in the search database
|
false
|
UseMetaDescriptionAsAbstract |
If true then the
page abstract will be taken from the page META description tag.
|
true |
UseMetaAbstractAsAbstract |
If true then the
page abstract will be taken from the (new) page META abstract tag.
|
true |
AbstractWords |
If building the abstract from the words in a file, this property
indicates the number of words to use.
|
0 |
Include |
A list of file specifications to read and include in the search database.
See below
|
All files will be included |
Exclude |
A list of file specifications to read but exclude from the search database.
See below
|
No files will be excluded |
HardExclude |
A list of file specifications to exclude from the search database.
See below
|
No files will be hard excluded |
UserAgent |
The name of the UserAgent to use when indexing
|
FindInSiteBot/version |
ObeyRobots |
Whether to read indexing instructions from ROBOTS.TXT
|
true |
Credentials |
A list of username/password credentials.
See below
|
No usernames/passwords |
MaxURLLength |
The maximum URL length. Set to 0 for no limit.
|
1024 |
FieldsToExclude |
A comma-separated list of fields to ignore (case-insensitive)
|
No fields ignored |
rule1, rule2, etc |
Optional indexing rules, see below
|
No indexing rules |
Include, Exclude and HardExclude files
The Include, Exclude and HardExclude
properties provide an optional list of file-specs to determine the files to include or exclude
in the search database.
Those that match HardExclude are not read at all - this is equivalent to being listed in the
site ROBOTS.TXT file.
Otherwise, note that all files are still read - however only those matching Include and not matching
Exclude are included in the search database.
The initial list of acceptable files is determined by the
HTML_Files, TXT_Files, etc.
Then:
If a HardExclude file-spec set is given,
then any files meeting one of the given file-specs will not be read or indexed.
If an Include file-spec set is given,
then only files meeting one of the given file-specs will be indexed.
If an Exclude file-spec set is given,
then any files meeting one of the given file-specs will not be indexed.
Note that the Includes are processed first and the Excludes afterwards,
so an Exclude file-spec takes precedence.
An individual file-spec can include zero or more * or ? wildcard characters,
where ? matches exactly one character, and
* matches zero or more characters.
For example file???.ht* would match:
file001.htm ,
file101.html and
file111.ht
but not
file1001.htm
A list of file-specs can be given directly in the property, or indirectly in a file.
Direct file-specs
Direct file-specs are semi-colon separated, eg:
Include=iso*;*12*
Exclude=file???.ht*
This specifies two Include file-specs and one Exclude file-spec.
Indirect file-specs in a file
An indirect value consists of @ followed by a file name,
where file-specs are specified one per line in plain text.
The above direct example may be expressed indirectly as follows:
[email protected]
[email protected]
where includes.txt contains:
iso*
*12*
and excludes.txt contains:
file???.ht*
If an indirect file cannot be opened, an error message is reported.
Username/password credentials
If the web site being indexed requires one or more usernames/passwords, then pass this information
in the Credentials property. findinsite-ms
indexing supports "basic", "digest" and "NTLM" (Integrated Windows Authentication) authentication.
The Credentials property must consist of a semi-colon separated list of
credentials. Each credential contains comma-separated fields: a username, a password and an optional path.
Spaces are trimmed at the ends of all fields.
To use a blank password, specify a period (.) in that field.
For example, for a single username (uname ) and password (pwd ), use this:
Credentials=uname,pwd
Only one credential can be supplied for each path on the web site. Therefore, if you are using more than one credential,
then the paths must be different. Suppose you are indexing www.example.org .
If username/password uname1/pwd1 is required for directory www.example.org/manager/ and
uname2/pwd2 is required for all other directories, then use this:
Credentials=uname1,pwd1,manager/ ; uname2,pwd2
Indexing rules
Indexing rules provide a limited means of altering aspects of the indexing process.
You can specify several rules, each named rule1 , rule2 , etc.
Each rule must have one or more conditions, and must have one action. For example, this rule
has two conditions (that the file being indexed is a PDF, and its URL starts with Default )
and one action (store its referer as the file URL):
rule1=C:type==PDF;C:url==Default;A:url=referer;
- Each rule consists of several elements, each separated by a semi-colon (;)
- Condition elements start with
C:
- The Action element starts with
A:
- If all the conditions are true, then the action is performed
Conditions that start with C:type== check that the file is a certain type, from this list:
html pdf txt doc xls ppt image pub .
Conditions that start with C:url== check that the file starts with the subsequent characters.
Note that the "base URL" should not be included here, ie if the indexing run started at
http://www.example.com/subdir/ and you want to check for files that start
http://www.example.com/subdir/another/ then use condition
C:url==another/
The Action A:url=referer sets the URL for this page to its
referer page, if it exists. In practice this restricts this rule to standard URL indexing runs.
Note that as a consequence, the referer URL will appear twice in the search database.
Example
Description=My web site
ScanType=url
ScanURL=http://www.mycompany.com/
ParseHTML=true
HTML_Files=*.htm,*.html,*.asp
ParseTXT=false
ParsePDF=true
PDF_Files=*.pdf
CaseSignificant=false
StoreStopWords=true
StopWordFile=
NoTitleIgnorePageLinks=true
ParseUpHierarchy=false
StorePositions=true
StoreLoneWords=true
UseMetaDescriptionAsAbstract=true
UseMetaAbstractAsAbstract=true
AbstractWords=0
Include=
Exclude=
|