findinsite-ms indexing
Before you can search your web site, you must index it to build a
search database.
- findinsite-ms supports indexing of HTML, PDF, DOC, DOCX, XLS, XLSX, PPT/PPS, PPTX, PUB, TXT, JPEG and TIFF file types,
featuring a regular indexing schedule and email of indexing results.
If need be, please see our notes on
using findinsite-ms on a load balanced server farm/cluster.
Setting the findinsite-ms Search database
findinsite-ms can search one or more search database.
To tell findinsite-ms which search database to use,
first make your search database as described below.
Then go to the Control Panel
Searching section and either:
- add your new search database, or
- change a search database to your new one and press Make Changes.
After an indexing run successfully rebuilds a current findinsite-ms search database,
findinsite-ms automatically reloads the new search database.
You can confirm the index dates and times for the current search databases by
looking in the Control Panel Searching section.
The findinsite-ms Indexer
The findinsite-ms indexer only builds one search database at a time;
any other pending indexing runs are held in an Immediate queue.
The indexer checks its Scheduled list frequently to see if it
has any work to do.
Application life-time and Indexing
The findinsite-ms indexer only ever does scheduled indexing runs on the hour.
For the indexing system to work, the findinsite-ms application needs to keep running continuously.
This will, for example, let it do an indexing run every day at 2am.
While some systems do run findinsite-ms continuously,
other systems will stop the findinsite-ms application if it does not receive regular use,
eg when searches are carried out.
Some systems are set up to stop an application after a certain number of hits, memory use or after 29 hours.
findinsite-ms restarts if it has been stopped. At the next hour,
findinsite-ms checks to see if any indexing runs have been missed, and runs them if need be.
If findinsite-ms is in regular use, then it will be restarted fairly quickly after stopping.
Therefore indexing runs should happen more-or-less when expected. If you expect your site to be very quiet overnight and want your
indexing run to happen at that time, then you must use an external scheduling tool to wake up findinsite-ms.
The scheduling tool can be set to access any findinsite-ms URL - however it is recommended that
you access this page: search.aspx?keep=alive If your indexing run is set up to run at 2am, then you should schedule
your wake up task for a short time before, eg 1:50am.
Indexing Configuration
findinsite-ms's indexing is controlled from the
Control Panel. The Indexing section
of the Control Panel lets you set up immediate or regular indexing runs.
Each indexing run builds one search database by indexing one web site.
The screenshot on the right shows the Indexing section of the Control Panel menu.
Click on one of these options...
- Indexing: shows the indexer status, and lets you set up indexing limits and
email reporting.
- Immediate queue: shows a list of indexing runs queued, in progress or run recently.
- Scheduled list: shows a schedule of the regular indexing runs.
- Create new: runs the wizard to set up a new indexing run.
|
|
Creating a new Indexing Run
The Create new indexing run wizard has five steps:
- Select time to run: "now" or regular schedule (see right)
- Choose a filename for the search database - see below
Either: Reindex an existing search database
Or: Build a new search database
- Enter URL of web site to index.
Check the file types that you want indexed.
- Enter any Advanced Options
- Confirm indexing run: Store in the schedule and optionally run it now
|
Regular indexing times |
Hourly on hour |
Daily at specified hour |
Weekly at specified week-day and hour |
Monthly at specified first week-day and hour |
Monthly at specified day and hour |
|
Search database filenames and the findinsite-ms Work directory
- Filenames
- A search database is actually stored in many files,
each with the basic filename you choose, but with a different extension,
eg for
index1 the actual files are index1.his , index1.hi1 , etc.
In addition, findinsite-ms may also use the basic filename with
an underscore appended,
eg index1_ , ie files index1_.his , index1_.hi1 , etc.
When findinsite-ms remakes an existing search database it chooses
the oldest filename to remake, eg index1 or index1_ .
This ensures that a good search database is still available in the event that
an indexing run fails, eg because network access to the site fails.
- Work directory
- findinsite-ms always puts its search database files in its work directory.
Make sure that this is in a suitable location by looking at the
Control Panel General section.
If you decide to change the work directory then you must
change the work appSettings value in
Web.Config ,
as described here.
Scheduled list of indexing runs
Clicking on the Scheduled list option in the Control Panel Indexing
section displays a list of your regular indexing runs like this:
The list shows a summary of each indexing run, with various control options, as described in the
next section. If an indexing run has completed then it will also show a summary of the
run output - see the Immediate list screen below for an example.
Indexing run control options
When an indexing run summary is displayed on screen, click on the appropriate icon
for the following options:
|
Details
|
Display a full description of the Indexing run and the Last output it generated.
|
|
Edit
|
Start the Create new wizard to edit the indexing run.
Note that the indexing run will be removed from the scheduled list when
you start an edit; therefore you must complete the wizard if you want to
store your indexing run.
|
|
Run
|
Put the indexing run in the Immediate queue so that it is run
when it comes to the front of the queue.
|
|
Stop/Remove
|
The action of this option depends on the state of the indexing run:
- If in progress, then stop the run
- If in the completed runs list, then remove from this list
- If in the scheduled list, then remove and delete the run
In each case, you are asked to confirm the action first.
|
Immediate queue of indexing runs
Clicking on the Immediate queue option in the Control Panel Indexing
section displays this information:
- The indexing run in progress
- The list of indexing runs queued waiting for execution
- The list of recent completed indexing runs (within the last 20 minutes)
If any indexing runs are in progress, then the display updates every 20 seconds
to show you the latest status.
The example screenshot below shows an indexing run in progress and one recently completed.
Notice how each summary lists the number of pages and words found. If any problems are reported,
click on the Details (i) icon
for more information on these problems.
Indexing status and general options
Clicking in the main Control Panel Indexing
section displays the current indexing status and settings, with an option to make changes
to your general configuration.
The Indexing limits values give you control of all your indexing runs,
if you do not want the run to take too long or the search database too large.
If you set all the Indexing Email reporting values
then findinsite-ms will email you with details of each indexing run completed -
useful to keep an eye on findinsite-ms.
Press the Make Changes button if you alter any settings.
Option |
Description |
Current status |
Immediate queue |
A summary of the size of the indexing queue,
whether an indexing run is in progress,
and the number of completed indexing runs.
|
Scheduled list |
The number of regular scheduled indexing runs |
Indexing Limits |
Time limit |
The maximum number of minutes for an indexing run (in minutes)
or 0 to have no limit.
|
File limit |
The maximum number of files for an indexing run,
or 0 to have no limit.
|
Indexing Email reporting |
All the following boxes must be completed to enable email reporting of index results.
|
SMTP send mail server |
The name of your mail server, eg mail.mycompany.com
|
SMTP send port |
The mail server port, eg 25 by default
|
From name |
The name of the email sender, eg Julie Wilson
|
From email address |
The email of the email sender, eg [email protected]
|
From email password |
If your mail server requires send authentication, enter your password here.
The password is stored in plain text in the work directory file findinsite.xml .
For more security, store it in the Web.Config file appSettings EmailFromPassword value -
see here for details.
|
To email address |
The email of the email recipient, eg [email protected]
|
Send email if findinsite-ms restarted |
If ths box is checked, findinsite-ms will send an email whenever it
is started by the servlet engine.
|
Send test email |
Click to send a test email.
Make sure that you press "Make Changes" first if you have just entered
any changes.
|
If any of the above options are in grey boxes then you cannot change them;
the value has been set by your webmaster or servlet administrator.
|
Technical details
Indexer user-agent
When the findinsite-ms indexer spiders/crawls a web site, it
calls itself FindInSiteBot . The user-agent HTTP header is set as follows:
FindInSiteBot/1.17.2235.31507 (http://www.phdcc.com/findinsite/bot.htm http://www.example.org/findinsite/)
This string includes the findinsite-ms version, a link to an explanation page
(http://www.phdcc.com/findinsite/bot.htm) and the URL
of the current instance of findinsite-ms. The latter field is useful in determining
which instance of findinsite-ms is indexing the site.
Indexing run sessions and cookies
findinsite-ms indexer maintains cookies and so preserves session state.
robots.txt
findinsite-ms supports the robots.txt exclusion file -
see the Robot Exclusion Standard for details.
findinsite-ms looks for the FindInSiteBot user-agent; if this is not present
it honours the commands for the * user-agent.
findinsite-ms supports the Crawl-Delay option
in robots.txt . The Crawl-Delay number indicates the number of seconds between accesses;
values greater than 60 are reduced 60. The default value is zero.
Page and link indexing control
The META robots tag is supported, including noindex and nofollow options.
The rel="nofollow" attribute for A tags is supported.
Excluding sections of pages
You can exclude portions of web pages from indexing or spidering as follows:
Comments
You can use commands within HTML comments. The commands "googleoff:" and "FindinSiteoff:" turn off the specified options,
while "googleon:" and "FindinSiteon:" turn on the specified options. The following options are available:
- index: text indexing
- follow: following links
- all: both index and follow
Examples:
<!--googleoff: index--> ... <!--googleon: index-->
<!-- FindinSiteoff:follow index--> ... <!--findinsiteon: all-->
The commands must appear at the start of the comment.
The options must be space or comma separated.
Commands cannot be nested.
Other options
- Words and links in between a <DIV class=nospy> tag and the next </DIV> tag are ignored.
- Words in between a <DIV class=nospytext> tag and the next </DIV> tag are ignored, but links are followed.
- Words and links in between a <DIV class=nospyabstract> tag and the next </DIV> tag are put in the search
database but not put in the abstract.
- Words and links in between APPLET../APPLET, SCRIPT../SCRIPT and STYLE../STYLE tags are ignored,
along with ASP script code in between <% and %>.
- Words and links after a NOFRAMES tag are ignored, apart from storing the abstract.
Searches by web crawler robots
Some web crawler robots fill in forms with (random) words.
These will appear as normal searches to FindinSite-MS.
If the associated UserAgent contains http:// then the search is marked as coming from a robot
in the logs and in the search count displayed in the Control Panel About section.
You can stop stop honest web crawlers from visiting the FindinSite-MS search page by adding a suitable
entry to the Disallow entry to your site's robots.txt file, eg:
Disallow: findinsite/search.aspx
|