FindinSite-CD: Search engine for CD/DVD   .
 
Powered by FindinSite-MS
. Home | Examples | Starting | Set up | Advanced | Languages | Purchasing | Email .
. .
  Overview | Index loading | Word highlighting | Word rules | Search page template | Field searches

 

findinsite rules for Word stemming and Synonyms


Introduction

A rules file tells FindinSite how to do word stemming, match synonyms and correct spellings. Rules files are used by FindinSite-CD, FindinSite-JS and FindinSite-MS.

Word stemming means taking the stem of a word and generating common variants of the word. As an example, if the search text is throws then the word stem is throw and common variants of this stem include thrower, throwers and throwing. FindinSite uses rules to check to see if the word stem or any of the variants exist in the search database. If they do, then these words are added as alternatives to the original text.

A synonym is a different word with the same meaning. The FindinSite rules contain a list of equivalent words. As well as coping with synonyms, this lets FindinSite match regional variations, eg color matches colour and vice versa. A specialised rule form lets FindinSite correct mis-spellings, so that teh matches the, but not vice versa.

With these rules in place, a search for teh color fade matches The colour fades. Note that the rules can go wrong, so that a search for car will match carer.

Languages supported

The default search page created by FindinSite-CD-Wizard tells FindinSite-CD to use the English rules file rulesen.txt. We also have prototype rules files for French (rulesfr.txt) and German (rulesde.txt - thanks to Paul Croome, Software AG). Let us know if these rules work for you.

If your information is in French, then change the rules parameter to refer to the French rules file only:

<PARAM NAME=rules VALUE="rulesfr.txt">

We need to improve FindinSite-CD's handling of rules files for multiple languages. Suppose a French user is looking at the FindinSite-CD English documentation. If both English and French rules files are available, then FindinSite-CD will use the French rules for the French user. In fact, even this French user should use the English rules because that's the language of the information.

In most cases this is not a problem, because your information will only be in one language and you should provide only one language file, as described earlier.


Using FindinSite-CD rules

The default search page generated by FindinSite-CD-Wizard tells FindinSite-CD to use the English rules file called rulesen.txt, described below. It is this line in the search page that tells FindinSite-CD to use this rules file:
<PARAM NAME=rules VALUE="rulesen.txt">
If you remove this line completely then FindinSite-CD will not use any rules, ie there will be no word stemming, synonyms or spelling corrections.

You can specify more than one rules file, comma-separated in the rules parameter. Each rules file has locale identifiers, ie language and optional country codes. FindinSite-CD chooses the most appropriate rules file for the user's locale at startup, or the first rules file if there is no match. If you then switch language, then FindinSite-CD does the same again for the new locale.

If you write your own rules file and give it a different filename, then make sure that you change the filename in the search page text. If you want a rules file put alongside all new FindinSite-CD-Wizard generated search pages then put it in the FindinSite-CD Redist directory. See the template page if you want to alter the search page that FindinSite-CD-Wizard generates.


The Rules File Format

A rules file is either a plain text file in ANSI characters or a text file using UTF-8 characters. In both cases, the file is divided up into lines.

Line 1 1 if using ANSI
2 if using UTF-8
Line 2 The language code and optional country codes, separated by a space,
eg en or en GB
Line 3 A description of the rules file
Displayed in FindinSite
Line 4 onwards One rule per line
Each rule is a comma (or greater-than-sign) separated list of items.

There are three types of rules:

How it works

FindinSite takes each word in the search text and applies all rules to this word. Each generated word then has all rules applied again and again until no new words are generated. (The rules are not applied again if a generated word is longer than previous word - this stops words becoming infinitely long.)

The end result is a list of possible alternatives to the original word. FindinSite then goes through this list of words and removes a word if it does not appear anywhere in the search database.

All rule tests are letter case insensitive. Rules are not applied to words with non-latin characters. Do not put punctuation characters in rules, eg co-operation,cooperation will not work.

Word stemming rules

A word stemming rule is a comma-separated list of items. If a word matches the first item in a rule, then word variants specified in the remaining items are added to the list of possible words.

For example, a rule *,*s,*es applied to the search word throw produces alternative words throws and throwes.

All items must start with an asterisk *. If there are any subsequent characters in the first item then the end of the test word must contain the same characters. These characters are then removed to form the word stem.

The second and following items in the rule describe what alternative words should be generated. The initial * in these items is replaced by the word stem. Any subsequent characters in these items are added to the word stem. A word stem of just one character is not used. Rules are not applied to words of just one character.

As an example, a rule *ise,*ize applied to the search word authorise produces an alternative word authorize. Note that you need the following rule if you want authorize to produce an alternative word authorise: *ize,*ise.

Special rule forms can check for words that contain either consonants or vowels:

  • *# matches words that end in a consonant, and
  • *$ matches words that end in a vowel.

Examples:

  • For rule *#,*#er, the search word throw produces alternative word thrower.
  • For rule *$,*er the search word care produces alternative word carer.

Multiple # or $ characters can be used, where each # or $ must match the same character. For example:

  • For rule *#,*##er,*##ed,*##ing the search word begin produces alternative words beginner, beginned and beginning.
  • For rule *##ing,*# the search word beginning produces alternative word begin.

Equivalent words rules

Equivalent words are simply put in a comma-separated list. For example, for the rule paper,magazine,journal, the search word magazine produces alternative words paper and journal.

Rules can be used to cope with different regional spellings, eg color,colour copes with the US and UK spelling of this word.

Similarly, his,her,their could be used to make common possessive pronouns interchangeable.

Spelling correction rules

To correct spelling mistakes in the search text, put the mis-spelling first, then a greater-than-sign > and then the correct spelling.

For example, teh>the generates the as an alternative word for teh. However teh is not generated as an alternative word for the.

If you want to correct spelling mistakes in your web pages then just use an equivalent words rule, eg teh,the.


Basic English Rules file

The supplied English rules file has a several basic rules. Please suggest any improvements to these rules. Let us have a copy of rules files for other languages.

Word stemming rules

The rules first remove any common word endings to get a word stem.
*s,*
*er,*
*ers,*
*ed,*
*ing,*
*eer,*
*ier,*
*ly,*
*ion,*
*ise,*
*ize,*

*er,*e
*ed,*e
*ion,*e

*##ing,*#
*##er,*#
*##ed,*#

The rules then try different word endings

*ise,*ize
*ize,*ise
*or,*er
*er,*or
*our,*or
*or,*our
*y,*ies
*able,*ible
*ible,*able
*ance,*ence
*ence,*ance
*g,*gue
*gue,*g

The rules then add common word endings:

*,*s,*es
*#,*#e,*#er,*#ers,*#ed,*#ing,*#eer,*#ier,*#ly,*#ise,*#ize,*#ion
*e,*er,*ers,*ed,*ing,*ion
*#,*##er,*##ers,*##ed,*##ing

Equivalent words and Spelling corrections

color,colour
licence,license
language,langauge
a,an
his,her,their
affect,effect
teh>the
neccesary>necessary
recieve>receive
francais,français
  All site Copyright © 1996-2011 PHD Computer Consultants Ltd, PHDCC   Privacy  

Last modified: 8 February 2006.

Valid HTML 4.01 Transitional Valid CSS!