FindinSite.TextExtractor: finds text and words   .
  search
Powered by FindinSite-MS
. Home .
. .

 

FindinSite.TextExtractor for .NET


Features

  • .NET class library DLL
  • Extracts plain text and field meta-data
  • HTML, PDF, DOC, DOCX, XLS, XLSX, PPT, PPTX, PUB, TXT, JPEG and TIFF files supported
    (except older format variants)
  • Characters are canonicalised, eg from ª to a
  • Finds unique (lower-cased) words

  • Trial download - restricted information returned: see below to get started
  • US$199 for use on 1 server and US$499 for use royalty-free in any runtimes

Input:

  • Parses a System.IO.Stream of a given file type

Various outputs available, all optional:

  • All plain text to System.IO.TextWriter
  • All plain text to System.Text.StringBuilder
  • Each different lower-case word reported as an event
  • Different lower-case words in a sorted System.Collections.ArrayList or System.Text.StringBuilder
  • All meta-data text in a System.Collections.Hashtable:
    System.Collections.Specialized.StringCollection values keyed, by string field name
  • Any error, reported as an event
FindinSite.TextExtractor: finds text and words

Supplied code

  • Runtime class DLLs - no source
    • phdcc.fis.Find.dll
    • phdcc.fis.Findex.dll
  • Example C# console application - project source and binary

Getting started

  1. Unzip the supplied kit in a clean directory.
  2. Try out the supplied TextExtractTest console application.
  3. Load the TextExtractTest project into VS.NET and check that it compiles.
  4. In your VS.NET project, add a reference to bin/phdcc.fis.Find.dll and bin/phdcc.fis.Findex.dll.
  5. Add code to your project to call TextExtractor.

TextExtractTest example application

TextExtractTest lets you try out all the features of TextExtract. TextExtractTest is a console application, so you will probably test it from a 'Command Prompt' DOS-box.

  1. Go to the directory where you installed the development kit.
  2. Enter bin\TextExtractTest.exe followed by the filename or path of the file that you want to parse. For example, to parse the supplied file test.htm, enter
    bin\TextExtractTest.exe test.htm
  3. TextExtractTest outputs all the information received to the console output. You can alter the output seen by changing the project source code and re-compiling.
  4. You can use TextExtractTest to parse any of the accepted file types. TextExtractTest uses the file extension to determine the file type.
  5. TextExtractTest only opens local files, so URLs are not accepted. However TextExtract itself will parse any Stream including those obtained from URLs.

TextExtractTest VS.NET project

Open an existing solution or a blank solution in VS.NET. Use "Add an existing project" to load TextExtractTest.csproj in the development kit directory.

The main example code is in C# source code file TextExtractTest.cs. You may edit this code as you wish.


Class definitions

Here are the public definitions of the main TextExtractor class, the file type class URLtoParse, and the associated event classes FindexWordFoundEventArgs and FindexErrorEventArgs.

In summary, you need to make a new TextExtractor object, set up event handlers, open your stream and then call TextExtractor.Parse(). Then process what you have received.

C#  
namespace com.phdcc.findex
{
  // Main TextExtractor class
  public class TextExtractor
  {
      public event FindexWordFoundEventHandler WordFound;
      public event FindexErrorEventHandler Error;
      public TextExtractor();
      public int Parse(
        URLtoParse.Type type,
        Stream InputStream,
        TextWriter OutputWriter,
        StringBuilder AllText,
        ArrayList alIndividualWords,
        StringBuilder sbIndividualWords,
        Hashtable Fields
      );
  }

  // WordFound event definition
  public class com.phdcc.findex.FindexWordFoundEventArgs: EventArgs 
  {
      public string word;
      public FindexWordFoundEventArgs(string word) 
      {
          this.word = word;
      }
  }
  public delegate void FindexWordFoundEventHandler(
    object sender,
    FindexWordFoundEventArgs fwfe
  );

  // Error event definition
  public class com.phdcc.findex.FindexErrorEventArgs: EventArgs 
  {
      public string msg;
      public FindexErrorEventArgs(string msg)
      {
          this.msg = msg;
      }
  }
  public delegate void FindexErrorEventHandler(
    object sender,
    FindexErrorEventArgs fee
  );
  
  public class URLtoParse
  {
    public enum Type { HTML, TXT, PDF, DOC, XLS, PPT, Image, PUB, };
  }
}

TextExtractor.Parse() method definition

You must supply two valid input parameters to Parse():

  • type must be set to one of the file type enumeration values, eg URLtoParse.Type.HTML
  • InputStream must be set to an open Stream to the file, URL or other resource.

There are five output parameters, any of which may be null if they are not desired:

  • All plain text characters are sent to the OutputWriter TextWriter
  • All plain text characters are appended to the AllText StringBuilder
  • All unique (lower-cased) words are added to the alIndividualWords ArrayList in sorted order
  • All unique (lower-cased) words are appended to the sbIndividualWords StringBuilder in sorted order, separated by spaces
  • Each field string is added to the Fields Hashtable, with a string key of the field name (in lower-case) and a StringCollection value. The StringCollection contains one string element for each instance field found. Note that fields are not split into words.

Parse() returns the number of bytes processed, ie the size of the file, or:

  • -1: InputStream null
  • -2: Unrecognised file type
  • -3: Any error found

In addition Parse() may raise the following events during processing:

  • Each unique (lower-cased) word is reported using the WordFound event.
  • Any error messages are reported using the Error event.

Notes

  • All the plain characters can be found using either the OutputWriter or AllText parameters, or both.
  • The individual words can be found using one or all of the alIndividualWords or sbIndividualWords parameters or the WordFound event.
  • Processing will be quicker if you do not look for individual words at all (because the code has to maintain a list of words and check to see if the word has already been found).

Character canonicalisation

Characters are canonicalised before being processed. Canonicalisation means converting characters to a basic root character, eg:

  • Characters ª and are changed to a, and ç to c
  • Ligatures such as are expanded to two characters fi

Note that the HTML parser will already have decoded escaped 'named entity' characters such as ç to ç

Word definition

A word is a sequence of characters delimited by white space, punctuation characters, line breaks, table breaks or similar. Note that this means that John's code. is reported as three words: "john", "s" and "code".

Each non-Latin character, eg an Asian character, is reported as a single word.  One character=one word.

Calling TextExtractor

To call TextExtractor:

  • Make a new instance of TextExtractor
  • Add WordFound and Error event handlers if desired
  • Open a System.IO.Stream to the file contents
  • Determine the file type, eg URLtoParse.Type.HTML
  • Set up objects to receive the results
  • Call TextExtractor.Parse()
C#  
static void Main(string[] args)
{
  TextExtractor te = new TextExtractor();
  te.WordFound += new FindexWordFoundEventHandler(WordFound);
  te.Error += new FindexErrorEventHandler(ErrorFound);

  // Open input stream and determine file type
  Stream InputStream = new FileStream(@"test.htm", FileMode.Open, FileAccess.Read);
  URLtoParse.Type type = URLtoParse.Type.HTML;

  // Build objects to receive the output
  TextWriter OutputWriter = System.Console.Out;
  StringBuilder AllText = new StringBuilder();
  ArrayList alIndividualWords = new ArrayList();
  StringBuilder sbIndividualWords = new StringBuilder();
  Hashtable Fields = new Hashtable();

  // Parse file
  int BytesParsed = te.Parse( type, InputStream, OutputWriter,
                              AllText, alIndividualWords, sbIndividualWords, Fields);
}

// Word found event handler
static void WordFound(object sender, FindexWordFoundEventArgs fwfe)
{
  Console.WriteLine("Word found: "+fwfe.word);
}

///////////////////////////////////////////////////////////////////////////////
// Error found event handler
static void ErrorFound(object sender, FindexErrorEventArgs fee)
{
  Console.WriteLine("Error: "+fee.msg);
}

Please see the TextExtractorTest code for examples of how to process the output.


Copyright and Licensing

All code, images and documentation is © Copyright 1998-2007 PHD Computer Consultants Ltd. The FindinSite.TextExtractor trial version finds a restricted amount of information in each file.

To purchase, please contact sales@phdcc.com mentioning FindinSite.TextExtractor

Let us know what features you want provided. This DLL may be provided as a component in future.

TextExtractorTest: version 1.4, 8 January 2007.

FindInSite-ms: Search engine for .NET
Search engine for ASP.NET
Click here to learn more about findinsite-ms
FindInSite-cd: Search engine for CD and DVD
Search engine for CD and DVD
Click here to learn more about findinsite-cd
PHD Computer Consultants Ltd
e-media tools and e-publishing
Click here to learn more about PHDCC's software tools and services
  All site Copyright © 1996-2007 PHD Computer Consultants Ltd, PHDCC   Privacy  

Last modified: 8 January 2007.