findinsite-ms version details

Known problems/bugs
Possible improvements

findinsite-ms ASP.NET application

Bugs: Wild card in field search fails

1.74	April 30, 2018	Fix: can now access TLS1.2 secure sites Rebuilt in .NET4
1.73	July 7, 2015	PDF: read xref correctly
1.72	February 26, 2014	Email: Set send email credentials better
1.71	April 10, 2012	Search API: Fields now getting through again Indexing: Cope with repeated Location headers
1.70	July 21, 2011	Indexing: PDF: Cope with unexpected name format
1.69	May 15, 2011	Indexing: Fix minor bug indexing PDFs
1.68	March 2, 2010	Indexing: Parse DOCX better so no broken words Web service: Correctly remove High and Low surrogate characters from returned XML Search databases: check every hour for updates and reload if necessary - see web farm information. Templates: nowrap removed from header template Indexing: Remove port when creating an automatic search database Indexing: For directory scans, cope if directory inaccessible
1.67	August 20, 2009	Highlighting: Remove various "if-" headers that cause highlighting to fail: "304 not modified" returned Highlighting: "ShowCredentials" parameter supported in Web.Config Indexing: Credentials now Negotiate" to support Kerberos Dynamic database searching: don't show %DB_CREATION_DATE% if no main database loaded Languages: 13 European languages added to user interface Highlighting: works for Greek and Bulgarian text Indexing: now uses latest version of ICSharpCode.SharpZipLib for unzipping new office files
1.66	April 3, 2009	Indexing: Cope with Content-Location HTTP header that refers to the current URL
1.65	March 9, 2009	Indexing: PDF: Output anchor "page=n" for pages 1 to 31. Indexing: Check for cancel during directory find all files. About: better statistics list Log: log IP address and Robot name Log: log pages highlighted Log: XML encode message About: Keep count of robot searches separately - more info Template: by default has meta robots nofollow and noindex
1.64	January 8, 2009	Indexing: PDF: Cope with unexpected too-large integer number
1.63	November 14, 2008	Indexing: PDF 1.5 format finally supported: cross reference streams and object streams Indexing: PDF Flate DecodeParms Predictor 12/Up supported Indexing: PDF \r\r line ends recognised Indexing: Content-Type HTTP header used to override file type Indexing: Use moved location of initial URL, eg follow "findinsite" to "findinsite/" Indexing: Use "Content-Location" to reduce duplicate URLs indexed, eg "findinsite/" is the same as "findinsite/default.htm" Indexing: Rules added Highlighting: base tag and (changed) header added at better position in web page Indexing email: server port added Search: Load database files more efficiently Templates: Consistently use %SEARCH_TEXT%, though %SEARCHTEXT% still supported Output and templates: updated to use better XHTML Output: default target supported
1.62	July 4, 2008	Indexing PDF: Finds endstream better Highlighting: Fixed bug highlighting URL with non-standard characters Search API: Snippet has search words highlighted using a SPAN with class hilite Indexing: FieldsToExclude advanced option added
1.61	April 17, 2008	Indexing: Credentials now supports Integrated Windows Authentication Indexing: Fixed bug when removing indexing from completed list
1.60	November 22, 2007	Compiled to run in ASP.NET 2.0+ web site Search: dynamic database searching supported Search: Highlighting of search words in results fixed for multiple subsets Search: Field searches fixed for multiple subsets Look and Feel: %DYNAMIC_DB% supported in header and footer Config: Searching section new option added "Dynamic database searching regular expression" Indexing: Cope with unusual BASE tag values Indexing: Cope with Moved Location even better Indexing: HardExclude advanced option added Startup: reallySetLanguages exception handled Indexing: PDF: Cope with format variant Emails: sent using ASP.NET 2.0+ method Config: Indexing From Email Password is a 'password' type input field Search: cope with bad URL parameters better
1.51	May 8, 2007	Indexing: XLS and PPT: TextExtractor call bug fixed Indexing: PDF and XLS: Floating point numbers identified correctly on non-English computers
1.50	December 18, 2006	Indexing: Algorithm changed to reduce memory requirement Indexing: HTML: Cope with just 'text/html' and 'text-html' charsets Indexing: PDF: indexing speed-ups Indexing: PDF: Only report unrecognised encoding `/Identity-H` if PDF_ReportCharacterDecodeProblems set Indexing: PDF: UnicodeEncoding bug fix Indexing: Image: Find XMP (Extensible Metadata Platform) meta-data, eg Vista Tags Indexing: Cope with (ie ignore) read errors Indexing: Cope with include/exclude/robots after HTTP redirect Indexing: Robots not case significant Indexing: Pause every 100 files for 0.1 second Indexing: Don't write fields or anchors if file not being indexed Control Panel: Memory, searches and indexings counts since restart listed on About page
1.21	November 22, 2006	Indexing: Word 2007 DOCX/DOCM files supported Indexing: Excel 2007 XLSX/XLSM files supported Indexing: Powerpoint 2007 PPTX/PPTM files supported
1.20	September 21, 2006	Indexing: Ignore <?xml...> in web pages Indexing: BASE tag supported Config: Load template files in UTF-8 Highlight: Find charset more flexibly Highlight: Fix bug if search word found in header Language: Thai language supported Search: Fix bug if space searched for
1.19	July 4, 2006	Indexing: Excel XLS file indexing - minor improvements Indexing: Sections of web pages can be excluded using GoogleOn/Off and FindinSiteOn/Off comments Indexing: URL recursion stopped using MaxURLLength, with default of 1024. Look and Feel: displayError template supported in finderror.htt - More... General: FindinSite image returned accurately
1.18	March 23, 2006	Indexing: Excel XLS file indexing and searching supported
1.17	February 13, 2006	Language: "Languages to Use" option added to Look and feel Control Panel Language: Language and text direction forced to English for config page heading Email: SMTP Mail Host option provided on Indexing config page Email: SMTP send basic authentication password support (can be stored in Web.Config appSettings)
1.16	October 28, 2005	Indexing: Publisher PUB file indexing now supported Language: Norwegian language file added General: Logo and web site change and rename to findinsite-ms General: Bug fix: Disregard include in template variable substitutions General: Improved results sorting
1.15	July 29, 2005	Language: Bug fix: non-Western characters identified correctly
1.14	July 28, 2005	Language: Arabic (العربية) user interface added (thanks to Lubna Sorour) Language: Arabic words now delimited by spaces etc Language: Arabic character versions handled better (ا ى ه و) Language: Arabic 'the' (ال) at start of word handled correctly Language: Arabic search for 'the' by itself ignored Language: Language files now assumed to be in UTF-8 Language: Right-to-left (RTL) languages supported using %L_HTML_TAG%, %L_BODY_TAG% and %L_ALIGN_TAG% strings in templates Language: findinsite-ms version date localised Language: Slovenian (Slovenščina) user interface added (thanks to Luka Malenšek) Indexing: If `Content-Type` HTTP header specifies HTML charset, use this and ignore META charset. Indexing: Try to determine HTML charset from META charset before main parse. Highlighting: Bug fix: pages starting with UTF-8 marker bytes incorrectly recognised
1.13	June 29, 2005	Output: Extra linefeeds removed from around Included file content Output: Included files only sent form data if included file is an .aspx Highlight: "highlighted by" footer removed because it was not shown in the correct position by FireFox on some sites Installation: bin dll library files renamed with `phdcc.fis.` prefix - be careful to delete old DLLs before installing new ones Search API: Remaining result line variables made available
1.12	May 27, 2005	Search API: Highlight URL returned now works with FireFox Indexing: First suggested filename doesn't have 1 appended Indexing: Results email includes URL, File or Directory Indexing: Search db description not saved if indexing run edited Indexing: Report better error if image file has zero length Search: Bug fix: crash if search db not loaded successfully Search: Remove ? from end of search if question asked, ie if more than 1 word Config: cope better if existing search db corrupted Config: better on-page JavaScript handling for create new indexing Output: Site(s) being searched added to default template using %L_SITE% and %SITES%
1.11	May 19, 2005	Config: Very first control panel has easy option to make index and search Indexing: For charset "text/html;" assume ISO 8859-1 Indexing: Unrecognised robots tags ignored Indexing: redirect out of directory handled better Indexing: .php added to default HTML file types Highlight: content-type checked better, so aspx pages work Highlight: works for sites that use Transfer-Encoding in response header Search: cope with apostrophes better
1.10	April 15, 2005	Indexing: Username/password supported using new Credentials advanced option (basic/digest credentials supported) Output: Various speed ups Output: %L_APPNAME% not made HTML-safe
1.9	April 14, 2005	Indexing: PDF and TXT indexing speed increased Indexing: Abort mid-file implemented Indexing: Bug fixed: slowness if AbstractWords set to 0 Indexing: Redirections off-site not reported as errors Indexing: Minor DOC parsing fixes
1.8	April 2, 2005	Indexing: Bug fixed: page redirection timeout
1.7	April 1, 2005	Indexing: Bug fixed: page redirection
1.6	April 1, 2005	Output: Results list has snippet excerpts from each page, with search words highlighted Output: Default template redesign Output: Styles used in many generated HTML elements Output: New results variables supported: file size, date, date-indexed, word-count, etc Output: More output dates localised Output: New language file strings supported Indexing: More information stored for each indexed file Indexing: If file fails Include or Exclude then it is still spidered and links followed Indexing: UserAgent and ObeyRobots advanced options added Indexing: <br> not added to abstract at line breaks Indexing: web errors made more concise: no stack trace Indexing: AbstractWords now defaults to 0, ie abstract not obtained from first words of file Highlight: Bug fixed: highlight fails for search of * Highlight: Copes with bad HTML better Config: Bug fixed: Pages now counted correctly when db removed
1.5 (5.4)	March 3, 2005	Indexing: `Crawl-Delay` throttle implemented for `robots.txt`
1.4 (5.4)	February 22, 2005	Indexing: Page redirect works better Highlight: Bug fixed: does not pass on "accept-encoding" header Output: Last run output for indexing in progress has better message Output: Default result logo updated Licensing: All starts logged at phdcc.com
1.3 (5.4)	February 10, 2005	Indexing: robots.txt supported Indexing: Cookies maintained throughout each indexing run, saving session state System: Fix initialise security exception on some shared hosts
1.2 (5.4)	February 2, 2005	Highlight: Highlight of hits in HTML pages; highlight configuration options added API: Search API updated to add HighlightURL to each returned result API: Search API bug fixed: GetFieldNames() causes exception if no fields available Indexing: `FindInSiteBot` user-agent HTTP header added to indexer, referring to robots bot page, Indexing: Various PDF indexing fixes Indexing: REL="nofollow" supported in A tags
1.1 (5.4)	January 4, 2005	Release

Possible problems

Email: Note that not all hosts support email from ASP.NET programs.
Highlight: In a very small number of cases, pages do not show correctly when findinsite-ms highlights across domains - more details.
Indexing: Running Visual Studio.NET may cause findinsite-ms to hang while indexing. Technical: when a page has been redirected, a hang sometimes occurs when the HttpWebResponse Close() method is called.

Known bugs

If you edit an indexing run that has not been run, then indexing information lost
https: access may go wrong

Possible improvements

Search: for specific search display specific pages
Install: .msi installer
Indexing: Handle larger sites
Indexing: Indexing depth option
Indexing: Split include into include_filter and include_index. The same for exclude.
Indexing: Configurable start and end abstract indicators
Indexing: Provide site report, eg add META DESCRIPTION, META DESCRIPTION all the same, etc
Indexing: Show indexing problem count as indexing is running
Indexing: support robots noarchive
Indexing: Last run output in more helpfule format, eg CSV or XML
Log: Option to log hits shown using highlighter
Output: Provide better information when cross-site scripting parameter attacks stopped, ie provide better response for < > etc
Output: Indicate file type for each result eg [PDF]
Output: Provide user with option to show more details for each link, eg %ALLTEXT%
Output: Somehow make HTML optional, eg do not include following <br> if abstract empty
Output: Provide parameters for %SNIPPET%
Output: Make complete set of template variables available as Include file variables
Highlight: header of highlighting page: could add found X words on page, to refine search, etc
Highlight: provide separate colours for each word
User: Results per page option for user
User: Advanced search: results design option for each user, ie choose which elements to show - stored in cookie
General: Image db containing thumbnails
General: Cookies turned off support
General: Provide a DotNetNuke DNN module to interface to an external instance of findinsite-ms