Tuesday, 21 June 2011

Can't we make it a little more Google?

As I work for a Library Management software vendor it is perhaps not surprising that I have some what of an interest in search technologies.  However, what I believe is surprising is the lack of interest many software packages seem to have in this technology.  In the "Information Age" search technology is perhaps the king of all technologies.

Search technology has grown up a lot in the last few years, perhaps Google's dominance and wealth from being a market leader in this technology has elevated the lowly searches status, but quite simply I wonder why it has been neglected for so long.

I remember a time of using multiple web search engines, Webcrawler was my early favourite, but Altavista soon became superior, then it was a 50/50 battle with Altavista and Yahoo, and then all of a sudden Google came from no where and won.  Anyway early searches tended to follow the wildcard boolean search paradigm.  i.e. you had to be very specific a search for child would not return any results for children for example.

This exact matching wildcard based search is sometimes helpful if you know all of the data you are searching for and are trying to find data on something you know exists, and to some extent may require an understanding of boolean logic, which I have found even trained librarians can struggle with the difference between AND and OR.  

Now in the early days of the internet it was not too bad, you could do a simple wildcard search and due to the limited material available you could trawl through each and every link to find relevance information.  However, with the internet explosion even some of the most isolated niches simply have too much data to sift through to find the information you are attempting to retrieve.

This is where relevance ranking becomes of such great value.

Relevance ranking is what allows search engines such as Google to provide such simple search interfaces and return useful relevance data.  What is more important is it allows you to search through data and find items without needing to know if the data exists or not with a reasonable confidence that if you do not find anything that there is nothing there to find.

If I take a simple contrived example, the "old style" exact matching search could lead you believe a record is not there more easily than a relevance ranked search.  If a person wanted to find the book "Harry Potter and the Deathly Hallows", with the older search methodology you could mistakenly search for "harry potter and the dead hallows" not find a result and perhaps assume that the item does not exist.  What the older methodology would then do is perhaps have drop words to counteract this problem.  However, if we imagine that this search were to trigger a couple of drop words then you could easily be swamped by all Harry Potter related material and have to sift through a lot of relatively irrelevant items.

Relevance ranking gets round this problem by essentially introducing ORs on each of the words, this will return dramatically more results than the non-relevance ranked result, but as the title Harry Potter and the Deathly Hallows will contain a higher percentage of the searched terms within its record it will have a higher relevance rank than other record so even if you end up returning every book in the library with your search, you will have the most important one at the top.

Another aspect that is great about relevance ranking is word stemming / breaking.  The introduction of full text searching in MS SQL 2008 with word breakers in multiple languages is in my opinion the most important new feature.  OK you could point to Transparent Data Encryption, but for me the text searching means that any application based on SQL 2008 no longer has an excuse to have a poor search mechanism.

I was recently dealing with a CRM solution that for some reason has a very poor search mechanism.  Data entry in CRMs is going to be relatively inaccurate, the people using it are doing so as a subsection of their job, they are not qualified catalogue librarians with OCD cataloguing skills, a contact could be registered as Jon, John, Jonathan or Johnathan, Saint Francis Primary school could be St., St, Saint and to make it even harder it could be referred to by some people as Primary School and other as First School.  These inconsistencies can range from a minor irritation, at the worst case incorrect denial of customer service.

Now in my situation I made a very simple change I introduced wildcard searches on all terms and provided some limited training advise on how best to search.  So the SQL driven searches went from

account.name like 'St Francis' 

to

account.name like St% Francis%'

Now this change does introduce a longer processing time, but in the case of the application this longer processing time moved the page request speed from 0.128 secs to 0.188 secs.  Yes significantly slower, but completely acceptable in the situation it was employed and unnoticeable to the staff using the application.    

Additionally it could be considered a problem that it incorrectly return more results.   Again this was completely acceptable as it significantly reduced the number of 0 result searches and rarely did it increase searches significantly and most searches returned less than 10 results.

What it did allow though was some really fast searching.  Before you had to type John Smith to return all John Smiths and as I stated this might not even include the person you were looking for.  Well now you can search for "j smi" and return 12 results.  Much less typing and a better result set.

Too many applications do not even automatically insert wildcards, which under most circumstances will help the situation, and if you are upset about the string handling inefficiency this introduces then perhaps you should implement a proper full text solution with word stemming / breaking, thesaurus terms, spell checking, phrase checking and relevance ranking.  There are plenty of tools to do this, yes some are expensive, but as Lucene is opensource you really have no excuse.

Introduce relevance ranking and so many UI problems go away, introduce faceting and you are even further down the road to providing your users with the information they want easily and quickly.  

No comments: