Simplifying search using elastic search and understanding search relevancy

tanmay
8 min readAug 18, 2019
elastic search — searching simplified

Introduction

Assume you have customer information system or customer on-boarding system . Most important functionality would be searching .

  • on-boarding system would like to understand whether there is existing customer by the same name .
  • Customer information search system would like to search by name

Article shows how elastic search makes searching easier . Article cover some of interesting features of elastic search

  • Searching documents ( made of texts ) with any order of the input words ( a.k.a Full text search) .
  • searching documents with small variation of spelling to that of input ( a.k.a edit distance search)
  • Searching documents with similar sounding as that of input ( a.k.a Phonetic search)

In the subsequent article , would cover search relevancy using TF-IDF .

Use case

Imagine a call centre application . Customer calls in . Customer service representative asks for customer identification data . Many of times we give our mobile number or email id or some such unique id . Assume that customer is able to give only customer name . Our customer database looks like this

Customer database

Customer gives his name as Mr Waugh . Assume that customer service representative does not know whether waugh is first name or last name ( Steve waugh is famous cricketer , ideally customer service representative would know the same 😃)

Assume we had this customer database present as RDBMS ( like a typical oracle or postgres) , you would have written query in your backend service like this

select * from customers where firstName like '%inputName' or lastName like '%inputName' or middleName like '%inputName'

Here inputName would = steve

We want to simplify this query at the backend , so as part of our database design , we add one more column ( full name ) , when a new customer is added , we will compute the full name of the customer and store it along with other attributes . It looks like this now

Customer database with fullName column for easier querying .

Our query would look like this .

select * from customers where fullName like '%inputName%';

Happy path

This solves a simple problem , now if you enter either steve or waugh , query works and gives us the desired output.

UnHappy path

However if the user enters inputName as waugh steve or steve waugh , query on fullName would fetch null results .

ElasticSearch to the rescue . Let us try this scenario with elasticsearch . We will create a index customer which will imitate the customer table in RDBMS . we will use the rest api interface for elastic search to create index. First we create index ( customer ) and type ( customer) .

PUT localhost:9200/customer{
"mappings": {
"customer": {
"properties": {
"firstName": {
"type": "text"
},
"middleName": {
"type": "text"
},
"lastName": {
"type": "text"
},
"fullName": {
"type": "text"
}
}
}
}
}

You will get the successful response .

{
"acknowledged": true,
"shards_acknowledged": true,
"index": "customer"
}

Data type — text

In index definition above , please note the data type of the field . It is called as “text” . This is very significant . For such fields , elastic search is going to tokenize the data and separate out individual words and search against the individual words . From the elastic search documentation .

A field to index full-text values, such as the body of an email or the description of a product. These fields are analyzed, that is they are passed through an analyzer to convert the string into a list of individual terms before being indexed. The analysis process allows Elasticsearch to search for individual words within each full text field.

Basically it allows user to do full text search within the field . If want Relational database equivalent field , then it would be of type — keyword

Let us create the same records in customer index as shown in above table.We will again use the rest api interface for elastic search.

PUT localhost:9200/customer/customer/1{
"firstName" : "Steve",
"middleName" : "Rodger",
"lastName": "Waugh" ,
"fullName" : "Steve Rodger Waugh"
}

success response from the api would be as follows

{
"_index": "customer",
"_type": "customer",
"_id": "1",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"created": true
}

Let us create 2'nd and 3'rd document in similar fashion .

We can do a fetch to get all document collection from the index .We should get all 3 records ( or documents ) for customer index

GET localhost:9200/customer/customer/_search{
"took": 67,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "customer",
"_type": "customer",
"_id": "2",
"_score": 1,
"_source": {
"firstName": "Mark",
"middleName": "Rodger",
"lastName": "Waugh",
"fullName": "Mark Rodger Waugh"
}
},
{
"_index": "customer",
"_type": "customer",
"_id": "1",
"_score": 1,
"_source": {
"firstName": "Steve",
"middleName": "Rodger",
"lastName": "Waugh",
"fullName": "Steve Rodger Waugh"
}
},
{
"_index": "customer",
"_type": "customer",
"_id": "3",
"_score": 1,
"_source": {
"firstName": "Austin",
"middleName": "Steve",
"lastName": "Waugh",
"fullName": "Austin steve Waugh"
}
}
]
}
}

Search

Let us test for unhappy path now . In fullName field , data is stored as Steve Rodger Waugh” and we will search with Waugh Steve . So we are inputting data in reverse order and we are giving incomplete name.
GET localhost:9200/customer/customer/_search

Body :
{
"query" : {
"match" : {
"fullName" : "Waugh Steve"
}
}
}

We get results here , and that also 3 records . Note that most matching ( relevant record ) is shown at the top . Full text search capability of elastic search allows to search for individual words and even any ordering of the word within full text.

{
"took": 26,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.5063205,
"hits": [
{
"_index": "customer",
"_type": "customer",
"_id": "1",
"_score": 0.5063205,
"_source": {
"firstName": "Steve",
"middleName": "Rodger",
"lastName": "Waugh",
"fullName": "Steve Rodger Waugh"
}
},
{
"_index": "customer",
"_type": "customer",
"_id": "3",
"_score": 0.5063205,
"_source": {
"firstName": "Austin",
"middleName": "Steve",
"lastName": "Waugh",
"fullName": "Austin steve Waugh"
}
},
{
"_index": "customer",
"_type": "customer",
"_id": "2",
"_score": 0.25316024,
"_source": {
"firstName": "Mark",
"middleName": "Rodger",
"lastName": "Waugh",
"fullName": "Mark Rodger Waugh"
}
}
]
}
}

Spelling mistakes and fuzzy search

Elastic search is very adapt at performing fuzzy searches . If the name input is wrong by letter or 2 , elastic search will still give results. For this we need to use leventshtein distance algorithm in the query .

Simple example for the same . Let us add a new customer with name as ‘Suraj Sharma’

PUT localhost:9200/customer/customer/5body : 
{
"firstName": "Suraj",
"lastName": "Sharma",
"fullName": "Suraj Sharma"
}

Now try to search the name with slightly different spelling . Instead of Suraj Sharma , we will perform search with first name as Sooraj . This is a real world scenario for sure as different people spell names differently. Please also note that , we are not performing search with string “sooraj sharma” as it will obviously give you the data ( remember elastic search internally tokenizes the field and perform query against individual words )

GET localhost:9200/customer/customer/_search
body :
{
"query" : {
"match" : {
"firstName" :
{
"query" : "Sooraj"
}
}
}
}
result:
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}

Result is obvious , Sooraj and Suraj did not match , hence query did not fetch any data . Here comes the fuzzy query , which will be able to fetch Suraj Sharma inspite of spelling differences . note the fuzziness parameter

GET localhost:9200/customer/customer/_search{
"query" : {
"match" : {
"firstName" :
{
"query" : "Sooraj",
"fuzziness" : "2"
}
}
}
result:
{
"took": 15,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.17260925,
"hits": [
{
"_index": "customer",
"_type": "customer",
"_id": "5",
"_score": 0.17260925,
"_source": {
"firstName": "Suraj",
"lastName": "Sharma",
"fullName": "Suraj Sharma"
}
}
]
}
}

Another way of writing the same query

GET localhost:9200/customer/customer/_search{
"query" : {
"match" : {
"firstName" : "Sooraj~"
}
}
}

Edit distance

If we have 2 strings , Edit distance is nothing but the number of changes/edits required to transform one string to another

Lets take example of Sooraj and Suraj . Sooraj and Suraj are within string distance of 2 from each other , hence elastic search was able to find the same.

2 edits can transform Sooraj into Suraj

Phonetic searches :

Many times , two names are similar sounding , however spelling could be different . classic example is that , in southern part of the india , you will always have one extra H in names .

  • Name can spelt as Aditi in some parts of india and Adithi in other parts.
  • Name can be spelt as Geeta in some places versus Geetha
  • Name can be spelt as Stephen in some places versus steven

Although some of the cases , we can still use edit distance algorithm . It has its own limitation . Edit distance is typically recommended for 2 letters of edit only .

Elastic search gives a special construct of creating phonetic filters and attaching it to a field of index . For this example , we will create a simple employee index with phonetic filter attached to its’ only field firstName.

Please also note use of double metaphone algorithm for phonetic matching

PUT : localhost:9200/employee/employee{
"settings": {
"index": {
"analysis": {
"analyzer": {
"phonetic_analyzer": {
"tokenizer": "standard",
"filter": [ "lowercase", "phonetic_filter" ]
}
},
"filter": {
"phonetic_filter": {
"type": "phonetic",
"replace": false,
"encoder" : "double_metaphone"
}
}
}
}
},
"mappings": {
"employee": {
"properties": {
"firstName": {
"type": "text",
"analyzer": "phonetic_analyzer"
}
}
}
}
}

Let us create a employee record with first name as “stephen”

PUT localhost:9200/employee/employee/1body : 
{
"firstName" : "stephen"
}

Now let us query the record with slightly different spelling with same sounding name ( phonetically similar )

GET localhost:9200/employee/employee/_search body : 
{
"query": {
"match": {
"firstName": "steven"
}
}
}

You will get matching record as steven and stephen are phonetically similar

result : 
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.36165747,
"hits": [
{
"_index": "employee",
"_type": "employee",
"_id": "1",
"_score": 0.36165747,
"_source": {
"firstName": "stephen"
}
}
]
}
}

Summary

In this article , we looked at the text matching capabilities of Elasticsearch .

We looked at Full text search , Edit distance searching and phonetic searching .

Next

In Next article , we will understand search relevancy and TF-IDF algorithm and we will hand compute TF-IDF values for few sentences and match them with sci-kit implementation

--

--

tanmay

Interests : software design ,architecture , search, open banking , machine learning ,mobility