Saturday, February 15, 2014

Document boosting in Elasticsearch

There has been some discussion on the Elasticsearch mailing list lately about applying index-time boosts at the document level, aka document boosting. The practice has been discouraged (in fact, the Elasticsearch team has officially deprecated document boosting) in favor of query-time scoring, but without any detailed explanation why. Instead of repeating myself each time the question is asked, I have decided to detail the various reasons why document boosts should no longer be used.

First, let us index some simple sample data with a single analyzed text field and a document boost field. Additional dummy documents have been added to further infuence the TF-IDF algorithm and effect the scoring. For all queries we will enabled explanations in order to see the effects of the boosts (omitted from the displayed queries for brevity).

 # Initial mapping  
 curl -XPOST localhost:9200/boosts -d '  
 {  
  "settings": {  
   "number_of_shards": 1,  
   "number_of_replicas": 0  
  },  
  "mappings": {  
   "doc": {  
    "_boost": {  
     "name": "boost",  
     "null_value": 1.0  
    },  
    "properties": {  
     "text": {  
      "type": "string"  
     }  
    }  
   }  
  }  
 }  
 '  

 # Documents  
 curl -XPUT 'http://localhost:9200/boosts/doc/1' -d '{  
   "text" : "The quick brown fox jumped over the lazy dog",  
   "boost" : 100  
 }'  
 curl -XPUT 'http://localhost:9200/boosts/doc/2' -d '{  
   "text" : "The quick black fox leaped over the sleeping dog"  
 }'  
 curl -XPUT 'http://localhost:9200/boosts/doc/3' -d '{  
   "text" : "Mr. Fox",  
   "boost" : 50  
 }'  
 curl -XPUT 'http://localhost:9200/boosts/doc/4' -d '{  
   "text" : "Dr. Dog",  
   "boost" : 100  
 }'  

Issue #1: Hard to read

Let us execute a simple text match query for the term 'quick'. Only the first two documents should match with their respective boost values of 100 and 1 (the default).

 {  
   "query": {  
     "match": {  
       "text": "quick"     
     }  
   }  
 }  

Without going into the details of the explanation, let us look at overall scores. The first two documents match as expected, but the first document has a score that is only 85 times the score of the second, although the boost should be 100. Let us see where the difference is in the explanation:

Truncated for brevity

 {  
  "value": 70.52713,  
  "details": [  
   {  
    "value": 1,  
    "description": "tf(freq=1.0), with freq of:",  
    "details": [  
     {  
      "value": 1,  
      "description": "termFreq=1.0"  
     }  
    ]  
   },  
   {  
    "value": 2.2039728,  
    "description": "idf(docFreq=2, maxDocs=10)"  
   },  
   {  
    "value": 32,  
    "description": "fieldNorm(doc=0)"  
   }  
  ]  
 }  

 {  
  "value": 0.8264898,  
  "details": [  
   {  
    "value": 0.8264898,  
    "description": "fieldWeight in 1, product of:",  
    "details": [  
     {  
      "value": 1,  
      "description": "tf(freq=1.0), with freq of:",  
      "details": [  
       {  
        "value": 1,  
        "description": "termFreq=1.0"  
       }  
      ]  
     },  
     {  
      "value": 2.2039728,  
      "description": "idf(docFreq=2, maxDocs=10)"  
     },  
     {  
      "value": 0.375,  
      "description": "fieldNorm(doc=1)"  
     }  
    ]  
   }  
  ]  
 }  


Both have a term frequency (tf) of 1 and the same inverse document frequecy (idf) of 2.2039728. The sole difference lies in the fieldNorm. In our example, a boost value 100 equals a field norm of 32 for the first document, while the default boost of 1 for the second document results in a field norm of 0.375.  Not only is the multiple "only" around 85, but the affect of the boost is hard to calculate. When fine-tuning your documents and queries for better search relevancy, the ability to understand the explanation is important. The above query is extremly simply. Only one term searched on one field. The explanation becomes more confusing as the query becomes more complicated.

Why are the fieldNorm values the way they are? The reason is ...

Issue #2: Lossy encoding

Field norms in Lucene are encoded into a single byte. This limitation causes a loss of precision when boosting, especially if your range of boosts vary greatly. To better illustrate this severity of the issue, let us try another query:

 {  
   "query": {  
     "match": {  
       "text": "fox"      
     }  
   }  
 }  

In this case, although the document boost for the first document (#1) is twice as much as the second matched document (#3), the field norm is the same and therefore they receive the same score (since the TF and IDF are identical).

 {  
  "_id": "1",  
  "_score": 61.321304,  
  "_explanation": {  
   "details": [  
    {  
     ...  
     {  
      "value": 32,  
      "description": "fieldNorm(doc=0)"  
     }  
    ]  
   }  
  ]  
 }  
 {  
  "_id": "3",  
  "_score": 61.321304,  
  "_explanation": {  
   "details": [  
    {  
     {  
      "value": 32,  
      "description": "fieldNorm(doc=2)"  
     }  
    ]  
   }  
  ]  
 }  

Due to the lossy encoding of the field norms, the actual document boost is the same. Obviously not the ideal situation.

Issue #3: Length norms

One issue faced during search relenvency tuning is the issue of length normalization. The default in Lucene (and therefore Elasticsearch) is to apply length normalization to a field, which means shorter fields that match a query will score higher than longer fields. In some uses cases, this behavior is ideal, but not for others. Here is an example where the field length affets scoring.

 {  
   "query": {  
     "match": {  
       "text": "dog"      
     }  
   },  
   "fields": ["boost"]  
 }  

Results

 "hits": [  
  {  
   "_index": "boosts",  
   "_type": "doc",  
   "_id": "4",  
   "_score": 122.64261,  
   "fields": {  
    "boost": 100  
   }  
  },  
  {  
   "_index": "boosts",  
   "_type": "doc",  
   "_id": "1",  
   "_score": 61.321304,  
   "fields": {  
    "boost": 100  
   }  
  },  
  {  
   "_index": "boosts",  
   "_type": "doc",  
   "_id": "2",  
   "_score": 0.71860904  
  }  
 ]  

The second matched document only contains two words, therefore it scores higher than the first document despite having a similar boost. To fix this this problem, we can omit norms from a field. We will create a new field sans norms and reindex all the documents with the new field.

 curl -XPUT localhost:9200/boosts/doc/_mapping -d '  
 {  
  "doc": {  
   "properties": {  
    "text_no_norms": {  
     "type": "string",  
     "norms": { "enabled": false }  
    }  
   }  
  }  
 }  
 '  

Now we can query the new field

 {  
   "query": {  
     "match": {  
       "text_no_norms": "dog"  
     }  
   }  
 }  

 "hits": [  
  {  
   "_index": "boosts",  
   "_type": "doc",  
   "_id": "1",  
   "_score": 1.9162908  
  },  
  {  
   "_index": "boosts",  
   "_type": "doc",  
   "_id": "2",  
   "_score": 1.9162908  
  },  
  {  
   "_index": "boosts",  
   "_type": "doc",  
   "_id": "4",  
   "_score": 1.9162908  
  }  
 ]  

After eliminating the norms from the searched field, each document now receives the same score. In Lucene, length normalization is also encoding in the field norm, so it you omits, you also lose document and field boosts and that field. Once again, not the right way to go.

Issue #4: Incompability with older versions

Besides using a field for the boost value, it might also be convienient to perhaps search/facet/sort on this value as well. My personal use case is to sort on the boost value whenever executing a match all query. The use of the boost field for anything else besides boosting is not possible with Elasticsearch versions prior to 0.90.6. The field cannot be indexed or stored and therefore cannot be used for other purposes. Starting with Elasticsearch 0.90.6, the boost field can be indexed, but it is not enabled by default, which is the opposite behavior of an unmapped field.

Issue #5: Currently deprecated

As of right now, document boosts are deprecated in Elasticsearch. Judging by the commit, only the documentation has changed, while the code is still intact. As with most APIs, it is not a good practice to rely on deprecated methods because they might disappear completely in the future.

The issues that I have addressed above primarily regard document boosts. Index-time field boosts have the same issues which pertain to field norms (#1-#3). Field boosts can be easily translated into query-time boosts by using many of the boostable queries such as term, query string, span and match.

Solution: function score boosting

Elasticsearch has the ability to script the value for a document's score since version 0.19.0. The various methods introduced since then have been recently merged and improved, and released as the new function score query.

Since we want to eliminate document boosts, the mapping should now contain the field to boost with as a normal field instead of the explicit boost field.

 curl -XPOST localhost:9200/boosts -d '  
 {  
  "settings": {  
   "number_of_shards": 1,  
   "number_of_replicas": 0  
  },  
  "mappings": {  
   "doc": {  
    "properties": {  
     "text": {  
      "type": "string"  
     },  
     "boost": {  
      "type": "double"  
     }  
    }  
   }  
  }  
 }  
 '  

After deleting the previous index and reindexing the data, we will wrap the original query

 {  
   "query": {  
    "function_score": {  
      "query": {  
       "match": {  
         "text": "quick"  
       }  
      },  
      "script_score": {  
       "script": "doc['boost'].isEmpty() ? 1 : doc['boost'].value"  
      }  
    }  
   }  
 }  

If your documents are guaranteed to contain the field to be used as a boost value, you can skip the isEmpty() check.

 "hits": [  
  {  
   "_index": "boosts",  
   "_type": "doc",  
   "_id": "1",  
   "_score": 82.64898,  
   "_source": {  
    "text": "The quick brown fox jumped over the lazy dog",  
    "boost": 100  
   }  
  },  
  {  
   "_index": "boosts",  
   "_type": "doc",  
   "_id": "2",  
   "_score": 0.8264898,  
   "_source": {  
    "text": "The quick black fox leaped over the sleeping dog"  
   }  
  }  
 ]  

Without going into the score explanation, we can see that the score of the boosted document is exactly 100 times the one that is not boosted. Much easier to understand the relationship betweens boosts.

Improvements

Of course, the biggest limitation of function scores is that they are executed for each document returned by a query. With index-time document boosts, the value is calculated only once and stored in the index. Elasticsearch uses mvel as the default scripting language, which offers quick performance. In addition, mvel scripts are cached, making execution even faster. For additional
speed improvements, mvel scripts can be converted into native Java scripts.

Conclusion

Although document boosts can be problematic and function scores are a superior substitute, I do think they can be useful in specific use cases. Having the boost value calculated only once is an obvious advantage and although mvel scripts are fast, they still will have an effect on the overall CPU usage. If search relevancy is important and careful fine tuning of queries is critical, function scores are far easier to work with.

Gist for all the example code

No comments:

Post a Comment