{ "blog" : "brusic" }

Thursday, March 28, 2024

Leveraging open-source LLMs for use with ElasticSearch's weighted_tokens sparse vector search

With the introduction of ChatGPT and LLMs to the general masses, there has been an increased focus in vector search, which includes topics such as semantic search. In addition to specialized vector databases (Vespa, Pinecone) built specifically for such tasks, many existing data stores such as MongoDB, Lucene, and even PostgreSQL now support vector search.

Vector Search

Search engines such as Elasticsearch have long used traditional TF-IDF/BM25 search methods, executing queries that rely on literal matching of terms. With semantic search, search engines can now capture intent using trained large language models to better query the data. Most focus is on dense vectors, which reduce content into a single vector, which are then used to find the similarities with other vectors.

Sparse vector search is a hybrid search of sorts, utilizing concepts from both types of search. Using LLMs, terms are expanded with other terms, each with a weight of how closely it will match the intent of the content. Scoring is similar to TF-IDF, where the dot product between the terms of query and document are used. There are numerous models trained for sparse vector search such as SPLADE and Elasticsearch's own ELSER model.

Elasticsearch

Elasticsearch introduced sparse vector search with the text_expansion query in Elasticsearch 8.8. Originally it used the rank_feature data type until the sparse_vector field type was introduced in 8.11

The format of the query is simple, the query will use a query and a pre-registered model_id as parameter.

GET _search
{
   "query":{
      "text_expansion":{
         "<sparse_vector_field>":{
            "model_id":"the model to produce the token weights",
            "model_text":"the query string"
         }
      }

    }
}

Elasticsearch's KNN dense vector search allows for a vector to be computed outside of indexing and querying, but not sparse vector search via text_expansion

{
  "knn": {
    "field": "byte-image-vector",
    "query_vector": [-5, 9],
    "k": 10,
    "num_candidates": 100
  },
  "fields": [ ... ]
}

eland, Elasticsearch's tool for registering models does not support SPLADE or similar models and ELSER is only available with an Elastic Stack subscriptions. With no open-source/free models to use, any user of the free version of Elasticsearch could not realistically use the text_expansion query.

New weighted_tokens query

Elasticsearch 8.13 introduced a new weighted_tokens query. Similar to the knn query, it now allows a dictionary of weighted tokens to be used as a parameter

POST _search
{
  "query": {
    "weighted_tokens": {
      "query_expansion_field": {
        "tokens": {"2161": 0.4679, "2621": 0.307, "2782": 0.1299, "2851": 0.1056,...}
      }
    }
  }
}

With this new functionality, we can now create sparse vectors outside an Elasticsearch ingest pipeline. To create these sparse vectors, we will use OpenSearch's Neural Sparse Retrieval model.

Enter OpenSearch

OpenSearch introduced the Neural Sparse Retrieval model in 2.11 under the Apache 2.0 license and it is available for direct download or via Hugging Face. With the Elasticsearech weighted_tokens query, we can pair it with OpenSearch's model.

The model will create sparse pytorch tensors which are not directly usable in Elasticsearch. After the content is tokenized and passed through the model, its representation must then be transformed into the relevant tokens and their weights. Luckily, the data card in Hugging Face has just the code we need.

import itertools
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

# get sparse vector from dense vectors with shape batch_size * seq_len * vocab_size
def get_sparse_vector(feature, output):
    values, _ = torch.max(output*feature["attention_mask"].unsqueeze(-1), dim=1)
    values = torch.log(1 + torch.relu(values))
    values[:,special_token_ids] = 0
    return values
    
# transform the sparse vector to a dict of (token, weight)
def transform_sparse_vector_to_dict(sparse_vector):
    sample_indices,token_indices=torch.nonzero(sparse_vector,as_tuple=True)
    non_zero_values = sparse_vector[(sample_indices,token_indices)].tolist()
    number_of_tokens_for_each_sample = torch.bincount(sample_indices).cpu().tolist()
    tokens = [transform_sparse_vector_to_dict.id_to_token[_id] for _id in token_indices.tolist()]

    output = []
    end_idxs = list(itertools.accumulate([0]+number_of_tokens_for_each_sample))
    for i in range(len(end_idxs)-1):
        token_strings = tokens[end_idxs[i]:end_idxs[i+1]]
        weights = non_zero_values[end_idxs[i]:end_idxs[i+1]]
        output.append(dict(zip(token_strings, weights)))
    return output
    
# load the model
model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-v1")
tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-v1")

# set the special tokens and id_to_token transform for post-process
special_token_ids = [tokenizer.vocab[token] for token in tokenizer.special_tokens_map.values()]
get_sparse_vector.special_token_ids = special_token_ids
id_to_token = ["" for i in range(tokenizer.vocab_size)]
for token, _id in tokenizer.vocab.items():
    id_to_token[_id] = token
transform_sparse_vector_to_dict.id_to_token = id_to_token

The example code does not directly produce a sparse vector required, so we will add another helper method.

def to_sparse_vector_dict(query):
    features = tokenizer([query], padding=True, truncation=True,  
                         return_tensors='pt', return_token_type_ids=False)
    output = model(**features)[0]
    sparse_vector = get_sparse_vector(features, output)    
    return transform_sparse_vector_to_dict(sparse_vector)[0]

Generating content with sparse vectors

We will largely use Elastic Labs' ELSER example as a model on how to utilize sparse vectors. Instead of using the previous text_expansion query, the new weighted_tokens query will be used, with the computation of the vector occuring externally. The example uses the classic movie data set using a local docker instance of Elasticsearch 8.13

docker pull elasticsearch:8.13.0
docker run --name elastic_weighted_tokens -p 9200:9200 -p 9300:9300 \
-e "discovery.type=single-node" -e "xpack.security.enabled=false" elasticsearch:8.13.0

client = Elasticsearch('http://localhost:9200')
 
client.indices.create(
    index="movies",
    mappings={
        "properties": {
            "plot": {
                "type": "text",
                "fields": {"keyword": {"type": "keyword", "ignore_above": 256}},
            },
            "plot_embeddings": {"type": "sparse_vector"},
        }
    },
    settings = {
        'index': {
          "number_of_replicas": 0,
          "number_of_shards": 1
        }  
    }
)
 

Sparse vector embeddings are generated using the model and the data in the plot field

url = "https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/notebooks/search/movies.json"
response = urlopen(url)

# Load the response data into a JSON object
data_json = json.loads(response.read())

# Prepare the documents to be indexed
documents = []
for doc in data_json:
    # additional process needed when using an external model 
    embeddings = to_sparse_vector_dict(doc['plot'])
    doc['plot_embeddings'] = embeddings
    documents.append(
        {
            "_index": "movies",
            "_source": doc,
        }
    )

# Use helpers.bulk to index
try:
    helpers.bulk(client, documents)
except BulkIndexError as bie:
    print(bie.errors)
    
time.sleep(5)
print("Done indexing documents into `movies` index!")

We are hit immediatelly with an indexing error

{
      'error': {
        'type': 'document_parsing_exception',
        'reason': '[1:440] failed to parse: [sparse_vector] fields do not support dots in feature names but found [.]',...
      }
      ...
} 

It turns out that the OpenSearch model will introduce tokens with dots '.' that are not found in the ELSER model. Instead of using the textual token, we can use the token ids returned by the model, but since these tokens are probably not useful, we will simply remove them

for doc in data_json:
    # additional process needed when using an external model 
    embeddings = to_sparse_vector_dict(doc['plot'])
    # remove tokens with dots    embeddings = { k:v for (k,v) in embeddings.items() if '.' not in k}
    doc['plot_embeddings'] = embeddings
    documents.append(
        {
            "_index": "movies",
            "_source": doc,
        }
    )

Similar to the original example, we will keep the query structure, but use the weighted_tokens query with the dictionary passed as a parameter

query = 'investigation'
query_embeddings = to_sparse_vector_dict(query)
response = client.search(
    index="movies",
    size=3,
    query={  
        "weighted_tokens": {
          "plot_embeddings": {
            "tokens": query_embeddings,
            "pruning_config": {
              "tokens_freq_ratio_threshold": 5,
              "tokens_weight_threshold": 0.4,
              "only_score_pruned_tokens": False
            }      
        }
      }
    }
)

for hit in response["hits"]["hits"]:
    doc_id = hit["_id"]
    score = hit["_score"]
    title = hit["_source"]["title"]
    plot = hit["_source"]["plot"]
    print(f"Score: {score}\nTitle: {title}\nPlot: {plot}\n")

How do the models compare?

The results using the ELSER model, directly from the example

Score: 6.403748
Title: se7en
Plot: Two detectives, a rookie and a veteran, hunt a serial killer who uses the seven deadly 
sins as his motives.

Score: 3.6703482
Title: the departed
Plot: An undercover cop and a mole in the police attempt to identify each other while 
infiltrating an Irish gang in South Boston.

Score: 2.9359207
Title: the usual suspects
Plot: A sole survivor tells of the twisty events leading up to a horrific gun battle on a 
boat, which began when five criminals met at a seemingly random police lineup.

The results using the OpenSearch model

Score: 1.4934435
Title: Se7en
Plot: Two detectives, a rookie and a veteran, hunt a serial killer who uses the seven deadly 
sins as his motives.

Score: 1.1345206
Title: The Departed
Plot: An undercover cop and a mole in the police attempt to identify each other while 
infiltrating an Irish gang in South Boston.

Score: 1.0607098
Title: The Matrix
Plot: A computer hacker learns from mysterious rebels about the true nature of his reality 
and his role in the war against its controllers.

The first two results are a positional match, but the third is different. Given the differences in the model, it is to be expected. Both are based of SPLADE using BERT for the encoding. Elastic describes their newer v2 model as "improved retrieval accuracy and more efficient indexing. This enhancement is attributed to the extension of the training data set". For OpenSearch "the model is based on the training procedure of SPLADE, but we also add some enhancements on training data. And we use IDF value to enhance the doc-only mode."

The indexing and querying are not performed with any tuning in mind. A deeper evaluation for any model would required using real content with actual queries. The Elasticsearch example does lowercase the title as part of the ingestion pipeline, which should not affect either the recall or score of the results.

While Elasticsearch has further opened up their use of sparse vector with this new query type, OpenSearch still only supports queries that use a model id as a parameter, instead of a dictionary of token/weight pairs. There is an open issue for this functionality.

Full version of the code can be find in a python notebook

Update

I found that new name confusing, and apparently so did someone else at Elastic. Both query types will be merged and renamed into a sparse_vector query. https://github.com/elastic/elasticsearch/issues/10626

Monday, January 15, 2018

Whither NoSQL

When I first started the NoSQL NYC meetup back in 2010, it was the first NoSQL group of its kind on Meetup.com. The reasoning for creating the group was not because I was an expert on the subject, it was because I knew nothing. How can I learn more? So many new database types were being released around this time. Hadoop started small in 2006, CouchDB became an Apache subproject in 2008 and MongoDB was released in 2009. Developers looking to move beyond whatever limitations they had with SQL databases now had choices. However, developers experienced in these technogies were few, so the meetup was created to bring like-minded people together.

By the time I hosted my last meetup before moving to California almost two years later, group membership was over a thousand of these like-minded developers. Interest was high and events were full. Document stores, graph databases, distributed filesystems, not a line of SQL in sight.

It was sad to leave the group behind, but it was left in the capable hands of a new organizer. Not long after, the number of meetups began to dwindle. It saddened me even more that the group I created and grew over those two years was being neglected. Should I have picked a different organizer? Should I have maintained an active role in organizing? The truth is that the new organizer was not negligent in his duties, but simply that NoSQL became the new norm. There was a better chance that a new startup was using MongoDB than MySQL. The need to lump all these new databases under one convienient term was no longer needed.

NoSQL was never against the SQL syntax, but simply an alternative to relational databases. Over time, even though the software development world fully embraced these new concepts, the term NoSQL simply went away. You will hear references to the CAP theorem, to eventual consistency, to document stores, but not NoSQL.

While the name is gone, the spirit will continue. Thanks for the memories.

Saturday, February 15, 2014

Document boosting in Elasticsearch

There has been some discussion on the Elasticsearch mailing list lately about applying index-time boosts at the document level, aka document boosting. The practice has been discouraged (in fact, the Elasticsearch team has officially deprecated document boosting) in favor of query-time scoring, but without any detailed explanation why. Instead of repeating myself each time the question is asked, I have decided to detail the various reasons why document boosts should no longer be used.

How private is your online job profile?

If you have been in the workforce for a considerable amount of time, you might have a job profile on a job site or two. Although using one's network is perhaps one of the best ways to secure a new position, I found myself once again using a job site to find a new position in an area where I had no connections.

For this job hunt, I posted my resume on Monster.com and Dice.com, two of the most popular job sites in the US. Once I secured a new job, I set all my online job profiles to Private in order to stop receiving emails about other opportunities. However, after a few months, I received a curious email that started off:

"I found your resume on Monster.com and wanted to run a new Java Developer position in Dulles, VA by you."

The email was sent via Monster using their anonymous mailer <Monster> anonredir@route.monster.com and the recruiter introduced themselves using my first name.

I quickly assumed that I left my profile as Public on Monster. However, my resume was in fact Private. Here is how my profile looks like:

How would this situation be possible? How can a recruiter contact me via Monster if my resume is set to Private? I regarded the email as a fluke and thought nothing of it. Then I received two more emails from the same recruiter. I quickly sent an email to Monster to understand how can it be possbile to be receive emails via my private resume. Their response was as follows:

The reason employers may be contacting you is that at one time your resume was set to Public or you applied to an employers job posting. In this case, a resume that you used to apply online for a job or that was searchable, employers, recruiters, and others who have paid for access to the Monster resume database, or have paid to obtain a copy of that database, as well as parties who have otherwise gained access, may have retained a copy of your resume in their own files or databases. Monster is not responsible for the retention, use, or privacy of resumes in these instances.

Let us break down the different scenarios on how a recruiter might have my information:

1) "at one time your resume was set to Public"

My job junt was not a secret. I was moving to a new area and I have already left my previous position. Setting my resume to Public helped me idenify companies I might not have found myself since they would be able to contact me (nothing interesting came up, but that's another story). Simply setting my resume to Public meant any employer or recruiter with access on Monster can save my profile ad infinitum regardless of any privacy changes I may make.

2) "or you applied to an employers job posting."

Not applicable in my case, but it is an acceptable scenario for someone's profile to be viewable.

3) "and others who have paid for access to the Monster resume database or have paid to obtain a copy of that database"

Is Monster.com telling me that even if someone's sets their profile to Private, Monster.com can still sell their resume database to anyone with a checkbook? There is no opt-out from having your information sold. According to Monster.com's own online FAQ, a Private profile is defined as:

"If you select private, your resume will not be seen by employers conducting resume searches. However, you can use your private resume to apply for jobs."

"If you select private as your resume status, your resume will not be seen by employers conducting resume searches (but you can still use your private resume to apply for specific jobs online)."

Monster.com has two conflicting viewpoints on how Private is defined. Further in the FAQ, regarding the deletion of a resume, Monster.com states:

"If you delete a resume that you used to apply online for a job or that was searchable, employers, recruiters, and others who have paid for access to the Monster resume database, or have paid to obtain a copy of that database, as well as parties who have otherwise gained access, may have retained a copy of your resume in their own files or databases. Monster is not responsible for the retention, use, or privacy of resumes in these instances."

This policy is similar to their Private resume policy. If a resume was deleted or set to Private, why does Monster.com allow someone to email the candidate in question? The email was not sent directly, but via Monster.com's email system. Monster.com can shut off access at any time, but choose not to. Nowadays, many of have have a virtual resume via a LinkedIn profile, but LinkedIn allows their user granular controls over what is publicly available. How much information does Monster.com actually sell? How much privacy are we entitled to when searching for a new job?

Friday, September 9, 2011

Create pluggable REST endpoints in elasticsearch

A quick introduction on how to create a plugin in elasticsearch that allows you to define new REST endpoints.