Thursday, March 28, 2024

Leveraging open-source LLMs for use with ElasticSearch's weighted_tokens sparse vector search

 

With the introduction of ChatGPT and LLMs to the general masses, there has been an increased focus in vector search, which includes topics such as semantic search. In addition to specialized vector databases (Vespa, Pinecone) built specifically for such tasks, many existing data stores such as MongoDB, Lucene, and even PostgreSQL now support vector search.

Vector Search 

Search engines such as Elasticsearch have long used traditional TF-IDF/BM25 search methods, executing queries that rely on literal matching of terms. With semantic search, search engines can now capture intent using trained large language models to better query the data. Most focus is on dense vectors, which reduce content into a single vector, which are then used to find the similarities with other vectors.

Sparse vector search is a hybrid search of sorts, utilizing concepts from both types of search. Using LLMs, terms are expanded with other terms, each with a weight of how closely it will match the intent of the content. Scoring is similar to TF-IDF, where the dot product between the terms of query and document are used. There are numerous models trained for sparse vector search such as SPLADE and Elasticsearch's own ELSER model.

Elasticsearch

Elasticsearch introduced sparse vector search with the text_expansion query in Elasticsearch 8.8. Originally it used the rank_feature data type until the sparse_vector field type was introduced in 8.11

The format of the query is simple, the query will use a query and a pre-registered model_id as parameter.
 

GET _search
{
"query":{
"text_expansion":{
"<sparse_vector_field>":{
"model_id":"the model to produce the token weights",
"model_text":"the query string"
}
}

}
}

Elasticsearch's KNN dense vector search allows for a vector to be computed outside of indexing and querying, but not sparse vector search via text_expansion

{
"knn": {
"field": "byte-image-vector",
"query_vector": [-5, 9],
"k": 10,
"num_candidates": 100
},
"fields": [ ... ]
}

eland, Elasticsearch's tool for registering models does not support SPLADE or similar models and ELSER is only available with an Elastic Stack subscriptions. With no open-source/free models to use, any user of the free version of Elasticsearch could not realistically use the text_expansion query.

New weighted_tokens query

Elasticsearch 8.13 introduced a new weighted_tokens query. Similar to the knn query, it now allows a dictionary of weighted tokens to be used as a parameter

POST _search
{
"query": {
"weighted_tokens": {
"query_expansion_field": {
"tokens": {"2161": 0.4679, "2621": 0.307, "2782": 0.1299, "2851": 0.1056,...}
}
}
}
}

 

With this new functionality, we can now create sparse vectors outside an Elasticsearch ingest pipeline. To create these sparse vectors, we will use OpenSearch's Neural Sparse Retrieval model.

Enter OpenSearch

OpenSearch introduced the Neural Sparse Retrieval model in 2.11 under the Apache 2.0 license and it is available for direct download or via Hugging Face.  With the Elasticsearech weighted_tokens query, we can pair it with OpenSearch's model.

The model will create sparse pytorch tensors which are not directly usable in Elasticsearch.  After the content is tokenized and passed through the model, its representation must then be transformed into the relevant tokens and their weights. Luckily, the data card in Hugging Face has just the code we need.

import itertools
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

# get sparse vector from dense vectors with shape batch_size * seq_len * vocab_size
def get_sparse_vector(feature, output):
values, _ = torch.max(output*feature["attention_mask"].unsqueeze(-1), dim=1)
values = torch.log(1 + torch.relu(values))
values[:,special_token_ids] = 0
return values
# transform the sparse vector to a dict of (token, weight)
def transform_sparse_vector_to_dict(sparse_vector):
sample_indices,token_indices=torch.nonzero(sparse_vector,as_tuple=True)
non_zero_values = sparse_vector[(sample_indices,token_indices)].tolist()
number_of_tokens_for_each_sample = torch.bincount(sample_indices).cpu().tolist()
tokens = [transform_sparse_vector_to_dict.id_to_token[_id] for _id in token_indices.tolist()]

output = []
end_idxs = list(itertools.accumulate([0]+number_of_tokens_for_each_sample))
for i in range(len(end_idxs)-1):
token_strings = tokens[end_idxs[i]:end_idxs[i+1]]
weights = non_zero_values[end_idxs[i]:end_idxs[i+1]]
output.append(dict(zip(token_strings, weights)))
return output
# load the model
model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-v1")
tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-v1")

# set the special tokens and id_to_token transform for post-process
special_token_ids = [tokenizer.vocab[token] for token in tokenizer.special_tokens_map.values()]
get_sparse_vector.special_token_ids = special_token_ids
id_to_token = ["" for i in range(tokenizer.vocab_size)]
for token, _id in tokenizer.vocab.items():
id_to_token[_id] = token
transform_sparse_vector_to_dict.id_to_token = id_to_token


The example code does not directly produce a sparse vector required, so we will add another helper method.

def to_sparse_vector_dict(query):
features = tokenizer([query], padding=True, truncation=True,  
return_tensors='pt', return_token_type_ids=False)
output = model(**features)[0]
sparse_vector = get_sparse_vector(features, output)
return transform_sparse_vector_to_dict(sparse_vector)[0]

Generating content with sparse vectors

We will largely use Elastic Labs' ELSER example as a model on how to utilize sparse vectors. Instead of using the previous text_expansion query, the new weighted_tokens query will be used, with the computation of the vector occuring externally. The example uses the classic movie data set using a local docker instance of Elasticsearch 8.13

docker pull elasticsearch:8.13.0
docker run --name elastic_weighted_tokens -p 9200:9200 -p 9300:9300 \
-e "discovery.type=single-node" -e "xpack.security.enabled=false" elasticsearch:8.13.0

client = Elasticsearch('http://localhost:9200')
 
client.indices.create(
index="movies",
mappings={
"properties": {
"plot": {
"type": "text",
"fields": {"keyword": {"type": "keyword", "ignore_above": 256}},
},
"plot_embeddings": {"type": "sparse_vector"},
}
},
settings = {
'index': {
"number_of_replicas": 0,
"number_of_shards": 1
}
}
)
 

Sparse vector embeddings are generated using the model and the data in the plot field

url = "https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/notebooks/search/movies.json"
response = urlopen(url)

# Load the response data into a JSON object
data_json = json.loads(response.read())

# Prepare the documents to be indexed
documents = []
for doc in data_json:
# additional process needed when using an external model 
embeddings = to_sparse_vector_dict(doc['plot'])
doc['plot_embeddings'] = embeddings
documents.append(
{
"_index": "movies",
"_source": doc,
}
)

# Use helpers.bulk to index
try:
helpers.bulk(client, documents)
except BulkIndexError as bie:
print(bie.errors)
time.sleep(5)
print("Done indexing documents into `movies` index!")

 We are hit immediatelly with an indexing error

{
'error': {
'type': 'document_parsing_exception',
'reason': '[1:440] failed to parse: [sparse_vector] fields do not support dots in feature names but found [.]',...
}
...

It turns out that the OpenSearch model will introduce tokens with dots '.' that are not found in the ELSER model. Instead of using the textual token, we can use the token ids returned by the model, but since these tokens are probably not useful, we will simply remove them

for doc in data_json:
# additional process needed when using an external model
embeddings = to_sparse_vector_dict(doc['plot'])
# remove tokens with dots
embeddings = { k:v for (k,v) in embeddings.items() if '.' not in k}
doc['plot_embeddings'] = embeddings
documents.append(
{
"_index": "movies",
"_source": doc,
}
)

 Similar to the original example, we will keep the query structure, but use the weighted_tokens query with the dictionary passed as a parameter

query = 'investigation'
query_embeddings = to_sparse_vector_dict(query)
response = client.search(
index="movies",
size=3,
query={
"weighted_tokens": {
"plot_embeddings": {
"tokens": query_embeddings,
"pruning_config": {
"tokens_freq_ratio_threshold": 5,
"tokens_weight_threshold": 0.4,
"only_score_pruned_tokens": False
}
}
}
}
)

for hit in response["hits"]["hits"]:
doc_id = hit["_id"]
score = hit["_score"]
title = hit["_source"]["title"]
plot = hit["_source"]["plot"]
print(f"Score: {score}\nTitle: {title}\nPlot: {plot}\n")

How do the models compare?

The results using the ELSER model, directly from the example

Score: 6.403748
Title: se7en
Plot: Two detectives, a rookie and a veteran, hunt a serial killer who uses the seven deadly 
sins as his motives.

Score: 3.6703482
Title: the departed
Plot: An undercover cop and a mole in the police attempt to identify each other while 
infiltrating an Irish gang in South Boston.

Score: 2.9359207
Title: the usual suspects
Plot: A sole survivor tells of the twisty events leading up to a horrific gun battle on a 
boat, which began when five criminals met at a seemingly random police lineup.

The results using the OpenSearch model

Score: 1.4934435
Title: Se7en
Plot: Two detectives, a rookie and a veteran, hunt a serial killer who uses the seven deadly 
sins as his motives.

Score: 1.1345206
Title: The Departed
Plot: An undercover cop and a mole in the police attempt to identify each other while 
infiltrating an Irish gang in South Boston.

Score: 1.0607098
Title: The Matrix
Plot: A computer hacker learns from mysterious rebels about the true nature of his reality 
and his role in the war against its controllers.

The first two results are a positional match, but the third is different. Given the differences in the model, it is to be expected. Both are based of SPLADE using BERT for the encoding. Elastic describes their newer v2 model as "improved retrieval accuracy and more efficient indexing. This enhancement is attributed to the extension of the training data set". For OpenSearch "the model is based on the training procedure of SPLADE, but we also add some enhancements on training data. And we use IDF value to enhance the doc-only mode."

The indexing and querying are not performed with any tuning in mind. A deeper evaluation for any model would required using real content with actual queries. The Elasticsearch example does lowercase the title as part of the ingestion pipeline, which should not affect either the recall or score of the results.

While Elasticsearch has further opened up their use of sparse vector with this new query type, OpenSearch still only supports queries that use a model id as a parameter, instead of a dictionary of token/weight pairs. There is an open issue for this functionality.

Full version of the code can be find in a python notebook

Update

I found that new name confusing, and apparently so did someone else at Elastic. Both query types will be merged and renamed into a sparse_vector query.  https://github.com/elastic/elasticsearch/issues/10626




No comments:

Post a Comment