The open blogging platform. Say no to algorithms and paywalls.

Unlocking FAISS: Vector Indexing and ANN with Serverless Architecture

A few months ago, when I was working on creating a strong recommendation and search system, I came across FAISS, a library from Facebook that’s used to find similar items in large sets of data. I got really curious about it, and after spending a few hours reading documentation and researching online, I noticed the gap between building a robust system and deploying it to serve requests.

In this post, I aim to bridge this gap and explain how can we take a piece of code from the Jupyter Notebook and transform it into an application. But, before that, let’s understand a bit about Faiss.

Introduction

Faiss, which stands for ”Facebook AI Similarity Search,” is a powerful and efficient library for similarity search and similarity indexing.

Developed by Facebook AI Research, Faiss is primarily used for searching and indexing high-dimensional data, making it a valuable tool in various applications, including natural language processing, computer vision, and recommendation systems. Key features and characteristics of Faiss include:

  1. Efficient Vector Search: Faiss is optimized for fast similarity search in large datasets of high-dimensional vectors. It provides both exact and approximate search algorithms, making it suitable for a wide range of use cases.

  2. GPU Support: Faiss includes GPU support, allowing users to take advantage of the computational power of modern graphics processing units to accelerate similarity search operations.

  3. Diverse Indexing Structures: Faiss provides a variety of indexing structures, including flat indexes, IVF (Inverted File) indexes, HNSW (Hierarchical Navigable Small World) indexes, and more, each tailored to specific data and performance requirements.

Terminologies

FAISS contains various terminologies and concepts related to similarity search and high-dimensional indexing which is important to understand before we move forward. Let’s check out:

  1. Similarity Search: This is the fundamental task that FAISS is designed for. It involves finding items or data points in a dataset that are most similar to a given query item, often based on distance or similarity metrics.
  2. Index: An index is a data structure used to organize and efficiently search through high-dimensional vectors. FAISS provides various index structures optimized for different use cases.
  3. Vector: A vector represents a data point or item in a high-dimensional space. These vectors are the entities that you search for or compare using FAISS.
  4. L2 Distance (Euclidean Distance): L2 distance is a common metric used to measure the distance between vectors in Euclidean space. It is the square root of the sum of squared differences between corresponding elements of two vectors.
  5. Cosine Similarity: Cosine similarity is a metric that measures the cosine of the angle between two vectors. It is often used for text or document similarity tasks and ranges from -1 (completely dissimilar) to 1 (completely similar).
  6. Index Type: FAISS offers a variety of index types, including “Flat” indexes, IVF (Inverted File) indexes, HNSW (Hierarchical Navigable Small World) indexes, etc. These determine how vectors are organized and searched within the index.
  7. ANN (Approximate Nearest Neighbors): ANN is a technique for searching for the nearest neighbors that are approximately closest to a given query. FAISS is often used for ANN search tasks.

Building a Search Engine

The first thing we need is data, we’ll be using open-source data available on Kaggle. The dataset consists of summarized news from Inshorts and news articles from the Hindu, Indian Times, and Guardian.

We will read this dataset, and extract the relevant text columns into a single list.

FAISS can take advantage of Graphics Processing Units (GPUs) for acceleration. This is particularly useful for large-scale similarity searches. We will also create our GPU device to run our code.

import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os

import time

import faiss

from sentence_transformers import SentenceTransformer

import torch

# https://www.kaggle.com/datasets/sunnysai12345/news-summary/

df_news = pd.read_csv("../input/news-summary/news_summary_more.csv")

torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = SentenceTransformer("all-MiniLM-L6-v2", device=torch_device)
Setting up the device

Curious Case of Faiss Indexes

Faiss comes with many different index types — many of which can be mixed and matched to produce multiple layers of indexes. We will be focused on a few indexes that prioritize search speed, quality, or index memory.

Now, which one of these indexes we use depends very much on our use case. We must consider factors such as dataset size, search frequency, or search-quality vs. search-speed.

1. IndexFlatL2 / IndexFlatIP

  • This is the simplest index structure where all data points are stored without any transformation (compression). This type of index doesn’t compress or cluster your vectors. Flat indexes are ‘flat’ because they do not modify the vectors that we feed into them.
  • Because there is no approximation or clustering of vectors — these indexes produce the most accurate results. We have perfect search quality, but this comes at the cost of significant search times.
  • With flat indexes, we introduce our query vector xq and compare it against every other full-size vector in our index — calculating the distance/inner-product to each. This is an EXHAUSTIVE SEARCH.
  • After calculating all of these distances, we will return the nearest k of those as our nearest matches. A k-nearest neighbors (kNN) search.
  • And for flat indexes, that is all we need to do — there is no training (as we have no parameters to optimize when storing vectors without transformations or clustering).

When To Use -

  • Search quality is a very high priority.
  • Search time does not matter OR when using a small index (<10K)

Below is the code that will execute it using Inner Product and will measure the output based on Cosine Similarity.

def create_index(data, text_column, model):
    embedding = model.encode(data[text_column].to_list())

    # dimension
    dimension = embedding.shape[1]

    # create the vector/embeddings and their IDs                                                                                                                                                                                                                                                embedding vectors and ids:
    db_vectors = embedding.copy().astype(np.float32)
    db_ids = data.index.values.astype(np.int64)

    faiss.normalize_L2(db_vectors)
    index = faiss.IndexFlatIP(dimension)
    index = faiss.IndexIDMap(index)
    index.add_with_ids(db_vectors, db_ids)

    return index

def search_your_query(index, model, query, k):

    t=time.time()
    query_vector = model.encode([query]).astype(np.float32)
    faiss.normalize_L2(query_vector)

    similarities, similarities_ids = index.search(query_vector, k)
    print('totaltime: {}\n'.format(time.time()-t))

    similarities = np.clip(similarities, 0, 1)

    output = []
    for i in range(len(similarities_ids[0])):
        item = {
            'id': similarities_ids[0][i],
            'text': df_news.loc[similarities_ids[0][i], 'text'],
            'similarity_score': similarities[0][i]
        }
        output.append(item)

    return output

 # Check output 1
 search_query(index=index_flatIP,
             model=model,
             query="Artificial Intelligence",
             k=5)

# Check output 2
search_query(index=index_flatIP,
             model=model,
             query="Data Science",
             k=5)
Create Index and Search your Query using **IndexFlatIP**

Poor Speed!

Using the IndexFlatL2 index alone is computationally expensive, it doesn’t scale well.

When using this index, we are performing an exhaustive search which means we compare our query vector xq to every other vector in our index, in our case that is 98k Inner Product calculations for every search.

Imagine the speed of our search for datasets containing 1M, 1B, or even more vectors — it will be insane!

So, how can we make our search faster?

There are two primary approaches:

  1. Reduce vector size — through dimensionality reduction or reducing the number of bits representing our vector values.
  2. Reduce search scope— we can do this by clustering or organizing vectors into tree structures based on certain attributes, similarity, or distance — and restricting our search to the closest clusters or filtering through most similar branches.

Using either of these approaches means that we are no longer performing an exhaustive nearest-neighbors search but an approximate nearest-neighbors (ANN) search — as we no longer search the entire, full-resolution dataset. So, what we produce is a more balanced mix that prioritizes both search-speed and search-time.

2. IndexIVFFlat

The Inverted File Index (IVF) index consists of search scope reduction through clustering. It’s a very popular index as it’s easy to use, with high search-quality and reasonable search-speed. It works on the concept of Voronoi diagrams.

To understand Voronoi diagrams, we need to imagine our highly-dimensional vectors placed into a 2D space. We then place a few additional points in our 2D space, which will become our ‘cluster’ (Voronoi cells in our case) centroids. We then extend an equal radius out from each of our centroids. At some point, the circumferences of each cell circle will collide with another — creating our cell edges.

Voronoi diagram

Using this method, we would take a query vector, identify the cell it belongs to, and then use our IndexFlatL2 (or another metric) to search between the query vector and all other vectors belonging to that specific cell.

So, we are reducing the scope of our search, producing an approximate answer, rather than exact (as produced through exhaustive search).

To implement this, we first initialize our index using IndexFlatL2 — but this time, we are using the L2 index as a quantizer step — which we feed into the partitioning IndexIVFFlat index.

nlist = 60  # how many Voronoi cells/partitions
quantizer = faiss.IndexFlatL2(dimension)
indexIVFFlat = faiss.IndexIVFFlat(quantizer, dimension, nlist)

indexIVFFlat.is_trained #False
# So, what we do now is train our index on our data — which we must do before adding any data to the index.

indexIVFFlat.train(embedding)
indexIVFFlat.is_trained  # check if index is now trained

indexIVFFlat.add(embedding)
indexIVFFlat.ntotal  # number of embeddings indexed

# Now that our index is trained, we add our data just as we did before and search again using the
# same indexed sentence embeddings and the same query vector.

def search_query(index, model, query, k):

    t=time.time()
    query_vector = model.encode([query]).astype(np.float32)
    faiss.normalize_L2(query_vector)

    similarities, similarities_ids = index.search(query_vector, k)
    print('totaltime: {}\n'.format(time.time()-t))

    similarities = np.clip(similarities, 0, 1)

    output = []
    for i in range(len(similarities_ids[0])):
        item = {
            'id': similarities_ids[0][i],
            'text': df_news.loc[similarities_ids[0][i], 'text']
        }
        output.append(item)

    return output

 # Check output 1
 search_query(index=indexIVFFlat,
             model=model,
             query="Artificial Intelligence",
             k=5)

# Check output 2
search_query(index=indexIVFFlat,
             model=model,
             query="Data Science",
             k=5)

For more details, refer to my GitHub repository.

Note:

  1. Now, because we’re searching a larger scope by increasing the nprobe value, we will see the search speed increase too!

  2. A higher nlist means that we must compare our vector to more centroid vectors — but after selecting the nearest centroid’s cells to search, there will be fewer vectors within each cell. So, increase nlist to prioritize search-speed.

  3. As for nprobe, we find the opposite. Increasing nprobe increases the search scope — thus prioritizing search-quality.

  4. IndexIVFPQ (Quantization)

Quantization is a generic method that refers to the compression of data into a smaller space.

  • Quantization reduces the scope S of possible vectors. Note that with pre-quantization the scope is typically infinite.

  • There are many ways of doing this. For example, we have clustering. When we cluster a set of vectors we replace the larger scope of potential values (all possible vectors), with a smaller discrete and symbolic set of centroids.

  • And this is really how we can define a quantization operation. The transformation of a vector into a space with a finite number of possible values, where those values are symbolic representations of the original vector.

How Product Quantization Works

Let’s work through the logic of PQ. We would usually have many vectors (all of equal length) — but for the sake of simplicity, we will use a single vector in our examples.

In short, PQ is the process of:

  • Taking a big, high-dimensional vector
  • Splitting it into equally sized chunks — our subvectors
  • Run a clustering algorithm and assign each of these subvectors to its nearest centroid (also called reproduction/reconstruction values). Centroids are SET by the modeler in advance. The more the centroids the more accurate the result will be. The average distance between the subvector and centroid is called Quantization Error, and we want to minimize it, so we do it by increasing the # of centroids.
  • Replacing these centroid values with unique IDs — each ID represents a centroid. This ID is usually 8-bit integer

Earlier — each value is 32bit float and 128 dim, so 32*128=4096 bits vector. Now, we have just ID vector which is 8 bit and 8 centroids so a 64 bits vector.

At the end of the process, we’ve reduced our highly dimensional vector that requires a lot of memory to a tiny vector of IDs that require very little memory.

m = 8  # number of centroid IDs in final compressed vectors
bits = 8 # number of bits in each centroid

quantizer = faiss.IndexFlatL2(dimension)  # we keep the same L2 distance flat index
indexIVFPQ = faiss.IndexIVFPQ(quantizer, dimension, nlist, m, bits)

indexIVFPQ.is_trained

indexIVFPQ.train(embedding)

indexIVFPQ.add(embedding)
indexIVFPQ.ntotal  # number of embeddings indexed

indexIVFPQ.nprobe = 8

# Check output 1
search_query(index=indexIVFPQ,
             model=model,
             query="Artificial Intelligence",
             k=5)

# Check output 2
search_query(index=indexIVFPQ,
             model=model,
             query="Data Science",
             k=5)

After adding PQ we’ve almost a similar time as our IVF search time, but when data scales up, the search becomes significantly faster.

However, we should also take note of the slightly different results being returned because now the vectors are compressed!

For more details, refer to my GitHub repository.

Deploy using AWS serverless architecture

To deploy your FAISS-based semantic similarity model using AWS serverless architecture, I followed these steps:

  1. Containerize Your Code: First, you need to containerize your code and dependencies. You can create a Docker container that encapsulates your code, libraries, and the FAISS library. Ensure your code can be executed within the container environment.
  2. Choose a Serverless Service: AWS offers several serverless services, such as AWS Lambda and AWS Fargate, for deploying containerized applications. You’ll need to choose the one that best suits your requirements. For this example, let’s consider AWS Lambda.
  3. Create an AWS Lambda Function: Upload your Docker image which is located on the ECR.
  4. Set Environment Variables: If your code requires environment variables (e.g., model paths or API keys), set them in your Lambda function’s environment variables.
  5. Define an API Gateway: If you want to create an API for invoking your Lambda function, set up an Amazon API Gateway. This allows you to create RESTful APIs or HTTP endpoints to interact with your Lambda function.
  6. Deploy Your Lambda Function: Deploy your Lambda function and any associated API Gateway if you’ve set it up.

To understand how to create REST API using API Gateway, please refer the article I wrote earlier.

import numpy as np
import pandas as pd
import os
import time
import json

import faiss
from sentence_transformers import SentenceTransformer
import torch


# grab model and index details from lambda environment variables
model_name_or_path = str(os.environ.get('model_name_or_path')) #"../artifacts/sentence_transformer_trained_model/"
index_path = str(os.environ.get('index_path')) #"../artifacts/index/"
data_path = str(os.environ.get('data_path')) #"../input/news_summary_more.csv"

def lambda_handler(event, context):

    try:
        # Parse the JSON input from the API Gateway
        request_body = json.loads(event['body'])

        # Retrieve the user query and top K from the JSON input
        user_query = request_body.get('query', '')
        k = request_body.get('top_k', '')

        # Load the model, index and data
        torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
        model = SentenceTransformer(model_name_or_path, device=torch_device)
        index = faiss.read_index(index_path)
        data = pd.read_csv(data_path)

        print("Sentence Transformer Model and Index are loaded.")

        response_data = search_your_query(data=data,
                                          index=index,
                                          model=model,
                                          query=user_query,
                                          k=k)

        # Create a response body
        response_body = {
            'message': 'Success',
            'data': response_data
        }

        # Return the response to the API Gateway
        return {
            'statusCode': 200,
            'body': json.dumps(response_body)
        }

    except Exception as e:
        # Handle any errors and return an error response
        return {
            'statusCode': 500,
            'body': json.dumps({'message': 'Error', 'error': str(e)})
        }

def search_your_query(data, index, model, query, k):

    t=time.time()
    query_vector = model.encode([query]).astype(np.float32)
    faiss.normalize_L2(query_vector)

    similarities, similarities_ids = index.search(query_vector, k)
    print('totaltime: {}\n'.format(time.time()-t))

    similarities = np.clip(similarities, 0, 1)

    output = []
    for i in range(len(similarities_ids[0])):
        item = {
            'id': similarities_ids[0][i],
            'text': data.loc[similarities_ids[0][i], 'text'],
            'similarity_score': similarities[0][i]
        }
        output.append(item)

    return output

What’s Under the Hood?

FAISS has a handful of features including:

  1. Used in conjunction with deep learning models, making it a valuable tool for encoding text, images, and other data into high-dimensional vectors. These vectors can then be efficiently searched and indexed.
  2. Faiss is used in various applications, including content-based recommendation systems, similarity-based search engines, image retrieval, and document clustering, to name a few.
  3. Faiss is designed to handle large datasets, making it suitable for both small-scale projects and large-scale production systems.

To conclude, Faiss is a versatile and efficient library for similarity search and indexing, particularly well-suited for handling high-dimensional data. Its capabilities in both exact and approximate search, GPU support, and a wide range of indexing structures make it a valuable resource for researchers and developers working on similarity-based applications.

Congratulations! 🎉👏🎊

You now understand how to build a robust semantic similarity search engine and deploy it with serverless magic leveraging Lambda and API Gateway.

If you enjoyed reading this article comment “Hell Yes!” in the comment section and let me know if any feedback.

Feel free to follow me on Medium, and GitHub, or say Hi on LinkedIn. I am excited to discuss on AI, ML, NLP, and MLOps areas!

Appendix

The end-to-end code for this module is saved to my GitHub repository.




Continue Learning