Implement Semantic Search with LLM and Pinecone Vector Database in Python

In the realm of digital data, traditional search engines often struggle to grasp the deeper intent behind user queries. Semantic search emerges as a solution, aiming to understand the context and meaning behind words and phrases. This article navigates through the intricacies of semantic search, exploring its mechanics and relevance in modern information retrieval. Additionally, we delve into the practical implementation of semantic search with LLM using Pinecone Vector Database, a managed, cloud-native solution optimized for high-performance search and similarity matching. Through hands-on examples and tutorials, we illuminate how semantic search, coupled with Pinecone, revolutionizes the search experience, delivering contextually relevant results with precision and efficiency.

Semantic Search

At its core, semantic search aims to enhance search accuracy and intelligence by understanding the meaning behind words and phrases. Unlike traditional search methods, which rely solely on keyword matches, semantic search leverages advanced algorithms to analyze the context, semantics, and intent embedded within the query. By deciphering the user’s true information needs, semantic search engines can deliver more relevant and contextually appropriate search results.

How Does Semantic Search Work?

Semantic search represents a paradigm shift in information retrieval, aiming to understand the underlying meaning and context behind user queries, rather than relying solely on keyword matches. At its core, semantic search engines employ advanced algorithms and techniques to decipher the semantic nuances inherent in user queries. This process involves several key steps:

Query Understanding: Semantic search engines analyze user queries to discern the underlying intent, context, and semantic nuances embedded within them. Rather than treating queries as mere strings of keywords, these engines strive to unravel the true meaning behind the user’s information needs.
Semantic Analysis: Once the user query is parsed, semantic search engines employ sophisticated algorithms to conduct semantic analysis. This entails extracting semantic entities, identifying relationships between words and phrases, and discerning contextual nuances such as synonyms, antonyms, and word ambiguity.
Embedding Generation: One of the pivotal components of semantic search is the generation of embeddings – dense numerical representations that encapsulate the semantic essence of textual inputs. These embeddings are typically derived from Large Language Models (LLMs) such as BERT or GPT, which possess the capability to understand the semantic context and nuances of language.
Vector Space Representation: The generated embeddings are then mapped into a high-dimensional vector space, where each vector represents a unique semantic concept or entity. By encoding textual information into numerical vectors, semantic search engines facilitate efficient comparison and analysis based on semantic similarity.
Similarity Matching: With the query represented as a vector in the semantic space, semantic search engines employ similarity matching algorithms to identify documents or records in the database that exhibit the closest semantic resemblance to the query vector. This process often involves metrics such as cosine similarity or Euclidean distance to quantify the similarity between vectors.
Ranking and Retrieval: Once the closest matches are identified, semantic search engines rank the retrieved documents based on their similarity scores and relevance to the query. This ensures that the most contextually relevant and semantically aligned results are presented to the user, enhancing the overall search experience.

In summary, semantic search engines operate by deciphering the underlying semantics of user queries, generating embeddings to represent textual information in a high-dimensional vector space, and employing similarity matching algorithms to retrieve contextually relevant search results. Through the fusion of advanced algorithms and semantic understanding, semantic search revolutionizes information retrieval by providing more accurate, contextually relevant, and nuanced search experiences.

Relevance with LLMs

Large Language Models play a pivotal role in semantic search, as they possess the ability to decipher the intent and meaning behind user queries. Through the generation of embeddings, LLMs transform textual inputs into numerical representations that encapsulate the semantic essence of the text. These embeddings serve as the foundation for semantic search engines, enabling them to conduct similarity matching and retrieve contextually relevant search results.

Pinecone Vector Database

Pinecone is a managed, cloud-native vector database designed for high-performance search and similarity matching of high-dimensional vector data. This type of data is often generated by large language models (LLMs), making Pinecone a valuable tool for applications utilizing them.

Key Features of Pinecone:

Simplicity: Pinecone offers a simple API and eliminates infrastructure management, allowing developers to focus on building applications.
Scalability: It handles billions of vectors with low latency and fresh, filtered query results.
Advanced Features: Pinecone goes beyond basic search, offering functionalities like:
- Filtering: Refine searches based on additional metadata associated with vectors.
- Approximate Nearest Neighbor (ANN) search: Efficiently identify the closest matches within a large dataset.
- Backups and collections: Ensure data security and manage specific subsets of data.
Security and Enterprise Readiness: Pinecone is SOC 2 and HIPAA certified, providing robust security and reliability for mission-critical applications.

Applicability for Large Language Models:

LLMs like LaMDA or GPT-3 generate text embeddings as numerical representations of semantic meaning. These embeddings can be stored and searched in Pinecone, enabling various functionalities:

Search within LLM Outputs: Find relevant text snippets or code generated by the LLM based on a user query.
Recommendation Systems: Recommend similar text content or code based on a user’s current query or interaction.
Document Retrieval: Retrieve similar documents from a vast corpus based on their semantic similarity.
Zero-shot Learning: Train LLMs on tasks without labeled data by finding similar examples from past interactions stored in Pinecone.

By efficiently searching and managing high-dimensional vector data, Pinecone empowers developers to leverage LLMs effectively in various applications.

Overall, Pinecone’s combination of simplicity, scalability, advanced features, and security makes it a valuable tool for developers working with large language models and other applications involving high-dimensional vector data.tunesharemore_vert

Hands-on Implementations: Semantic Search with LLM in Python

As we had an understanding of the Semantic Search above, now let’s delve into the hands-on implementations on Semantic Search with LLM and Pinecone Vector Database in Python. We will understand this in steps. We will understand the code parts in detail in the below section. For correponding outputs of each code snippets, you should refer to the notebook embedded at the end of this section.

1. Installing Required Packages

The code begins with the installation of necessary Python packages using pip. These packages include:

pinecone-client[grpc] version 2.2.1: Pinecone client library for interacting with the Pinecone vector database, including support for gRPC communication.
datasets version 2.12.0: A library for easily accessing and loading various datasets, used here to load the Quora dataset.
sentence-transformers version 2.2.2: A library for generating sentence embeddings using pre-trained transformer models.

# Installing Required Packages
!pip install -qU \
  "pinecone-client[grpc]"==2.2.1 \
  datasets==2.12.0 \
  sentence-transformers==2.2.2

2. Loading Quora Dataset

The Quora dataset is loaded using the load_dataset function from the datasets library. Specifically, the training split of the dataset from indices 240,000 to 290,000 is loaded. This split is likely chosen to obtain a subset of the dataset for demonstration purposes, considering its size.

# Loading Quora Dataset
from datasets import load_dataset

dataset = load_dataset('quora', split='train[240000:290000]')
dataset

This code snippet loads a subset of the Quora dataset using the load_dataset function from the datasets library. Specifically, it loads the training split of the dataset from indices 240,000 to 290,000. The loaded dataset is stored in the variable dataset.

3. Extracting Questions

The questions from the loaded dataset are extracted into a list named questions. Each question text is appended to this list. Duplicate questions are removed to ensure uniqueness.

# Extracting Questions
questions = []

for record in dataset['questions']:
    questions.extend(record['text'])

# Remove duplicates
questions = list(set(questions))
print('\n'.join(questions[:5]))
print(len(questions))

In this step, the code iterates over each record in the dataset['questions'] and extracts the text of the questions, appending them to the list questions. Then, duplicates are removed from the list to ensure uniqueness. Finally, the first five unique questions are printed, along with the total number of unique questions.

4. Building Embeddings and Upsert Format

The MiniLM-L6 sentence transformer model is initialized using the SentenceTransformer class from the sentence-transformers library. This model is used to generate embeddings for the questions. The embeddings, along with their corresponding IDs and metadata, are formatted into a list suitable for upserting into the Pinecone index.

from sentence_transformers import SentenceTransformer
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
if device != 'cuda':
    print(f"You are using {device}. This is much slower than using "
          "a CUDA-enabled GPU. If on Colab you can change this by "
          "clicking Runtime > Change runtime type > GPU.")

model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
model

It first checks if a GPU (CUDA) is available. If not, it prints a message informing the user that the process will be slower compared to using a GPU. It also provides guidance on how to switch to a GPU environment if using Google Colab.
The MiniLM-L6 model is then loaded using the SentenceTransformer class from the sentence-transformers library. The model is loaded onto either the GPU or CPU based on the availability of CUDA.
Finally, the initialized model object is stored in the variable model.

query = 'which city is the most populated in the world?'

xq = model.encode(query)
xq.shape

A query string is defined as 'which city is the most populated in the world?'.
The query string is encoded into a vector representation using the encode method of the initialized model.
The shape of the resulting vector, xq, is obtained using the shape attribute.

This code essentially generates a vector representation of the query using the MiniLM-L6 sentence transformer model and retrieves its shape, indicating the dimensions of the resulting vector.

_id = '0'
metadata = {'text': query}

vectors = [(_id, xq, metadata)]

An _id variable is assigned the value '0'.
A metadata dictionary is created with the key 'text' mapped to the query string defined earlier.
A list vectors is created containing a single tuple. The tuple consists of three elements:
- The _id string.
- The encoded vector representation of the query xq.
- The metadata dictionary containing information about the text of the query.

This structure prepares the data in a format suitable for upserting into the Pinecone index. Each tuple in the vectors list represents a record to be inserted or updated in the index, with the _id serving as a unique identifier for each record.

5. Initializing and Creating the Pinecone Index

The Pinecone client library is initialized with the appropriate API key and environment variables. A new index named “semantic-search” is created if it does not already exist. The dimensionality of the embeddings and the similarity metric (cosine similarity) are specified during index creation.

import os
from pinecone import Pinecone

# get api key from app.pinecone.io
api_key = os.environ.get('PINECONE_API_KEY') or 'PINECONE_API_KEY'
# find your environment next to the api key in pinecone console
env = os.environ.get('PINECONE_ENVIRONMENT') or 'PINECONE_ENVIRONMENT'

pinecone.init(
    api_key=api_key,
    environment=env
)

This code snippet initializes the Pinecone client library for interacting with the Pinecone vector database. Here’s a breakdown of what each part does:

import os: Imports the Python os module, which provides a portable way of using operating system-dependent functionality, including accessing environment variables.
from pinecone import Pinecone: Imports the Pinecone class from the pinecone module. This class provides functionality for interacting with the Pinecone vector database, such as initializing the client, creating indexes, and performing queries.
api_key = os.environ.get('PINECONE_API_KEY') or 'PINECONE_API_KEY': Retrieves the Pinecone API key from the environment variables. If the API key is not found in the environment variables, it defaults to the string 'PINECONE_API_KEY'.
env = os.environ.get('PINECONE_ENVIRONMENT') or 'PINECONE_ENVIRONMENT': Retrieves the Pinecone environment name from the environment variables. If the environment name is not found in the environment variables, it defaults to the string 'PINECONE_ENVIRONMENT'.
pinecone.init(api_key=api_key, environment=env): Initializes the Pinecone client with the provided API key and environment name. This step is necessary before performing any operations with the Pinecone database. It establishes a connection to the Pinecone service using the provided credentials and environment settings.

index_name = 'semantic-search'

This line of code simply assigns the string 'semantic-search' to the variable index_name. This variable likely represents the name of the index that will be created or accessed in the Pinecone vector database. Naming indexes is important for organization and identification purposes, especially when working with multiple indexes. The chosen name 'semantic-search' suggests that this index may be used for semantic search operations.

# only create index if it doesn't exist
if index_name not in pinecone.list_indexes().names():
    pinecone.create_index(
        name=index_name,
        dimension=model.get_sentence_embedding_dimension(),
        metric='cosine'
    )

# now connect to the index
index = pinecone.Index(index_name)

his code snippet performs the following tasks:

Check Index Existence: It checks whether an index with the name stored in the variable index_name exists in the Pinecone vector database. This is done by calling the list_indexes() method of the Pinecone client, which returns a list of all existing indexes. The names() method is then used to extract only the names of these indexes. If index_name is not found in this list, it means the index doesn’t exist yet.
Create Index (if necessary): If the index does not exist (i.e., index_name is not found in the list of existing indexes), the code proceeds to create the index using the create_index() method of the Pinecone client. The parameters passed to this method include:
- name: The name of the index, which is set to index_name.
- dimension: The dimensionality of the vectors to be stored in the index. In this case, it is obtained by calling model.get_sentence_embedding_dimension(), which returns the dimensionality of the embeddings produced by the sentence transformer model.
- metric: The distance metric used for similarity calculations. Here, 'cosine' is specified, indicating that cosine similarity will be used to measure the similarity between vectors.
Connect to the Index: After either creating the index or confirming its existence, the code establishes a connection to the index using the Index() constructor provided by Pinecone. This creates an instance of the Index class representing the specified index, allowing further operations such as upserting vectors, querying, and deleting. The connected index instance is stored in the variable index for future use.

6. Upserting Data in Batches

The embeddings and metadata for the questions are upserted into the Pinecone index in batches to improve efficiency. This is done using a loop that iterates over the questions list in batch sizes. For each batch, IDs, metadata, and embeddings are created and then upserted into the Pinecone index using the upsert method.

from tqdm.auto import tqdm

batch_size = 128
vector_limit = 100000

questions = questions[:vector_limit]

for i in tqdm(range(0, len(questions), batch_size)):
    # find end of batch
    i_end = min(i+batch_size, len(questions))
    # create IDs batch
    ids = [str(x) for x in range(i, i_end)]
    # create metadata batch
    metadatas = [{'text': text} for text in questions[i:i_end]]
    # create embeddings
    xc = model.encode(questions[i:i_end])
    # create records list for upsert
    records = zip(ids, xc, metadatas)
    # upsert to Pinecone
    index.upsert(vectors=records)

# check number of records in the index
index.describe_index_stats()

This code snippet performs the following tasks:

Import tqdm Module: It imports the tqdm module, which provides a progress bar for iterations.
Define Batch Parameters: The batch_size variable is set to 128, indicating the number of records processed in each batch. The vector_limit variable is set to 100,000, which limits the number of questions to be processed.
Trim Questions: The questions list is trimmed to contain a maximum of vector_limit elements.
Iterate Over Batches with tqdm: The code iterates over the questions list in batches using tqdm for visualizing the progress. The range of iteration is determined by the length of the questions list, with each iteration processing batch_size number of questions.
Batch Processing: Within each iteration, the code:
- Determines the end index of the current batch (i_end) using min(i+batch_size, len(questions)).
- Generates a list of IDs for the batch using list comprehension.
- Creates a list of metadata dictionaries for the batch, where each dictionary contains the text of a question.
- Encodes the batch of questions into embeddings (xc) using the pre-trained model.
- Zips the IDs, embeddings, and metadata into a list of tuples (records).
- Upserts the batch of records into the Pinecone index using the upsert method.
Check Index Stats: After processing all batches, the code retrieves and prints the statistics of the index using the describe_index_stats() method. This provides information such as the number of records in the index, the dimensionality of the vectors, and the time of the last modification.

7. Making Queries and Semantic Search

Once the index is populated, semantic search queries are made to find similar questions based on user queries. The user provides a sample query, which is encoded into a vector using the sentence transformer model. Semantic search is then performed using the Pinecone index to retrieve the most similar questions to the query. The top-k most relevant results are returned along with their scores.

query = "which city has the highest population in the world?"

# create the query vector
xq = model.encode(query).tolist()

# now query
xc = index.query(vector=xq, top_k=5, include_metadata=True)
xc

In this code snippet:

A query string is defined as "which city has the highest population in the world?".
The query string is encoded into a vector representation using the encode method of the initialized model. The .tolist() method is used to convert the resulting tensor into a Python list.
The encoded query vector xq is passed to the query method of the index object, which represents the Pinecone index. The top_k parameter is set to 5, indicating that the top 5 most similar records should be retrieved. The include_metadata parameter is set to True, indicating that metadata associated with the retrieved records should be included in the results.
The result of the query is stored in the variable xc, which contains information about the most similar records to the query, including their IDs, scores, and metadata.

8. Displaying Query Results

The results of the semantic search are displayed in a readable format, showing the score and the corresponding question text for each match returned in the response.

for result in xc['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['text']}")

This code snippet iterates over the matches returned by the query and prints the score and text associated with each match. Let’s break it down:

xc['matches']: This retrieves the list of matches from the query result xc. Each match represents a record that is similar to the query.
for result in xc['matches']:: This starts a loop that iterates over each match in the list of matches.
print(f"{round(result['score'], 2)}: {result['metadata']['text']}"): Within the loop, this line prints the score and text associated with each match. Here’s what each part does:
- round(result['score'], 2): This rounds the score associated with the match to two decimal places. It accesses the 'score' key of the current match (result) and applies the round function to it.
- {result['metadata']['text']}: This retrieves the text associated with the match from its metadata. It accesses the 'metadata' dictionary of the current match (result) and then retrieves the value associated with the 'text' key.

So, this code effectively prints out the score and text of each match returned by the query, providing insights into the similarity between the query and each matched record.

9. Modifying Query and Repeating Semantic Search

The robustness of the semantic search system is tested by modifying the query and performing another semantic search. This helps assess if the system can still retrieve relevant questions even when the wording of the query is slightly different.

query = "which metropolis has the highest number of people?"

# create the query vector
xq = model.encode(query).tolist()

# now query
xc = index.query(vector=xq, top_k=5, include_metadata=True)
for result in xc['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['text']}")

This code snippet performs a semantic search with a modified query and prints the top matching records. Here’s a breakdown:

query = "which metropolis has the highest number of people?": Defines a new query string with a modified question.
xq = model.encode(query).tolist(): Encodes the modified query into a vector representation using the pre-trained model, and converts it into a Python list.
xc = index.query(vector=xq, top_k=5, include_metadata=True): Performs a semantic search using the encoded query vector. The top_k parameter specifies to retrieve the top 5 most similar records. The include_metadata parameter is set to True to include metadata associated with the retrieved records.
for result in xc['matches']:: Iterates over the matches returned by the query.
print(f"{round(result['score'], 2)}: {result['metadata']['text']}"): Within the loop, prints the score and text associated with each match. The score represents the similarity between the query and the matched record, and the text is the content of the matched record.

Overall, this code snippet demonstrates how the semantic search system can handle modified queries and retrieve relevant records based on semantic similarity, providing insights into the effectiveness of the search algorithm.

10. Deleting the Index

Finally, after completing all operations, including data insertion, semantic search, and testing, the Pinecone index is deleted to release allocated resources and ensure efficient resource management.

pinecone.delete_index(index_name)

This line of code deletes the Pinecone index with the name stored in the variable index_name. Deleting an index removes all stored data and frees up resources associated with that index. It’s important to be cautious when using this operation, as it permanently removes all data stored in the index and cannot be undone.

In this context, index_name likely represents the name of the index that was created earlier in the code. By calling pinecone.delete_index(index_name), the index with that name will be deleted from the Pinecone vector database. This step is typically done when the index is no longer needed or when the script has completed its tasks and wants to clean up resources.

Reference Notebook

Conclusion

In summary, the provided code outlines a robust process for semantic search using Pinecone and MiniLM-L6. It loads a dataset, generates embeddings, and upserts them into Pinecone indexes. With semantic queries, it efficiently retrieves relevant records. The system showcases adaptability by handling modified queries effectively. This guide underscores the importance of efficient vector storage and powerful language models in semantic search systems.

Implement Semantic Search with LLM and Pinecone Vector Database in Python

Semantic Search

How Does Semantic Search Work?

Relevance with LLMs

Pinecone Vector Database

Key Features of Pinecone:

Applicability for Large Language Models:

Hands-on Implementations: Semantic Search with LLM in Python

1. Installing Required Packages

2. Loading Quora Dataset

3. Extracting Questions

4. Building Embeddings and Upsert Format

5. Initializing and Creating the Pinecone Index

6. Upserting Data in Batches

7. Making Queries and Semantic Search

8. Displaying Query Results

9. Modifying Query and Repeating Semantic Search

10. Deleting the Index

Reference Notebook

Conclusion

Related

PEFT Techniques for LLMs: Efficiently Fine-Tuning for Any Task

Reinforcement Learning from Human Feedback (RLHF) for LLMs

LLM Distillation demystified with its techniques, benefits and applications

Gemini LLM: Deep Dive with Hands-on Implementations

Build Your First Generative AI Agent with LangChain

LLaMA 2 and LlamaIndex Hands-on Tutorial: Developing Chat Apps with Web Data

Explore Incubity

More From Us

Semantic Search

How Does Semantic Search Work?

Relevance with LLMs

Pinecone Vector Database

Key Features of Pinecone:

Applicability for Large Language Models:

Hands-on Implementations: Semantic Search with LLM in Python

1. Installing Required Packages

2. Loading Quora Dataset

3. Extracting Questions

4. Building Embeddings and Upsert Format

5. Initializing and Creating the Pinecone Index

6. Upserting Data in Batches

7. Making Queries and Semantic Search

8. Displaying Query Results

9. Modifying Query and Repeating Semantic Search

10. Deleting the Index

Reference Notebook

Conclusion

Related

Similar Posts

Explore Incubity

More From Us

Review Cart