In the realm of digital data, traditional search engines often struggle to grasp the deeper intent behind user queries. Semantic search emerges as a solution, aiming to understand the context and meaning behind words and phrases. This article navigates through the intricacies of semantic search, exploring its mechanics and relevance in modern information retrieval. Additionally, we delve into the practical implementation of semantic search with LLM using Pinecone Vector Database, a managed, cloud-native solution optimized for high-performance search and similarity matching. Through hands-on examples and tutorials, we illuminate how semantic search, coupled with Pinecone, revolutionizes the search experience, delivering contextually relevant results with precision and efficiency.
Semantic Search
At its core, semantic search aims to enhance search accuracy and intelligence by understanding the meaning behind words and phrases. Unlike traditional search methods, which rely solely on keyword matches, semantic search leverages advanced algorithms to analyze the context, semantics, and intent embedded within the query. By deciphering the user’s true information needs, semantic search engines can deliver more relevant and contextually appropriate search results.
How Does Semantic Search Work?
Semantic search represents a paradigm shift in information retrieval, aiming to understand the underlying meaning and context behind user queries, rather than relying solely on keyword matches. At its core, semantic search engines employ advanced algorithms and techniques to decipher the semantic nuances inherent in user queries. This process involves several key steps:
- Query Understanding: Semantic search engines analyze user queries to discern the underlying intent, context, and semantic nuances embedded within them. Rather than treating queries as mere strings of keywords, these engines strive to unravel the true meaning behind the user’s information needs.
- Semantic Analysis: Once the user query is parsed, semantic search engines employ sophisticated algorithms to conduct semantic analysis. This entails extracting semantic entities, identifying relationships between words and phrases, and discerning contextual nuances such as synonyms, antonyms, and word ambiguity.
- Embedding Generation: One of the pivotal components of semantic search is the generation of embeddings – dense numerical representations that encapsulate the semantic essence of textual inputs. These embeddings are typically derived from Large Language Models (LLMs) such as BERT or GPT, which possess the capability to understand the semantic context and nuances of language.
- Vector Space Representation: The generated embeddings are then mapped into a high-dimensional vector space, where each vector represents a unique semantic concept or entity. By encoding textual information into numerical vectors, semantic search engines facilitate efficient comparison and analysis based on semantic similarity.
- Similarity Matching: With the query represented as a vector in the semantic space, semantic search engines employ similarity matching algorithms to identify documents or records in the database that exhibit the closest semantic resemblance to the query vector. This process often involves metrics such as cosine similarity or Euclidean distance to quantify the similarity between vectors.
- Ranking and Retrieval: Once the closest matches are identified, semantic search engines rank the retrieved documents based on their similarity scores and relevance to the query. This ensures that the most contextually relevant and semantically aligned results are presented to the user, enhancing the overall search experience.
In summary, semantic search engines operate by deciphering the underlying semantics of user queries, generating embeddings to represent textual information in a high-dimensional vector space, and employing similarity matching algorithms to retrieve contextually relevant search results. Through the fusion of advanced algorithms and semantic understanding, semantic search revolutionizes information retrieval by providing more accurate, contextually relevant, and nuanced search experiences.
Relevance with LLMs
Large Language Models play a pivotal role in semantic search, as they possess the ability to decipher the intent and meaning behind user queries. Through the generation of embeddings, LLMs transform textual inputs into numerical representations that encapsulate the semantic essence of the text. These embeddings serve as the foundation for semantic search engines, enabling them to conduct similarity matching and retrieve contextually relevant search results.
Pinecone Vector Database
Pinecone is a managed, cloud-native vector database designed for high-performance search and similarity matching of high-dimensional vector data. This type of data is often generated by large language models (LLMs), making Pinecone a valuable tool for applications utilizing them.
Key Features of Pinecone:
- Simplicity: Pinecone offers a simple API and eliminates infrastructure management, allowing developers to focus on building applications.
- Scalability: It handles billions of vectors with low latency and fresh, filtered query results.
- Advanced Features: Pinecone goes beyond basic search, offering functionalities like:
- Filtering: Refine searches based on additional metadata associated with vectors.
- Approximate Nearest Neighbor (ANN) search: Efficiently identify the closest matches within a large dataset.
- Backups and collections: Ensure data security and manage specific subsets of data.
- Security and Enterprise Readiness: Pinecone is SOC 2 and HIPAA certified, providing robust security and reliability for mission-critical applications.
Applicability for Large Language Models:
LLMs like LaMDA or GPT-3 generate text embeddings as numerical representations of semantic meaning. These embeddings can be stored and searched in Pinecone, enabling various functionalities:
- Search within LLM Outputs: Find relevant text snippets or code generated by the LLM based on a user query.
- Recommendation Systems: Recommend similar text content or code based on a user’s current query or interaction.
- Document Retrieval: Retrieve similar documents from a vast corpus based on their semantic similarity.
- Zero-shot Learning: Train LLMs on tasks without labeled data by finding similar examples from past interactions stored in Pinecone.
By efficiently searching and managing high-dimensional vector data, Pinecone empowers developers to leverage LLMs effectively in various applications.
Overall, Pinecone’s combination of simplicity, scalability, advanced features, and security makes it a valuable tool for developers working with large language models and other applications involving high-dimensional vector data.tunesharemore_vert
Hands-on Implementations: Semantic Search with LLM in Python
As we had an understanding of the Semantic Search above, now let’s delve into the hands-on implementations on Semantic Search with LLM and Pinecone Vector Database in Python. We will understand this in steps. We will understand the code parts in detail in the below section. For correponding outputs of each code snippets, you should refer to the notebook embedded at the end of this section.
1. Installing Required Packages
The code begins with the installation of necessary Python packages using pip. These packages include:
pinecone-client[grpc]
version 2.2.1: Pinecone client library for interacting with the Pinecone vector database, including support for gRPC communication.datasets
version 2.12.0: A library for easily accessing and loading various datasets, used here to load the Quora dataset.sentence-transformers
version 2.2.2: A library for generating sentence embeddings using pre-trained transformer models.
# Installing Required Packages
!pip install -qU \
"pinecone-client[grpc]"==2.2.1 \
datasets==2.12.0 \
sentence-transformers==2.2.2
2. Loading Quora Dataset
The Quora dataset is loaded using the load_dataset
function from the datasets
library. Specifically, the training split of the dataset from indices 240,000 to 290,000 is loaded. This split is likely chosen to obtain a subset of the dataset for demonstration purposes, considering its size.
# Loading Quora Dataset
from datasets import load_dataset
dataset = load_dataset('quora', split='train[240000:290000]')
dataset
This code snippet loads a subset of the Quora dataset using the load_dataset
function from the datasets
library. Specifically, it loads the training split of the dataset from indices 240,000 to 290,000. The loaded dataset is stored in the variable dataset
.
3. Extracting Questions
The questions from the loaded dataset are extracted into a list named questions
. Each question text is appended to this list. Duplicate questions are removed to ensure uniqueness.
# Extracting Questions
questions = []
for record in dataset['questions']:
questions.extend(record['text'])
# Remove duplicates
questions = list(set(questions))
print('\n'.join(questions[:5]))
print(len(questions))
In this step, the code iterates over each record in the dataset['questions']
and extracts the text of the questions, appending them to the list questions
. Then, duplicates are removed from the list to ensure uniqueness. Finally, the first five unique questions are printed, along with the total number of unique questions.
4. Building Embeddings and Upsert Format
The MiniLM-L6 sentence transformer model is initialized using the SentenceTransformer
class from the sentence-transformers
library. This model is used to generate embeddings for the questions. The embeddings, along with their corresponding IDs and metadata, are formatted into a list suitable for upserting into the Pinecone index.
from sentence_transformers import SentenceTransformer
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
if device != 'cuda':
print(f"You are using {device}. This is much slower than using "
"a CUDA-enabled GPU. If on Colab you can change this by "
"clicking Runtime > Change runtime type > GPU.")
model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
model
- It first checks if a GPU (CUDA) is available. If not, it prints a message informing the user that the process will be slower compared to using a GPU. It also provides guidance on how to switch to a GPU environment if using Google Colab.
- The MiniLM-L6 model is then loaded using the
SentenceTransformer
class from thesentence-transformers
library. The model is loaded onto either the GPU or CPU based on the availability of CUDA. - Finally, the initialized model object is stored in the variable
model
.
query = 'which city is the most populated in the world?'
xq = model.encode(query)
xq.shape
- A query string is defined as
'which city is the most populated in the world?'
. - The query string is encoded into a vector representation using the
encode
method of the initializedmodel
. - The shape of the resulting vector,
xq
, is obtained using theshape
attribute.
This code essentially generates a vector representation of the query using the MiniLM-L6 sentence transformer model and retrieves its shape, indicating the dimensions of the resulting vector.
_id = '0'
metadata = {'text': query}
vectors = [(_id, xq, metadata)]
- An
_id
variable is assigned the value'0'
. - A
metadata
dictionary is created with the key'text'
mapped to the query string defined earlier. - A list
vectors
is created containing a single tuple. The tuple consists of three elements:- The
_id
string. - The encoded vector representation of the query
xq
. - The
metadata
dictionary containing information about the text of the query.
- The
This structure prepares the data in a format suitable for upserting into the Pinecone index. Each tuple in the vectors
list represents a record to be inserted or updated in the index, with the _id
serving as a unique identifier for each record.
5. Initializing and Creating the Pinecone Index
The Pinecone client library is initialized with the appropriate API key and environment variables. A new index named “semantic-search” is created if it does not already exist. The dimensionality of the embeddings and the similarity metric (cosine similarity) are specified during index creation.
import os
from pinecone import Pinecone
# get api key from app.pinecone.io
api_key = os.environ.get('PINECONE_API_KEY') or 'PINECONE_API_KEY'
# find your environment next to the api key in pinecone console
env = os.environ.get('PINECONE_ENVIRONMENT') or 'PINECONE_ENVIRONMENT'
pinecone.init(
api_key=api_key,
environment=env
)
This code snippet initializes the Pinecone client library for interacting with the Pinecone vector database. Here’s a breakdown of what each part does:
import os
: Imports the Pythonos
module, which provides a portable way of using operating system-dependent functionality, including accessing environment variables.from pinecone import Pinecone
: Imports thePinecone
class from thepinecone
module. This class provides functionality for interacting with the Pinecone vector database, such as initializing the client, creating indexes, and performing queries.api_key = os.environ.get('PINECONE_API_KEY') or 'PINECONE_API_KEY'
: Retrieves the Pinecone API key from the environment variables. If the API key is not found in the environment variables, it defaults to the string'PINECONE_API_KEY'
.env = os.environ.get('PINECONE_ENVIRONMENT') or 'PINECONE_ENVIRONMENT'
: Retrieves the Pinecone environment name from the environment variables. If the environment name is not found in the environment variables, it defaults to the string'PINECONE_ENVIRONMENT'
.pinecone.init(api_key=api_key, environment=env)
: Initializes the Pinecone client with the provided API key and environment name. This step is necessary before performing any operations with the Pinecone database. It establishes a connection to the Pinecone service using the provided credentials and environment settings.
index_name = 'semantic-search'
This line of code simply assigns the string 'semantic-search'
to the variable index_name
. This variable likely represents the name of the index that will be created or accessed in the Pinecone vector database. Naming indexes is important for organization and identification purposes, especially when working with multiple indexes. The chosen name 'semantic-search'
suggests that this index may be used for semantic search operations.
# only create index if it doesn't exist
if index_name not in pinecone.list_indexes().names():
pinecone.create_index(
name=index_name,
dimension=model.get_sentence_embedding_dimension(),
metric='cosine'
)
# now connect to the index
index = pinecone.Index(index_name)
his code snippet performs the following tasks:
- Check Index Existence: It checks whether an index with the name stored in the variable
index_name
exists in the Pinecone vector database. This is done by calling thelist_indexes()
method of the Pinecone client, which returns a list of all existing indexes. Thenames()
method is then used to extract only the names of these indexes. Ifindex_name
is not found in this list, it means the index doesn’t exist yet. - Create Index (if necessary): If the index does not exist (i.e.,
index_name
is not found in the list of existing indexes), the code proceeds to create the index using thecreate_index()
method of the Pinecone client. The parameters passed to this method include:name
: The name of the index, which is set toindex_name
.dimension
: The dimensionality of the vectors to be stored in the index. In this case, it is obtained by callingmodel.get_sentence_embedding_dimension()
, which returns the dimensionality of the embeddings produced by the sentence transformer model.metric
: The distance metric used for similarity calculations. Here,'cosine'
is specified, indicating that cosine similarity will be used to measure the similarity between vectors.
- Connect to the Index: After either creating the index or confirming its existence, the code establishes a connection to the index using the
Index()
constructor provided by Pinecone. This creates an instance of theIndex
class representing the specified index, allowing further operations such as upserting vectors, querying, and deleting. The connected index instance is stored in the variableindex
for future use.
6. Upserting Data in Batches
The embeddings and metadata for the questions are upserted into the Pinecone index in batches to improve efficiency. This is done using a loop that iterates over the questions list in batch sizes. For each batch, IDs, metadata, and embeddings are created and then upserted into the Pinecone index using the upsert
method.
from tqdm.auto import tqdm
batch_size = 128
vector_limit = 100000
questions = questions[:vector_limit]
for i in tqdm(range(0, len(questions), batch_size)):
# find end of batch
i_end = min(i+batch_size, len(questions))
# create IDs batch
ids = [str(x) for x in range(i, i_end)]
# create metadata batch
metadatas = [{'text': text} for text in questions[i:i_end]]
# create embeddings
xc = model.encode(questions[i:i_end])
# create records list for upsert
records = zip(ids, xc, metadatas)
# upsert to Pinecone
index.upsert(vectors=records)
# check number of records in the index
index.describe_index_stats()
This code snippet performs the following tasks:
- Import tqdm Module: It imports the
tqdm
module, which provides a progress bar for iterations. - Define Batch Parameters: The
batch_size
variable is set to 128, indicating the number of records processed in each batch. Thevector_limit
variable is set to 100,000, which limits the number of questions to be processed. - Trim Questions: The
questions
list is trimmed to contain a maximum ofvector_limit
elements. - Iterate Over Batches with tqdm: The code iterates over the
questions
list in batches usingtqdm
for visualizing the progress. The range of iteration is determined by the length of thequestions
list, with each iteration processingbatch_size
number of questions. - Batch Processing: Within each iteration, the code:
- Determines the end index of the current batch (
i_end
) usingmin(i+batch_size, len(questions))
. - Generates a list of IDs for the batch using list comprehension.
- Creates a list of metadata dictionaries for the batch, where each dictionary contains the text of a question.
- Encodes the batch of questions into embeddings (
xc
) using the pre-trained model. - Zips the IDs, embeddings, and metadata into a list of tuples (
records
). - Upserts the batch of records into the Pinecone index using the
upsert
method.
- Determines the end index of the current batch (
- Check Index Stats: After processing all batches, the code retrieves and prints the statistics of the index using the
describe_index_stats()
method. This provides information such as the number of records in the index, the dimensionality of the vectors, and the time of the last modification.
7. Making Queries and Semantic Search
Once the index is populated, semantic search queries are made to find similar questions based on user queries. The user provides a sample query, which is encoded into a vector using the sentence transformer model. Semantic search is then performed using the Pinecone index to retrieve the most similar questions to the query. The top-k most relevant results are returned along with their scores.
query = "which city has the highest population in the world?"
# create the query vector
xq = model.encode(query).tolist()
# now query
xc = index.query(vector=xq, top_k=5, include_metadata=True)
xc
In this code snippet:
- A query string is defined as
"which city has the highest population in the world?"
. - The query string is encoded into a vector representation using the
encode
method of the initializedmodel
. The.tolist()
method is used to convert the resulting tensor into a Python list. - The encoded query vector
xq
is passed to thequery
method of theindex
object, which represents the Pinecone index. Thetop_k
parameter is set to5
, indicating that the top 5 most similar records should be retrieved. Theinclude_metadata
parameter is set toTrue
, indicating that metadata associated with the retrieved records should be included in the results. - The result of the query is stored in the variable
xc
, which contains information about the most similar records to the query, including their IDs, scores, and metadata.
8. Displaying Query Results
The results of the semantic search are displayed in a readable format, showing the score and the corresponding question text for each match returned in the response.
for result in xc['matches']:
print(f"{round(result['score'], 2)}: {result['metadata']['text']}")
This code snippet iterates over the matches returned by the query and prints the score and text associated with each match. Let’s break it down:
xc['matches']
: This retrieves the list of matches from the query resultxc
. Each match represents a record that is similar to the query.for result in xc['matches']:
: This starts a loop that iterates over each match in the list of matches.print(f"{round(result['score'], 2)}: {result['metadata']['text']}")
: Within the loop, this line prints the score and text associated with each match. Here’s what each part does:round(result['score'], 2)
: This rounds the score associated with the match to two decimal places. It accesses the'score'
key of the current match (result
) and applies theround
function to it.{result['metadata']['text']}
: This retrieves the text associated with the match from its metadata. It accesses the'metadata'
dictionary of the current match (result
) and then retrieves the value associated with the'text'
key.
So, this code effectively prints out the score and text of each match returned by the query, providing insights into the similarity between the query and each matched record.
9. Modifying Query and Repeating Semantic Search
The robustness of the semantic search system is tested by modifying the query and performing another semantic search. This helps assess if the system can still retrieve relevant questions even when the wording of the query is slightly different.
query = "which metropolis has the highest number of people?"
# create the query vector
xq = model.encode(query).tolist()
# now query
xc = index.query(vector=xq, top_k=5, include_metadata=True)
for result in xc['matches']:
print(f"{round(result['score'], 2)}: {result['metadata']['text']}")
This code snippet performs a semantic search with a modified query and prints the top matching records. Here’s a breakdown:
query = "which metropolis has the highest number of people?"
: Defines a new query string with a modified question.xq = model.encode(query).tolist()
: Encodes the modified query into a vector representation using the pre-trained model, and converts it into a Python list.xc = index.query(vector=xq, top_k=5, include_metadata=True)
: Performs a semantic search using the encoded query vector. Thetop_k
parameter specifies to retrieve the top 5 most similar records. Theinclude_metadata
parameter is set toTrue
to include metadata associated with the retrieved records.for result in xc['matches']:
: Iterates over the matches returned by the query.print(f"{round(result['score'], 2)}: {result['metadata']['text']}")
: Within the loop, prints the score and text associated with each match. The score represents the similarity between the query and the matched record, and the text is the content of the matched record.
Overall, this code snippet demonstrates how the semantic search system can handle modified queries and retrieve relevant records based on semantic similarity, providing insights into the effectiveness of the search algorithm.
10. Deleting the Index
Finally, after completing all operations, including data insertion, semantic search, and testing, the Pinecone index is deleted to release allocated resources and ensure efficient resource management.
pinecone.delete_index(index_name)
This line of code deletes the Pinecone index with the name stored in the variable index_name
. Deleting an index removes all stored data and frees up resources associated with that index. It’s important to be cautious when using this operation, as it permanently removes all data stored in the index and cannot be undone.
In this context, index_name
likely represents the name of the index that was created earlier in the code. By calling pinecone.delete_index(index_name)
, the index with that name will be deleted from the Pinecone vector database. This step is typically done when the index is no longer needed or when the script has completed its tasks and wants to clean up resources.
Reference Notebook
Conclusion
In summary, the provided code outlines a robust process for semantic search using Pinecone and MiniLM-L6. It loads a dataset, generates embeddings, and upserts them into Pinecone indexes. With semantic queries, it efficiently retrieves relevant records. The system showcases adaptability by handling modified queries effectively. This guide underscores the importance of efficient vector storage and powerful language models in semantic search systems.