In today’s digital age, information is abundant and readily available on the internet. However, with the vast amount of data accessible online, navigating through the plethora of web pages to find relevant information can be daunting. Fortunately, advancements in technology have provided us with tools and frameworks to streamline this process. In this article, we explore the combination of large language models (LLMs) and web scraping techniques to build a powerful LLM-Powered Web Reader web reader application using LangChain.
Understanding LangChain
LangChain is a framework designed to simplify the development of applications leveraging large language models (LLMs). With LangChain, developers can easily integrate LLMs into their applications to perform various natural language processing tasks such as document analysis, summarization, chatbots, and code analysis.
The Power of Apify
Apify is a cloud platform tailored for web scraping and data extraction tasks. It offers a vast ecosystem of pre-built apps called Actors, designed for various web scraping, crawling, and data extraction use cases. Actors can scrape data from websites like Google, Instagram, Facebook, Amazon, and more, making it a versatile tool for extracting information from the web.
Combining LangChain and Apify
By combining the capabilities of LangChain and Apify, developers can create sophisticated applications that leverage the power of LLMs for analyzing and extracting insights from web content. This integration enables the creation of a web reader application capable of extracting text content from web pages, analyzing it with an LLM, and providing relevant information in response to user queries.
Building an LLM-Powered Web Reader
In this article, we will demonstrate how to build an LLM-powered web reader using LangChain and Apify. The process involves the following steps:
- Setting up environment variables: Before getting started, we need to set up environment variables for authentication with the OpenAI and Apify APIs. This ensures secure access to the required services throughout the application.
- Using Apify for web scraping: We will utilize the Apify platform to crawl website content using the “website-content-crawler” actor. This actor is capable of deeply crawling websites such as documentation, knowledge bases, help centers, or blogs, and extracting text content from web pages.
- Creating a vector index: Once the content is extracted from web pages, we will create a vector index using the VectorstoreIndexCreator provided by LangChain. This index will allow us to efficiently query the crawled content and retrieve relevant information.
- Querying the index with LLM: With the vector index in place, we can now leverage the power of LLMs to analyze the crawled content and answer user queries. We will demonstrate how to formulate queries and retrieve answers using the LLM-powered index.
Building an LLM-Powered Web Reader: Step-by-step guide
Now that we’ve laid the groundwork by understanding the powerful synergy between LangChain and Apify for building an LLM-powered web reader, let’s dive into the hands-on implementation. By combining the robust web scraping capabilities of Apify with the advanced natural language processing functionalities of LangChain, we’re poised to create a dynamic and intelligent web reader application.
In the following sections we’ll walk through the step-by-step process of setting up the environment, utilizing Apify for web scraping, creating a vector index with LangChain, and querying the index with an LLM to extract insights from web content. Let’s embark on this journey to harness the full potential of modern technology in accessing and analyzing online information.
1. Installation of required packages:
!pip install apify-client openai langchain chromadb tiktoken
This command installs the necessary Python packages required for running the code.
2. Importing necessary modules:
from langchain.llms import OpenAI
from langchain.document_loaders import OnlinePDFLoader, PyPDFLoader
from langchain.indexes import VectorstoreIndexCreator
import os
from langchain.document_loaders.base import Document
from langchain.utilities import ApifyWrapper
These lines import the required modules from various packages (langchain, apify-client, openai, etc.) that are needed for the subsequent code execution.
3. Setting environment variables:
os.environ["OPENAI_API_KEY"] = "*****pass your token here******"
os.environ["APIFY_API_TOKEN"] = "****pass your token here*******"
Here, environment variables are set for the OpenAI API key and the Apify API token. This allows authentication with these services.
4. Using Apify to crawl website content:
apify = ApifyWrapper()
loader = apify.call_actor(
    actor_id="apify/website-content-crawler",
    run_input={"startUrls": [{"url": "https://ambilio.com/"}]},
    dataset_mapping_function=lambda item: Document(
        page_content=item["text"] or "", metadata={"source": item["url"]}
    ),
)
This code segment uses the Apify API to run the “website-content-crawler” actor, which scrapes content from the specified URL (“https://ambilio.com/”) and stores it. The extracted text content is then converted into a Document object for further processing.
5. Creating a vector index from the crawled content:
index = VectorstoreIndexCreator().from_loaders([loader])
A vector index is created using the VectorstoreIndexCreator from the content loaded by the loader. This index allows for efficient querying of the crawled content.
6. Querying the index:
query = "what are Top 5 GenAI-Based Solutions for Retail?"
result = index.query_with_sources(query)
print(result["answer"])
print(result["sources"])
query = "What is Ambilio?"
result = index.query_with_sources(query)
print(result["answer"])
print(result["sources"])
Note: Refer to the below embedded notebook for outputs.
These lines execute queries against the index created earlier. The first query retrieves information about top GenAI-based solutions for retail, while the second query seeks information about Ambilio. The results are then printed, including the answers and the sources of the information.
Embedded Code for Reference
Conclusion
By harnessing the capabilities of LangChain and Apify, developers can build sophisticated web reader applications capable of extracting and analyzing text content from the web using LLMs. This integration opens up a wide range of possibilities for leveraging natural language processing techniques to extract insights and provide valuable information to users. In the hands-on implementation following this article, we will walk through the code snippets to demonstrate how to implement each step in building an LLM-powered web reader.