Spaces:
Sleeping
A newer version of the Gradio SDK is available:
5.29.0
Knowledge Base Website Loader
This loader is a web crawler and scraper that fetches text content from websites hosting public knowledge bases. Examples are the Intercom help center or the Robinhood help center. Typically these sites have a directory structure with several sections and many articles in each section. This loader crawls and finds all links that match the article path provided, and scrapes the content of each article. This can be used to create bots that answer customer questions based on public documentation.
It uses Playwright to drive a browser. This reduces the chance of getting blocked by Cloudflare or other CDNs, but makes it a bit more challenging to run on cloud services.
Usage
First run
playwright install
This installs the browsers that Playwright requires.
To use this loader, you need to pass in the root URL and the string to search for in the URL to tell if the crawler has reached an article. You also need to pass in several CSS selectors so the cralwer knows which links to follow and which elements to extract content from. use
from llama_index import download_loader
KnowledgeBaseWebReader = download_loader("KnowledgeBaseWebReader")
loader = KnowledgeBaseWebReader()
documents = loader.load_data(
root_url="https://www.intercom.com/help",
link_selectors=[".article-list a", ".article-list a"],
article_path="/articles",
body_selector=".article-body",
title_selector=".article-title",
subtitle_selector=".article-subtitle",
)
Examples
This loader is designed to be used as a way to load data into LlamaIndex and/or subsequently used as a Tool in a LangChain Agent.
LlamaIndex
from llama_index import VectorStoreIndex, download_loader
KnowledgeBaseWebReader = download_loader("KnowledgeBaseWebReader")
loader = KnowledgeBaseWebReader()
documents = loader.load_data(
root_url="https://support.intercom.com",
link_selectors=[".article-list a", ".article-list a"],
article_path="/articles",
body_selector=".article-body",
title_selector=".article-title",
subtitle_selector=".article-subtitle",
)
index = VectorStoreIndex.from_documents(documents)
index.query("What languages does Intercom support?")
LangChain
Note: Make sure you change the description of the Tool
to match your use-case.
from llama_index import VectorStoreIndex, download_loader
from langchain.agents import initialize_agent, Tool
from langchain.llms import OpenAI
from langchain.chains.conversation.memory import ConversationBufferMemory
KnowledgeBaseWebReader = download_loader("KnowledgeBaseWebReader")
loader = KnowledgeBaseWebReader()
documents = loader.load_data(
root_url="https://support.intercom.com",
link_selectors=[".article-list a", ".article-list a"],
article_path="/articles",
body_selector=".article-body",
title_selector=".article-title",
subtitle_selector=".article-subtitle",
)
index = VectorStoreIndex.from_documents(documents)
tools = [
Tool(
name="Website Index",
func=lambda q: index.query(q),
description=f"Useful when you want answer questions about a product that has a public knowledge base.",
),
]
llm = OpenAI(temperature=0)
memory = ConversationBufferMemory(memory_key="chat_history")
agent_chain = initialize_agent(
tools, llm, agent="zero-shot-react-description", memory=memory
)
output = agent_chain.run(input="What languages does Intercom support?")