Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
5.28.0
Unstructured.io URL Loader
This loader extracts the text from URLs using Unstructured.io. The partition_html function partitions an HTML document and returns a list of document Element objects.
Usage
from llama_index import download_loader
UnstructuredURLLoader = download_loader("UnstructuredURLLoader")
urls = [
"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023",
"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023",
]
loader = UnstructuredURLLoader(
urls=urls, continue_on_failure=False, headers={"User-Agent": "value"}
)
loader.load()
Note:
If the version of unstructured is less than 0.5.7 and headers is not an empty dict, the user will see a warning (You are using old version of unstructured. The headers parameter is ignored).
If the user will create the object of UnstructuredURLLoader without the headers parameter or with an empty dict, he will not see the warning.