Support for Langchain Intergration
Is it possible to load this quantised model for integration to a Langchain via langchain's HuggingFace Local Pipeline Integration? The original MPT-7B-Instruct could be loaded in a similar fashion.
Check out ctransformers. This has LangChain integration and supports CPU inference on these GGML MPT models.
My partial code with this model, rest can be referred from langchain and ctranformers docs. It works well.
from langchain.vectorstores import FAISS
from ctransformers.langchain import CTransformers
from langchain.chains import RetrievalQA
from langchain.embeddings import HuggingFaceInstructEmbeddings
llm = CTransformers(model='D:\\Ai\\models\\MPT-7B-Instruct-GGML\\mpt-7b-instruct.ggmlv3.q5_0.bin',
model_type='mpt')
instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2",
model_kwargs={"device": "cpu"})
db = FAISS.load_local("faiss_index", instructor_embeddings)
retriever = db.as_retriever(search_kwargs={"k": 3})
qa_chain = RetrievalQA.from_chain_type(llm=llm,
chain_type="stuff",
retriever=retriever)
If this code is used with the llama-65B-GGML model, qa_chain.run method is takes a very long time. How to solve this problem?
My partial code with this model, rest can be referred from langchain and ctranformers docs. It works well.
from langchain.vectorstores import FAISS from ctransformers.langchain import CTransformers from langchain.chains import RetrievalQA from langchain.embeddings import HuggingFaceInstructEmbeddings llm = CTransformers(model='D:\\Ai\\models\\MPT-7B-Instruct-GGML\\mpt-7b-instruct.ggmlv3.q5_0.bin', model_type='mpt') instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2", model_kwargs={"device": "cpu"}) db = FAISS.load_local("faiss_index", instructor_embeddings) retriever = db.as_retriever(search_kwargs={"k": 3}) qa_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)
When trying the code above, it returns OSError: /lib64/libm.so.6: version `GLIBC_2.29' not found for the Ctransformers library.. any way to use Ctransformers without upgrading the GLIBC version?
Then just leave out the CT_CUBLAS=1
part:
pip install ctransformers --no-binary ctransformers
My partial code with this model, rest can be referred from langchain and ctranformers docs. It works well.
from langchain.vectorstores import FAISS from ctransformers.langchain import CTransformers from langchain.chains import RetrievalQA from langchain.embeddings import HuggingFaceInstructEmbeddings llm = CTransformers(model='D:\\Ai\\models\\MPT-7B-Instruct-GGML\\mpt-7b-instruct.ggmlv3.q5_0.bin', model_type='mpt') instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2", model_kwargs={"device": "cpu"}) db = FAISS.load_local("faiss_index", instructor_embeddings) retriever = db.as_retriever(search_kwargs={"k": 3}) qa_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)
@vsns It would be great if you can share a more complete example code where this works for you. I have been trying your example and others from langchain on many of these models but the responses are non-sensical and/or completely outside the context. Very similar code just works with OpenAI models (ada for embedding and 3.5 turbo as the model) making me wonder if I am doing something wrong or these models are just not capable.
My partial code with this model, rest can be referred from langchain and ctranformers docs. It works well.
from langchain.vectorstores import FAISS from ctransformers.langchain import CTransformers from langchain.chains import RetrievalQA from langchain.embeddings import HuggingFaceInstructEmbeddings llm = CTransformers(model='D:\\Ai\\models\\MPT-7B-Instruct-GGML\\mpt-7b-instruct.ggmlv3.q5_0.bin', model_type='mpt') instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2", model_kwargs={"device": "cpu"}) db = FAISS.load_local("faiss_index", instructor_embeddings) retriever = db.as_retriever(search_kwargs={"k": 3}) qa_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)
@vsns It would be great if you can share a more complete example code where this works for you. I have been trying your example and others from langchain on many of these models but the responses are non-sensical and/or completely outside the context. Very similar code just works with OpenAI models (ada for embedding and 3.5 turbo as the model) making me wonder if I am doing something wrong or these models are just not capable.
Here you go:
import typer
# 0xVs
from ctransformers.langchain import CTransformers
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import SentenceTransformersTokenTextSplitter
from langchain.document_loaders import PDFPlumberLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from rich import print
from rich.prompt import Prompt
app = typer.Typer()
device = "cpu"
@app
.command()
def import_pdfs(dir: str, embedding_model="sentence-transformers/all-MiniLM-L6-v2"):
loader = DirectoryLoader(dir, glob="./*.pdf", loader_cls=PDFPlumberLoader, show_progress=True)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
embeddings = HuggingFaceInstructEmbeddings(model_name=embedding_model,
model_kwargs={"device": device})
db = FAISS.from_documents(docs, embeddings)
db.save_local("faiss_index")
@app
.command()
def question(model_path: str = "./models/mpt-7b-instruct.ggmlv3.q5_0.bin",
model_type='mpt',
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
search_breadth : int = 5, threads : int = 6, temperature : float = 0.4):
embeddings = HuggingFaceInstructEmbeddings(model_name=embedding_model,
model_kwargs={"device": device})
config = {'temperature': temperature, 'threads' : threads}
llm = CTransformers(model=model_path, model_type=model_type, config=config)
db = FAISS.load_local("faiss_index", embeddings)
retriever = db.as_retriever(search_kwargs={"k": search_breadth})
memory = ConversationBufferMemory(memory_key="chat_history", output_key="answer", return_messages=True)
qa = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever,
memory=memory, return_source_documents=True)
while True:
query = Prompt.ask('[bright_yellow]\nQuestion[/bright_yellow] ')
res = qa({"question": query})
print("[spring_green4]"+res['answer']+"[/spring_green4]")
if "source_documents" in res:
print("\n[italic grey46]References[/italic grey46]:")
for ref in res["source_documents"]:
print("> [grey19]" + ref.metadata['source'] + "[/grey19]")
if __name__ == "__main__":
app()
Some notes:
- From my experience (take it with pinch of salt) for QA, creating a good vector data is more important than model (i avoid proprietary systems or models)
- I haven't tested code much, and so multiple optimizations are possible. To name a few (different embedding model, use of custom prompt template, configuration tweaks etc)
- Currently considering VMware/open-llama-7b-open-instruct with llama-cpp-python, as when I use docs on narrow domains with less text, not getting good results
- Ultimately will be planning to have a single static binary (with naive assumption that qdrant can be packed inside it) using Rustformers and falcon-40b-instruct, when the support is available in it
My partial code with this model, rest can be referred from langchain and ctranformers docs. It works well.
from langchain.vectorstores import FAISS from ctransformers.langchain import CTransformers from langchain.chains import RetrievalQA from langchain.embeddings import HuggingFaceInstructEmbeddings llm = CTransformers(model='D:\\Ai\\models\\MPT-7B-Instruct-GGML\\mpt-7b-instruct.ggmlv3.q5_0.bin', model_type='mpt') instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2", model_kwargs={"device": "cpu"}) db = FAISS.load_local("faiss_index", instructor_embeddings) retriever = db.as_retriever(search_kwargs={"k": 3}) qa_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)
@vsns It would be great if you can share a more complete example code where this works for you. I have been trying your example and others from langchain on many of these models but the responses are non-sensical and/or completely outside the context. Very similar code just works with OpenAI models (ada for embedding and 3.5 turbo as the model) making me wonder if I am doing something wrong or these models are just not capable.
Here you go:
import typer # 0xVs from ctransformers.langchain import CTransformers from langchain.vectorstores import FAISS from langchain.embeddings import HuggingFaceInstructEmbeddings from langchain.document_loaders import DirectoryLoader from langchain.text_splitter import SentenceTransformersTokenTextSplitter from langchain.document_loaders import PDFPlumberLoader from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.chains import ConversationalRetrievalChain from langchain.memory import ConversationBufferMemory from rich import print from rich.prompt import Prompt app = typer.Typer() device = "cpu" @app .command() def import_pdfs(dir: str, embedding_model="sentence-transformers/all-MiniLM-L6-v2"): loader = DirectoryLoader(dir, glob="./*.pdf", loader_cls=PDFPlumberLoader, show_progress=True) documents = loader.load() text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=0) docs = text_splitter.split_documents(documents) embeddings = HuggingFaceInstructEmbeddings(model_name=embedding_model, model_kwargs={"device": device}) db = FAISS.from_documents(docs, embeddings) db.save_local("faiss_index") @app .command() def question(model_path: str = "./models/mpt-7b-instruct.ggmlv3.q5_0.bin", model_type='mpt', embedding_model="sentence-transformers/all-MiniLM-L6-v2", search_breadth : int = 5, threads : int = 6, temperature : float = 0.4): embeddings = HuggingFaceInstructEmbeddings(model_name=embedding_model, model_kwargs={"device": device}) config = {'temperature': temperature, 'threads' : threads} llm = CTransformers(model=model_path, model_type=model_type, config=config) db = FAISS.load_local("faiss_index", embeddings) retriever = db.as_retriever(search_kwargs={"k": search_breadth}) memory = ConversationBufferMemory(memory_key="chat_history", output_key="answer", return_messages=True) qa = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, memory=memory, return_source_documents=True) while True: query = Prompt.ask('[bright_yellow]\nQuestion[/bright_yellow] ') res = qa({"question": query}) print("[spring_green4]"+res['answer']+"[/spring_green4]") if "source_documents" in res: print("\n[italic grey46]References[/italic grey46]:") for ref in res["source_documents"]: print("> [grey19]" + ref.metadata['source'] + "[/grey19]") if __name__ == "__main__": app()
Some notes:
- From my experience (take it with pinch of salt) for QA, creating a good vector data is more important than model (i avoid proprietary systems or models)
- I haven't tested code much, and so multiple optimizations are possible. To name a few (different embedding model, use of custom prompt template, configuration tweaks etc)
- Currently considering VMware/open-llama-7b-open-instruct with llama-cpp-python, as when I use docs on narrow domains with less text, not getting good results
- Ultimately will be planning to have a single static binary (with naive assumption that qdrant can be packed inside it) using Rustformers and falcon-40b-instruct, when the support is available in it
What should be "question" in 'python test.py question'? A string? Another python file?
I am getting:
AttributeError: 'CTransformers' object has no attribute 'task'
That is appearing due to:
this block of code:
huggingface_pipeline.py:169, in HuggingFacePipeline._call(self, prompt, stop, run_manager)
162 def _call(
163 self,
164 prompt: str,
165 stop: Optional[List[str]] = None,
166 run_manager: Optional[CallbackManagerForLLMRun] = None,
167 ) -> str:
168 response = self.pipeline(prompt)
--> 169 if self.pipeline.task == "text-generation":
170 # Text generation return includes the starter text.
171 text = response[0]["generated_text"][len(prompt) :]
172 elif self.pipeline.task == "text2text-generation":
It looks like we need to add some sort of pipeline abstraction to ctransformers now?
how can i increase context_length and max_input_seq_token of this MPT quantized model?