Loading any TheBloke GGUF Model using CTransformers from lang chain result in maximum context length being limited to 512
2024-05-28 13:51:08,540 - INFO - Loading LLM TheBloke/LLaMA-Pro-8B-GGUF using mode = GGUF
2024-05-28 13:51:08,835 - INFO - Model's Default Context Length: 2048
2024-05-28 13:51:08,835 - INFO - Using context length: 2048
Fetching 1 files: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<?, ?it/s]
Fetching 1 files: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<?, ?it/s]
2024-05-28 13:52:44,008 - INFO - Loading faiss with AVX2 support.
2024-05-28 13:52:44,085 - INFO - Successfully loaded faiss with AVX2 support.
2024-05-28 13:52:53,592 - INFO - Managing returned Sources ... mode = GGUF
2024-05-28 13:52:53,603 - WARNING - Number of tokens (1155) exceeded maximum context length (512).
The model is loaded using:
try:
logging.info(f"Loading LLM {self.llm_model} using mode = {self.mode}")
config = transformers.AutoConfig.from_pretrained(self.llm_model)
model_context_length = getattr(config, "max_position_embeddings", None)
if model_context_length:
logging.info(f"Model's Default Context Length: {model_context_length}")
# Ensure context_length is within the model's maximum context length
context_length = min(4096, model_context_length)
logging.info(f"Using context length: {context_length}")
else:
context_length = 2048 # Fallback if not specified
model = CTransformers(
model=self.llm_model,
batch_size=52,
max_new_tokens=1024,
context_length=context_length,
gpu_layers=0
)
return model
except OSError as e:
logging.error(f"Error loading {self.llm_model} model: {e}")
except Exception as e:
logging.error(f"Unexpected error loading {self.llm_model} model: {e}")
and called using:
generated_text = self.loaded_model.invoke(
formatted_prompt,
**generate_kwargs,
do_sample=True,
stream=True,
details=True,
return_full_text=False)