LlamaCppChatGenerator
LlamaCppGenerator enables chat completion using an LLM running on Llama.cpp。
| pipeline 中的最常见位置 | 在 ChatPromptBuilder 之后 |
| 必需的初始化变量 | "model": 要使用的模型路径 |
| 强制运行变量 | “messages”: 一个包含实例的列表ChatMessage代表输入消息 |
| 输出变量 | “replies”: 一个包含实例的列表ChatMessage由 LLM 生成的所有回复 |
| API 参考 | Llama.cpp |
| GitHub 链接 | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp |
概述
Llama.cpp is a library written in C/C++ for efficient inference of Large Language Models. It leverages the efficient quantized GGUF format, dramatically reducing memory requirements and accelerating inference. This means it is possible to run LLMs efficiently on standard machines (even without GPUs)。
Llama.cpp uses the quantized binary file of the LLM in GGUF format, which can be downloaded from Hugging Face.LlamaCppChatGenerator 支持在以下环境下运行的模型Llama.cpp 通过将本地保存的 GGUF 文件的路径作为model 参数在初始化时传递。
安装
安装llama-cpp-haystack package to use this integration
pip install llama-cpp-haystack
使用不同的计算后端
默认安装行为是构建llama.cpp for CPU on Linux and Windows and use Metal on MacOS. To use other compute backends
- Follow instructions on the llama.cpp installation page to install llama-cpp-python for your preferred compute backend.
- Install llama-cpp-haystack using the command above.
For example, to usellama-cpp-haystack with the cuBLAS backend, you have to run the following commands
export GGML_CUDA=1
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
pip install llama-cpp-haystack
用法
- Download the GGUF version of the desired LLM. The GGUF versions of popular models can be downloaded from Hugging Face.
- Initialize
LlamaCppChatGeneratorwith the path to the GGUF file and specify the required model and text generation parameters
from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
generator = LlamaCppChatGenerator(
model="/content/openchat-3.5-1210.Q3_K_S.gguf",
n_ctx=512,
n_batch=128,
model_kwargs={"n_gpu_layers": -1},
generation_kwargs={"max_tokens": 128, "temperature": 0.1},
)
generator.warm_up()
messages = [ChatMessage.from_user("Who is the best American actor?")]
result = generator.run(messages)
Passing additional model parameters
该model, n_ctx, n_batch arguments have been exposed for convenience and can be directly passed to the Generator during initialization as keyword arguments. Note thatmodel translates tollama.cpp'smodel_path parameter.
该model_kwargs parameter can pass additional arguments when initializing the model. In case of duplication, these parameters override themodel, n_ctx 和n_batch initialization parameters.
See Llama.cpp's LLM documentation for more information on the available model arguments.
Note: Llama.cpp automatically extracts thechat_template from the model metadata for applying formatting to ChatMessages. You can overide thechat_template used by passing in a customchat_handler orchat_format as a model parameter.
For example, to offload the model to GPU during initialization
from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
from haystack.dataclasses import ChatMessage
generator = LlamaCppChatGenerator(
model="/content/openchat-3.5-1210.Q3_K_S.gguf",
n_ctx=512,
n_batch=128,
model_kwargs={"n_gpu_layers": -1}
)
generator.warm_up()
messages = [ChatMessage.from_user("Who is the best American actor?")]
result = generator.run(messages, generation_kwargs={"max_tokens": 128})
generated_reply = result["replies"][0].content
print(generated_reply)
Passing text generation parameters
该generation_kwargs parameter can pass additional generation arguments likemax_tokens, temperature, top_k, top_p, and others to the model during inference.
See Llama.cpp's Chat Completion API documentation for more information on the available generation arguments.
Note: JSON mode, Function Calling, and Tools are all supported asgeneration_kwargs. Please see the llama-cpp-python GitHub README for more information on how to use them.
For example, to set themax_tokens andtemperature:
from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
from haystack.dataclasses import ChatMessage
generator = LlamaCppChatGenerator(
model="/content/openchat-3.5-1210.Q3_K_S.gguf",
n_ctx=512,
n_batch=128,
generation_kwargs={"max_tokens": 128, "temperature": 0.1},
)
generator.warm_up()
messages = [ChatMessage.from_user("Who is the best American actor?")]
result = generator.run(messages)
该generation_kwargs can also be passed to therun method of the generator directly
from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
from haystack.dataclasses import ChatMessage
generator = LlamaCppChatGenerator(
model="/content/openchat-3.5-1210.Q3_K_S.gguf",
n_ctx=512,
n_batch=128,
)
generator.warm_up()
messages = [ChatMessage.from_user("Who is the best American actor?")]
result = generator.run(
messages,
generation_kwargs={"max_tokens": 128, "temperature": 0.1},
)
在 pipeline 中
We use theLlamaCppChatGenerator in a Retrieval Augmented Generation pipeline on the Simple Wikipedia Dataset from Hugging Face and generate answers using the OpenChat-3.5 LLM.
Load the dataset
# Install HuggingFace Datasets using "pip install datasets"
from datasets import load_dataset
from haystack import Document, Pipeline
from haystack.components.builders.answer_builder import AnswerBuilder
from haystack.components.builders import ChatPromptBuilder
from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.dataclasses import ChatMessage
# Import LlamaCppChatGenerator
from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
# Load first 100 rows of the Simple Wikipedia Dataset from HuggingFace
dataset = load_dataset("pszemraj/simple_wikipedia", split="validation[:100]")
docs = [
Document(
content=doc["text"],
meta={
"title": doc["title"],
"url": doc["url"],
},
)
for doc in dataset
]
Index the documents to theInMemoryDocumentStore using theSentenceTransformersDocumentEmbedder andDocumentWriter:
doc_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
# Install sentence transformers using "pip install sentence-transformers"
doc_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
# Indexing Pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=doc_embedder, name="DocEmbedder")
indexing_pipeline.add_component(instance=DocumentWriter(document_store=doc_store), name="DocWriter")
indexing_pipeline.connect("DocEmbedder", "DocWriter")
indexing_pipeline.run({"DocEmbedder": {"documents": docs}})
Create the RAG pipeline and add theLlamaCppChatGenerator to it
system_message = ChatMessage.from_system(
"""
Answer the question using the provided context.
Context:
{% for doc in documents %}
{{ doc.content }}
{% endfor %}
"""
)
user_message = ChatMessage.from_user("Question: {{question}}")
assistent_message = ChatMessage.from_assistant("Answer: ")
chat_template = [system_message, user_message, assistent_message]
rag_pipeline = Pipeline()
text_embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
# Load the LLM using LlamaCppChatGenerator
model_path = "openchat-3.5-1210.Q3_K_S.gguf"
generator = LlamaCppChatGenerator(model=model_path, n_ctx=4096, n_batch=128)
rag_pipeline.add_component(
instance=text_embedder,
name="text_embedder",
)
rag_pipeline.add_component(instance=InMemoryEmbeddingRetriever(document_store=doc_store, top_k=3), name="retriever")
rag_pipeline.add_component(instance=ChatPromptBuilder(template=chat_template), name="prompt_builder")
rag_pipeline.add_component(instance=generator, name="llm")
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
rag_pipeline.connect("text_embedder", "retriever")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
rag_pipeline.connect("llm", "answer_builder")
rag_pipeline.connect("retriever", "answer_builder.documents")
Run the pipeline
question = "Which year did the Joker movie release?"
result = rag_pipeline.run(
{
"text_embedder": {"text": question},
"prompt_builder": {"question": question},
"llm": {"generation_kwargs": {"max_tokens": 128, "temperature": 0.1}},
"answer_builder": {"query": question},
}
)
generated_answer = result["answer_builder"]["answers"][0]
print(generated_answer.data)
# The Joker movie was released on October 4, 2019.
更新于 9 个月前
