模块 document_store

BM25DocumentStats

一个用于管理 BM25 检索文档统计信息的类。

参数:

freq_token：文档中词元频率的计数器。
doc_len：文档中的词元数量。

InMemoryDocumentStore

将数据存储在内存中。它是临时的，无法保存到磁盘。

InMemoryDocumentStore.init

def __init__(bm25_tokenization_regex: str = r"(?u)\b\w\w+\b",
             bm25_algorithm: Literal["BM25Okapi", "BM25L",
                                     "BM25Plus"] = "BM25L",
             bm25_parameters: Optional[dict] = None,
             embedding_similarity_function: Literal["dot_product",
                                                    "cosine"] = "dot_product",
             index: Optional[str] = None,
             async_executor: Optional[ThreadPoolExecutor] = None,
             return_embedding: bool = True)

初始化 DocumentStore。

参数:

bm25_tokenization_regex：用于对 BM25 检索文本进行分词的正则表达式。
bm25_algorithm：要使用的 BM25 算法。可以是 "BM25Okapi"、"BM25L" 或 "BM25Plus" 之一。
bm25_parameters：以字典格式表示的 BM25 实现的参数。例如{'k1':1.5, 'b':0.75, 'epsilon':0.25} 您可以访问 https://github.com/dorianbrown/rank_bm25 了解更多关于这些参数的信息。
embedding_similarity_function：用于比较文档嵌入的相似度函数。可以是 "dot_product"（默认）或 "cosine" 之一。要选择最合适的函数，请查找有关您的嵌入模型的信息。
index：用于存储文档的特定索引。如果未指定，则使用随机 UUID。使用相同的索引允许您将文档存储在多个 InMemoryDocumentStore 实例中。
async_executor：用于异步调用的可选 ThreadPoolExecutor。如果未提供，将初始化并使用单个线程的执行器。
return_embedding：是否返回检索文档的嵌入。默认为 True。

InMemoryDocumentStore.del

def __del__()

在实例被销毁时进行清理。

InMemoryDocumentStore.shutdown

def shutdown()

显式关闭我们拥有的执行器。

InMemoryDocumentStore.storage

@property
def storage() -> dict[str, Document]

返回此 InMemoryDocumentStore 实例使用的存储的实用属性。

InMemoryDocumentStore.to_dict

def to_dict() -> dict[str, Any]

将组件序列化为字典。

返回值:

包含序列化数据的字典。

InMemoryDocumentStore.from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "InMemoryDocumentStore"

从字典反序列化组件。

参数:

data: 要反序列化的字典。

返回值:

反序列化后的组件。

InMemoryDocumentStore.save_to_disk

def save_to_disk(path: str) -> None

将数据库及其数据作为 JSON 文件写入磁盘。

参数:

path：JSON 文件的路径。

InMemoryDocumentStore.load_from_disk

@classmethod
def load_from_disk(cls, path: str) -> "InMemoryDocumentStore"

从磁盘加载数据库及其数据作为 JSON 文件。

参数:

path：JSON 文件的路径。

返回值:

已加载的 InMemoryDocumentStore。

InMemoryDocumentStore.count_documents

def count_documents() -> int

返回 DocumentStore 中存在的文档数量。

InMemoryDocumentStore.filter_documents

def filter_documents(
        filters: Optional[dict[str, Any]] = None) -> list[Document]

返回与提供的过滤器匹配的文档。

有关过滤器的详细说明，请参阅 DocumentStore.filter_documents() 协议文档。

参数:

filters: 要应用于文档列表的过滤器。

返回值:

与给定过滤器匹配的文档列表。

InMemoryDocumentStore.write_documents

def write_documents(documents: list[Document],
                    policy: DuplicatePolicy = DuplicatePolicy.NONE) -> int

请参阅 DocumentStore.write_documents() 协议文档。

如果policy 设置为DuplicatePolicy.NONE 默认为DuplicatePolicy.FAIL.

InMemoryDocumentStore.delete_documents

def delete_documents(document_ids: list[str]) -> None

从 DocumentStore 中删除所有具有匹配 document_ids 的文档。

参数:

document_ids：要删除的对象 ID。

InMemoryDocumentStore.bm25_retrieval

def bm25_retrieval(query: str,
                   filters: Optional[dict[str, Any]] = None,
                   top_k: int = 10,
                   scale_score: bool = False) -> list[Document]

使用 BM25 算法检索与查询最相关的文档。

参数:

query：查询字符串。
filters：一个包含过滤器的字典，用于缩小搜索范围。
top_k：要检索的顶部文档数量。默认为 10。
scale_score：是否缩放检索文档的分数。默认为 False。

返回值:

与查询最相关的 top_k 文档列表。

InMemoryDocumentStore.embedding_retrieval

def embedding_retrieval(
        query_embedding: list[float],
        filters: Optional[dict[str, Any]] = None,
        top_k: int = 10,
        scale_score: bool = False,
        return_embedding: Optional[bool] = False) -> list[Document]

使用向量相似度度量检索与查询嵌入最相似的文档。

参数:

query_embedding: 查询的嵌入。
filters：一个包含过滤器的字典，用于缩小搜索范围。
top_k：要检索的顶部文档数量。默认为 10。
scale_score：是否缩放检索文档的分数。默认为 False。
return_embedding：是否返回检索文档的嵌入。如果未提供，将使用组件初始化时设置的return_embedding 参数的值。默认为 False。

返回值:

与查询最相关的 top_k 文档列表。

InMemoryDocumentStore.count_documents_async

async def count_documents_async() -> int

返回 DocumentStore 中存在的文档数量。

InMemoryDocumentStore.filter_documents_async

async def filter_documents_async(
        filters: Optional[dict[str, Any]] = None) -> list[Document]

返回与提供的过滤器匹配的文档。

有关过滤器的详细说明，请参阅 DocumentStore.filter_documents() 协议文档。

参数:

filters: 要应用于文档列表的过滤器。

返回值:

与给定过滤器匹配的文档列表。

InMemoryDocumentStore.write_documents_async

async def write_documents_async(
        documents: list[Document],
        policy: DuplicatePolicy = DuplicatePolicy.NONE) -> int

请参阅 DocumentStore.write_documents() 协议文档。

如果policy 设置为DuplicatePolicy.NONE 默认为DuplicatePolicy.FAIL.

InMemoryDocumentStore.delete_documents_async

async def delete_documents_async(document_ids: list[str]) -> None

从 DocumentStore 中删除所有具有匹配 document_ids 的文档。

参数:

document_ids：要删除的对象 ID。

InMemoryDocumentStore.bm25_retrieval_async

async def bm25_retrieval_async(query: str,
                               filters: Optional[dict[str, Any]] = None,
                               top_k: int = 10,
                               scale_score: bool = False) -> list[Document]

使用 BM25 算法检索与查询最相关的文档。

参数:

query：查询字符串。
filters：一个包含过滤器的字典，用于缩小搜索范围。
top_k：要检索的顶部文档数量。默认为 10。
scale_score：是否缩放检索文档的分数。默认为 False。

返回值:

与查询最相关的 top_k 文档列表。

InMemoryDocumentStore.embedding_retrieval_async

async def embedding_retrieval_async(
        query_embedding: list[float],
        filters: Optional[dict[str, Any]] = None,
        top_k: int = 10,
        scale_score: bool = False,
        return_embedding: bool = False) -> list[Document]

使用向量相似度度量检索与查询嵌入最相似的文档。

参数:

query_embedding: 查询的嵌入。
filters：一个包含过滤器的字典，用于缩小搜索范围。
top_k：要检索的顶部文档数量。默认为 10。
scale_score：是否缩放检索文档的分数。默认为 False。
return_embedding：是否返回检索文档的嵌入。默认为 False。

返回值:

与查询最相关的 top_k 文档列表。

模块 document_store

BM25DocumentStats

InMemoryDocumentStore

InMemoryDocumentStore.__init__

InMemoryDocumentStore.__del__

InMemoryDocumentStore.shutdown

InMemoryDocumentStore.storage

InMemoryDocumentStore.to_dict

InMemoryDocumentStore.from_dict

InMemoryDocumentStore.save_to_disk

InMemoryDocumentStore.load_from_disk

InMemoryDocumentStore.count_documents

InMemoryDocumentStore.filter_documents

InMemoryDocumentStore.write_documents

InMemoryDocumentStore.delete_documents

InMemoryDocumentStore.bm25_retrieval

InMemoryDocumentStore.embedding_retrieval

InMemoryDocumentStore.count_documents_async

InMemoryDocumentStore.filter_documents_async

InMemoryDocumentStore.write_documents_async

InMemoryDocumentStore.delete_documents_async

InMemoryDocumentStore.bm25_retrieval_async

InMemoryDocumentStore.embedding_retrieval_async

InMemoryDocumentStore.init

InMemoryDocumentStore.del