retrievers#

Retriever objects help retrieve texts based on a set of queries. All retrievers derive from the Retriever base class.

Tip

Instead of using run() directly, use a step that takes a Retriever as an args argument such as Retrieve and RAGPrompt. Some other steps like and FewShotPromptWithRetrieval use retrievers internally.

Caching#

Retrievers typically initially build an index once and cache the index to disk. Retrievers additionally internally perform caching to disk, so if you retrieve results for the same query multiple times, the retriever will only retrieve results for the query once and then cache the results for future runs.

class datadreamer.retrievers.Retriever(texts, cache_folder_path=None)[source]#

Bases: _Cachable

Base class for all retrievers.

Parameters:
abstract property index[source]#

The index instance being used.

unload_model()[source]#

Unloads resources required to run the retriever from memory.

class datadreamer.retrievers.EmbeddingRetriever(texts, embedder, truncate=False, index_batch_size=10, index_instruction=None, query_instruction=None, cache_folder_path=None, device=None, **kwargs)[source]#

Bases: Retriever

Loads an embedding retriever.

Parameters:
  • texts (OutputDatasetColumn | OutputIterableDatasetColumn) – The texts to index for retrieval.

  • embedder (Embedder) – The embedder to use for embedding the texts.

  • truncate (bool, default: False) – Whether to truncate the texts.

  • index_batch_size (int, default: 10) – The batch size to use for indexing.

  • index_instruction (Optional[str], default: None) – An instruction to prepend to the texts when indexing.

  • query_instruction (Optional[str], default: None) – An instruction to prepend to the texts when querying.

  • cache_folder_path (Optional[str], default: None) – The path to the cache folder. If None, the default cache folder for the DataDreamer session will be used.

  • device (Union[None, int, str, device, list[int | str | device]], default: None) – The type to use for the model weights.

  • **kwargs – Additional keyword arguments to pass to the embedder.

property index[source]#

The index instance being used.

run(queries, k=5, batch_size=10, batch_scheduler_buffer_size=None, adaptive_batch_size=False, progress_interval=60, force=False, cache_only=False, verbose=None, log_level=None, total_num_queries=None, return_generator=False, **kwargs)[source]#

Retrieves the closest texts to the input queries.

Parameters:
  • queries (Iterable[Any]) – The queries to retrieve the closest texts to.

  • k (int, default: 5) – The number of closest texts to retrieve.

  • batch_size (int, default: 10) – The batch size to use for retrieval.

  • batch_scheduler_buffer_size (Optional[int], default: None) – The buffer size to use for the batch scheduler.

  • adaptive_batch_size (bool, default: False) – Whether to use adaptive batch sizing.

  • progress_interval (Optional[int], default: 60) – How often to log progress in seconds.

  • force (bool, default: False) – Whether to force run the step (ignore saved results).

  • cache_only (bool, default: False) – Whether to only use the cache.

  • verbose (Optional[bool], default: None) – Whether or not to print verbose logs.

  • log_level (Optional[int], default: None) – The logging level to use (DEBUG, INFO, etc.).

  • total_num_queries (Optional[int], default: None) – The total number of queries being processed (helps with displaying progress).

  • return_generator (bool, default: False) – Whether to return a generator instead of a list.

  • **kwargs – Additional keyword arguments to pass to the embedder.

Return type:

Union[Generator[dict[str, Any], None, None], list[dict[str, Any]]]

Returns:

A set of results.

class datadreamer.retrievers.ParallelRetriever(*retrievers)[source]#

Bases: _ParallelCachable, Retriever

Creates a retriever that will run multiple retrievers in parallel. See running models in parallel for more details.

Parameters:

*retrievers (Retriever) – The retrievers to run in parallel.

run(queries, *args, **kwargs)[source]#

Retrieves the closest texts to the input queries.

Parameters:
  • queries (Iterable[Any]) – The queries to retrieve the closest texts to.

  • k (int, default: 5) – The number of closest texts to retrieve.

  • batch_size (int, default: 10) – The batch size to use for retrieval.

  • batch_scheduler_buffer_size (Optional[int], default: None) – The buffer size to use for the batch scheduler.

  • adaptive_batch_size (bool, default: False) – Whether to use adaptive batch sizing.

  • progress_interval (Optional[int], default: 60) – How often to log progress in seconds.

  • force (bool, default: False) – Whether to force run the step (ignore saved results).

  • cache_only (bool, default: False) – Whether to only use the cache.

  • verbose (Optional[bool], default: None) – Whether or not to print verbose logs.

  • log_level (Optional[int], default: None) – The logging level to use (DEBUG, INFO, etc.).

  • total_num_queries (Optional[int], default: None) – The total number of queries being processed (helps with displaying progress).

  • return_generator (bool, default: False) – Whether to return a generator instead of a list.

  • **kwargs – Additional keyword arguments to pass to the embedder.

Return type:

Union[Generator[dict[str, Any], None, None], list[dict[str, Any]]]

Returns:

A set of results.