retrievers#
Retriever
objects help retrieve texts based on a set of queries.
All retrievers derive from the Retriever
base class.
Tip
Instead of using run()
directly, use a
step
that takes a Retriever
as an args
argument such as Retrieve
and
RAGPrompt
. Some other steps like and
FewShotPromptWithRetrieval
use retrievers internally.
Caching#
Retrievers typically initially build an index once and cache the index to disk. Retrievers additionally internally perform caching to disk, so if you retrieve results for the same query multiple times, the retriever will only retrieve results for the query once and then cache the results for future runs.
- class datadreamer.retrievers.Retriever(texts, cache_folder_path=None)[source]#
Bases:
_Cachable
Base class for all retrievers.
- Parameters:
texts (
UnionType
[None
,OutputDatasetColumn
,OutputIterableDatasetColumn
]) β The texts to index for retrieval.cache_folder_path (
Optional
[str
], default:None
) β The path to the cache folder. IfNone
, the default cache folder for the DataDreamer session will be used.
- class datadreamer.retrievers.EmbeddingRetriever(texts, embedder, truncate=False, index_batch_size=10, index_instruction=None, query_instruction=None, cache_folder_path=None, device=None, **kwargs)[source]#
Bases:
Retriever
Loads an embedding retriever.
- Parameters:
texts (
OutputDatasetColumn
|OutputIterableDatasetColumn
) β The texts to index for retrieval.embedder (
Embedder
) β The embedder to use for embedding the texts.truncate (
bool
, default:False
) β Whether to truncate the texts.index_batch_size (
int
, default:10
) β The batch size to use for indexing.index_instruction (
Optional
[str
], default:None
) β An instruction to prepend to the texts when indexing.query_instruction (
Optional
[str
], default:None
) β An instruction to prepend to the texts when querying.cache_folder_path (
Optional
[str
], default:None
) β The path to the cache folder. IfNone
, the default cache folder for the DataDreamer session will be used.device (
Union
[None
,int
,str
,device
,list
[int
|str
|device
]], default:None
) β The type to use for the model weights.**kwargs β Additional keyword arguments to pass to the embedder.
- run(queries, k=5, batch_size=10, batch_scheduler_buffer_size=None, adaptive_batch_size=False, progress_interval=60, force=False, cache_only=False, verbose=None, log_level=None, total_num_queries=None, return_generator=False, **kwargs)[source]#
Retrieves the closest texts to the input queries.
- Parameters:
queries (
Iterable
[Any
]) β The queries to retrieve the closest texts to.k (
int
, default:5
) β The number of closest texts to retrieve.batch_size (
int
, default:10
) β The batch size to use for retrieval.batch_scheduler_buffer_size (
Optional
[int
], default:None
) β The buffer size to use for the batch scheduler.adaptive_batch_size (
bool
, default:False
) β Whether to use adaptive batch sizing.progress_interval (
Optional
[int
], default:60
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).cache_only (
bool
, default:False
) β Whether to only use the cache.verbose (
Optional
[bool
], default:None
) β Whether or not to print verbose logs.log_level (
Optional
[int
], default:None
) β The logging level to use (DEBUG
,INFO
, etc.).total_num_queries (
Optional
[int
], default:None
) β The total number of queries being processed (helps with displaying progress).return_generator (
bool
, default:False
) β Whether to return a generator instead of a list.**kwargs β Additional keyword arguments to pass to the embedder.
- Return type:
Union
[Generator
[dict
[str
,Any
],None
,None
],list
[dict
[str
,Any
]]]- Returns:
A set of results.
- class datadreamer.retrievers.ParallelRetriever(*retrievers)[source]#
Bases:
_ParallelCachable
,Retriever
Creates a retriever that will run multiple retrievers in parallel. See running models in parallel for more details.
- Parameters:
*retrievers (
Retriever
) β The retrievers to run in parallel.
- run(queries, *args, **kwargs)[source]#
Retrieves the closest texts to the input queries.
- Parameters:
queries (
Iterable
[Any
]) β The queries to retrieve the closest texts to.k (
int
, default:5
) β The number of closest texts to retrieve.batch_size (
int
, default:10
) β The batch size to use for retrieval.batch_scheduler_buffer_size (
Optional
[int
], default:None
) β The buffer size to use for the batch scheduler.adaptive_batch_size (
bool
, default:False
) β Whether to use adaptive batch sizing.progress_interval (
Optional
[int
], default:60
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).cache_only (
bool
, default:False
) β Whether to only use the cache.verbose (
Optional
[bool
], default:None
) β Whether or not to print verbose logs.log_level (
Optional
[int
], default:None
) β The logging level to use (DEBUG
,INFO
, etc.).total_num_queries (
Optional
[int
], default:None
) β The total number of queries being processed (helps with displaying progress).return_generator (
bool
, default:False
) β Whether to return a generator instead of a list.**kwargs β Additional keyword arguments to pass to the embedder.
- Return type:
Union
[Generator
[dict
[str
,Any
],None
,None
],list
[dict
[str
,Any
]]]- Returns:
A set of results.