embedders#
Embedder
objects help convert texts to embeddings.
All embedders derive from the Embedder
base class.
Tip
Instead of using run()
directly, use a
step
that takes an Embedder
as an args
argument such as Embed
or construct an
EmbeddingRetriever
with the embedder and then use
a retrieval step such as Retrieve
.
Caching#
Embedders internally perform caching to disk, so if you embed the same text multiple times, the embedder will only embed the text once and then cache the results for future runs.
- class datadreamer.embedders.Embedder(model_name, cache_folder_path=None)[source]#
Bases:
TaskModel
Base class for all embedders.
- Parameters:
- abstract run(texts, truncate=False, instruction=None, batch_size=10, batch_scheduler_buffer_size=None, adaptive_batch_size=True, progress_interval=60, force=False, cache_only=False, verbose=None, log_level=None, total_num_texts=None, return_generator=False, **kwargs)[source]#
Runs the model on the texts.
- Parameters:
texts (
Iterable
[Any
]) β The texts to run against the model.instruction (
str
) β An instruction to prepend to the texts before running.truncate (
bool
, default:False
) β Whether to truncate the texts.batch_size (
int
, default:10
) β The batch size to use.batch_scheduler_buffer_size (
Optional
[int
], default:None
) β The buffer size to use for the batch scheduler.adaptive_batch_size (
bool
, default:False
) β Whether to use adaptive batch sizing.progress_interval (
Optional
[int
], default:60
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).cache_only (
bool
, default:False
) β Whether to only use the cache.verbose (
Optional
[bool
], default:None
) β Whether or not to print verbose logs.log_level (
Optional
[int
], default:None
) β The logging level to use (DEBUG
,INFO
, etc.).total_num_texts (
Optional
[int
], default:None
) β The total number of texts being processed (helps with displaying progress).return_generator (
bool
, default:False
) β Whether to return a generator instead of a list.**kwargs β Additional keyword arguments to pass when running the model.
- Return type:
Union
[Generator
[dict
[str
,Any
],None
,None
],list
[dict
[str
,Any
]]]- Returns:
The result of running the model on the texts.
- class datadreamer.embedders.OpenAIEmbedder(model_name, dimensions=DEFAULT, organization=None, api_key=None, base_url=None, api_version=None, retry_on_fail=False, cache_folder_path=None, **kwargs)[source]#
Bases:
Embedder
Loads an OpenAI embedder.
- Parameters:
model_name (
str
) β The name of the model to use.dimensions (
int
|Default
, default:DEFAULT
) β The number of dimensions to use for the embeddings. IfNone
, the default number of dimensions for the model will be used.organization (
Optional
[str
], default:None
) β The organization to use for the API.api_key (
Optional
[str
], default:None
) β The API key to use for the API.base_url (
Optional
[str
], default:None
) β The base URL to use for the API.api_version (
Optional
[str
], default:None
) β The version of the API to use.retry_on_fail (
bool
, default:False
) β Whether to retry API calls if they fail.cache_folder_path (
Optional
[str
], default:None
) β The path to the cache folder. IfNone
, the default cache folder for the DataDreamer session will be used.**kwargs β Additional keyword arguments to pass to the OpenAI client.
- run(texts, truncate=False, batch_size=10, batch_scheduler_buffer_size=None, adaptive_batch_size=False, progress_interval=60, force=False, cache_only=False, verbose=None, log_level=None, total_num_texts=None, return_generator=False, **kwargs)[source]#
Runs the model on the texts.
- Parameters:
texts (
Iterable
[Any
]) β The texts to run against the model.truncate (
bool
, default:False
) β Whether to truncate the texts.batch_size (
int
, default:10
) β The batch size to use.batch_scheduler_buffer_size (
Optional
[int
], default:None
) β The buffer size to use for the batch scheduler.adaptive_batch_size (
bool
, default:False
) β Whether to use adaptive batch sizing.progress_interval (
Optional
[int
], default:60
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).cache_only (
bool
, default:False
) β Whether to only use the cache.verbose (
Optional
[bool
], default:None
) β Whether or not to print verbose logs.log_level (
Optional
[int
], default:None
) β The logging level to use (DEBUG
,INFO
, etc.).total_num_texts (
Optional
[int
], default:None
) β The total number of texts being processed (helps with displaying progress).return_generator (
bool
, default:False
) β Whether to return a generator instead of a list.**kwargs β Additional keyword arguments to pass when running the model.
- Return type:
Union
[Generator
[dict
[str
,Any
],None
,None
],list
[dict
[str
,Any
]]]- Returns:
The result of running the model on the texts.
- class datadreamer.embedders.SentenceTransformersEmbedder(model_name, trust_remote_code=False, device=None, dtype=None, cache_folder_path=None, **kwargs)[source]#
Bases:
Embedder
Loads an SentenceTransformers embedder.
- Parameters:
model_name (
str
) β The name of the model to use.trust_remote_code (
bool
, default:False
) β Whether to trust remote code.device (
Union
[None
,int
,str
,device
], default:None
) β The device to use for the model.dtype (
Union
[None
,str
,dtype
], default:None
) β The type to use for the model weights.cache_folder_path (
Optional
[str
], default:None
) β The path to the cache folder. IfNone
, the default cache folder for the DataDreamer session will be used.**kwargs β Additional keyword arguments to pass to the SentenceTransformers constructor.
- run(texts, truncate=False, instruction=None, batch_size=10, batch_scheduler_buffer_size=None, adaptive_batch_size=True, progress_interval=60, force=False, cache_only=False, verbose=None, log_level=None, total_num_texts=None, return_generator=False, **kwargs)[source]#
Runs the model on the texts.
- Parameters:
texts (
Iterable
[Any
]) β The texts to run against the model.instruction (
str
) β An instruction to prepend to the texts before running.truncate (
bool
, default:False
) β Whether to truncate the texts.batch_size (
int
, default:10
) β The batch size to use.batch_scheduler_buffer_size (
Optional
[int
], default:None
) β The buffer size to use for the batch scheduler.adaptive_batch_size (
bool
, default:False
) β Whether to use adaptive batch sizing.progress_interval (
Optional
[int
], default:60
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).cache_only (
bool
, default:False
) β Whether to only use the cache.verbose (
Optional
[bool
], default:None
) β Whether or not to print verbose logs.log_level (
Optional
[int
], default:None
) β The logging level to use (DEBUG
,INFO
, etc.).total_num_texts (
Optional
[int
], default:None
) β The total number of texts being processed (helps with displaying progress).return_generator (
bool
, default:False
) β Whether to return a generator instead of a list.**kwargs β Additional keyword arguments to pass when running the model.
- Return type:
Union
[Generator
[dict
[str
,Any
],None
,None
],list
[dict
[str
,Any
]]]- Returns:
The result of running the model on the texts.
- class datadreamer.embedders.TogetherEmbedder(model_name, api_key=None, max_context_length=None, tokenizer_model_name=None, tokenizer_revision=None, tokenizer_trust_remote_code=False, retry_on_fail=True, cache_folder_path=None, warn_tokenizer_model_name=True, warn_max_context_length=True, **kwargs)[source]#
Bases:
Embedder
Loads a Together AI embedder.
- Parameters:
model_name (
str
) β The name of the model to use.api_key (
Optional
[str
], default:None
) β The API key to use for the API.max_context_length (
Optional
[int
], default:None
) β The maximum context length to use for the model. IfNone
, the maximum context length will be inferred.tokenizer_model_name (
Optional
[str
], default:None
) β The name of the tokenizer model to use. IfNone
, the tokenizer model will be inferred.tokenizer_revision (
Optional
[str
], default:None
) β The revision of the tokenizer model to use.tokenizer_trust_remote_code (
bool
, default:False
) β Whether to trust remote code for the tokenizer.retry_on_fail (
bool
, default:True
) β Whether to retry API calls if they fail.cache_folder_path (
Optional
[str
], default:None
) β The path to the cache folder. IfNone
, the default cache folder for the DataDreamer session will be used.warn_tokenizer_model_name (
Optional
[bool
], default:True
) β Whether to warn if the tokenizer model name is inferred and not explicitly specified.warn_max_context_length (
Optional
[bool
], default:True
) β Whether to warn if the maximum context length is inferred and not explicitly specified.**kwargs β Additional keyword arguments to pass to the Together client.
- run(texts, truncate=False, batch_size=10, batch_scheduler_buffer_size=None, adaptive_batch_size=False, progress_interval=60, force=False, cache_only=False, verbose=None, log_level=None, total_num_texts=None, return_generator=False, **kwargs)[source]#
Runs the model on the texts.
- Parameters:
texts (
Iterable
[Any
]) β The texts to run against the model.truncate (
bool
, default:False
) β Whether to truncate the texts.batch_size (
int
, default:10
) β The batch size to use.batch_scheduler_buffer_size (
Optional
[int
], default:None
) β The buffer size to use for the batch scheduler.adaptive_batch_size (
bool
, default:False
) β Whether to use adaptive batch sizing.progress_interval (
Optional
[int
], default:60
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).cache_only (
bool
, default:False
) β Whether to only use the cache.verbose (
Optional
[bool
], default:None
) β Whether or not to print verbose logs.log_level (
Optional
[int
], default:None
) β The logging level to use (DEBUG
,INFO
, etc.).total_num_texts (
Optional
[int
], default:None
) β The total number of texts being processed (helps with displaying progress).return_generator (
bool
, default:False
) β Whether to return a generator instead of a list.**kwargs β Additional keyword arguments to pass when running the model.
- Return type:
Union
[Generator
[dict
[str
,Any
],None
,None
],list
[dict
[str
,Any
]]]- Returns:
The result of running the model on the texts.
- class datadreamer.embedders.ParallelEmbedder(*embedders)[source]#
Bases:
ParallelTaskModel
,Embedder
Creates an embedder that will run multiple embedders in parallel. See running models in parallel for more details.
- Parameters:
*embedders (
Embedder
) β The embedders to run in parallel.
- run(texts, *args, **kwargs)[source]#
Runs the model on the texts.
- Parameters:
texts (
Iterable
[Any
]) β The texts to run against the model.truncate (
bool
, default:False
) β Whether to truncate the texts.batch_size (
int
, default:10
) β The batch size to use.batch_scheduler_buffer_size (
Optional
[int
], default:None
) β The buffer size to use for the batch scheduler.adaptive_batch_size (
bool
, default:False
) β Whether to use adaptive batch sizing.progress_interval (
Optional
[int
], default:60
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).cache_only (
bool
, default:False
) β Whether to only use the cache.verbose (
Optional
[bool
], default:None
) β Whether or not to print verbose logs.log_level (
Optional
[int
], default:None
) β The logging level to use (DEBUG
,INFO
, etc.).total_num_texts (
Optional
[int
], default:None
) β The total number of texts being processed (helps with displaying progress).return_generator (
bool
, default:False
) β Whether to return a generator instead of a list.**kwargs β Additional keyword arguments to pass when running the model.
- Return type:
Union
[Generator
[dict
[str
,Any
],None
,None
],list
[dict
[str
,Any
]]]- Returns:
The result of running the model on the texts.