embedders#

Embedder objects help convert texts to embeddings. All embedders derive from the Embedder base class.

Tip

Instead of using run() directly, use a step that takes an Embedder as an args argument such as Embed or construct an EmbeddingRetriever with the embedder and then use a retrieval step such as Retrieve.

Caching#

Embedders internally perform caching to disk, so if you embed the same text multiple times, the embedder will only embed the text once and then cache the results for future runs.

class datadreamer.embedders.Embedder(model_name, cache_folder_path=None)[source]#

Bases: TaskModel

Base class for all embedders.

Parameters:
  • model_name (str) – The name of the model to use.

  • cache_folder_path (Optional[str], default: None) – The path to the cache folder. If None, the default cache folder for the DataDreamer session will be used.

abstract count_tokens(value)[source]#

Counts the number of tokens in a string.

Parameters:

value (str) – The string to count tokens for.

Return type:

int

Returns:

The number of tokens in the string.

abstract property model_max_length: int[source]#

The maximum length of the model.

abstract property dims: int[source]#

The dimensions of the embeddings.

abstract run(texts, truncate=False, instruction=None, batch_size=10, batch_scheduler_buffer_size=None, adaptive_batch_size=True, progress_interval=60, force=False, cache_only=False, verbose=None, log_level=None, total_num_texts=None, return_generator=False, **kwargs)[source]#

Runs the model on the texts.

Parameters:
  • texts (Iterable[Any]) – The texts to run against the model.

  • instruction (str) – An instruction to prepend to the texts before running.

  • truncate (bool, default: False) – Whether to truncate the texts.

  • batch_size (int, default: 10) – The batch size to use.

  • batch_scheduler_buffer_size (Optional[int], default: None) – The buffer size to use for the batch scheduler.

  • adaptive_batch_size (bool, default: False) – Whether to use adaptive batch sizing.

  • progress_interval (Optional[int], default: 60) – How often to log progress in seconds.

  • force (bool, default: False) – Whether to force run the step (ignore saved results).

  • cache_only (bool, default: False) – Whether to only use the cache.

  • verbose (Optional[bool], default: None) – Whether or not to print verbose logs.

  • log_level (Optional[int], default: None) – The logging level to use (DEBUG, INFO, etc.).

  • total_num_texts (Optional[int], default: None) – The total number of texts being processed (helps with displaying progress).

  • return_generator (bool, default: False) – Whether to return a generator instead of a list.

  • **kwargs – Additional keyword arguments to pass when running the model.

Return type:

Union[Generator[dict[str, Any], None, None], list[dict[str, Any]]]

Returns:

The result of running the model on the texts.

class datadreamer.embedders.OpenAIEmbedder(model_name, dimensions=DEFAULT, organization=None, api_key=None, base_url=None, api_version=None, retry_on_fail=False, cache_folder_path=None, **kwargs)[source]#

Bases: Embedder

Loads an OpenAI embedder.

Parameters:
  • model_name (str) – The name of the model to use.

  • dimensions (int | Default, default: DEFAULT) – The number of dimensions to use for the embeddings. If None, the default number of dimensions for the model will be used.

  • organization (Optional[str], default: None) – The organization to use for the API.

  • api_key (Optional[str], default: None) – The API key to use for the API.

  • base_url (Optional[str], default: None) – The base URL to use for the API.

  • api_version (Optional[str], default: None) – The version of the API to use.

  • retry_on_fail (bool, default: False) – Whether to retry API calls if they fail.

  • cache_folder_path (Optional[str], default: None) – The path to the cache folder. If None, the default cache folder for the DataDreamer session will be used.

  • **kwargs – Additional keyword arguments to pass to the OpenAI client.

property client: OpenAI | AzureOpenAI[source]#

The API client instance being used.

property tokenizer: Encoding[source]#

The tokenizer instance being used.

property model_max_length: int[source]#

The maximum length of the model.

property dims: int[source]#

The dimensions of the embeddings.

run(texts, truncate=False, batch_size=10, batch_scheduler_buffer_size=None, adaptive_batch_size=False, progress_interval=60, force=False, cache_only=False, verbose=None, log_level=None, total_num_texts=None, return_generator=False, **kwargs)[source]#

Runs the model on the texts.

Parameters:
  • texts (Iterable[Any]) – The texts to run against the model.

  • truncate (bool, default: False) – Whether to truncate the texts.

  • batch_size (int, default: 10) – The batch size to use.

  • batch_scheduler_buffer_size (Optional[int], default: None) – The buffer size to use for the batch scheduler.

  • adaptive_batch_size (bool, default: False) – Whether to use adaptive batch sizing.

  • progress_interval (Optional[int], default: 60) – How often to log progress in seconds.

  • force (bool, default: False) – Whether to force run the step (ignore saved results).

  • cache_only (bool, default: False) – Whether to only use the cache.

  • verbose (Optional[bool], default: None) – Whether or not to print verbose logs.

  • log_level (Optional[int], default: None) – The logging level to use (DEBUG, INFO, etc.).

  • total_num_texts (Optional[int], default: None) – The total number of texts being processed (helps with displaying progress).

  • return_generator (bool, default: False) – Whether to return a generator instead of a list.

  • **kwargs – Additional keyword arguments to pass when running the model.

Return type:

Union[Generator[dict[str, Any], None, None], list[dict[str, Any]]]

Returns:

The result of running the model on the texts.

class datadreamer.embedders.SentenceTransformersEmbedder(model_name, trust_remote_code=False, device=None, dtype=None, cache_folder_path=None, **kwargs)[source]#

Bases: Embedder

Loads an SentenceTransformers embedder.

Parameters:
  • model_name (str) – The name of the model to use.

  • trust_remote_code (bool, default: False) – Whether to trust remote code.

  • device (Union[None, int, str, device], default: None) – The device to use for the model.

  • dtype (Union[None, str, dtype], default: None) – The type to use for the model weights.

  • cache_folder_path (Optional[str], default: None) – The path to the cache folder. If None, the default cache folder for the DataDreamer session will be used.

  • **kwargs – Additional keyword arguments to pass to the SentenceTransformers constructor.

property model: SentenceTransformer[source]#

The model instance being used.

property tokenizer: Any[source]#

The tokenizer instance being used.

property model_max_length: int[source]#

The maximum length of the model.

property dims: int[source]#

The dimensions of the embeddings.

run(texts, truncate=False, instruction=None, batch_size=10, batch_scheduler_buffer_size=None, adaptive_batch_size=True, progress_interval=60, force=False, cache_only=False, verbose=None, log_level=None, total_num_texts=None, return_generator=False, **kwargs)[source]#

Runs the model on the texts.

Parameters:
  • texts (Iterable[Any]) – The texts to run against the model.

  • instruction (str) – An instruction to prepend to the texts before running.

  • truncate (bool, default: False) – Whether to truncate the texts.

  • batch_size (int, default: 10) – The batch size to use.

  • batch_scheduler_buffer_size (Optional[int], default: None) – The buffer size to use for the batch scheduler.

  • adaptive_batch_size (bool, default: False) – Whether to use adaptive batch sizing.

  • progress_interval (Optional[int], default: 60) – How often to log progress in seconds.

  • force (bool, default: False) – Whether to force run the step (ignore saved results).

  • cache_only (bool, default: False) – Whether to only use the cache.

  • verbose (Optional[bool], default: None) – Whether or not to print verbose logs.

  • log_level (Optional[int], default: None) – The logging level to use (DEBUG, INFO, etc.).

  • total_num_texts (Optional[int], default: None) – The total number of texts being processed (helps with displaying progress).

  • return_generator (bool, default: False) – Whether to return a generator instead of a list.

  • **kwargs – Additional keyword arguments to pass when running the model.

Return type:

Union[Generator[dict[str, Any], None, None], list[dict[str, Any]]]

Returns:

The result of running the model on the texts.

class datadreamer.embedders.TogetherEmbedder(model_name, api_key=None, max_context_length=None, tokenizer_model_name=None, tokenizer_revision=None, tokenizer_trust_remote_code=False, retry_on_fail=True, cache_folder_path=None, warn_tokenizer_model_name=True, warn_max_context_length=True, **kwargs)[source]#

Bases: Embedder

Loads a Together AI embedder.

Parameters:
  • model_name (str) – The name of the model to use.

  • api_key (Optional[str], default: None) – The API key to use for the API.

  • max_context_length (Optional[int], default: None) – The maximum context length to use for the model. If None, the maximum context length will be inferred.

  • tokenizer_model_name (Optional[str], default: None) – The name of the tokenizer model to use. If None, the tokenizer model will be inferred.

  • tokenizer_revision (Optional[str], default: None) – The revision of the tokenizer model to use.

  • tokenizer_trust_remote_code (bool, default: False) – Whether to trust remote code for the tokenizer.

  • retry_on_fail (bool, default: True) – Whether to retry API calls if they fail.

  • cache_folder_path (Optional[str], default: None) – The path to the cache folder. If None, the default cache folder for the DataDreamer session will be used.

  • warn_tokenizer_model_name (Optional[bool], default: True) – Whether to warn if the tokenizer model name is inferred and not explicitly specified.

  • warn_max_context_length (Optional[bool], default: True) – Whether to warn if the maximum context length is inferred and not explicitly specified.

  • **kwargs – Additional keyword arguments to pass to the Together client.

property client: Any[source]#

The API client instance being used.

property tokenizer: Encoding[source]#

The tokenizer instance being used.

property model_max_length: int[source]#

The maximum length of the model.

property dims: int[source]#

The dimensions of the embeddings.

run(texts, truncate=False, batch_size=10, batch_scheduler_buffer_size=None, adaptive_batch_size=False, progress_interval=60, force=False, cache_only=False, verbose=None, log_level=None, total_num_texts=None, return_generator=False, **kwargs)[source]#

Runs the model on the texts.

Parameters:
  • texts (Iterable[Any]) – The texts to run against the model.

  • truncate (bool, default: False) – Whether to truncate the texts.

  • batch_size (int, default: 10) – The batch size to use.

  • batch_scheduler_buffer_size (Optional[int], default: None) – The buffer size to use for the batch scheduler.

  • adaptive_batch_size (bool, default: False) – Whether to use adaptive batch sizing.

  • progress_interval (Optional[int], default: 60) – How often to log progress in seconds.

  • force (bool, default: False) – Whether to force run the step (ignore saved results).

  • cache_only (bool, default: False) – Whether to only use the cache.

  • verbose (Optional[bool], default: None) – Whether or not to print verbose logs.

  • log_level (Optional[int], default: None) – The logging level to use (DEBUG, INFO, etc.).

  • total_num_texts (Optional[int], default: None) – The total number of texts being processed (helps with displaying progress).

  • return_generator (bool, default: False) – Whether to return a generator instead of a list.

  • **kwargs – Additional keyword arguments to pass when running the model.

Return type:

Union[Generator[dict[str, Any], None, None], list[dict[str, Any]]]

Returns:

The result of running the model on the texts.

class datadreamer.embedders.ParallelEmbedder(*embedders)[source]#

Bases: ParallelTaskModel, Embedder

Creates an embedder that will run multiple embedders in parallel. See running models in parallel for more details.

Parameters:

*embedders (Embedder) – The embedders to run in parallel.

run(texts, *args, **kwargs)[source]#

Runs the model on the texts.

Parameters:
  • texts (Iterable[Any]) – The texts to run against the model.

  • truncate (bool, default: False) – Whether to truncate the texts.

  • batch_size (int, default: 10) – The batch size to use.

  • batch_scheduler_buffer_size (Optional[int], default: None) – The buffer size to use for the batch scheduler.

  • adaptive_batch_size (bool, default: False) – Whether to use adaptive batch sizing.

  • progress_interval (Optional[int], default: 60) – How often to log progress in seconds.

  • force (bool, default: False) – Whether to force run the step (ignore saved results).

  • cache_only (bool, default: False) – Whether to only use the cache.

  • verbose (Optional[bool], default: None) – Whether or not to print verbose logs.

  • log_level (Optional[int], default: None) – The logging level to use (DEBUG, INFO, etc.).

  • total_num_texts (Optional[int], default: None) – The total number of texts being processed (helps with displaying progress).

  • return_generator (bool, default: False) – Whether to return a generator instead of a list.

  • **kwargs – Additional keyword arguments to pass when running the model.

Return type:

Union[Generator[dict[str, Any], None, None], list[dict[str, Any]]]

Returns:

The result of running the model on the texts.