task_models#

TaskModel objects help perform some sort of arbitrary NLP task (classification, etc.). All task models derive from the TaskModel base class.

Tip

Instead of using run() directly, use a step that takes a TaskModel as an args argument such as RunTaskModel.

Caching#

Task models internally perform caching to disk, so if you run the same text multiple times, the task model will only run once and then cache the results for future runs.

class datadreamer.task_models.TaskModel(cache_folder_path=None)[source]#

Bases: _Cachable

Base class for all task models.

Parameters:: cache_folder_path (Optional[str], default: None) – The path to the cache folder. If None, the default cache folder for the DataDreamer session will be used.

abstract count_tokens(value)[source]#

Counts the number of tokens in a string.

Parameters:: value (str) – The string to count tokens for.
Return type:: int
Returns:: The number of tokens in the string.

abstract property model_max_length: int[source]#: The maximum length of the model.

abstract run(texts, truncate=False, batch_size=10, batch_scheduler_buffer_size=None, adaptive_batch_size=True, progress_interval=60, force=False, cache_only=False, verbose=None, log_level=None, total_num_texts=None, return_generator=False, **kwargs)[source]#

Runs the model on the texts.

Parameters:

texts (Iterable[Any]) – The texts to run against the model.
truncate (bool, default: False) – Whether to truncate the texts.
batch_size (int, default: 10) – The batch size to use.
batch_scheduler_buffer_size (Optional[int], default: None) – The buffer size to use for the batch scheduler.
adaptive_batch_size (bool, default: False) – Whether to use adaptive batch sizing.
progress_interval (Optional[int], default: 60) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
cache_only (bool, default: False) – Whether to only use the cache.
verbose (Optional[bool], default: None) – Whether or not to print verbose logs.
log_level (Optional[int], default: None) – The logging level to use (DEBUG, INFO, etc.).
total_num_texts (Optional[int], default: None) – The total number of texts being processed (helps with displaying progress).
return_generator (bool, default: False) – Whether to return a generator instead of a list.
**kwargs – Additional keyword arguments to pass when running the model.

Return type:

Union[Generator[dict[str, Any], None, None], list[dict[str, Any]]]

Returns:

The result of running the model on the texts.

unload_model()[source]#: Unloads resources required to run the model from memory.

class datadreamer.task_models.HFClassificationTaskModel(model_name, revision=None, trust_remote_code=False, device=None, device_map=None, dtype=None, adapter_name=None, adapter_kwargs=None, cache_folder_path=None, **kwargs)[source]#

Bases: TaskModel

Loads a HFClassificationTaskModel task model.

Parameters:

model_name (str) – The name of the model to use.
revision (Optional[str], default: None) – The version (commit hash) of the model to use.
trust_remote_code (bool, default: False) – Whether to trust remote code.
device (Union[None, int, str, device, list[int | str | device]], default: None) – The device to use for the model.
device_map (Union[None, dict, str], default: None) – The device map to use for the model.
dtype (Union[None, str, dtype], default: None) – The type to use for the model weights.
adapter_name (Optional[str], default: None) – The name of the adapter to use.
adapter_kwargs (Optional[dict], default: None) – Additional keyword arguments to pass the PeftModel constructor.
cache_folder_path (Optional[str], default: None) – The path to the cache folder. If None, the default cache folder for the DataDreamer session will be used.
**kwargs – Additional keyword arguments to pass to the Hugging Face model constructor.

property model: PreTrainedModel[source]#: The model instance being used.

property tokenizer: PreTrainedTokenizer[source]#: The tokenizer instance being used.

property model_max_length: int[source]#: The maximum length of the model.

run(texts, truncate=False, batch_size=10, batch_scheduler_buffer_size=None, adaptive_batch_size=True, progress_interval=60, force=False, cache_only=False, verbose=None, log_level=None, total_num_texts=None, return_generator=False, **kwargs)[source]#

Runs the model on the texts.

Parameters:

texts (Iterable[Any]) – The texts to run against the model.
truncate (bool, default: False) – Whether to truncate the texts.
batch_size (int, default: 10) – The batch size to use.
batch_scheduler_buffer_size (Optional[int], default: None) – The buffer size to use for the batch scheduler.
adaptive_batch_size (bool, default: False) – Whether to use adaptive batch sizing.
progress_interval (Optional[int], default: 60) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
cache_only (bool, default: False) – Whether to only use the cache.
verbose (Optional[bool], default: None) – Whether or not to print verbose logs.
log_level (Optional[int], default: None) – The logging level to use (DEBUG, INFO, etc.).
total_num_texts (Optional[int], default: None) – The total number of texts being processed (helps with displaying progress).
return_generator (bool, default: False) – Whether to return a generator instead of a list.
**kwargs – Additional keyword arguments to pass when running the model.

Return type:

Union[Generator[dict[str, Any], None, None], list[dict[str, Any]]]

Returns:

The result of running the model on the texts.

class datadreamer.task_models.ParallelTaskModel(*task_models)[source]#

Bases: _ParallelCachable, TaskModel

Creates a task model that will run multiple task models in parallel. See running models in parallel for more details.

Parameters:: *task_models (TaskModel) – The task models to run in parallel.

property model_max_length: int[source]#: The maximum length of the model.

run(texts, *args, **kwargs)[source]#

Runs the model on the texts.

Parameters:

texts (Iterable[Any]) – The texts to run against the model.
truncate (bool, default: False) – Whether to truncate the texts.
batch_size (int, default: 10) – The batch size to use.
batch_scheduler_buffer_size (Optional[int], default: None) – The buffer size to use for the batch scheduler.
adaptive_batch_size (bool, default: False) – Whether to use adaptive batch sizing.
progress_interval (Optional[int], default: 60) – How often to log progress in seconds.
force (bool, default: False) – Whether to force run the step (ignore saved results).
cache_only (bool, default: False) – Whether to only use the cache.
verbose (Optional[bool], default: None) – Whether or not to print verbose logs.
log_level (Optional[int], default: None) – The logging level to use (DEBUG, INFO, etc.).
total_num_texts (Optional[int], default: None) – The total number of texts being processed (helps with displaying progress).
return_generator (bool, default: False) – Whether to return a generator instead of a list.
**kwargs – Additional keyword arguments to pass when running the model.

Return type:

Union[Generator[dict[str, Any], None, None], list[dict[str, Any]]]

Returns:

The result of running the model on the texts.