task_models#
TaskModel
objects help perform some sort of arbitrary NLP task
(classification, etc.).
All task models derive from the TaskModel
base class.
Tip
Instead of using run()
directly, use a
step
that takes a TaskModel
as an
args
argument such as RunTaskModel
.
Caching#
Task models internally perform caching to disk, so if you run the same text multiple times, the task model will only run once and then cache the results for future runs.
- class datadreamer.task_models.TaskModel(cache_folder_path=None)[source]#
Bases:
_Cachable
Base class for all task models.
- Parameters:
cache_folder_path (
Optional
[str
], default:None
) β The path to the cache folder. IfNone
, the default cache folder for the DataDreamer session will be used.
- abstract run(texts, truncate=False, batch_size=10, batch_scheduler_buffer_size=None, adaptive_batch_size=True, progress_interval=60, force=False, cache_only=False, verbose=None, log_level=None, total_num_texts=None, return_generator=False, **kwargs)[source]#
Runs the model on the texts.
- Parameters:
texts (
Iterable
[Any
]) β The texts to run against the model.truncate (
bool
, default:False
) β Whether to truncate the texts.batch_size (
int
, default:10
) β The batch size to use.batch_scheduler_buffer_size (
Optional
[int
], default:None
) β The buffer size to use for the batch scheduler.adaptive_batch_size (
bool
, default:False
) β Whether to use adaptive batch sizing.progress_interval (
Optional
[int
], default:60
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).cache_only (
bool
, default:False
) β Whether to only use the cache.verbose (
Optional
[bool
], default:None
) β Whether or not to print verbose logs.log_level (
Optional
[int
], default:None
) β The logging level to use (DEBUG
,INFO
, etc.).total_num_texts (
Optional
[int
], default:None
) β The total number of texts being processed (helps with displaying progress).return_generator (
bool
, default:False
) β Whether to return a generator instead of a list.**kwargs β Additional keyword arguments to pass when running the model.
- Return type:
Union
[Generator
[dict
[str
,Any
],None
,None
],list
[dict
[str
,Any
]]]- Returns:
The result of running the model on the texts.
- class datadreamer.task_models.HFClassificationTaskModel(model_name, revision=None, trust_remote_code=False, device=None, device_map=None, dtype=None, adapter_name=None, adapter_kwargs=None, cache_folder_path=None, **kwargs)[source]#
Bases:
TaskModel
Loads a HFClassificationTaskModel task model.
- Parameters:
model_name (
str
) β The name of the model to use.revision (
Optional
[str
], default:None
) β The version (commit hash) of the model to use.trust_remote_code (
bool
, default:False
) β Whether to trust remote code.device (
Union
[None
,int
,str
,device
,list
[int
|str
|device
]], default:None
) β The device to use for the model.device_map (
Union
[None
,dict
,str
], default:None
) β The device map to use for the model.dtype (
Union
[None
,str
,dtype
], default:None
) β The type to use for the model weights.adapter_name (
Optional
[str
], default:None
) β The name of the adapter to use.adapter_kwargs (
Optional
[dict
], default:None
) β Additional keyword arguments to pass the PeftModel constructor.cache_folder_path (
Optional
[str
], default:None
) β The path to the cache folder. IfNone
, the default cache folder for the DataDreamer session will be used.**kwargs β Additional keyword arguments to pass to the Hugging Face model constructor.
- run(texts, truncate=False, batch_size=10, batch_scheduler_buffer_size=None, adaptive_batch_size=True, progress_interval=60, force=False, cache_only=False, verbose=None, log_level=None, total_num_texts=None, return_generator=False, **kwargs)[source]#
Runs the model on the texts.
- Parameters:
texts (
Iterable
[Any
]) β The texts to run against the model.truncate (
bool
, default:False
) β Whether to truncate the texts.batch_size (
int
, default:10
) β The batch size to use.batch_scheduler_buffer_size (
Optional
[int
], default:None
) β The buffer size to use for the batch scheduler.adaptive_batch_size (
bool
, default:False
) β Whether to use adaptive batch sizing.progress_interval (
Optional
[int
], default:60
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).cache_only (
bool
, default:False
) β Whether to only use the cache.verbose (
Optional
[bool
], default:None
) β Whether or not to print verbose logs.log_level (
Optional
[int
], default:None
) β The logging level to use (DEBUG
,INFO
, etc.).total_num_texts (
Optional
[int
], default:None
) β The total number of texts being processed (helps with displaying progress).return_generator (
bool
, default:False
) β Whether to return a generator instead of a list.**kwargs β Additional keyword arguments to pass when running the model.
- Return type:
Union
[Generator
[dict
[str
,Any
],None
,None
],list
[dict
[str
,Any
]]]- Returns:
The result of running the model on the texts.
- class datadreamer.task_models.ParallelTaskModel(*task_models)[source]#
Bases:
_ParallelCachable
,TaskModel
Creates a task model that will run multiple task models in parallel. See running models in parallel for more details.
- Parameters:
*task_models (
TaskModel
) β The task models to run in parallel.
- run(texts, *args, **kwargs)[source]#
Runs the model on the texts.
- Parameters:
texts (
Iterable
[Any
]) β The texts to run against the model.truncate (
bool
, default:False
) β Whether to truncate the texts.batch_size (
int
, default:10
) β The batch size to use.batch_scheduler_buffer_size (
Optional
[int
], default:None
) β The buffer size to use for the batch scheduler.adaptive_batch_size (
bool
, default:False
) β Whether to use adaptive batch sizing.progress_interval (
Optional
[int
], default:60
) β How often to log progress in seconds.force (
bool
, default:False
) β Whether to force run the step (ignore saved results).cache_only (
bool
, default:False
) β Whether to only use the cache.verbose (
Optional
[bool
], default:None
) β Whether or not to print verbose logs.log_level (
Optional
[int
], default:None
) β The logging level to use (DEBUG
,INFO
, etc.).total_num_texts (
Optional
[int
], default:None
) β The total number of texts being processed (helps with displaying progress).return_generator (
bool
, default:False
) β Whether to return a generator instead of a list.**kwargs β Additional keyword arguments to pass when running the model.
- Return type:
Union
[Generator
[dict
[str
,Any
],None
,None
],list
[dict
[str
,Any
]]]- Returns:
The result of running the model on the texts.