You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This method returns the auto-fixed dataset. It works for text or tabular dataset only.
374
374
Args:
375
375
cleanset_id (str): ID of cleanset.
376
+
original_df (pd.DataFrame): The original dataset in DataFrame format.
376
377
params (dict, optional): Default parameter dictionary containing confidence threshold for auto-relabelling, and
377
378
fraction of rows to drop for each issue type. If not provided, default values will be used.
378
-
379
-
Example:
379
+
This dictionary includes the following options:
380
+
381
+
* drop_ambiguous (float): Fraction of rows to drop when encountering ambiguous data. Default is 0.0 (no rows dropped).
382
+
* drop_label_issue (float): Fraction of rows to drop when facing label-related issues. Default is 0.5 (50% of rows dropped).
383
+
* drop_near_duplicate (float): Fraction of rows to drop for near-duplicate data. Default is 0.5 (50% of rows dropped).
384
+
* drop_outlier (float): Fraction of rows to drop for outlier data. Default is 0.2 (20% of rows dropped).
385
+
* relabel_confidence_threshold (float): Confidence threshold for auto-relabelling. Default is 0.95.
386
+
For example, the default values are:
380
387
{
381
388
'drop_ambiguous': 0.0,
382
389
'drop_label_issue': 0.5,
383
390
'drop_near_duplicate': 0.5,
384
391
'drop_outlier': 0.2,
385
392
'relabel_confidence_threshold': 0.95
386
393
}
394
+
395
+
Specify values in params to customize the behavior for specific scenarios. If params are provided, the values in params take precedence over default ones.
396
+
387
397
strategy (str): Auto-fixing strategy to use,
388
398
Possible strategies: optimized_training_data, drop_all_issues, suggested_actions
0 commit comments