当前位置：网站首页>Datasets dataset class (2)

Datasets dataset class (2)

2022-06-26 14:32:00 【Live up to your youth】

Object methods （ important ）

1、map function

map(
        function: Optional[Callable] = None,
        with_indices: bool = False,
        with_rank: bool = False,
        input_columns: Optional[Union[str, List[str]]] = None,
        batched: bool = False,
        batch_size: Optional[int] = 1000,
        drop_last_batch: bool = False,
        remove_columns: Optional[Union[str, List[str]]] = None,
        keep_in_memory: bool = False,
        load_from_cache_file: bool = True,
        cache_file_names: Optional[Dict[str, Optional[str]]] = None,
        writer_batch_size: Optional[int] = 1000,
        features: Optional[Features] = None,
        disable_nullable: bool = False,
        fn_kwargs: Optional[dict] = None,
        num_proc: Optional[int] = None,
        desc: Optional[str] = None,
    )

Through a mapping function function, Handle Dataset Every element in . If you don't specify function, The default function is lambda x: x. Parameters batched Indicates whether to perform batch processing , Parameters batch_size Indicates the size of the batch , That is, how many elements are processed each time , The default is 1000. Parameters drop_last_batch Indicates when the quantity of the last batch is less than batch_size, Whether to process the last batch .

>>> import transformers
>>> import datasets
>>> dataset = datasets.load_dataset("glue", "cola", split="train")
>>> dataset
Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 8551
})
>>> dataset = dataset.map(lambda data: tokenizer(data["sentence"],
                                                         padding="max_length",
                                                         truncation=True,
                                                         max_length=10),
                                  batched=True,
                                  batch_size=1000,
                                  drop_last_batch=False)
>>> dataset
Dataset({
    features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 8551
})

Parameters input_columns Represents the entered column name , The default is Dataset All columns in , Pass in as a dictionary type . Parameters remove_columns Represents the removed column name .

>>> import transformers
>>> import datasets
>>> dataset = datasets.load_dataset("glue", "cola", split="train")
>>> dataset
Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 8551
})
>>> dataset = dataset.map(lambda data: tokenizer(data["sentence"],padding=True), batched=True)
>>> dataset
Dataset({
    features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 8551
})
#  Using parameter input_columns when , Pay attention to the incoming lambda Form of function 
>>> dataset = dataset.map(lambda data: tokenizer(data,
                                                 padding="max_length",
                                                 truncation=True,
                                                 max_length=10),
                          batched=True,
                          batch_size=1000,
                          drop_last_batch=False,
                          input_columns=["sentence"])
>>> dataset
Dataset({
    features: ['sentence', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 8551
})
#  Using parameter remove_columns
>>> dataset = dataset.map(lambda data: tokenizer(data["sentence"],padding=True), batched=True, remove_columns=["sentence", "idx"])
>>> dataset
Dataset({
    features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 8551
})

2、to_tf_dataset function

to_tf_dataset(
        columns: Union[str, List[str]],
        batch_size: int,
        shuffle: bool,
        collate_fn: Callable,
        drop_remainder: bool = None,
        collate_fn_args: Dict[str, Any] = None,
        label_cols: Union[str, List[str]] = None,
        dummy_labels: bool = False,
        prefetch: bool = True,
    )

according to datasets.Dataset Object to create one tf.data.Dataset object . If you set batch_size Words , that tf.data.Dataset Will be taken from datasets.Dataset Load a batch of data in , Each batch of data is a dictionary , All key names come from the settings columns Parameters .

Parameters columns Indicates the key name of the generated data , The scope is datasets.Dataset Of features One or more of . Parameters batch_size Indicates the size of each batch of data in the generated data . Parameters shuffle Indicates whether the data is scrambled .collate_fn Represents a function used to convert multiple data into a batch of data .

Parameters drop_remainder Indicates whether to delete the last incomplete batch when loading , Ensure that all batches produced by the dataset have the same length in the batch dimension . Parameters label_cols Represents the dataset column to load as a label . Parameters prefetch Indicates whether to run the data loader in a separate thread and maintain a small batch buffer for training . Improve performance by allowing data to be loaded in the background during model training .

>>> import transformers
>>> import datasets
>>> dataset = datasets.load_dataset("glue", "cola", split="train")
>>> dataset
Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 8551
})
>>> tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased")
>>> dataset = dataset.map(lambda data: tokenizer(data["sentence"],padding=True), batched=True)
>>> dataset
Dataset({
    features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 8551
})
>>> data_collator = transformers.DataCollatorWithPadding(tokenizer, return_tensors="tf")
>>> dataset = dataset.to_tf_dataset(columns=["label", "input_ids"], batch_size=16, shuffle=False, collate_fn=data_collator)
>>> dataset
<PrefetchDataset element_spec={'input_ids': TensorSpec(shape=(None, None), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(None, None), dtype=tf.int64, name=None), 'labels': TensorSpec(shape=(None,), dtype=tf.int64, name=None)}>

原网站

版权声明
本文为[Live up to your youth]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/177/202206261327159389.html