当前位置:网站首页>Datasets dataset class (2)
Datasets dataset class (2)
2022-06-26 14:32:00 【Live up to your youth】
Object methods ( important )
1、map function
map(
function: Optional[Callable] = None,
with_indices: bool = False,
with_rank: bool = False,
input_columns: Optional[Union[str, List[str]]] = None,
batched: bool = False,
batch_size: Optional[int] = 1000,
drop_last_batch: bool = False,
remove_columns: Optional[Union[str, List[str]]] = None,
keep_in_memory: bool = False,
load_from_cache_file: bool = True,
cache_file_names: Optional[Dict[str, Optional[str]]] = None,
writer_batch_size: Optional[int] = 1000,
features: Optional[Features] = None,
disable_nullable: bool = False,
fn_kwargs: Optional[dict] = None,
num_proc: Optional[int] = None,
desc: Optional[str] = None,
)
Through a mapping function function, Handle Dataset Every element in . If you don't specify function, The default function is lambda x: x. Parameters batched Indicates whether to perform batch processing , Parameters batch_size Indicates the size of the batch , That is, how many elements are processed each time , The default is 1000. Parameters drop_last_batch Indicates when the quantity of the last batch is less than batch_size, Whether to process the last batch .
>>> import transformers
>>> import datasets
>>> dataset = datasets.load_dataset("glue", "cola", split="train")
>>> dataset
Dataset({
features: ['sentence', 'label', 'idx'],
num_rows: 8551
})
>>> dataset = dataset.map(lambda data: tokenizer(data["sentence"],
padding="max_length",
truncation=True,
max_length=10),
batched=True,
batch_size=1000,
drop_last_batch=False)
>>> dataset
Dataset({
features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 8551
})
Parameters input_columns Represents the entered column name , The default is Dataset All columns in , Pass in as a dictionary type . Parameters remove_columns Represents the removed column name .
>>> import transformers
>>> import datasets
>>> dataset = datasets.load_dataset("glue", "cola", split="train")
>>> dataset
Dataset({
features: ['sentence', 'label', 'idx'],
num_rows: 8551
})
>>> dataset = dataset.map(lambda data: tokenizer(data["sentence"],padding=True), batched=True)
>>> dataset
Dataset({
features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 8551
})
# Using parameter input_columns when , Pay attention to the incoming lambda Form of function
>>> dataset = dataset.map(lambda data: tokenizer(data,
padding="max_length",
truncation=True,
max_length=10),
batched=True,
batch_size=1000,
drop_last_batch=False,
input_columns=["sentence"])
>>> dataset
Dataset({
features: ['sentence', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 8551
})
# Using parameter remove_columns
>>> dataset = dataset.map(lambda data: tokenizer(data["sentence"],padding=True), batched=True, remove_columns=["sentence", "idx"])
>>> dataset
Dataset({
features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 8551
})
2、to_tf_dataset function
to_tf_dataset(
columns: Union[str, List[str]],
batch_size: int,
shuffle: bool,
collate_fn: Callable,
drop_remainder: bool = None,
collate_fn_args: Dict[str, Any] = None,
label_cols: Union[str, List[str]] = None,
dummy_labels: bool = False,
prefetch: bool = True,
)
according to datasets.Dataset Object to create one tf.data.Dataset object . If you set batch_size Words , that tf.data.Dataset Will be taken from datasets.Dataset Load a batch of data in , Each batch of data is a dictionary , All key names come from the settings columns Parameters .
Parameters columns Indicates the key name of the generated data , The scope is datasets.Dataset Of features One or more of . Parameters batch_size Indicates the size of each batch of data in the generated data . Parameters shuffle Indicates whether the data is scrambled .collate_fn Represents a function used to convert multiple data into a batch of data .
Parameters drop_remainder Indicates whether to delete the last incomplete batch when loading , Ensure that all batches produced by the dataset have the same length in the batch dimension . Parameters label_cols Represents the dataset column to load as a label . Parameters prefetch Indicates whether to run the data loader in a separate thread and maintain a small batch buffer for training . Improve performance by allowing data to be loaded in the background during model training .
>>> import transformers
>>> import datasets
>>> dataset = datasets.load_dataset("glue", "cola", split="train")
>>> dataset
Dataset({
features: ['sentence', 'label', 'idx'],
num_rows: 8551
})
>>> tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased")
>>> dataset = dataset.map(lambda data: tokenizer(data["sentence"],padding=True), batched=True)
>>> dataset
Dataset({
features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 8551
})
>>> data_collator = transformers.DataCollatorWithPadding(tokenizer, return_tensors="tf")
>>> dataset = dataset.to_tf_dataset(columns=["label", "input_ids"], batch_size=16, shuffle=False, collate_fn=data_collator)
>>> dataset
<PrefetchDataset element_spec={'input_ids': TensorSpec(shape=(None, None), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(None, None), dtype=tf.int64, name=None), 'labels': TensorSpec(shape=(None,), dtype=tf.int64, name=None)}>
边栏推荐
- Codeforces Global Round 21A~D
- 常用控件及自定义控件
- Online bull Blogger
- Electron
- ArcGIS secondary development - arcpy delete layer
- [scoi2016] lucky numbers
- Difference between classification and regression
- DOS command
- Electron
- Equation derivation: second order active bandpass filter design! (download: Tutorial + schematic + Video + code)
猜你喜欢

C language | Consortium

Matlab programming related knowledge

Comparison of disk partition modes (MBR and GPT)

Codeforces Global Round 21A~D

ArcGIS batch render layer script

'coach, I want to play basketball!'—— AI Learning Series booklet for system students

ThreadLocal巨坑!内存泄露只是小儿科...

常用控件及自定义控件

ArcGIS batch export layer script

Caelus - full scene offline mixed Department solution
随机推荐
[jsoi2015] string tree
SwiftUI找回丢失的列表视图(List)动画
ArcGIS batch export layer script
方程推导:二阶有源带通滤波器设计!(下载:教程+原理图+视频+代码)
K gold Chef (two conditions, two points and difference)
One article of the quantification framework backtrader read observer
(improved) bubble sorting and (improved) cocktail sorting
Matlab programming related knowledge
Server create virtual environment run code
永远不要使用Redis过期监听实现定时任务!
从Celsius到三箭:加密百亿巨头们的多米诺,史诗级流动性的枯竭
Codeforces Round #765 (Div. 2) D. Binary Spiders
Hard (magnetic) disk (II)
Win10 home vs pro vs enterprise vs enterprise LTSC
通俗语言说BM3D
'coach, I want to play basketball!'—— AI Learning Series booklet for system students
Leaflet load day map
Difference between classification and regression
这才是优美的文件系统挂载方式,亲测有效
C language | Consortium