当前位置:网站首页>transformers DataCollatorWithPadding类
transformers DataCollatorWithPadding类
2022-06-26 13:27:00 【不负韶华ღ】
构造方法
DataCollatorWithPadding(tokenizer:PreTrainedTokenizerBase
padding:typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = True
max_length : typing.Optional[int] = None
pad_to_multiple_of : typing.Optional[int] = None
return_tensors : str = 'pt ' )
在transfomers中,定义了一个DataCollator类,该类用于将数据集的单个元素打包成一批数据。DataCollatorWithPadding类是DataCollator类的一个实现类,该类在打包时将动态填充输入的数据。
参数tokenizer表示输入的分词器。参数padding可以为bool类型,True表示填充,False表示不填充;也可以为字符串,表示填充策略,"longest"表示根据输入数据中最长的数据来进行填充,"max_length"表示填充至参数max_length设置的长度,“do_not_pad"表示不填充。参数pad_to_multiple_of表示填充的数据的倍数。参数return_tensors表示返回的数据类型,可以为"pt”,pytorch数据类型;“tf”,tensorflow数据类型;“np”,"numpy"数据类型。
使用示例
>>> import transformers
>>> import datasets
>>> dataset = datasets.load_dataset("glue", "cola", split="train")
>>> dataset = dataset.map(lambda data: tokenizer(data["sentence"],padding=True), batched=True)
>>> dataset
Dataset({
features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 8551
})
>>> tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased")
>>> data_collator = transformers.DataCollatorWithPadding(tokenizer,
padding="max_length",
max_length=12,
return_tensors="tf")
>>> dataset = dataset.to_tf_dataset(columns=["label", "input_ids"], batch_size=16, shuffle=False, collate_fn=data_collator)
>>> dataset
<PrefetchDataset element_spec={'input_ids': TensorSpec(shape=(None, None), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(None, None), dtype=tf.int64, name=None), 'labels': TensorSpec(shape=(None,), dtype=tf.int64, name=None)}>
边栏推荐
- Common operation and Principle Exploration of stream
- Server create virtual environment run code
- [ahoi2005] route planning
- Gartner 2022 Top Strategic Technology Trends Report
- Usage of unique function
- 虫子 STL string 下 练习题
- Lucky numbers in the matrix
- Insect operator overloaded a fun
- 永远不要使用Redis过期监听实现定时任务!
- 近期比较重要消息
猜你喜欢

From Celsius to the three arrows: encrypting the domino of the ten billion giants, and drying up the epic liquidity

Sword finger offer 45.61 Sort (simple)

One article of the quantification framework backtrader read observer

7.consul service registration and discovery

量化框架backtrader之一文读懂observer观测器

AGCO AI frontier promotion (6.26)

Eigen(3):error: ‘Eigen’ has not been declared

ThreadLocal giant pit! Memory leaks are just Pediatrics

Zero basics of C language lesson 8: Functions

9 regulations and 6 prohibitions! The Ministry of education and the emergency management department jointly issued the nine provisions on fire safety management of off campus training institutions
随机推荐
9项规定6个严禁!教育部、应急管理部联合印发《校外培训机构消防安全管理九项规定》
Stream常用操作以及原理探索
Svn commit error after deleting files locally
[hnoi2010] flying sheep
C language | Consortium
数学建模经验分享:国赛美赛对比/选题参考/常用技巧
character constants
The most critical elements of team management
量化框架backtrader之一文读懂observer观测器
Insect operator overloaded a fun
Caelus - full scene offline mixed Department solution
Hard (magnetic) disk (I)
【HCSD应用开发实训营】一行代码秒上云评测文章—实验过程心得
Hands on data analysis unit 3 model building and evaluation
免费的机器学习数据集网站(6300+数据集)
A solution to the problem that the display of newff function in neural network cannot be converted from double to struct
Logical operation
Luogu p4145 seven minutes of God created questions 2 / Huashen travels around the world
First k large XOR value problem
Design of PHP asymmetric encryption algorithm (RSA) encryption mechanism