🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Overview



Build GitHub Documentation GitHub release Number of datasets Contributor Covenant

🤗Datasets is a lightweight library providing two main features:

  • one-line dataloaders for many public datasets: one liners to download and pre-process any of the number of datasets major public datasets (in 467 languages and dialects!) provided on the HuggingFace Datasets Hub. With a simple command like squad_dataset = load_datasets("squad"), get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX),
  • efficient data pre-processing: simple, fast and reproducible data pre-processing for the above public datasets as well as your own local datasets in CSV/JSON/text. With simple commands like tokenized_dataset = dataset.map(tokenize_exemple), efficiently prepare the dataset for inspection and ML model evaluation and training.

🎓 Documentation 🕹 Colab tutorial

🔎 Find a dataset in the Hub 🌟 Add a new dataset to the Hub

🤗Datasets also provides access to +15 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics.

🤗Datasets has many additional interesting features:

  • Thrive on large datasets: 🤗Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped using an efficient zero-serialization cost backend (Apache Arrow).
  • Smart caching: never wait for your data to process several times.
  • Lightweight and fast with a transparent and pythonic API (multi-processing/caching/memory-mapping).
  • Built-in interoperability with NumPy, pandas, PyTorch, Tensorflow 2 and JAX.

🤗Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗Datasets and tfds can be found in the section Main differences between 🤗Datasets and tfds.

Installation

With pip

🤗Datasets can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance)

pip install datasets

With conda

🤗Datasets can be installed using conda as follows:

conda install -c huggingface -c conda-forge datasets

Follow the installation pages of TensorFlow and PyTorch to see how to install them with conda.

For more details on installation, check the installation page in the documentation: https://huggingface.co/docs/datasets/installation.html

Installation to use with PyTorch/TensorFlow/pandas

If you plan to use 🤗Datasets with PyTorch (1.0+), TensorFlow (2.2+) or pandas, you should also install PyTorch, TensorFlow or pandas.

For more details on using the library with NumPy, pandas, PyTorch or TensorFlow, check the quick tour page in the documentation: https://huggingface.co/docs/datasets/quicktour.html

Usage

🤗Datasets is made to be very simple to use. The main methods are:

  • datasets.list_datasets() to list the available datasets
  • datasets.load_dataset(dataset_name, **kwargs) to instantiate a dataset
  • datasets.list_metrics() to list the available metrics
  • datasets.load_metric(metric_name, **kwargs) to instantiate a metric

Here is a quick example:

from datasets import list_datasets, load_dataset, list_metrics, load_metric

# Print all the available datasets
print(list_datasets())

# Load a dataset and print the first example in the training set
squad_dataset = load_dataset('squad')
print(squad_dataset['train'][0])

# List all the available metrics
print(list_metrics())

# Load a metric
squad_metric = load_metric('squad')

# Process the dataset - add a column with the length of the context texts
dataset_with_length = squad_dataset.map(lambda x: {"length": len(x["context"])})

# Process the dataset - tokenize the context texts (using a tokenizer from the 🤗Transformers library)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

tokenized_dataset = squad_dataset.map(lambda x: tokenizer(x['context']), batched=True)

For more details on using the library, check the quick tour page in the documentation: https://huggingface.co/docs/datasets/quicktour.html and the specific pages on:

Another introduction to 🤗Datasets is the tutorial on Google Colab here: Open In Colab

Add a new dataset to the Hub

We have a very detailed step-by-step guide to add a new dataset to the number of datasets datasets already provided on the HuggingFace Datasets Hub.

You will find the step-by-step guide here to add a dataset to this repository.

You can also have your own repository for your dataset on the Hub under your or your organization's namespace and share it with the community. More information in the documentation section about dataset sharing.

Main differences between 🤗Datasets and tfds

If you are familiar with the great Tensorflow Datasets, here are the main differences between 🤗Datasets and tfds:

  • the scripts in 🤗Datasets are not provided within the library but are queried, downloaded/cached and dynamically loaded upon request
  • 🤗Datasets also provides evaluation metrics in a similar fashion to the datasets, i.e. as dynamically installed scripts with a unified API. This gives access to the pair of a benchmark dataset and a benchmark metric for instance for benchmarks like SQuAD or GLUE.
  • the backend serialization of 🤗Datasets is based on Apache Arrow instead of TF Records and leverage python dataclasses for info and features with some diverging features (we mostly don't do encoding and store the raw data as much as possible in the backend serialization cache).
  • the user-facing dataset object of 🤗Datasets is not a tf.data.Dataset but a built-in framework-agnostic dataset class with methods inspired by what we like in tf.data (like a map() method). It basically wraps a memory-mapped Arrow table cache.

Disclaimers

Similar to TensorFlow Datasets, 🤗Datasets is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use them. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.

If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thanks for your contribution to the ML community!

BibTeX

If you want to cite this framework you can use this:

@article{2020HuggingFace-datasets,
  title={Datasets},
  author={Thomas Wolf and Quentin Lhoest and Patrick von Platen and Yacine Jernite and Mariama Drame and Julien Plu and Julien Chaumond and Clement Delangue and Clara Ma and Abhishek Thakur and Suraj Patil and Joe Davison and Teven Le Scao and Victor Sanh and Canwen Xu and Nicolas Patry and Angie McMillan-Major and Simon Brandeis and Sylvain Gugger and François Lagunas and Lysandre Debut and Morgan Funtowicz and Anthony Moi and Sasha Rush and Philipp Schmidd and Pierric Cistac and Victor Muštar and Jeff Boudier and Anna Tordjmann},
  journal={GitHub. Note: https://github.com/huggingface/datasets},
  volume={1},
  year={2020}
}
Comments
  • Load text file for RoBERTa pre-training.

    Load text file for RoBERTa pre-training.

    I migrate my question from https://github.com/huggingface/transformers/pull/4009#issuecomment-690039444

    I tried to train a Roberta from scratch using transformers. But I got OOM issues with loading a large text file. According to the suggestion from @thomwolf , I tried to implement datasets to load my text file. This test.txt is a simple sample where each line is a sentence.

    from datasets import load_dataset
    dataset = load_dataset('text', data_files='test.txt',cache_dir="./")
    dataset.set_format(type='torch',columns=["text"])
    dataloader = torch.utils.data.DataLoader(dataset, batch_size=8)
    next(iter(dataloader))
    

    But dataload cannot yield sample and error is:

    ---------------------------------------------------------------------------
    KeyError                                  Traceback (most recent call last)
    <ipython-input-12-388aca337e2f> in <module>
    ----> 1 next(iter(dataloader))
    
    /Library/Python/3.7/site-packages/torch/utils/data/dataloader.py in __next__(self)
        361 
        362     def __next__(self):
    --> 363         data = self._next_data()
        364         self._num_yielded += 1
        365         if self._dataset_kind == _DatasetKind.Iterable and \
    
    /Library/Python/3.7/site-packages/torch/utils/data/dataloader.py in _next_data(self)
        401     def _next_data(self):
        402         index = self._next_index()  # may raise StopIteration
    --> 403         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
        404         if self._pin_memory:
        405             data = _utils.pin_memory.pin_memory(data)
    
    /Library/Python/3.7/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
         42     def fetch(self, possibly_batched_index):
         43         if self.auto_collation:
    ---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
         45         else:
         46             data = self.dataset[possibly_batched_index]
    
    /Library/Python/3.7/site-packages/torch/utils/data/_utils/fetch.py in <listcomp>(.0)
         42     def fetch(self, possibly_batched_index):
         43         if self.auto_collation:
    ---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
         45         else:
         46             data = self.dataset[possibly_batched_index]
    
    KeyError: 0
    

    dataset.set_format(type='torch',columns=["text"]) returns a log says:

    Set __getitem__(key) output type to torch for ['text'] columns (when key is int or slice) and don't output other (un-formatted) columns.
    

    I noticed the dataset is DatasetDict({'train': Dataset(features: {'text': Value(dtype='string', id=None)}, num_rows: 44)}). Each sample can be accessed by dataset["train"]["text"] instead of dataset["text"].

    Could you please give me any suggestions on how to modify this code to load the text file?

    Versions: Python version 3.7.3 PyTorch version 1.6.0 TensorFlow version 2.3.0 datasets version: 1.0.1

    opened by chiyuzhang94 43
  • load_dataset for text files not working

    load_dataset for text files not working

    Trying the following snippet, I get different problems on Linux and Windows.

    dataset = load_dataset("text", data_files="data.txt")
    # or 
    dataset = load_dataset("text", data_files=["data.txt"])
    

    (ps This example shows that you can use a string as input for data_files, but the signature is Union[Dict, List].)

    The problem on Linux is that the script crashes with a CSV error (even though it isn't a CSV file). On Windows the script just seems to freeze or get stuck after loading the config file.

    Linux stack trace:

    PyTorch version 1.6.0+cu101 available.
    Checking /home/bram/.cache/huggingface/datasets/b1d50a0e74da9a7b9822cea8ff4e4f217dd892e09eb14f6274a2169e5436e2ea.30c25842cda32b0540d88b7195147decf9671ee442f4bc2fb6ad74016852978e.py for additional imports.
    Found main folder for dataset https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py at /home/bram/.cache/huggingface/modules/datasets_modules/datasets/text
    Found specific version folder for dataset https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py at /home/bram/.cache/huggingface/modules/datasets_modules/datasets/text/7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7
    Found script file from https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py to /home/bram/.cache/huggingface/modules/datasets_modules/datasets/text/7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7/text.py
    Couldn't find dataset infos file at https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/dataset_infos.json
    Found metadata file for dataset https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py at /home/bram/.cache/huggingface/modules/datasets_modules/datasets/text/7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7/text.json
    Using custom data configuration default
    Generating dataset text (/home/bram/.cache/huggingface/datasets/text/default-0907112cc6cd2a38/0.0.0/7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7)
    Downloading and preparing dataset text/default-0907112cc6cd2a38 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/bram/.cache/huggingface/datasets/text/default-0907112cc6cd2a38/0.0.0/7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7...
    Dataset not on Hf google storage. Downloading and preparing it from source
    Downloading took 0.0 min
    Checksum Computation took 0.0 min
    Unable to verify checksums.
    Generating split train
    Traceback (most recent call last):
      File "/home/bram/Python/projects/dutch-simplification/utils.py", line 45, in prepare_data
        dataset = load_dataset("text", data_files=dataset_f)
      File "/home/bram/.local/share/virtualenvs/dutch-simplification-NcpPZtDF/lib/python3.8/site-packages/datasets/load.py", line 608, in load_dataset
        builder_instance.download_and_prepare(
      File "/home/bram/.local/share/virtualenvs/dutch-simplification-NcpPZtDF/lib/python3.8/site-packages/datasets/builder.py", line 468, in download_and_prepare
        self._download_and_prepare(
      File "/home/bram/.local/share/virtualenvs/dutch-simplification-NcpPZtDF/lib/python3.8/site-packages/datasets/builder.py", line 546, in _download_and_prepare
        self._prepare_split(split_generator, **prepare_split_kwargs)
      File "/home/bram/.local/share/virtualenvs/dutch-simplification-NcpPZtDF/lib/python3.8/site-packages/datasets/builder.py", line 888, in _prepare_split
        for key, table in utils.tqdm(generator, unit=" tables", leave=False, disable=not_verbose):
      File "/home/bram/.local/share/virtualenvs/dutch-simplification-NcpPZtDF/lib/python3.8/site-packages/tqdm/std.py", line 1130, in __iter__
        for obj in iterable:
      File "/home/bram/.cache/huggingface/modules/datasets_modules/datasets/text/7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7/text.py", line 100, in _generate_tables
        pa_table = pac.read_csv(
      File "pyarrow/_csv.pyx", line 714, in pyarrow._csv.read_csv
      File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
      File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
    pyarrow.lib.ArrowInvalid: CSV parse error: Expected 1 columns, got 2
    

    Windows just seems to get stuck. Even with a tiny dataset of 10 lines, it has been stuck for 15 minutes already at this message:

    Checking C:\Users\bramv\.cache\huggingface\datasets\b1d50a0e74da9a7b9822cea8ff4e4f217dd892e09eb14f6274a2169e5436e2ea.30c25842cda32b0540d88b7195147decf9671ee442f4bc2fb6ad74016852978e.py for additional imports.
    Found main folder for dataset https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py at C:\Users\bramv\.cache\huggingface\modules\datasets_modules\datasets\text
    Found specific version folder for dataset https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py at C:\Users\bramv\.cache\huggingface\modules\datasets_modules\datasets\text\7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7
    Found script file from https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py to C:\Users\bramv\.cache\huggingface\modules\datasets_modules\datasets\text\7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7\text.py
    Couldn't find dataset infos file at https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text\dataset_infos.json
    Found metadata file for dataset https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py at C:\Users\bramv\.cache\huggingface\modules\datasets_modules\datasets\text\7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7\text.json
    Using custom data configuration default
    
    dataset bug 
    opened by BramVanroy 41
  • Create Audio feature

    Create Audio feature

    Create Audio feature to handle raw audio files.

    Some decisions to be further discussed:

    • I have chosen soundfile as the audio library; another interesting library is librosa, but this requires soundfile (see here). If we require some more advanced functionalities, we could eventually switch the library.
    • I have implemented the audio feature as an extra: pip install datasets[audio]. For the moment, the typical datasets user uses only text datasets, and there is no need for them for additional package requirements for audio/image if they do not need them.
    • For tests, I require audio dependencies (so that all audio functionalities are checked with our CI test suite); I exclude Linux platforms, which require an additional library to be installed with the distribution package manager
      • I also require pytest-datadir, which allow to have (audio) data files for tests
    • The audio data contain: array and sample_rate.
    • The array is reshaped as 1D array (expected input for Wav2Vec2).

    Note that to install soundfile on Linux, you need to install libsndfile using your distribution’s package manager, for example sudo apt-get install libsndfile1.

    Requirements Specification

    • Access example with audio loading and resampling:
      ds[0]["audio"]
      
    • Map with audio loading & resampling:
      def preprocess(batch):
           batch["input_values"] = processor(batch["audio"]).input_values
           return batch
      
      ds = ds.map(preprocess)
      
    • Map without audio loading and resampling:
      def preprocess(batch):
           batch["labels"] = processor(batch["target_text"]).input_values
           return batch
      
      ds = ds.map(preprocess)
      
    • Additional requirement specification (see https://github.com/huggingface/datasets/pull/2324#pullrequestreview-768864998): Cast audio column to change sampling sate:
      ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
      
    opened by albertvillanova 30
  • Fatal error condition occurred in aws-c-io

    Fatal error condition occurred in aws-c-io

    Describe the bug

    Fatal error when using the library

    Steps to reproduce the bug

    from datasets import load_dataset
    dataset = load_dataset('wikiann', 'en')
    

    Expected results

    No fatal errors

    Actual results

    Fatal error condition occurred in D:\bld\aws-c-io_1633633258269\work\source\event_loop.c:74: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS
    Exiting Application
    

    Environment info

    • datasets version: 1.15.2.dev0
    • Platform: Windows-10-10.0.22504-SP0
    • Python version: 3.8.12
    • PyArrow version: 6.0.0
    bug 
    opened by Crabzmatic 26
  • Checksums didn't match for dataset source

    Checksums didn't match for dataset source

    Dataset viewer issue for 'wiki_lingua*'

    Link: link to the dataset viewer page

    data = datasets.load_dataset("wiki_lingua", name=language, split="train[:2000]") short description of the issue

    [NonMatchingChecksumError: Checksums didn't match for dataset source files:
    ['https://drive.google.com/uc?export=download&id=11wMGqNVSwwk6zUnDaJEgm3qT71kAHeff']]()
    

    Am I the one who added this dataset ? No

    dataset-viewer 
    opened by rafikg 25
  • Only user permission of saved cache files, not group

    Only user permission of saved cache files, not group

    Hello,

    It seems when a cached file is saved from calling dataset.map for preprocessing, it gets the user permissions and none of the user's group permissions. As we share data files across members of our team, this is causing a bit of an issue as we have to continually reset the permission of the files. Do you know any ways around this or a way to correctly set the permissions?

    enhancement good first issue 
    opened by lorr1 23
  • Adding support for  generic multi dimensional tensors and auxillary image data for multimodal datasets

    Adding support for generic multi dimensional tensors and auxillary image data for multimodal datasets

    nlp/features.py:

    The main factory class is MultiArray, every single time this class is called, a corresponding pyarrow extension array and type class is generated (and added to the list of globals for future use) for a given root data type and set of dimensions/shape. I provide examples on working with this in datasets/lxmert_pretraining_beta/test_multi_array.py

    src/nlp/arrow_writer.py

    I had to add a method for writing batches that include extension array types because despite having a unique class for each multidimensional array shape, pyarrow is unable to write any other "array-like" data class to a batch object unless it is of the type pyarrow.ExtensionType. The problem in this is that when writing multiple batches, the order of the schema and data to be written get mixed up (where the pyarrow datatype in the schema only refers to as ExtensionAray, but each ExtensionArray subclass has a different shape) ... possibly I am missing something here and would be grateful if anyone else could take a look!

    datasets/lxmert_pretraining_beta/lxmert_pretraining_beta.py & datasets/lxmert_pretraining_beta/to_arrow_data.py:

    I have begun adding the data from the original LXMERT paper (https://arxiv.org/abs/1908.07490) hosted here: (https://github.com/airsplay/lxmert). The reason I am not pulling from the source of truth for each individual dataset is because it seems that there will also need to be functionality to aggregate multimodal datasets to create a pre-training corpus (:sleepy: ). For now, this is just being used to test and run edge-cases for the MultiArray feature, so ive labeled it as "beta_pretraining"!

    (still working on the pretraining, just wanted to push out the new functionality sooner than later)

    opened by eltoto1219 23
  • map/filter multiprocessing raises errors and corrupts datasets

    map/filter multiprocessing raises errors and corrupts datasets

    After upgrading to the 1.0 started seeing errors in my data loading script after enabling multiprocessing.

        ...
        ner_ds_dict = ner_ds.train_test_split(test_size=test_pct, shuffle=True, seed=seed)
        ner_ds_dict["validation"] = ner_ds_dict["test"]
        rel_ds_dict = rel_ds.train_test_split(test_size=test_pct, shuffle=True, seed=seed)
        rel_ds_dict["validation"] = rel_ds_dict["test"]
        return ner_ds_dict, rel_ds_dict
    

    The first train_test_split, ner_ds/ner_ds_dict, returns a train and test split that are iterable. The second, rel_ds/rel_ds_dict in this case, returns a Dataset dict that has rows but if selected from or sliced into into returns an empty dictionary. eg rel_ds_dict['train'][0] == {} and rel_ds_dict['train'][0:100] == {}.

    Ok I think I know the problem -- the rel_ds was mapped though a mapper with num_proc=12. If I remove num_proc. The dataset loads.

    I also see errors with other map and filter functions when num_proc is set.

    Done writing 67 indices in 536 bytes .
    Done writing 67 indices in 536 bytes .
    Fatal Python error: PyCOND_WAIT(gil_cond) failed
    
    bug 
    opened by timothyjlaurent 22
  • Very slow data loading on large dataset

    Very slow data loading on large dataset

    I made a simple python script to check the NLP library speed, which loads 1.1 TB of textual data. It has been 8 hours and still, it is on the loading steps. It does work when the text dataset size is small about 1 GB, but it doesn't scale. It also uses a single thread during the data loading step.

    train_files = glob.glob("xxx/*.txt",recursive=True)
    random.shuffle(train_files)
    
    print(train_files)
    
    dataset = nlp.load_dataset('text', 
                               data_files=train_files,
                               name="customDataset",
                               version="1.0.0",
                               cache_dir="xxx/nlp")
    

    Is there something that I am missing ?

    opened by agemagician 22
  • Add a Depth Estimation dataset - DIODE / NYUDepth / KITTI

    Add a Depth Estimation dataset - DIODE / NYUDepth / KITTI

    Name

    NYUDepth

    Paper

    http://cs.nyu.edu/~silberman/papers/indoor_seg_support.pdf

    Data

    https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html

    Motivation

    Depth estimation is an important problem in computer vision. We have a couple of Depth Estimation models on Hub as well:

    Would be nice to have a dataset for depth estimation. These datasets usually have three things: input image, depth map image, and depth mask (validity mask to indicate if a reading for a pixel is valid or not). Since we already have semantic segmentation datasets on the Hub, I don't think we need any extended utilities to support this addition.

    Having this dataset would also allow us to author data preprocessing guides for depth estimation, particularly like the ones we have for other tasks (example).

    Ccing @osanseviero @nateraw @NielsRogge

    Happy to work on adding it.

    dataset request 
    opened by sayakpaul 21
  • Dataset librispeech_asr fails to load

    Dataset librispeech_asr fails to load

    Describe the bug

    The dataset librispeech_asr (standard Librispeech) fails to load.

    Steps to reproduce the bug

    datasets.load_dataset("librispeech_asr")
    

    Expected results

    It should download and prepare the whole dataset (all subsets).

    In the doc, it says it has two configurations (clean and other). However, the dataset doc says that not specifying split should just load the whole dataset, which is what I want.

    Also, in case of this specific dataset, this is also the standard what the community uses. When you look at any publications with results on Librispeech, they always use the whole train dataset for training.

    Actual results

    ...
      File "/home/az/.cache/huggingface/modules/datasets_modules/datasets/librispeech_asr/1f4602f6b5fed8d3ab3e3382783173f2e12d9877e98775e34d7780881175096c/librispeech_asr.py", line 119, in LibrispeechASR._split_generators
        line: archive_path = dl_manager.download(_DL_URLS[self.config.name])
        locals:
          archive_path = <not found>
          dl_manager = <local> <datasets.utils.download_manager.DownloadManager object at 0x7fc07b426160>
          dl_manager.download = <local> <bound method DownloadManager.download of <datasets.utils.download_manager.DownloadManager object at 0x7fc07b426160>>
          _DL_URLS = <global> {'clean': {'dev': 'http://www.openslr.org/resources/12/dev-clean.tar.gz', 'test': 'http://www.openslr.org/resources/12/test-clean.tar.gz', 'train.100': 'http://www.openslr.org/resources/12/train-clean-100.tar.gz', 'train.360': 'http://www.openslr.org/resources/12/train-clean-360.tar.gz'}, 'other'...
          self = <local> <datasets_modules.datasets.librispeech_asr.1f4602f6b5fed8d3ab3e3382783173f2e12d9877e98775e34d7780881175096c.librispeech_asr.LibrispeechASR object at 0x7fc12a633310>
          self.config = <local> BuilderConfig(name='default', version=0.0.0, data_dir='/home/az/i6/setups/2022-03-20--sis/work/i6_core/datasets/huggingface/DownloadAndPrepareHuggingFaceDatasetJob.TV6Nwm6dFReF/output/data_dir', data_files=None, description=None)
          self.config.name = <local> 'default', len = 7
    KeyError: 'default'
    

    Environment info

    • datasets version: 2.1.0
    • Platform: Linux-5.4.0-107-generic-x86_64-with-glibc2.31
    • Python version: 3.9.9
    • PyArrow version: 6.0.1
    • Pandas version: 1.4.2
    bug 
    opened by albertz 21
  • Finish Deprecating the `fs=` arg

    Finish Deprecating the `fs=` arg

    See #5385 for some discussion on this

    The fs= arg was depcrecated from Dataset.save_to_disk and Dataset.load_from_disk in 2.8.0 (to be removed in 3.0.0). There are a few other places where the fs= arg was still used (functions/methods in datasets.info and datasets.load). This PR adds a similar behavior, warnings and the storage_options= arg to these functions and methods.

    One question: should the "deprecated" / "added" versions be 2.8.1 for the docs/warnings on these? Right now I'm going with "fs was deprecated in 2.8.0" but "storage_options= was added in 2.8.1" where appropriate.

    @mariosasko

    opened by dconathan 2
  • Whisper Event - RuntimeError: The size of tensor a (504) must match the size of tensor b (448) at non-singleton dimension 1 100% 1000/1000 [2:52:21<00:00, 10.34s/it]

    Whisper Event - RuntimeError: The size of tensor a (504) must match the size of tensor b (448) at non-singleton dimension 1 100% 1000/1000 [2:52:21<00:00, 10.34s/it]

    Done in a VM with a GPU (Ubuntu) following the Whisper Event - PYTHON instructions.

    Attempted using RuntimeError: he size of tensor a (504) must match the size of tensor b (448) at non-singleton dimension 1 100% 1000/1000 - WEB - another person experiencing the same issue. But could not resolve the issue with the google/fleurs data. Not clear what can be modified in the PY code to resolve the input data size mismatch, as the training data is already very small.

    Tried posting on Discord, @sanchit-gandhi and @vaibhavs10. Was hoping that the event is over and some input/help is now available. Hugging Face - whisper-small-amet.

    The paper Robust Speech Recognition via Large-Scale Weak Supervision am_et is a low resource language (Table E), with the WER results ranging from 120-229, based on model size. (Whisper small WER=120.2).

    ---> Initial Training Output

    /usr/local/lib/python3.8/dist-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning warnings.warn( [INFO|trainer.py:1641] 2022-12-18 05:23:28,799 >> ***** Running training ***** [INFO|trainer.py:1642] 2022-12-18 05:23:28,799 >> Num examples = 446 [INFO|trainer.py:1643] 2022-12-18 05:23:28,799 >> Num Epochs = 72 [INFO|trainer.py:1644] 2022-12-18 05:23:28,799 >> Instantaneous batch size per device = 16 [INFO|trainer.py:1645] 2022-12-18 05:23:28,799 >> Total train batch size (w. parallel, distributed & accumulation) = 32 [INFO|trainer.py:1646] 2022-12-18 05:23:28,799 >> Gradient Accumulation steps = 2 [INFO|trainer.py:1647] 2022-12-18 05:23:28,800 >> Total optimization steps = 1000 [INFO|trainer.py:1648] 2022-12-18 05:23:28,801 >> Number of trainable parameters = 241734912

    ---> Error

    14% 9/65 [07:07<48:34, 52.04s/it][INFO|configuration_utils.py:523] 2022-12-18 05:03:07,941 >> Generate config GenerationConfig { "begin_suppress_tokens": [ 220, 50257 ], "bos_token_id": 50257, "decoder_start_token_id": 50258, "eos_token_id": 50257, "max_length": 448, "pad_token_id": 50257, "transformers_version": "4.26.0.dev0", "use_cache": false }

    Traceback (most recent call last): File "run_speech_recognition_seq2seq_streaming.py", line 629, in main() File "run_speech_recognition_seq2seq_streaming.py", line 578, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1534, in train return inner_training_loop( File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1859, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2122, in _maybe_log_save_evaluate metrics = self.evaluate(ignore_keys=ignore_keys_for_eval) File "/usr/local/lib/python3.8/dist-packages/transformers/trainer_seq2seq.py", line 78, in evaluate return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix) File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2818, in evaluate output = eval_loop( File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 3000, in evaluation_loop loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys) File "/usr/local/lib/python3.8/dist-packages/transformers/trainer_seq2seq.py", line 213, in prediction_step outputs = model(**inputs) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/dist-packages/transformers/models/whisper/modeling_whisper.py", line 1197, in forward outputs = self.model( File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/dist-packages/transformers/models/whisper/modeling_whisper.py", line 1066, in forward decoder_outputs = self.decoder( File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/dist-packages/transformers/models/whisper/modeling_whisper.py", line 873, in forward hidden_states = inputs_embeds + positions RuntimeError: The size of tensor a (504) must match the size of tensor b (448) at non-singleton dimension 1 100% 1000/1000 [2:52:21<00:00, 10.34s/it]

    opened by catswithbats 0
  • Missing documentation page : improve-performance

    Missing documentation page : improve-performance

    Describe the bug

    Trying to access https://huggingface.co/docs/datasets/v2.8.0/en/package_reference/cache#improve-performance, the page is missing.

    The link is in here : https://huggingface.co/docs/datasets/v2.8.0/en/package_reference/loading_methods#datasets.load_dataset.keep_in_memory

    Steps to reproduce the bug

    Access the page and see it's missing.

    Expected behavior

    Not missing page

    Environment info

    Doesn't matter

    opened by astariul 1
  • Is `fs=` deprecated in `load_from_disk()` as well?

    Is `fs=` deprecated in `load_from_disk()` as well?

    Describe the bug

    The fs= argument was deprecated from Dataset.save_to_disk and Dataset.load_from_disk in favor of automagically figuring it out via fsspec: https://github.com/huggingface/datasets/blob/9a7272cd4222383a5b932b0083a4cc173fda44e8/src/datasets/arrow_dataset.py#L1339-L1340

    Is there a reason the same thing shouldn't also apply to datasets.load.load_from_disk() as well ?

    https://github.com/huggingface/datasets/blob/9a7272cd4222383a5b932b0083a4cc173fda44e8/src/datasets/load.py#L1779

    Steps to reproduce the bug

    n/a

    Expected behavior

    n/a

    Environment info

    n/a

    opened by dconathan 2
Releases(2.8.0)
  • 2.8.0(Dec 19, 2022)

    Important

    • Removed YAML integer keys from class_label metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/5277
      • From now on, datasets pushed on the Hub and using ClassLabel will use a new YAML model to store the feature types
      • The new model uses strings instead of integers for the ids in label name mapping (e.g. 0 -> "0"). This is due to the Hub limitations. In a few months the Hub may stop allowing users to push the old YAML model.
      • Old versions of datasets are not able to reload datasets pushed with this new model, so we encourage everyone to update.

    Datasets Features

    • Fix methods using IterableDataset.map that lead to features=None by @alvarobartt in https://github.com/huggingface/datasets/pull/5287
      • Datasets in streaming mode now update their features after column renaming or removal
    • Add num_proc to from_csv/generator/json/parquet/text by @lhoestq in https://github.com/huggingface/datasets/pull/5239
      • Use multiprocessing to load multiple files in parallel
    • Add features param to IterableDataset.map by @alvarobartt in https://github.com/huggingface/datasets/pull/5311
    • Sharded save_to_disk + multiprocessing by @lhoestq in https://github.com/huggingface/datasets/pull/5268
      • Pass num_shards or max_shard_size to ds.save_to_disk() or ds.push_to_hub()
      • Pass num_proc to use multiprocessing.
    • Support for decoding Image/Audio types in map when format type is not default one by @mariosasko in https://github.com/huggingface/datasets/pull/5252
    • Support torch dataloader without torch formatting for IterableDataset by @lhoestq in https://github.com/huggingface/datasets/pull/5357
      • You can now pass any dataset in streaming mode to a PyTorch DataLoader directly:
      from datasets import load_dataset
      ds = load_dataset("c4", "en", streaming=True, split="train")
      dataloader = DataLoader(ds, batch_size=32, num_workers=4)
      

    Docs

    • Complete doc migration by @mishig25 in https://github.com/huggingface/datasets/pull/5248

    General improvements and bug fixes

    • typo by @WrRan in https://github.com/huggingface/datasets/pull/5253
    • typo by @WrRan in https://github.com/huggingface/datasets/pull/5254
    • remove an unused statement by @WrRan in https://github.com/huggingface/datasets/pull/5257
    • fix wrong print by @WrRan in https://github.com/huggingface/datasets/pull/5256
    • Fix max_shard_size docs by @lhoestq in https://github.com/huggingface/datasets/pull/5267
    • Specify arguments as keywords in librosa.reshape to avoid future errors by @polinaeterna in https://github.com/huggingface/datasets/pull/5266
    • Change release procedure to use only pull requests by @albertvillanova in https://github.com/huggingface/datasets/pull/5250
    • Warn about checksums by @lhoestq in https://github.com/huggingface/datasets/pull/5279
    • Tweak readme by @lhoestq in https://github.com/huggingface/datasets/pull/5210
    • Save file name in embed_storage by @lhoestq in https://github.com/huggingface/datasets/pull/5285
    • Use correct dataset type in from_generator docs by @mariosasko in https://github.com/huggingface/datasets/pull/5307
    • Support streaming datasets with pathlib.Path.with_suffix by @albertvillanova in https://github.com/huggingface/datasets/pull/5294
    • Fix xjoin for Windows pathnames by @albertvillanova in https://github.com/huggingface/datasets/pull/5297
    • Fix xopen for Windows pathnames by @albertvillanova in https://github.com/huggingface/datasets/pull/5299
    • Ci py3.10 by @lhoestq in https://github.com/huggingface/datasets/pull/5065
    • Update Overview.ipynb google colab by @lhoestq in https://github.com/huggingface/datasets/pull/5211
    • Support xPath for Windows pathnames by @albertvillanova in https://github.com/huggingface/datasets/pull/5310
    • Fix description of streaming in the docs by @polinaeterna in https://github.com/huggingface/datasets/pull/5313
    • Fix Text sample_by paragraph by @albertvillanova in https://github.com/huggingface/datasets/pull/5319
    • [Extract] Place the lock file next to the destination directory by @lhoestq in https://github.com/huggingface/datasets/pull/5320
    • Fix loading from HF GCP cache by @lhoestq in https://github.com/huggingface/datasets/pull/5321
      • This was affecting datasets like wikipedia or natural_questions
    • Fix docs building for main by @albertvillanova in https://github.com/huggingface/datasets/pull/5328
    • Origin/fix missing features error by @eunseojo in https://github.com/huggingface/datasets/pull/5318
    • fix: 🐛 pass the token to get the list of config names by @severo in https://github.com/huggingface/datasets/pull/5333
    • Clarify imagefolder is for small datasets by @stevhliu in https://github.com/huggingface/datasets/pull/5329
    • Close stream in ArrowWriter.finalize before inference error by @mariosasko in https://github.com/huggingface/datasets/pull/5309
    • Use same num_proc for dataset download and generation by @mariosasko in https://github.com/huggingface/datasets/pull/5300
    • Set IterableDataset.map param batch_size typing as optional by @alvarobartt in https://github.com/huggingface/datasets/pull/5336
    • fix: dataset path should be absolute by @vigsterkr in https://github.com/huggingface/datasets/pull/5234
    • Clean up DatasetInfo and Dataset docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5340
    • Clean up docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5334
    • Remove tasks.json by @lhoestq in https://github.com/huggingface/datasets/pull/5341
    • Support topdown parameter in xwalk by @mariosasko in https://github.com/huggingface/datasets/pull/5308
    • Improve use_auth_token docstring and deprecate use_auth_token in download_and_prepare by @mariosasko in https://github.com/huggingface/datasets/pull/5302
    • Clean up Loading methods docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5350
    • Clean up remaining Main Classes docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5349
    • Clean up Dataset and DatasetDict by @stevhliu in https://github.com/huggingface/datasets/pull/5344
    • Clean up Table class docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5355
    • Raise error for .tar archives in the same way as for .tar.gz and .tgz in _get_extraction_protocol by @polinaeterna in https://github.com/huggingface/datasets/pull/5322
    • Clean filesystem and logging docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5356
    • ExamplesIterable fixes by @lhoestq in https://github.com/huggingface/datasets/pull/5366
    • Simplify skipping by @Muennighoff in https://github.com/huggingface/datasets/pull/5373
    • Release: 2.8.0 by @lhoestq in https://github.com/huggingface/datasets/pull/5375

    New Contributors

    • @WrRan made their first contribution in https://github.com/huggingface/datasets/pull/5253
    • @eunseojo made their first contribution in https://github.com/huggingface/datasets/pull/5318
    • @vigsterkr made their first contribution in https://github.com/huggingface/datasets/pull/5234
    • @Muennighoff made their first contribution in https://github.com/huggingface/datasets/pull/5373

    Full Changelog: https://github.com/huggingface/datasets/compare/2.7.0...2.8.0

    Source code(tar.gz)
    Source code(zip)
  • 2.7.1(Nov 22, 2022)

    Bug fixes

    • Remove YAML integer keys from class_label metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/5277

    Full Changelog: https://github.com/huggingface/datasets/compare/2.7.0...2.7.1

    Source code(tar.gz)
    Source code(zip)
  • 2.6.2(Nov 22, 2022)

    Bug fixes

    • Remove YAML integer keys from class_label metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/5277

    Full Changelog: https://github.com/huggingface/datasets/compare/2.6.1...2.6.2

    Source code(tar.gz)
    Source code(zip)
  • 2.7.0(Nov 16, 2022)

    Dataset Features

    • Multiprocessed dataset builder by @TevenLeScao in https://github.com/huggingface/datasets/pull/5107
      • Load big datasets faster than before using multiprocessing:
      from datasets import load_dataset
      ds = load_dataset("imagenet-1k", num_proc=4)
      
    • Make torch.Tensor and spacy models cacheable by @mariosasko in https://github.com/huggingface/datasets/pull/5191
      • Function passed to map or filter that uses tensors or pipelines can now be cached
    • Drop labels in Image and Audio folders if files are on different levels in directory or if there is only one label by @polinaeterna in https://github.com/huggingface/datasets/pull/5192
    • TextConfig: added "errors" by @NightMachinery in https://github.com/huggingface/datasets/pull/5155

    Audio setup

    • Add ffmpeg4 installation instructions in warnings by @polinaeterna in https://github.com/huggingface/datasets/pull/5167

    Docs

    • Update create image dataset docs by @stevhliu in https://github.com/huggingface/datasets/pull/5177
    • add: segmentation guide. by @sayakpaul in https://github.com/huggingface/datasets/pull/5188
    • Reword E2E training and inference tips in the vision guides by @sayakpaul in https://github.com/huggingface/datasets/pull/5217
    • Add SQL guide by @stevhliu in https://github.com/huggingface/datasets/pull/5223

    General improvements and bug fixes

    • Add pyproject.toml for black by @mariosasko in https://github.com/huggingface/datasets/pull/5125
    • Fix tqdm zip bug by @david1542 in https://github.com/huggingface/datasets/pull/5120
    • Install tensorflow-macos dependency conditionally by @albertvillanova in https://github.com/huggingface/datasets/pull/5124
    • [TYPO] Update new_dataset_script.py by @cakiki in https://github.com/huggingface/datasets/pull/5119
    • Avoid extra cast in class_encode_column by @mariosasko in https://github.com/huggingface/datasets/pull/5130
    • Use yaml for issue templates + revamp by @mariosasko in https://github.com/huggingface/datasets/pull/5116
    • Update docs once dataset scripts transferred to the Hub by @albertvillanova in https://github.com/huggingface/datasets/pull/5136
    • Delete duplicate issue template file by @albertvillanova in https://github.com/huggingface/datasets/pull/5146
    • Deprecate num_proc parameter in DownloadManager.extract by @ayushthe1 in https://github.com/huggingface/datasets/pull/5142
    • Raise ImportError instead of OSError by @ayushthe1 in https://github.com/huggingface/datasets/pull/5141
    • Fix CI require beam by @albertvillanova in https://github.com/huggingface/datasets/pull/5168
    • Make iter_files deterministic by @albertvillanova in https://github.com/huggingface/datasets/pull/5149
    • Add PB and TB in convert_file_size_to_int by @lhoestq in https://github.com/huggingface/datasets/pull/5171
    • Reduce default max writer_batch_size by @mariosasko in https://github.com/huggingface/datasets/pull/5163
    • Support dill 0.3.6 by @albertvillanova in https://github.com/huggingface/datasets/pull/5166
    • Make filename matching more robust by @riccardobucco in https://github.com/huggingface/datasets/pull/5128
    • Preserve None in list type cast in PyArrow 10 by @mariosasko in https://github.com/huggingface/datasets/pull/5174
    • Raise ffmpeg warnings only once by @polinaeterna in https://github.com/huggingface/datasets/pull/5173
    • Add "ipykernel" to list of co_filenames to remove by @gpucce in https://github.com/huggingface/datasets/pull/5169
    • chore: add notebook links to img cls and obj det. by @sayakpaul in https://github.com/huggingface/datasets/pull/5187
    • Fix docs about dataset_info in YAML by @albertvillanova in https://github.com/huggingface/datasets/pull/5194
    • fsspec lock reset in multiprocessing by @lhoestq in https://github.com/huggingface/datasets/pull/5159
    • Add note about the name of a dataset script by @polinaeterna in https://github.com/huggingface/datasets/pull/5198
    • Deprecate dummy data generation command by @mariosasko in https://github.com/huggingface/datasets/pull/5199
    • Do not sort splits in dataset info by @polinaeterna in https://github.com/huggingface/datasets/pull/5201
    • Add missing DownloadConfig.use_auth_token value by @alvarobartt in https://github.com/huggingface/datasets/pull/5205
    • Update canonical links to Hub links by @stevhliu in https://github.com/huggingface/datasets/pull/5203
    • Refactor CI hub fixtures to use monkeypatch instead of patch by @albertvillanova in https://github.com/huggingface/datasets/pull/5208
    • Update github pr docs actions by @mishig25 in https://github.com/huggingface/datasets/pull/5214
    • Use hfh hf_hub_url function by @albertvillanova in https://github.com/huggingface/datasets/pull/5196
    • Pin typer version in tests to <0.5 to fix Windows CI by @polinaeterna in https://github.com/huggingface/datasets/pull/5235
    • Fix shards in IterableDataset.from_generator by @lhoestq in https://github.com/huggingface/datasets/pull/5233
    • Fix class name of symbolic link by @riccardobucco in https://github.com/huggingface/datasets/pull/5126
    • Make Version hashable by @mariosasko in https://github.com/huggingface/datasets/pull/5238
    • Handle ArrowNotImplementedError caused by try_type being Image or Audio in cast by @mariosasko in https://github.com/huggingface/datasets/pull/5236
    • Encode path only for old versions of hfh by @lhoestq in https://github.com/huggingface/datasets/pull/5237
    • Fix CI require_beam maximum compatible dill version by @albertvillanova in https://github.com/huggingface/datasets/pull/5212
    • Support hfh rc version by @lhoestq in https://github.com/huggingface/datasets/pull/5241
    • Cleaner error tracebacks for dataset script errors by @mariosasko in https://github.com/huggingface/datasets/pull/5240

    New Contributors

    • @david1542 made their first contribution in https://github.com/huggingface/datasets/pull/5120
    • @ayushthe1 made their first contribution in https://github.com/huggingface/datasets/pull/5142
    • @gpucce made their first contribution in https://github.com/huggingface/datasets/pull/5169
    • @sayakpaul made their first contribution in https://github.com/huggingface/datasets/pull/5187
    • @NightMachinery made their first contribution in https://github.com/huggingface/datasets/pull/5155

    Full Changelog: https://github.com/huggingface/datasets/compare/2.6.1...2.7.0

    Source code(tar.gz)
    Source code(zip)
  • 2.6.1(Oct 14, 2022)

    Bug fixes

    • Fix filter indices when batched by @albertvillanova in https://github.com/huggingface/datasets/pull/5113
      • fixed a bug where filter could return examples with the wrong indices
    • Fix iter_batches by @lhoestq in https://github.com/huggingface/datasets/pull/5115
      • fixed a bug where map with batch=True could return a dataset with less examples
    • Fix a typo in arrow_dataset.py by @yangky11 in https://github.com/huggingface/datasets/pull/5108

    New Contributors

    • @yangky11 made their first contribution in https://github.com/huggingface/datasets/pull/5108

    Full Changelog: https://github.com/huggingface/datasets/compare/2.6.0...2.6.1

    Source code(tar.gz)
    Source code(zip)
  • 2.6.0(Oct 13, 2022)

    Important

    • [GH->HF] Remove all dataset scripts from github by @lhoestq in https://github.com/huggingface/datasets/pull/4974
      • all the dataset scripts and dataset cards are now on https://hf.co/datasets
      • we invite users and contributors to open discussions or pull requests on the Hugging Face Hub from now on

    Datasets features

    • Add ability to read-write to SQL databases. by @Dref360 in https://github.com/huggingface/datasets/pull/4928
      • Read from sqlite file:
      from datasets import Dataset
      dataset = Dataset.from_sql("data_table", "sqlite:///sqlite_file.db")
      
      • Allow connection objects in from_sql + small doc improvement by @mariosasko in https://github.com/huggingface/datasets/pull/5091
      from datasets import Dataset
      from sqlite3 import connect
      con = connect(...)
      dataset = Dataset.from_sql("SELECT text FROM table WHERE length(text) > 100 LIMIT 10", con)
      
    • Image & Audio formatting for numpy/torch/tf/jax by @lhoestq in https://github.com/huggingface/datasets/pull/5072
      • return numpy/torch/tf/jax tensors with
      from datasets import load_dataset
      ds = load_dataset("imagenet-1k").with_format("torch")  # or numpy/tf/jax
      ds[0]["image"]
      
    • Added IterableDataset.from_generator by @hamid-vakilzadeh in https://github.com/huggingface/datasets/pull/5052
    • Fast dataset iter by @mariosasko in https://github.com/huggingface/datasets/pull/5030
      • speed up by a factor of 2 using the Arrow Table reader
    • Dataset infos in yaml by @lhoestq in https://github.com/huggingface/datasets/pull/4926
      • you can now specify the feature types and number of samples in the dataset card, see https://huggingface.co/docs/datasets/dataset_card
    • Add kwargs to Dataset.from_generator by @mariosasko in https://github.com/huggingface/datasets/pull/5049
    • Support converters in CsvBuilder by @mariosasko in https://github.com/huggingface/datasets/pull/5057
    • Restore saved format state in load_from_disk by @asofiaoliveira in https://github.com/huggingface/datasets/pull/5073

    Dataset changes

    • Update: hendrycks_test - support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/5041
    • Update: swiss judgment prediction by @JoelNiklaus in https://github.com/huggingface/datasets/pull/5019
      • Update swiss judgment prediction by @JoelNiklaus in https://github.com/huggingface/datasets/pull/5042
    • Fix: xcsr - fix languages of X-CSQA configs by @albertvillanova in https://github.com/huggingface/datasets/pull/5022
    • Fix: sbu_captions - fix URLs by @donglixp in https://github.com/huggingface/datasets/pull/5020
    • Fix: xcsr - fix string features by @albertvillanova in https://github.com/huggingface/datasets/pull/5024
    • Fix: hendrycks_test - fix NonMatchingChecksumError by @albertvillanova in https://github.com/huggingface/datasets/pull/5040
    • Fix: cats_vs_dogs - fix number of samples by @lhoestq in https://github.com/huggingface/datasets/pull/5047
    • Fix: lex_glue - fix bug with labels of eurlex config of lex_glue dataset by @iliaschalkidis in https://github.com/huggingface/datasets/pull/5048
    • Fix: msr_sqa - fix dataset generation by @Timothyxxx in https://github.com/huggingface/datasets/pull/3715

    Dataset cards

    • Add description to hellaswag dataset by @julien-c in https://github.com/huggingface/datasets/pull/4810
    • Add deprecation warning to multilingual_librispeech dataset card by @albertvillanova in https://github.com/huggingface/datasets/pull/5010
    • Update languages in aeslc dataset card by @apergo-ai in https://github.com/huggingface/datasets/pull/3357
    • Update license to bookcorpus dataset card by @meg-huggingface in https://github.com/huggingface/datasets/pull/3526
    • Update paper link in medmcqa dataset card by @monk1337 in https://github.com/huggingface/datasets/pull/4290
    • Add oversampling strategy iterable datasets interleave by @ylacombe in https://github.com/huggingface/datasets/pull/5036
    • Fix license/citation information of squadshifts dataset card by @albertvillanova in https://github.com/huggingface/datasets/pull/5054

    General improvements and bug fixes

    • Fix missing use_auth_token in streaming docstrings by @albertvillanova in https://github.com/huggingface/datasets/pull/5003
    • Add some note about running the transformers ci before a release by @lhoestq in https://github.com/huggingface/datasets/pull/5007
    • Remove license tag file and validation by @albertvillanova in https://github.com/huggingface/datasets/pull/5004
    • Re-apply input columns change by @mariosasko in https://github.com/huggingface/datasets/pull/5008
    • patch CI_HUB_TOKEN_PATH with Path instead of str by @Wauplin in https://github.com/huggingface/datasets/pull/5026
    • Fix typo in error message by @severo in https://github.com/huggingface/datasets/pull/5027
    • Fix import in ClassLabel docstring example by @alvarobartt in https://github.com/huggingface/datasets/pull/5029
    • Remove redundant code from some dataset module factories by @albertvillanova in https://github.com/huggingface/datasets/pull/5033
    • Fix typos in load docstrings and comments by @albertvillanova in https://github.com/huggingface/datasets/pull/5035
    • Prefer split patterns from directories over split patterns from filenames by @polinaeterna in https://github.com/huggingface/datasets/pull/4985
    • Fix tar extraction vuln by @lhoestq in https://github.com/huggingface/datasets/pull/5016
    • Support hfh 0.10 implicit auth by @lhoestq in https://github.com/huggingface/datasets/pull/5031
    • Fix flatten_indices with empty indices mapping by @mariosasko in https://github.com/huggingface/datasets/pull/5043
    • Improve CI performance speed of PackagedDatasetTest by @albertvillanova in https://github.com/huggingface/datasets/pull/5037
    • Revert task removal in folder-based builders by @mariosasko in https://github.com/huggingface/datasets/pull/5051
    • Fix backward compatibility for dataset_infos.json by @lhoestq in https://github.com/huggingface/datasets/pull/5055
    • Fix typo by @stevhliu in https://github.com/huggingface/datasets/pull/5059
    • Fix CI hfh token warning by @albertvillanova in https://github.com/huggingface/datasets/pull/5062
    • Mark CI tests as xfail when 502 error by @albertvillanova in https://github.com/huggingface/datasets/pull/5058
    • Fix passed download_config in HubDatasetModuleFactoryWithoutScript by @albertvillanova in https://github.com/huggingface/datasets/pull/5077
    • Fix CONTRIBUTING once dataset scripts transferred to Hub by @albertvillanova in https://github.com/huggingface/datasets/pull/5067
    • Fix header level in Audio docs by @stevhliu in https://github.com/huggingface/datasets/pull/5078
    • Support DEFAULT_CONFIG_NAME when no BUILDER_CONFIGS by @albertvillanova in https://github.com/huggingface/datasets/pull/5071
    • Support streaming gzip.open by @albertvillanova in https://github.com/huggingface/datasets/pull/5066
    • adding keep in memory by @Mustapha-AJEGHRIR in https://github.com/huggingface/datasets/pull/5082
    • refactor: replace AssertionError with more meaningful exceptions (#5074) by @galbwe in https://github.com/huggingface/datasets/pull/5079
    • fix: update exception throw from OSError to EnvironmentError in `push… by @rahulXs in https://github.com/huggingface/datasets/pull/5076
    • Align signature of list_repo_files with latest hfh by @albertvillanova in https://github.com/huggingface/datasets/pull/5063
    • Align signature of create/delete_repo with latest hfh by @albertvillanova in https://github.com/huggingface/datasets/pull/5064
    • Fix filter with empty indices by @Mouhanedg56 in https://github.com/huggingface/datasets/pull/5087
    • Fix tutorial (#5093) by @riccardobucco in https://github.com/huggingface/datasets/pull/5095
    • Use HTML relative paths for tiles in the docs by @lewtun in https://github.com/huggingface/datasets/pull/5092
    • Fix loading how to guide (#5102) by @riccardobucco in https://github.com/huggingface/datasets/pull/5104
    • url encode hub url (#5099) by @riccardobucco in https://github.com/huggingface/datasets/pull/5103
    • Free the "hf" filesystem protocol for hffs by @lhoestq in https://github.com/huggingface/datasets/pull/5101
    • Fix task template reload from dict by @lhoestq in https://github.com/huggingface/datasets/pull/5106

    New Contributors

    • @Wauplin made their first contribution in https://github.com/huggingface/datasets/pull/5026
    • @donglixp made their first contribution in https://github.com/huggingface/datasets/pull/5020
    • @Timothyxxx made their first contribution in https://github.com/huggingface/datasets/pull/3715
    • @hamid-vakilzadeh made their first contribution in https://github.com/huggingface/datasets/pull/5052
    • @Mustapha-AJEGHRIR made their first contribution in https://github.com/huggingface/datasets/pull/5082
    • @galbwe made their first contribution in https://github.com/huggingface/datasets/pull/5079
    • @rahulXs made their first contribution in https://github.com/huggingface/datasets/pull/5076
    • @Mouhanedg56 made their first contribution in https://github.com/huggingface/datasets/pull/5087
    • @riccardobucco made their first contribution in https://github.com/huggingface/datasets/pull/5095
    • @asofiaoliveira made their first contribution in https://github.com/huggingface/datasets/pull/5073

    Full Changelog: https://github.com/huggingface/datasets/compare/2.5.1...2.6.0

    Source code(tar.gz)
    Source code(zip)
  • 2.5.2(Oct 5, 2022)

    Bug fixes

    • Revert task removal in folder-based builders (#5051)
    • Support hfh 0.10 implicit auth (#5031)

    Full Changelog: https://github.com/huggingface/datasets/compare/2.5.1...2.5.2

    Source code(tar.gz)
    Source code(zip)
  • 2.5.1(Sep 21, 2022)

    Bug fixes

    • Revert input_columns change by @lhoestq in https://github.com/huggingface/datasets/pull/5006

    Full Changelog: https://github.com/huggingface/datasets/compare/2.5.0...2.5.1

    Source code(tar.gz)
    Source code(zip)
  • 2.5.0(Sep 21, 2022)

    Important

    • Drop Python 3.6 support by @mariosasko in https://github.com/huggingface/datasets/pull/4460
    • Deprecate metrics by @albertvillanova in https://github.com/huggingface/datasets/pull/4739
      • Metrics are now deprecated and have been moved to evaluate:
        !pip install evaluate
        import evaluate
        metric = evaluate.load("accuracy")
        
    • Load GitHub datasets from Hub by @albertvillanova in https://github.com/huggingface/datasets/pull/4059
      • datasets with no namespace like "squad" were loaded from this GitHub repository, now they're loaded from https://huggingface.co/datasets
    • Decode mp3 with librosa if torchaudio is > 0.12 as a temporary workaround by @polinaeterna in https://github.com/huggingface/datasets/pull/4923
      • latest version of torchaudio 0.12 now requires ffmpeg (version 4) to read MP3 files, please downgrade to 0.12 for now or use librosa
    • Use HTTP requests to access data and metadata through the Datasets REST API (docs here)

    Datasets features

    No-code loaders

    • Add AudioFolder packaged loader by @polinaeterna in https://github.com/huggingface/datasets/pull/4530
    • Add support for CSV metadata files to ImageFolder by @mariosasko in https://github.com/huggingface/datasets/pull/4837
    • Add support for parsing JSON files in array form by @mariosasko in https://github.com/huggingface/datasets/pull/4997

    Dataset methods

    • add Dataset.from_list by @sanderland in https://github.com/huggingface/datasets/pull/4890
    • Add Dataset.from_generator by @mariosasko in https://github.com/huggingface/datasets/pull/4957
    • Add oversampling strategies to interleave datasets by @ylacombe in https://github.com/huggingface/datasets/pull/4831
    • Preserve non-input_colums in Dataset.map if input_columns are specified by @mariosasko in https://github.com/huggingface/datasets/pull/4971
    • Add fn_kwargs param to IterableDataset.map by @mariosasko in https://github.com/huggingface/datasets/pull/4975
    • More rigorous shape inference in to_tf_dataset by @Rocketknight1 in https://github.com/huggingface/datasets/pull/4763

    Parquet support

    • Download and prepare as Parquet for cloud storage by @lhoestq in https://github.com/huggingface/datasets/pull/4724
    • Shard parquet in download_and_prepare by @lhoestq in https://github.com/huggingface/datasets/pull/4747
    • Embed image/audio data in dl_and_prepare parquet by @lhoestq in https://github.com/huggingface/datasets/pull/4987

    Datasets changes

    • Update: natural questions - Add long answer candidates by @seirasto in https://github.com/huggingface/datasets/pull/4368
    • Update: opus_paracrawl - update version by @albertvillanova in https://github.com/huggingface/datasets/pull/4816
    • Update: ReCoRD - Include entity positions as feature by @richarddwang in https://github.com/huggingface/datasets/pull/4479
    • Update: swda - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4914
    • Update: Enwik8 - update broken link and information by @mtanghu in https://github.com/huggingface/datasets/pull/4
    • Update: compguesswhat - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4968
    • Update: nli_tr - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4970
    • Update: IndicGLUE - update download links by @sumanthd17 in https://github.com/huggingface/datasets/pull/4978
    • Update: iwslt2017 - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4992
    • Fix: mbpp - fix NonMatchingChecksumError by @albertvillanova in https://github.com/huggingface/datasets/pull/4788
    • Fix: mkqa - Update data URL by @albertvillanova in https://github.com/huggingface/datasets/pull/4823
    • Fix: exams - fix bug and checksums by @albertvillanova in https://github.com/huggingface/datasets/pull/4853
    • Fix: trec - use fine classes by @albertvillanova in https://github.com/huggingface/datasets/pull/4801
    • Fix: wmt datasets - fix CWMT zh subsets by @lhoestq in https://github.com/huggingface/datasets/pull/4871
    • Fix: LibriSpeech - Fix dev split local_extracted_archive for 'all' config by @sanchit-gandhi in https://github.com/huggingface/datasets/pull/4904
    • Fix: compguesswhat - fix data URLs by @albertvillanova in https://github.com/huggingface/datasets/pull/4959
    • Fix: vivos - fix data URL and metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/4969
    • Fix: MBPP - Add splits by @cwarny in https://github.com/huggingface/datasets/pull/4943

    Dataset cards

    • Add language_bcp47 tag by @lhoestq in https://github.com/huggingface/datasets/pull/4753
    • Added more information in the README about contributors of the Arabic Speech Corpus by @nawarhalabi in https://github.com/huggingface/datasets/pull/4701
    • Remove "unkown" language tags by @lhoestq in https://github.com/huggingface/datasets/pull/4754
    • Highlight non-commercial license in amazon_reviews_multi dataset card by @sbroadhurst-hf in https://github.com/huggingface/datasets/pull/4712
    • Added dataset information in clinic oos dataset card by @Arnav-Ladkat in https://github.com/huggingface/datasets/pull/4751
    • Fix opus_gnome dataset card by @gojiteji in https://github.com/huggingface/datasets/pull/4806
    • Complete the mlqa dataset card by @eldhoittangeorge in https://github.com/huggingface/datasets/pull/4809
    • Fix loading example in opus dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4813
    • Add missing language tags to resources by @albertvillanova in https://github.com/huggingface/datasets/pull/4819
    • Fix titles in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4824
    • Fix language tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4826
    • Add license metadata to pg19 by @julien-c in https://github.com/huggingface/datasets/pull/4827
    • Fix task tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4830
    • Fix tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4832
    • Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4833
    • Fix documentation card of recipe_nlg dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4834
    • Fix documentation card of ethos dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4835
    • Update documentation card of miam dataset by @PierreColombo in https://github.com/huggingface/datasets/pull/4846
    • Update stackexchange license by @cakiki in https://github.com/huggingface/datasets/pull/4842
    • Update ted_talks_iwslt license to include ND by @cakiki in https://github.com/huggingface/datasets/pull/4841
    • Fix documentation card of adv_glue dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4838
    • Complete tags of superglue dataset card by @richarddwang in https://github.com/huggingface/datasets/pull/48674869
    • Fix license tag and Source Data section in billsum dataset card by @kashif in https://github.com/huggingface/datasets/pull/4851
    • Fix documentation card of covid_qa_castorini dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4877
    • Fix Citation Information section in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4879
    • Fix documentation card of math_qa dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4884
    • Added names of less-studied languages by @BenjaminGalliot in https://github.com/huggingface/datasets/pull/4880
    • Fix language tags resource file by @albertvillanova in https://github.com/huggingface/datasets/pull/4882
    • Add citation to ro_sts and ro_sts_parallel datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/4892
    • Add citation information to makhzan dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4894
    • Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4891
    • Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4896
    • Re-add code and und language tags by @albertvillanova in https://github.com/huggingface/datasets/pull/4899
    • Add "cc-by-nc-sa-2.0" to list of licenses by @osanseviero in https://github.com/huggingface/datasets/pull/48874903
    • Update GLUE evaluation metadata by @lewtun in https://github.com/huggingface/datasets/pull/4909
    • Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4908
    • Add license and citation information to cosmos_qa dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4913
    • Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4921
    • Add cc-by-nc-2.0 to list of licenses by @albertvillanova in https://github.com/huggingface/datasets/pull/4930
    • Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4931
    • Add Papers with Code ID to scifact dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4941
    • Fix license information in qasc dataset card by @albertvillanova in https://github.com/huggingface/datasets/pull/4951
    • Fix multilinguality tag and missing sections in xquad_r dataset card by @albertvillanova in https://github.com/huggingface/datasets/pull/4940
    • Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4979
    • Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4991

    Documentation

    • Update map docs by @stevhliu in https://github.com/huggingface/datasets/pull/4743
    • Add image classification processing guide by @stevhliu in https://github.com/huggingface/datasets/pull/4748
    • Fix train_test_split docs by @NielsRogge in https://github.com/huggingface/datasets/pull/4821
    • Update local loading script docs by @stevhliu in https://github.com/huggingface/datasets/pull/4778
    • Docs for creating a loading script for image datasets by @stevhliu in https://github.com/huggingface/datasets/pull/4783
    • Docs for creating an audio dataset by @stevhliu in https://github.com/huggingface/datasets/pull/4872

    General improvements and bug fixes

    • Use CI unit/integration tests by @albertvillanova in https://github.com/huggingface/datasets/pull/4738
    • Fix multiprocessing in map_nested by @albertvillanova in https://github.com/huggingface/datasets/pull/4740
    • Add 2.4.0 version added to docstrings by @albertvillanova in https://github.com/huggingface/datasets/pull/4767
    • Update CI badge by @mariosasko in https://github.com/huggingface/datasets/pull/4764
    • Fix version in map_nested docstring by @albertvillanova in https://github.com/huggingface/datasets/pull/4765
    • fix typo by @xwwwwww in https://github.com/huggingface/datasets/pull/4770
    • Unpin rouge_score test dependency by @albertvillanova in https://github.com/huggingface/datasets/pull/4768
    • Remove apache_beam import from module level in natural_questions dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4780
    • Require torchaudio<0.12.0 to avoid RuntimeError by @albertvillanova in https://github.com/huggingface/datasets/pull/4777
    • Remove dummy data generation docs by @stevhliu in https://github.com/huggingface/datasets/pull/4771
    • Require torchaudio<0.12.0 in docs by @albertvillanova in https://github.com/huggingface/datasets/pull/4785
    • Fix bug in function validate_type for Python >= 3.9 by @albertvillanova in https://github.com/huggingface/datasets/pull/4812
    • Fix typo in streaming docs by @flozi00 in https://github.com/huggingface/datasets/pull/4843
    • Fix test of _get_extraction_protocol for TAR files by @albertvillanova in https://github.com/huggingface/datasets/pull/4850
    • Fix typos in documentation by @fl-lo in https://github.com/huggingface/datasets/pull/
    • Mark CI tests as xfail if Hub HTTP error by @albertvillanova in https://github.com/huggingface/datasets/pull/4845
    • [Windows] Fix Access Denied when using os.rename() by @DougTrajano in https://github.com/huggingface/datasets/pull/4825
    • [docs] Some tiny doc tweaks by @julien-c in https://github.com/huggingface/datasets/pull/4874
    • Document loading from relative path by @stevhliu in https://github.com/huggingface/datasets/pull/4773
    • Fix CI reporting by @albertvillanova in https://github.com/huggingface/datasets/pull/4903
    • Add 'val' to VALIDATION_KEYWORDS. by @akt42 in https://github.com/huggingface/datasets/pull/4844
    • Raise ManualDownloadError from get_dataset_config_info by @albertvillanova in https://github.com/huggingface/datasets/pull/4901
    • feat: improve error message on Keys mismatch. closes #4917 by @PaulLerner in https://github.com/huggingface/datasets/pull/4919
    • Fixes a typo in loading documentation by @sighingnow in https://github.com/huggingface/datasets/pull/4929
    • Remove main branch rename notice by @lhoestq in https://github.com/huggingface/datasets/pull/4938
    • Fix NonMatchingChecksumError in adv_glue dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4939
    • Remove deprecated identical_ok by @lhoestq in https://github.com/huggingface/datasets/pull/4937
    • Pin TensorFlow temporarily by @albertvillanova in https://github.com/huggingface/datasets/pull/4954
    • Fix minor typo in error message for missing imports by @mariosasko in https://github.com/huggingface/datasets/pull/4948
    • Fix TF tests for 2.10 by @Rocketknight1 in https://github.com/huggingface/datasets/pull/4956
    • fix BLEU metric card by @antoniolanza1996 in https://github.com/huggingface/datasets/pull/4927
    • Update doc upload_dataset.mdx by @mishig25 in https://github.com/huggingface/datasets/pull/4789
    • Improve features resolution in streaming by @lhoestq in https://github.com/huggingface/datasets/pull/4762
    • Fix label renaming and add a battery of tests by @Rocketknight1 in https://github.com/huggingface/datasets/pull/4781
    • Strip "/" in local dataset path to avoid empty dataset name error by @apohllo in https://github.com/huggingface/datasets/pull/4967
    • Introduce regex check when pushing as well by @LysandreJik in https://github.com/huggingface/datasets/pull/4946
    • [doc] Fix broken snippet that had too many quotes by @tomaarsen in https://github.com/huggingface/datasets/pull/4986
    • Fix map batched with torch output by @lhoestq in https://github.com/huggingface/datasets/pull/4972
    • fix: avoid casting tuples after Dataset.map by @szmoro in https://github.com/huggingface/datasets/pull/4993
    • decode mp3 with librosa if torchaudio is > 0.12 as a temporary workaround by @polinaeterna in https://github.com/huggingface/datasets/pull/4923
    • Don't add a tag on the Hub on release by @lhoestq in https://github.com/huggingface/datasets/pull/4998
    • Add EmptyDatasetError by @lhoestq in https://github.com/huggingface/datasets/pull/4999

    New Contributors

    • @seirasto made their first contribution in https://github.com/huggingface/datasets/pull/4368
    • @sbroadhurst-hf made their first contribution in https://github.com/huggingface/datasets/pull/4712
    • @nawarhalabi made their first contribution in https://github.com/huggingface/datasets/pull/4701
    • @Arnav-Ladkat made their first contribution in https://github.com/huggingface/datasets/pull/4751
    • @xwwwwww made their first contribution in https://github.com/huggingface/datasets/pull/4770
    • @gojiteji made their first contribution in https://github.com/huggingface/datasets/pull/4806
    • @eldhoittangeorge made their first contribution in https://github.com/huggingface/datasets/pull/4809
    • @flozi00 made their first contribution in https://github.com/huggingface/datasets/pull/4843
    • @fl-lo made their first contribution in https://github.com/huggingface/datasets/pull/4869
    • @BenjaminGalliot made their first contribution in https://github.com/huggingface/datasets/pull/4880
    • @DougTrajano made their first contribution in https://github.com/huggingface/datasets/pull/4825
    • @ylacombe made their first contribution in https://github.com/huggingface/datasets/pull/4831
    • @osanseviero made their first contribution in https://github.com/huggingface/datasets/pull/4887
    • @akt42 made their first contribution in https://github.com/huggingface/datasets/pull/4844
    • @sanderland made their first contribution in https://github.com/huggingface/datasets/pull/4890
    • @sighingnow made their first contribution in https://github.com/huggingface/datasets/pull/4929
    • @mtanghu made their first contribution in https://github.com/huggingface/datasets/pull/4950
    • @antoniolanza1996 made their first contribution in https://github.com/huggingface/datasets/pull/4927
    • @apohllo made their first contribution in https://github.com/huggingface/datasets/pull/4967
    • @cwarny made their first contribution in https://github.com/huggingface/datasets/pull/4943
    • @tomaarsen made their first contribution in https://github.com/huggingface/datasets/pull/4986
    • @szmoro made their first contribution in https://github.com/huggingface/datasets/pull/4993

    Full Changelog: https://github.com/huggingface/datasets/compare/2.4.0...2.5.0

    Source code(tar.gz)
    Source code(zip)
  • 2.4.0(Jul 25, 2022)

    Dataset Features

    • Add concatenate_datasets for iterable datasets by @lhoestq in https://github.com/huggingface/datasets/pull/4500
    • Support parallelism with PyTorch DataLoader with parquet/json/csv/text/image/etc. files by @mariosasko in https://github.com/huggingface/datasets/pull/4625
    • Support using PCM audio files (#4323) by @YooSungHyun in https://github.com/huggingface/datasets/pull/4409
    • [data_files] Files disambiguation: match split names in data files if they are between separators by @lhoestq in https://github.com/huggingface/datasets/pull/4633
    • Support extract 7-zip compressed data files by @albertvillanova in https://github.com/huggingface/datasets/pull/4672
    • Support extract lz4 compressed data files by @albertvillanova in https://github.com/huggingface/datasets/pull/4700
    • Support metadata.jsonl from parent directories in imagefolder @mariosasko in https://github.com/huggingface/datasets/pull/4576

    Dataset changes

    • Update: allocine - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4563
    • Update: multi_news - Host data on the Hub instead of Google Drive by @albertvillanova in https://github.com/huggingface/datasets/pull/4585
    • Update: pn_summary - Host data on the Hub instead of Google Drive by @albertvillanova in https://github.com/huggingface/datasets/pull/4586
    • Update: financial_phrasebank - Host data on the Hub by @albertvillanova in https://github.com/huggingface/datasets/pull/4598
    • Update: cfq - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4579
    • Update: head_qa - Host data on the Hub and fix NonMatchingChecksumError by @albertvillanova in https://github.com/huggingface/datasets/pull/4588
    • Update: bookcorpus - Support streaming dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4564
    • Update: fever - Refactor and add metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/4503
    • Update: mlsum - Support streaming dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4574
    • Fix: cats_vs_dogs - Update download url and improve card by @mariosasko in https://github.com/huggingface/datasets/pull/4523
    • Fix: conll2003 - fix empty example by @lhoestq in https://github.com/huggingface/datasets/pull/4662
    • Fix: WMT datasets - fix loading issue when choosing specific subsets and docs update by @khushmeeet in https://github.com/huggingface/datasets/pull/4554
    • Fix: xtreme - fix empty examples in dataset for bucc18 config by @lhoestq in https://github.com/huggingface/datasets/pull/4706
    • Fix: crd3 - fix splits that were containing the same data by @lhoestq in https://github.com/huggingface/datasets/pull/4705

    Dataset Cards

    • Add action names in schema_guided_dstc8 dataset card by @lhoestq in https://github.com/huggingface/datasets/pull/4559
    • Add evaluation data to acronym_identification by @lewtun in https://github.com/huggingface/datasets/pull/4561
    • Update WinoBias README by @sashavor in https://github.com/huggingface/datasets/pull/4631
    • Support "tags" yaml tag by @lhoestq in https://github.com/huggingface/datasets/pull/4716
    • Fix POS tags by @lhoestq in https://github.com/huggingface/datasets/pull/4715
    • AESLC dataset: Add summarization tags by @hobson in https://github.com/huggingface/datasets/pull/4517

    Documentation

    • Update docs around audio and vision by @stevhliu in https://github.com/huggingface/datasets/pull/4440
    • Update Google Cloud Storage documentation and add Azure Blob Storage example by @alvarobartt in https://github.com/huggingface/datasets/pull/4513
    • Remove multiple config section by @stevhliu in https://github.com/huggingface/datasets/pull/4600
    • Create new sections for audio and vision in guides by @stevhliu in https://github.com/huggingface/datasets/pull/4519
    • Document installation of sox OS dependency for audio by @albertvillanova in https://github.com/huggingface/datasets/pull/4713

    General improvements and bug fixes

    • Add regression test for ArrowWriter.write_batch when batch is empty by @alvarobartt in https://github.com/huggingface/datasets/pull/4510
    • Support all negative values in ClassLabel by @lhoestq in https://github.com/huggingface/datasets/pull/4511
    • Add uppercased versions of image file extensions for automatic module inference by @mariosasko in https://github.com/huggingface/datasets/pull/4515
    • Patch tests for hfh v0.8.0 by @LysandreJik in https://github.com/huggingface/datasets/pull/4518
    • Replace deprecated logging.warn with logging.warning by @hugovk in https://github.com/huggingface/datasets/pull/4539
    • [CI] Fix upstream hub test url by @lhoestq in https://github.com/huggingface/datasets/pull/4543
    • Fix timestamp conversion from Pandas to Python datetime in streaming mode by @lhoestq in https://github.com/huggingface/datasets/pull/4541
    • [CI] fixing seqeval install in ci by pinning setuptools-scm by @lhoestq in https://github.com/huggingface/datasets/pull/4546
    • Tell users to upload on the hub directly by @lhoestq in https://github.com/huggingface/datasets/pull/4552
    • Add batch_size parameter when calling add_faiss_index and add_faiss_index_from_external_arrays by @alvarobartt in https://github.com/huggingface/datasets/pull/4535
    • Make DuplicateKeysError more user friendly [For Issue #2556] by @VijayKalmath in https://github.com/huggingface/datasets/pull/4545
    • Properly raise FileNotFound even if the dataset is private by @lhoestq in https://github.com/huggingface/datasets/pull/4536
    • Fix hashing for python 3.9 by @lhoestq in https://github.com/huggingface/datasets/pull/4516
    • [CI] Fix some warnings by @lhoestq in https://github.com/huggingface/datasets/pull/4547
    • Validate new_fingerprint passed by user by @lhoestq in https://github.com/huggingface/datasets/pull/4587
    • Update CI Windows orb by @albertvillanova in https://github.com/huggingface/datasets/pull/4604
    • Perform hidden file check on relative data file path by @mariosasko in https://github.com/huggingface/datasets/pull/4551
    • Align more metadata with other repo types (models,spaces) by @julien-c in https://github.com/huggingface/datasets/pull/4607
    • Align/fix license metadata info by @julien-c in https://github.com/huggingface/datasets/pull/4613
    • Preserve member order by MockDownloadManager.iter_archive by @albertvillanova in https://github.com/huggingface/datasets/pull/4611
    • Add authentication tip to load_dataset by @mariosasko in https://github.com/huggingface/datasets/pull/4577
    • Stop dropping columns in to_tf_dataset() before we load batches by @Rocketknight1 in https://github.com/huggingface/datasets/pull/4553
    • fix(dataset_wrappers): Fixes access to fsspec.asyn in torch_iterable_dataset.py. by @gugarosa in https://github.com/huggingface/datasets/pull/4630
    • Fix xisfile, xgetsize, xisdir, xlistdir in private repo by @lhoestq in https://github.com/huggingface/datasets/pull/4608
    • Rename master to main by @lhoestq in https://github.com/huggingface/datasets/pull/4643
    • Set HF_SCRIPTS_VERSION to main by @lhoestq in https://github.com/huggingface/datasets/pull/4645
    • [Minor fix] Typo correction by @cakiki in https://github.com/huggingface/datasets/pull/4644
    • fixed duplicate calculation of spearmanr function in metrics wrapper. by @benlipkin in https://github.com/huggingface/datasets/pull/4627
    • Generalize meta_path json file creation in load.py [#4540] by @VijayKalmath in https://github.com/huggingface/datasets/pull/4590
    • Fix time type _arrow_to_datasets_dtype conversion by @mariosasko in https://github.com/huggingface/datasets/pull/4628
    • Fix _resolve_single_pattern_locally on Windows with multiple drives by @albertvillanova in https://github.com/huggingface/datasets/pull/4660
    • Replace assertEqual with assertTupleEqual in unit tests for verbosity by @alvarobartt in https://github.com/huggingface/datasets/pull/4496
    • Fix embed_storage on features inside lists/sequences by @mariosasko in https://github.com/huggingface/datasets/pull/4615
    • Add links to vision tasks scripts in ADD_NEW_DATASET template by @mariosasko in https://github.com/huggingface/datasets/pull/4512
    • Transfer CI to GitHub Actions by @albertvillanova in https://github.com/huggingface/datasets/pull/4659
    • Fix mock fsspec by @albertvillanova in https://github.com/huggingface/datasets/pull/4685
    • Trigger CI also on push to main by @albertvillanova in https://github.com/huggingface/datasets/pull/4687
    • Fix ImageFolder with parameters drop_metadata=True and drop_labels=False (when metadata.jsonl is present) by @polinaeterna in https://github.com/huggingface/datasets/pull/4622
    • Skip test_extractor only for zstd param if zstandard not installed by @albertvillanova in https://github.com/huggingface/datasets/pull/4688
    • Test extractors for all compression formats by @albertvillanova in https://github.com/huggingface/datasets/pull/4689
    • Refactor base extractors by @albertvillanova in https://github.com/huggingface/datasets/pull/4690
    • Update create dataset card docs by @stevhliu in https://github.com/huggingface/datasets/pull/4683
    • Add text decorators by @stevhliu in https://github.com/huggingface/datasets/pull/4663
    • Skip tests only for lz4/zstd params if not installed by @albertvillanova in https://github.com/huggingface/datasets/pull/4704
    • Ensure ConcatenationTable.cast uses target_schema metadata by @dtuit in https://github.com/huggingface/datasets/pull/4614
    • Docs: Fix same-page haslinks by @mishig25 in https://github.com/huggingface/datasets/pull/4722
    • Fix broken link to the Hub by @stevhliu in https://github.com/huggingface/datasets/pull/4726
    • Refactor conftest fixtures by @albertvillanova in https://github.com/huggingface/datasets/pull/4723
    • Add object detection processing tutorial by @nateraw in https://github.com/huggingface/datasets/pull/4710
    • Fix require torchaudio and refactor test requirements by @albertvillanova in https://github.com/huggingface/datasets/pull/4708
    • docs: ✏️ fix TranslationVariableLanguages example by @severo in https://github.com/huggingface/datasets/pull/4731
    • Pin rouge_score test dependency by @albertvillanova in https://github.com/huggingface/datasets/pull/4735
    • Fix named split sorting and remove unnecessary casting by @albertvillanova in https://github.com/huggingface/datasets/pull/4714
    • Make cast in from_pandas more robust by @mariosasko in https://github.com/huggingface/datasets/pull/4703
    • Make Extractor accept Path as input by @albertvillanova in https://github.com/huggingface/datasets/pull/4718
    • Refactor Hub tests by @albertvillanova in https://github.com/huggingface/datasets/pull/4729
    • Fix to dict conversion of DatasetInfo/Features by @mariosasko in https://github.com/huggingface/datasets/pull/4741

    New Contributors

    • @hugovk made their first contribution in https://github.com/huggingface/datasets/pull/4539
    • @VijayKalmath made their first contribution in https://github.com/huggingface/datasets/pull/4545
    • @gugarosa made their first contribution in https://github.com/huggingface/datasets/pull/4630
    • @benlipkin made their first contribution in https://github.com/huggingface/datasets/pull/4627
    • @YooSungHyun made their first contribution in https://github.com/huggingface/datasets/pull/4409
    • @hobson made their first contribution in https://github.com/huggingface/datasets/pull/4517
    • @khushmeeet made their first contribution in https://github.com/huggingface/datasets/pull/4554
    • @dtuit made their first contribution in https://github.com/huggingface/datasets/pull/4614

    Full Changelog: https://github.com/huggingface/datasets/compare/2.3.2...2.4.0

    Source code(tar.gz)
    Source code(zip)
  • 2.3.2(Jun 15, 2022)

    Bug fixes

    • Fix double dots in data files by @lhoestq in https://github.com/huggingface/datasets/pull/4505
      • fix a bug when /../ is passed to data_files causing FileNotFoundError
    • fix ETT m1/m2 test/val dataset by @kashif in https://github.com/huggingface/datasets/pull/4499
    • Corrected broken links in doc by @clefourrier in https://github.com/huggingface/datasets/pull/4501

    New Contributors

    • @clefourrier made their first contribution in https://github.com/huggingface/datasets/pull/4501

    Full Changelog: https://github.com/huggingface/datasets/compare/2.3.1...2.3.2

    Source code(tar.gz)
    Source code(zip)
  • 2.3.1(Jun 15, 2022)

    Bug fixes

    • Fix patching module that doesn't exist by @lhoestq in https://github.com/huggingface/datasets/pull/4495
      • fix bug when importing the lib when scipy is not installed
    • Re-add download_manager module in utils by @lhoestq in https://github.com/huggingface/datasets/pull/4497
      • fix moved imports of DownloadConfig, DownloadMode, DownloadManager
    • Support streaming UDHR dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4487

    Full Changelog: https://github.com/huggingface/datasets/compare/2.3.0...2.3.1

    Source code(tar.gz)
    Source code(zip)
  • 2.3.0(Jun 14, 2022)

    Datasets Changes

    • New: ImageNet-Sketch by @nateraw in https://github.com/huggingface/datasets/pull/4301
    • New: Biwi Kinect Head Pose by @dnaveenr in https://github.com/huggingface/datasets/pull/3903
    • New: enwik8 by @HallerPatrick in https://github.com/huggingface/datasets/pull/4321
    • New: LCCC dataset by @silverriver in https://github.com/huggingface/datasets/pull/4416
    • New: TruthfulQA by @jon-tow in https://github.com/huggingface/datasets/pull/4159
    • New: BIG-bench by @andersjohanandreassen in https://github.com/huggingface/datasets/pull/4125
    • New: QuickDraw by @mariosasko in https://github.com/huggingface/datasets/pull/3592
    • New: SST-2 by @albertvillanova in https://github.com/huggingface/datasets/pull/4473
    • Update: imagenet-1k - remove manual download by @mariosasko in https://github.com/huggingface/datasets/pull/4299
      • ImageNet can now be loaded in python with load_dataset without requiring a manual download !
      • It also supports streaming mode with load_dataset("imagenet-1k", streaming=True)
    • Update: spider - Remove Google Drive URL by @albertvillanova in https://github.com/huggingface/datasets/pull/4410
    • Update: blended_skill_talk - add missing columns to by @mariosasko in https://github.com/huggingface/datasets/pull/4437
    • Update: multi-news - Use newer version with fixes by @JohnGiorgi in https://github.com/huggingface/datasets/pull/4451
    • Update: fever - update data URLs by @albertvillanova in https://github.com/huggingface/datasets/pull/44554459
    • Update: udhr - Add and fix language tags by @albertvillanova in https://github.com/huggingface/datasets/pull/
    • Update: udhr - update metadata by @leondz in https://github.com/huggingface/datasets/pull/4362
    • Update: wider_face - Replace data URLs once hosted on the Hub by @albertvillanova in https://github.com/huggingface/datasets/pull/4469
    • Update: PASS - update dataset version by @mariosasko in https://github.com/huggingface/datasets/pull/4488
    • Fix: GEM - fix bug in wiki_auto_asset_turk config by @albertvillanova in https://github.com/huggingface/datasets/pull/4389
    • Fix: GEM - fix URL for totto config by @albertvillanova in https://github.com/huggingface/datasets/pull/4396
    • Fix: timit_asr - fix DuplicatedKeysError by @albertvillanova in https://github.com/huggingface/datasets/pull/4424
    • Fix: timit_asr - Make extensions case-insensitive by @albertvillanova in https://github.com/huggingface/datasets/pull/4425
    • Fix: timit_asr - Fix directory names for LDC data by @albertvillanova in https://github.com/huggingface/datasets/pull/4436
    • Fix: iwslt2017 by @lhoestq in https://github.com/huggingface/datasets/pull/4481

    Dataset Features

    • to_tf_dataset rewrite by @Rocketknight1 in https://github.com/huggingface/datasets/pull/4170
    • Support DataLoader with num_workers > 0 in streaming mode by @lhoestq in https://github.com/huggingface/datasets/pull/4375
    • Added stratify option to train_test_split by @nandwalritik in https://github.com/huggingface/datasets/pull/4322
    • Re-add support for Apache Beam functionality by @albertvillanova in https://github.com/huggingface/datasets/pull/4328
    • Resume push_to_hub: skip identical files in push_to_hub instead of overwriting by @mariosasko in https://github.com/huggingface/datasets/pull/4402
    • Support nested/complex feature types as features in packaged loaders by @mariosasko in https://github.com/huggingface/datasets/pull/4364
    • Optimize contiguous shard and select by @lhoestq in https://github.com/huggingface/datasets/pull/4466

    Dataset Cards

    • Minor fixes/improvements in scene_parse_150 card by @mariosasko in https://github.com/huggingface/datasets/pull/4447
    • Tidy up license metadata for google_wellformed_query, newspop, sick by @leondz in https://github.com/huggingface/datasets/pull/4378
    • Fix example in opus_ubuntu, Add license info by @leondz in https://github.com/huggingface/datasets/pull/4360
    • Update README.md of fquad by @lhoestq in https://github.com/huggingface/datasets/pull/4450

    Documentation

    • Add API code examples for loading methods by @stevhliu in https://github.com/huggingface/datasets/pull/4300
    • Add API code examples for remaining main classes by @stevhliu in https://github.com/huggingface/datasets/pull/4292
    • Generalize tutorials for audio and vision by @stevhliu in https://github.com/huggingface/datasets/pull/4468
    • [Docs] How to use with PyTorch page by @lhoestq in https://github.com/huggingface/datasets/pull/4474
    • First draft of the docs for TF + Datasets by @Rocketknight1 in https://github.com/huggingface/datasets/pull/4457

    Other improvements and bug fixes

    • Update CI deprecated legacy image by @albertvillanova in https://github.com/huggingface/datasets/pull/4393
    • remove int documentation from logging docs by @lvwerra in https://github.com/huggingface/datasets/pull/4392
    • Fix docstring in DatasetDict::shuffle by @felixdivo in https://github.com/huggingface/datasets/pull/4344
    • Fix Version equality by @albertvillanova in https://github.com/huggingface/datasets/pull/4359
    • Set builder name from module instead of class by @albertvillanova in https://github.com/huggingface/datasets/pull/4388
    • Test dill by @albertvillanova in https://github.com/huggingface/datasets/pull/4385
    • Refactor download by @albertvillanova in https://github.com/huggingface/datasets/pull/4384
    • Fix dependency on dill version by @albertvillanova in https://github.com/huggingface/datasets/pull/4397
    • Support remote cache_dir by @albertvillanova in https://github.com/huggingface/datasets/pull/4347
    • Update imagenet gate by @lhoestq in https://github.com/huggingface/datasets/pull/4408
    • Fix dataset builder default version by @albertvillanova in https://github.com/huggingface/datasets/pull/4356
    • Uncomment logging deactivation for ArrowBasedBuilder by @thomasw21 in https://github.com/huggingface/datasets/pull/4403
    • Rename DatasetBuilder config_name by @albertvillanova in https://github.com/huggingface/datasets/pull/4414
    • Fix metadata validation by @albertvillanova in https://github.com/huggingface/datasets/pull/4390
    • Add HF.co for PRs/Issues for specific datasets by @lhoestq in https://github.com/huggingface/datasets/pull/4427
    • Fix type hint and documentation for new_fingerprint by @fxmarty in https://github.com/huggingface/datasets/pull/4326
    • Skip hidden files/directories in data files resolution and iter_files by @mariosasko in https://github.com/huggingface/datasets/pull/4412
    • Fix docstring of inspect_dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4438
    • Fix builder docstring by @albertvillanova in https://github.com/huggingface/datasets/pull/4432
    • Fix kwargs in docstrings by @albertvillanova in https://github.com/huggingface/datasets/pull/4444
    • Fix missing args in docstring of load_dataset_builder by @albertvillanova in https://github.com/huggingface/datasets/pull/4445
    • Add missing kwargs to docstrings by @albertvillanova in https://github.com/huggingface/datasets/pull/4446
    • Add extractor for bzip2-compressed files by @asivokon in https://github.com/huggingface/datasets/pull/4421
    • Fix dummy dataset generation script for handling nested types of _URLs by @silverriver in https://github.com/huggingface/datasets/pull/4434
    • Update dataset_infos.json with new split info in Dataset.push_to_hub to avoid verification error by @mariosasko in https://github.com/huggingface/datasets/pull/4415
    • Update builder docstring for deprecated/added arguments by @albertvillanova in https://github.com/huggingface/datasets/pull/4429
    • Extend support for streaming datasets that use xml.dom.minidom.parse by @albertvillanova in https://github.com/huggingface/datasets/pull/4464
    • Fix script fetching and local path handling in inspect_dataset and inspect_metric by @mariosasko in https://github.com/huggingface/datasets/pull/4433
    • Fix bigbench config names by @lhoestq in https://github.com/huggingface/datasets/pull/4465
    • Fix 401 error for unauthticated requests to non-existing repos by @lhoestq in https://github.com/huggingface/datasets/pull/4472
    • Reorder returned validation/test splits in script template by @albertvillanova in https://github.com/huggingface/datasets/pull/4470
    • Better ImportError message when a dataset script dependency is missing by @lhoestq in https://github.com/huggingface/datasets/pull/4484
    • Fix cast to null by @lhoestq in https://github.com/huggingface/datasets/pull/4485
    • Update _format_columns in remove_columns by @alvarobartt in https://github.com/huggingface/datasets/pull/4411
    • Fix wrong map parameter name in cache docs by @h4iku in https://github.com/huggingface/datasets/pull/4293
    • Pin the revision in imagenet download links by @lhoestq in https://github.com/huggingface/datasets/pull/4492
    • Refactor column mappings for question answering datasets by @lewtun in https://github.com/huggingface/datasets/pull/4391

    New Contributors

    • @leondz made their first contribution in https://github.com/huggingface/datasets/pull/4378
    • @felixdivo made their first contribution in https://github.com/huggingface/datasets/pull/4344
    • @nandwalritik made their first contribution in https://github.com/huggingface/datasets/pull/4322
    • @fxmarty made their first contribution in https://github.com/huggingface/datasets/pull/4326
    • @HallerPatrick made their first contribution in https://github.com/huggingface/datasets/pull/4321
    • @silverriver made their first contribution in https://github.com/huggingface/datasets/pull/4416
    • @asivokon made their first contribution in https://github.com/huggingface/datasets/pull/4421
    • @andersjohanandreassen made their first contribution in https://github.com/huggingface/datasets/pull/4125

    Full Changelog: https://github.com/huggingface/datasets/compare/2.2.2...lol

    Source code(tar.gz)
    Source code(zip)
  • 2.2.2(May 20, 2022)

    Datasets fixes

    • Fix: irc_disentangle - fix checksum and bug dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4377
    • Fix: CC-Aligned - fix invalid url by @juntang-zhuang in https://github.com/huggingface/datasets/pull/4231
    • Fix: multi_news - don't strip proceeding hyphen by @JohnGiorgi in https://github.com/huggingface/datasets/pull/4353

    Bug fixes

    • Support lists of multi-dimensional numpy arrays by @albertvillanova in https://github.com/huggingface/datasets/pull/4194
    • Check if dataset features match before push in DatasetDict.push_to_hub by @mariosasko in https://github.com/huggingface/datasets/pull/4372
    • Pin dill by @albertvillanova in https://github.com/huggingface/datasets/pull/4380
      • dill 0.3.5 has some issues in transformers - pinning the version to <0.3.5 for now

    Dataset Cards

    • Adding eval metadata for ade v2 by @sashavor in https://github.com/huggingface/datasets/pull/4319
    • Adding eval metadata for AG News by @sashavor in https://github.com/huggingface/datasets/pull/4329
    • Adding eval metadata to Allociné dataset by @sashavor in https://github.com/huggingface/datasets/pull/4330
    • Adding eval metadata to Amazon Polarity by @sashavor in https://github.com/huggingface/datasets/pull/4331
    • Adding eval metadata for arabic speech corpus by @sashavor in https://github.com/huggingface/datasets/pull/4332
    • Adding eval metadata for Banking 77 by @sashavor in https://github.com/huggingface/datasets/pull/4333
    • Eval metadata Batch 4: Tweet Eval, Tweets Hate Speech Detection, VCTK, Weibo NER, Wisesight Sentiment, XSum, Yahoo Answers Topics, Yelp Polarity, Yelp Review Full by @sashavor in https://github.com/huggingface/datasets/pull/4338
    • Eval metadata batch 3: Reddit, Rotten Tomatoes, SemEval 2010, Sentiment 140, SMS Spam, Snips, SQuAD, SQuAD v2, Timit ASR by @sashavor in https://github.com/huggingface/datasets/pull/4337
    • Eval metadata batch 1: BillSum, CoNLL2003, CoNLLPP, CUAD, Emotion, GigaWord, GLUE, Hate Speech 18, Hate Speech by @sashavor in https://github.com/huggingface/datasets/pull/4335
    • Eval metadata batch 2 : Health Fact, Jigsaw Toxicity, LIAR, LJ Speech, MSRA NER, Multi News, NCBI Disease, Poem Sentiment by @sashavor in https://github.com/huggingface/datasets/pull/4336

    Docs

    • Add API code examples for Builder classes by @stevhliu in https://github.com/huggingface/datasets/pull/4313
    • Add redirect to dataset script in the repo structure page by @lhoestq in https://github.com/huggingface/datasets/pull/4369

    Other improvements and bug fixes

    • Fix failing CI on Windows for sari and wiki_split metrics by @albertvillanova in https://github.com/huggingface/datasets/pull/4342
    • Fix never ending GH Action to build documentation by @albertvillanova in https://github.com/huggingface/datasets/pull/4345
    • Fix warning in upload_file by @albertvillanova in https://github.com/huggingface/datasets/pull/4355
    • Fix warning in push_to_hub by @albertvillanova in https://github.com/huggingface/datasets/pull/4357
    • Remove config names as yaml keys by @lhoestq in https://github.com/huggingface/datasets/pull/4367
    • Add missing language tags for udhr dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4371
    • Remove links in docs to old dataset viewer by @mariosasko in https://github.com/huggingface/datasets/pull/4373

    New Contributors

    • @JohnGiorgi made their first contribution in https://github.com/huggingface/datasets/pull/4353
    • @juntang-zhuang made their first contribution in https://github.com/huggingface/datasets/pull/4231

    Full Changelog: https://github.com/huggingface/datasets/compare/2.2.1...2.2.2

    Source code(tar.gz)
    Source code(zip)
  • 2.2.1(May 11, 2022)

    Datasets bug fixes

    • Fix cnn_dailymail (dm stories were ignored) by @lhoestq in https://github.com/huggingface/datasets/pull/4317
      • datasets 2.2.0 introduced a bug in cnn_dailymail and some examples were missing in the dataset

    General improvements and bug fixes

    • Fix: Add missing comma by @mrm8488 in https://github.com/huggingface/datasets/pull/4303
    • Catch pull error when mirroring by @lhoestq in https://github.com/huggingface/datasets/pull/4314
    • Remove unused multiprocessing args from test CLI by @albertvillanova in https://github.com/huggingface/datasets/pull/4308
    • Fix CLI run_beam namespace by @albertvillanova in https://github.com/huggingface/datasets/pull/4315
    • Support passing config_kwargs to CLI run_beam by @albertvillanova in https://github.com/huggingface/datasets/pull/4316
    • Don't check f.loc in _get_extraction_protocol_with_magic_number by @lhoestq in https://github.com/huggingface/datasets/pull/4318

    New Contributors

    • @mrm8488 made their first contribution in https://github.com/huggingface/datasets/pull/4303

    Full Changelog: https://github.com/huggingface/datasets/compare/2.2.0...2.2.1

    Source code(tar.gz)
    Source code(zip)
  • 2.2.0(May 10, 2022)

    Dataset Changes

    • New: ImageNet by @apsdehal in https://github.com/huggingface/datasets/pull/4178
      • Manual download only for now
    • New: Google Conceptual Captions by @abhishekkrthakur in https://github.com/huggingface/datasets/pull/1459
    • New: Conceptual 12M by @thomasw21 in https://github.com/huggingface/datasets/pull/4162
    • New: Visual Genome by @thomasw21 in https://github.com/huggingface/datasets/pull/4161
    • New: RVL-CDIP by @dnaveenr in https://github.com/huggingface/datasets/pull/4050
    • New: Text-based NP Enrichment (TNE) by @yanaiela in https://github.com/huggingface/datasets/pull/4153
    • New: TextVQA by @apsdehal in https://github.com/huggingface/datasets/pull/3967
    • New: ETT time series dataset by @kashif in https://github.com/huggingface/datasets/pull/4213
    • Update: assin2 - update metadata by @lhoestq in https://github.com/huggingface/datasets/pull/4172
    • Update: Librispeech - Add 'all' config by @patrickvonplaten in https://github.com/huggingface/datasets/pull/4184
    • Update: XGLUE - Support streaming dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4249
    • Update: crd3 - group all the turns in one example by @shanyas10 in https://github.com/huggingface/datasets/pull/4240
    • Update: pubmed_qa - Remove google drive URL by @lhoestq in https://github.com/huggingface/datasets/pull/4255
    • Update: SAMSum - Replace data URL dataset and support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4254
    • Update: SAMSum - Replace data URL dataset within the same repository by @albertvillanova in https://github.com/huggingface/datasets/pull/4267
    • Update: big_patent - Replace data URL in dataset and support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4236
    • Update: openbookqa - Add missing features for additional config by @albertvillanova in https://github.com/huggingface/datasets/pull/4278
    • Update: commonsense_qa - Add missing features by @albertvillanova in https://github.com/huggingface/datasets/pull/4280
    • Fix: Common Voice - Make sure bytes are correctly deleted if path exists by @patrickvonplaten in https://github.com/huggingface/datasets/pull/4212
    • Fix: openbookqa - fix bug in choices labels by @manandey in https://github.com/huggingface/datasets/pull/4259
    • Fix: openbookqa - fix style in openbookqa dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4270

    Dataset Features

    • Add support for metadata files to imagefolder by @mariosasko in https://github.com/huggingface/datasets/pull/4069
    • Infer splits from the data_dir parameter when loading datasets without script by @polinaeterna in https://github.com/huggingface/datasets/pull/4144
    • Enable label alignment for token classification datasets by @lewtun in https://github.com/huggingface/datasets/pull/4277
    • Add drop_last_batch to IterableDataset.map by @mariosasko in https://github.com/huggingface/datasets/pull/4215
    • Load dataset with TSV files by @albertvillanova in https://github.com/huggingface/datasets/pull/4246

    Dataset Cards

    • Autoeval config by @nrajani in https://github.com/huggingface/datasets/pull/4234
      • Add train-deval-index metadata to automate evaluation on your datasets based on their tasks
    • Adding license information for Openbookcorpus by @meg-huggingface in https://github.com/huggingface/datasets/pull/3525
    • Make code for image downloading from image urls cacheable by @mariosasko in https://github.com/huggingface/datasets/pull/4218
    • Fix description links in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4222
    • Add YAML tags to Dataset Card rotten tomatoes by @mo6zes in https://github.com/huggingface/datasets/pull/4262
    • Remove a copy-paste sentence in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4281
    • Update LexGLUE README.md by @iliaschalkidis in https://github.com/huggingface/datasets/pull/4285
    • leadboard info added for TNE by @yanaiela in https://github.com/huggingface/datasets/pull/4273
    • Add Lahnda language tag by @mariosasko in https://github.com/huggingface/datasets/pull/4286
    • Add license and point of contact to big_patent dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4269
    • Add HF Speech Bench to Librispeech Dataset Card by @sanchit-gandhi in https://github.com/huggingface/datasets/pull/4266

    Metrics Changes

    • Perplexity Speedup by @emibaylor in https://github.com/huggingface/datasets/pull/4108
    • Add AUC ROC Metric by @emibaylor in https://github.com/huggingface/datasets/pull/4158
    • Small fixes in ROC AUC docs by @wschella in https://github.com/huggingface/datasets/pull/4239
    • Fix/start token mask issue and update documentation by @TristanThrush in https://github.com/huggingface/datasets/pull/4258
    • Add pearsonr mc, update functionality to match the original docs by @emibaylor in https://github.com/huggingface/datasets/pull/4226

    Metric Cards

    • Metric card for the XTREME-S dataset by @sashavor in https://github.com/huggingface/datasets/pull/4251
    • Creating metric card for MAE by @sashavor in https://github.com/huggingface/datasets/pull/4252
    • Create metric cards for mean IOU by @sashavor in https://github.com/huggingface/datasets/pull/4253
    • Create metric card for Mahalanobis Distance by @sashavor in https://github.com/huggingface/datasets/pull/4257
    • Create metric card for MSE by @sashavor in https://github.com/huggingface/datasets/pull/4256
    • Fix exact match by @emibaylor in https://github.com/huggingface/datasets/pull/4166
    • Fix google bleu typos, examples by @emibaylor in https://github.com/huggingface/datasets/pull/4165
    • Add f1 metric card, update docstring in py file by @emibaylor in https://github.com/huggingface/datasets/pull/4227
    • Add Recall Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4204
    • Matthews Correlation Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4110
    • Add Precision Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4203
    • Add Accuracy Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4223
    • Add Spearmanr Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4109
    • Metric card template by @emibaylor in https://github.com/huggingface/datasets/pull/3915

    Documentation

    • Document save_to_disk and push_to_hub on images and audio files by @lhoestq in https://github.com/huggingface/datasets/pull/4193
    • Add to docs how to load from local script by @albertvillanova in https://github.com/huggingface/datasets/pull/4200
    • Add code examples to API docs by @stevhliu in https://github.com/huggingface/datasets/pull/4168
    • Add code examples for DatasetDict by @stevhliu in https://github.com/huggingface/datasets/pull/4245
    • Add API code examples for IterableDataset by @stevhliu in https://github.com/huggingface/datasets/pull/4274
    • Add packaged builder configs to the documentation by @lhoestq in https://github.com/huggingface/datasets/pull/4307
    • [Imagefolder] Docs + Don't infer labels from file names when there are metadata + Error messages when metadata and images aren't linked correctly by @lhoestq in https://github.com/huggingface/datasets/pull/4311

    General improvements and bug fixes

    • Generate tasks.json taxonomy from huggingface_hub by @julien-c in https://github.com/huggingface/datasets/pull/4154
    • Fix when map function modifies input in-place by @thomasw21 in https://github.com/huggingface/datasets/pull/4174
    • Support streaming cnn_dailymail dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4188
    • Don't duplicate data when encoding audio or image by @lhoestq in https://github.com/huggingface/datasets/pull/4187
    • Fix outdated docstring about default dataset config by @lhoestq in https://github.com/huggingface/datasets/pull/4186
    • Deprecate shard_size in push_to_hub in favor of max_shard_size by @mariosasko in https://github.com/huggingface/datasets/pull/4190
    • Fix some type annotation in doc by @thomasw21 in https://github.com/huggingface/datasets/pull/4202
    • Update GH template for dataset viewer issues by @albertvillanova in https://github.com/huggingface/datasets/pull/4201
    • Update auth when mirroring datasets on the hub by @lhoestq in https://github.com/huggingface/datasets/pull/4242
    • Rename imagenet2012 -> imagenet-1k by @lhoestq in https://github.com/huggingface/datasets/pull/4263
    • Skip checksum computation in Imagefolder by default by @mariosasko in https://github.com/huggingface/datasets/pull/4214
    • Fix convert_file_size_to_int for kilobits and megabits by @mariosasko in https://github.com/huggingface/datasets/pull/4205
    • Fix typo in logging docs by @stevhliu in https://github.com/huggingface/datasets/pull/4272
    • Bump PyArrow Version to 6 by @dnaveenr in https://github.com/huggingface/datasets/pull/4250
    • task id update by @nrajani in https://github.com/huggingface/datasets/pull/4244
    • Avoid recursion error in map if example is returned as dict value by @mariosasko in https://github.com/huggingface/datasets/pull/4216
    • Update minimal PyArrow version warning by @mariosasko in https://github.com/huggingface/datasets/pull/4279
    • [Minor edit] Fix typo in class name by @cakiki in https://github.com/huggingface/datasets/pull/4207
    • Stream private zipped images by @lhoestq in https://github.com/huggingface/datasets/pull/4173
    • Fix filesystem docstring by @stevhliu in https://github.com/huggingface/datasets/pull/4283
    • Document how to use FAISS index for special operations by @albertvillanova in https://github.com/huggingface/datasets/pull/4189
    • Contributing MedMCQA dataset by @monk1337 in https://github.com/huggingface/datasets/pull/4064
    • Don't do unnecessary list type casting to avoid replacing None values by empty lists by @lhoestq in https://github.com/huggingface/datasets/pull/4282
    • Fix missing lz4 dependency for tests by @albertvillanova in https://github.com/huggingface/datasets/pull/4295
    • Altered faiss installation comment by @vishalsrao in https://github.com/huggingface/datasets/pull/4220
    • Fix CLI run_beam save_infos by @albertvillanova in https://github.com/huggingface/datasets/pull/4294
    • Add missing faiss import to fix https://github.com/huggingface/datasets/issues/4287 by @alvarobartt in https://github.com/huggingface/datasets/pull/4288

    New Contributors

    • @shanyas10 made their first contribution in https://github.com/huggingface/datasets/pull/4240
    • @apsdehal made their first contribution in https://github.com/huggingface/datasets/pull/4178
    • @wschella made their first contribution in https://github.com/huggingface/datasets/pull/4239
    • @TristanThrush made their first contribution in https://github.com/huggingface/datasets/pull/4258
    • @yanaiela made their first contribution in https://github.com/huggingface/datasets/pull/4153
    • @mo6zes made their first contribution in https://github.com/huggingface/datasets/pull/4262
    • @nrajani made their first contribution in https://github.com/huggingface/datasets/pull/4244
    • @sanchit-gandhi made their first contribution in https://github.com/huggingface/datasets/pull/4266
    • @cakiki made their first contribution in https://github.com/huggingface/datasets/pull/4207
    • @monk1337 made their first contribution in https://github.com/huggingface/datasets/pull/4064
    • @alvarobartt made their first contribution in https://github.com/huggingface/datasets/pull/4288

    Full Changelog: https://github.com/huggingface/datasets/compare/2.1.0...2.2.0

    Source code(tar.gz)
    Source code(zip)
  • 2.1.0(Apr 14, 2022)

    Datasets Changes

    • New: initial monash time series forecasting by @kashif in https://github.com/huggingface/datasets/pull/3743
    • New: Roman Urdu Hate Speech dataset by @bp-high in https://github.com/huggingface/datasets/pull/3972
    • New: Adversarial GLUE by @jxmorris12 in https://github.com/huggingface/datasets/pull/3849
    • New: MetaShift by @dnaveenr in https://github.com/huggingface/datasets/pull/3900
    • New: GSM8K by @jon-tow in https://github.com/huggingface/datasets/pull/4103
    • New: SBU Captions Photo by @thomasw21 in https://github.com/huggingface/datasets/pull/4130
    • Deprecated: Multilingual Librispeech - deprecate dataset in favor of facebook/multilingual_librispeechby @polinaeterna in https://github.com/huggingface/datasets/pull/4060
    • Update (BREAKING): TIMIT - Redirect users to download data manually from LDC by @lhoestq in https://github.com/huggingface/datasets/pull/4145
    • Update: Wikipedia by @albertvillanova in https://github.com/huggingface/datasets/pull/3821 and https://github.com/huggingface/datasets/pull/3989
    • Update: conll2012_ontonotesv5 - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4002
    • Update: daily_dialog - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4008
    • Update: id_clickbait - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4014
    • Update: blimp - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4016
    • Update: scan - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4017
    • Update: yelp_review_full - Replace data url by @lhoestq in https://github.com/huggingface/datasets/pull/4018
    • Update: yelp_polarity - Support streaming by @lhoestq in https://github.com/huggingface/datasets/pull/4019
    • Update: amazon_polarity - Replace data URL by @lhoestq in https://github.com/huggingface/datasets/pull/4020
    • Update: dbpedia_14 - Replace data url by @lhoestq in https://github.com/huggingface/datasets/pull/4022
    • Update: xtreme - Support streaming dataset for bucc18 config by @albertvillanova in https://github.com/huggingface/datasets/pull/4026
    • Update: yahoo_answers_topics - Replace data url by @lhoestq in https://github.com/huggingface/datasets/pull/4023* Update: ASSIN 2 dataset - replace broken Google Drive URLS by links on github by @ruanchaves in https://github.com/huggingface/datasets/pull/4004
    • Update: xcopa - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4039
    • Update: medical_dialog - Add configs with processed data by @albertvillanova in https://github.com/huggingface/datasets/pull/4127
    • Update: xtreme - Support streaming for udpos config by @albertvillanova in https://github.com/huggingface/datasets/pull/4131
    • Update: xtreme - Support streaming for PAWS-X config by @albertvillanova in https://github.com/huggingface/datasets/pull/4132
    • Update: xtreme - Support streaming for PAN-X config by @albertvillanova in https://github.com/huggingface/datasets/pull/4135
    • Update: SQuAD v2 - Use a constant for the articles regex by @bryant1410 in https://github.com/huggingface/datasets/pull/4030
    • Update: HANS - Support streaming by @mariosasko in https://github.com/huggingface/datasets/pull/4155
    • Fix: cats_vs_dogs - fix checksum error dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4033
    • Fix: xcopa - fix null checksum by @albertvillanova in https://github.com/huggingface/datasets/pull/4034
    • Fix: amazon_us_reviews - fix metadata - 4/4/2022 by @trentonstrong in https://github.com/huggingface/datasets/pull/4092

    Dataset Cards

    • Updated annotations for nli_tr dataset by @e-budur in https://github.com/huggingface/datasets/pull/4058
    • Add missing label for emotion description by @lijiazheng99 in https://github.com/huggingface/datasets/pull/4151
    • Remove unncessary 'pylint disable' message in ReadMe by @Datta0 in https://github.com/huggingface/datasets/pull/3955
    • Improve RedCaps dataset card by @mariosasko in https://github.com/huggingface/datasets/pull/4100
    • Fix duplicate key in multi_news by @lhoestq in https://github.com/huggingface/datasets/pull/4164

    Datasets Tags and Search on the Hugging Face Hub

    • Tasks alignment with models by @lhoestq in https://github.com/huggingface/datasets/pull/4066
    • Update datasets task tags to align tags with models by @lhoestq in https://github.com/huggingface/datasets/pull/4067

    Metrics Changes

    • Xtreme-S Metrics by @patrickvonplaten in https://github.com/huggingface/datasets/pull/3799
    • Fix xtreme s metrics by @patrickvonplaten in https://github.com/huggingface/datasets/pull/3957
    • Avoid info log messages from transformers in FrugalScore metric by @albertvillanova in https://github.com/huggingface/datasets/pull/3938
    • Add exact match metric by @emibaylor in https://github.com/huggingface/datasets/pull/3899
    • Fix comet metric by @lhoestq in https://github.com/huggingface/datasets/pull/3945
    • Add zero_division argument to precision and recall metrics by @albertvillanova in https://github.com/huggingface/datasets/pull/4035
    • Support float data types in pearsonr/spearmanr metrics by @albertvillanova in https://github.com/huggingface/datasets/pull/4054
    • Remove GLEU metric by @emibaylor in https://github.com/huggingface/datasets/pull/3949

    Metric Cards

    • Perplexity Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/3905
    • Create README.md by @sashavor in https://github.com/huggingface/datasets/pull/3917
    • Create README.md for CER metric by @sashavor in https://github.com/huggingface/datasets/pull/3911
    • Create README.md by @sashavor in https://github.com/huggingface/datasets/pull/3944
    • Update README.md by @sashavor in https://github.com/huggingface/datasets/pull/3933
    • Create SARI metric card by @sashavor in https://github.com/huggingface/datasets/pull/3932
    • Create MAUVE metric card by @sashavor in https://github.com/huggingface/datasets/pull/3934
    • Create CoVAL metric card by @sashavor in https://github.com/huggingface/datasets/pull/3940
    • Google BLEU Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/3948
    • Create metric card for BERTScore by @sashavor in https://github.com/huggingface/datasets/pull/3966
    • Rename wer to cer by @pmgautam in https://github.com/huggingface/datasets/pull/4012
    • Create metric card for XNLI by @sashavor in https://github.com/huggingface/datasets/pull/4046
    • Create metric card for the Code Eval metric by @sashavor in https://github.com/huggingface/datasets/pull/4049
    • Add TER metric card by @emibaylor in https://github.com/huggingface/datasets/pull/3981
    • BLEU metric card by @emibaylor in https://github.com/huggingface/datasets/pull/3947
    • Create metric card for CUAD by @sashavor in https://github.com/huggingface/datasets/pull/4043
    • Create metric card for METEOR by @sashavor in https://github.com/huggingface/datasets/pull/4065
    • Create a metric card for Competition MATH by @sashavor in https://github.com/huggingface/datasets/pull/4073
    • Create metric card for seqeval by @sashavor in https://github.com/huggingface/datasets/pull/4070
    • Create README.md by @sashavor in https://github.com/huggingface/datasets/pull/3930
    • Create metric card for Frugal Score by @sashavor in https://github.com/huggingface/datasets/pull/4089
    • Updating FrugalScore metric card by @sashavor in https://github.com/huggingface/datasets/pull/4097
    • Proposing WikiSplit metric card by @sashavor in https://github.com/huggingface/datasets/pull/4098
    • Fix formatting in BLEU metric card by @mariosasko in https://github.com/huggingface/datasets/pull/4157

    Documentation

    • Doc maintenance by @stevhliu in https://github.com/huggingface/datasets/pull/3926
    • [Doc] Don't use v for version tags on GitHub by @sgugger in https://github.com/huggingface/datasets/pull/3943
    • Use templates for doc-builidng jobs by @sgugger in https://github.com/huggingface/datasets/pull/3914
    • Add align_labels_with_mapping docs by @stevhliu in https://github.com/huggingface/datasets/pull/3931
    • Add tip on how to speed up loading with ImageFolder by @mariosasko in https://github.com/huggingface/datasets/pull/3980
    • Fix main_classes docs index by @lhoestq in https://github.com/huggingface/datasets/pull/3925
    • More consistent references in docs by @mariosasko in https://github.com/huggingface/datasets/pull/3988
    • Docs maintenance by @stevhliu in https://github.com/huggingface/datasets/pull/3999
    • Add ROUGE Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4076
    • Add chrF(++) Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4082
    • Add SacreBLEU Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4083

    General improvements and bug fixes

    • Fix flatten of complex feature types by @mariosasko in https://github.com/huggingface/datasets/pull/3723
    • Fix flatten of Sequence feature type by @lhoestq in https://github.com/huggingface/datasets/pull/3962
    • Exclude Google Drive tests of the CI by @lhoestq in https://github.com/huggingface/datasets/pull/3982
    • Close PIL.Image file handler in Image.decode_example by @mariosasko in https://github.com/huggingface/datasets/pull/3995
    • Fix Faiss custom_index device by @albertvillanova in https://github.com/huggingface/datasets/pull/3987
    • Fix None issue with Sequence of dict by @lhoestq in https://github.com/huggingface/datasets/pull/4010
    • Update main readme by @lhoestq in https://github.com/huggingface/datasets/pull/3927
    • Fix map remove_columns on empty dataset by @lhoestq in https://github.com/huggingface/datasets/pull/4021
    • Fix Audio.encode_example() when writing an array by @polinaeterna in https://github.com/huggingface/datasets/pull/3998
    • Use audio feature in ASR task template by @lhoestq in https://github.com/huggingface/datasets/pull/4006
    • Improve out of bounds error message by @lhoestq in https://github.com/huggingface/datasets/pull/4068
    • Increase max retries for GitHub metrics by @albertvillanova in https://github.com/huggingface/datasets/pull/4063
    • Fix CLI dummy data generation by @albertvillanova in https://github.com/huggingface/datasets/pull/4045
    • Fix docs on audio feature installation by @albertvillanova in https://github.com/huggingface/datasets/pull/4028
    • Add installation instructions to image_process doc by @mariosasko in https://github.com/huggingface/datasets/pull/4072
    • Fix GithubMetricModuleFactory instantiation with None download_config by @albertvillanova in https://github.com/huggingface/datasets/pull/4078
    • Increase max retries for GitHub datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/4079
    • Close parquet writer properly in push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/4081
    • fix typo in rename_column error message by @hunterlang in https://github.com/huggingface/datasets/pull/4095
    • Fix BeamWriter output Parquet file by @albertvillanova in https://github.com/huggingface/datasets/pull/4087
    • Remove unused legacy Beam utils by @albertvillanova in https://github.com/huggingface/datasets/pull/4088
    • Hotfix failing CI tests on Windows by @albertvillanova in https://github.com/huggingface/datasets/pull/4119
    • Update security policy by @albertvillanova in https://github.com/huggingface/datasets/pull/4111
    • Avoid writing empty license files by @albertvillanova in https://github.com/huggingface/datasets/pull/4090
    • Support huggingface_hub 0.5 by @lhoestq in https://github.com/huggingface/datasets/pull/4106
    • Pretty print dataset info files by @mariosasko in https://github.com/huggingface/datasets/pull/4116
    • Add single dataset citations for TweetEval by @gchhablani in https://github.com/huggingface/datasets/pull/4137
    • Adjust path to datasets tutorial in How-To by @NimaBoscarino in https://github.com/huggingface/datasets/pull/4147
    • Applied index-filters on scores in search.py. by @vishalsrao in https://github.com/huggingface/datasets/pull/3971
    • More robust cast_to_python_objects in TypedSequence by @mariosasko in https://github.com/huggingface/datasets/pull/4128
    • Sync Features dictionaries by @mariosasko in https://github.com/huggingface/datasets/pull/3997
    • Avoid rate limit in update hub repositories by @lhoestq in https://github.com/huggingface/datasets/pull/4167

    New Contributors

    • @bp-high made their first contribution in https://github.com/huggingface/datasets/pull/3972
    • @ruanchaves made their first contribution in https://github.com/huggingface/datasets/pull/4004
    • @pmgautam made their first contribution in https://github.com/huggingface/datasets/pull/4012
    • @hunterlang made their first contribution in https://github.com/huggingface/datasets/pull/4095
    • @trentonstrong made their first contribution in https://github.com/huggingface/datasets/pull/4092
    • @NimaBoscarino made their first contribution in https://github.com/huggingface/datasets/pull/4147
    • @jon-tow made their first contribution in https://github.com/huggingface/datasets/pull/4103
    • @lijiazheng99 made their first contribution in https://github.com/huggingface/datasets/pull/4151
    • @Datta0 made their first contribution in https://github.com/huggingface/datasets/pull/3955
    • @vishalsrao made their first contribution in https://github.com/huggingface/datasets/pull/3971

    Full Changelog: https://github.com/huggingface/datasets/compare/2.0.0...2.1.0

    Source code(tar.gz)
    Source code(zip)
  • 2.0.0(Mar 15, 2022)

    🤗 Datasets 2.0.0

    We're happy to announce that our new documentation is available at hf.co/docs/datasets !

    Dataset Features

    • Load a folder of images using the imagefolder dataset loader:
      • Add imagefolder dataset by @nateraw in https://github.com/huggingface/datasets/pull/2830
      • Faster ImageFolder + add option to drop labels by @mariosasko in https://github.com/huggingface/datasets/pull/3887
    • Push your image and audio datasets on the Hugging Face Hub with push_to_hub:
      • Add support for Audio and Image feature in push_to_hub by @mariosasko in https://github.com/huggingface/datasets/pull/3685
    • New processing methods for streaming datasets:
      • Add IterableDataset.filter by @lhoestq in https://github.com/huggingface/datasets/pull/3826
      • Manipulate columns on IterableDataset (rename columns, cast, etc.) by @lhoestq in https://github.com/huggingface/datasets/pull/3862
      • Add the new methods to IterableDatasetDict by @lhoestq in https://github.com/huggingface/datasets/pull/3923
    • And more:
      • Add more compression types for to_json by @bhavitvyamalik in https://github.com/huggingface/datasets/pull/3551
      • Multi-GPU support for FaissIndex by @rentruewang in https://github.com/huggingface/datasets/pull/3721

    Breaking changes

    • API changes for map and shuffle for datasets loaded in streaming mode:
      • Align map when streaming: update instead of overwrite + add missing parameters by @lhoestq in https://github.com/huggingface/datasets/pull/3801
      • Align IterableDataset.shuffle with Dataset.shuffle by @lhoestq in https://github.com/huggingface/datasets/pull/3842
    • Rename GenerateMode to DownloadMode by @albertvillanova in https://github.com/huggingface/datasets/pull/3759
    • Remove deprecated methods/params (preparation for v2.0) by @mariosasko in https://github.com/huggingface/datasets/pull/3803
    • Remove deprecated remove_columns param in filter by @mariosasko in https://github.com/huggingface/datasets/pull/3827
    • Module namespace cleanup for v2.0 by @mariosasko in https://github.com/huggingface/datasets/pull/3875

    Dataset Changes

    • New: CFPB Consumer Complaints by @kayvane1 in https://github.com/huggingface/datasets/pull/3617
    • New: told-br (brazilian hate speech) by @JAugusto97 in https://github.com/huggingface/datasets/pull/3683
    • New: electricity load diagram by @kashif in https://github.com/huggingface/datasets/pull/3722
    • New: MIT Scene Parsing Benchmark by @mariosasko in https://github.com/huggingface/datasets/pull/3607
    • New: ElkarHizketak v1.0 by @antxa in https://github.com/huggingface/datasets/pull/3780
    • New: wikitablequestions by @SivilTaram in https://github.com/huggingface/datasets/pull/3870
    • New: ontonotes_conll by @richarddwang in https://github.com/huggingface/datasets/pull/3853
    • Update: BnL Historical Newspapers - make the dataset streamable by @albertvillanova in https://github.com/huggingface/datasets/pull/3616
    • Update: Common voice - add validated partition by @shalymin-amzn in https://github.com/huggingface/datasets/pull/3669
    • Update: Common Voice - add local paths to audio files by @lhoestq in https://github.com/huggingface/datasets/pull/3736
    • Update: Common Voice - simplify code by @lhoestq in https://github.com/huggingface/datasets/pull/3817
    • Update: Natural Questions - add dev-only configuration by @albertvillanova in https://github.com/huggingface/datasets/pull/3699
    • Update: pubmed - update data url by @albertvillanova in https://github.com/huggingface/datasets/pull/3692
    • Update: pubmed - make the dataset streamable by @abhi-mosaic in https://github.com/huggingface/datasets/pull/3740
    • Update: RedCaps - make the dataset streamable by @mariosasko in https://github.com/huggingface/datasets/pull/3737
    • Update: cats_vs_dogs - update metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/3752
    • Update: newsroom - update manual download url by @albertvillanova in https://github.com/huggingface/datasets/pull/3779
    • Update: xcopa - update to new version by @albertvillanova in https://github.com/huggingface/datasets/pull/3810
    • Update: cats_vs_dogs size by @mariosasko in https://github.com/huggingface/datasets/pull/3878
    • Fix: sem_eval_2018_task_1 - fix download location by @maxpel in https://github.com/huggingface/datasets/pull/3643
    • Fix: newsqa - fix unique keys by @albertvillanova in https://github.com/huggingface/datasets/pull/3696
    • Fix: The Pile datasets - fix host urls by @albertvillanova in https://github.com/huggingface/datasets/pull/3627
    • Fix: Evidence Infer Treatment - fix dataset script by @albertvillanova in https://github.com/huggingface/datasets/pull/3718
    • Fix: NewsQA - fix dataset script by @albertvillanova in https://github.com/huggingface/datasets/pull/3734
    • Fix: head_qa - fix data url by @albertvillanova in https://github.com/huggingface/datasets/pull/3766
    • Fix: msr_sqa - fix unique keys by @albertvillanova in https://github.com/huggingface/datasets/pull/3771
    • Fix: reddit_tifu - fix data url by @albertvillanova in https://github.com/huggingface/datasets/pull/3774
    • Fix: wiki_lingua - fix spanish data file url by @albertvillanova in https://github.com/huggingface/datasets/pull/3806
    • Fix: beans - fix data urls by @mariosasko in https://github.com/huggingface/datasets/pull/3890
    • Fix: CRD3 - fix NonMatchingChecksumError by @albertvillanova in https://github.com/huggingface/datasets/pull/3921
    • Fix: MultiWOZ 2.2 - fix NonMatchingChecksumError by @albertvillanova in https://github.com/huggingface/datasets/pull/3922

    Dataset cards

    • Add code example in wikipedia card by @lhoestq in https://github.com/huggingface/datasets/pull/3678
    • Fix Multi-News dataset metadata and card by @albertvillanova in https://github.com/huggingface/datasets/pull/3731
    • Reddit dataset card additions by @anna-kay in https://github.com/huggingface/datasets/pull/3781
    • Update gigaword card and info by @mariosasko in https://github.com/huggingface/datasets/pull/3775
    • Reddit dataset card contribution by @anna-kay in https://github.com/huggingface/datasets/pull/3797

    Metric Changes

    • New: FrugalScore by @moussaKam in https://github.com/huggingface/datasets/pull/3674
    • New: Mahalanobis distance by @JoaoLages in https://github.com/huggingface/datasets/pull/3794
    • New: mIoU by @NielsRogge in https://github.com/huggingface/datasets/pull/3745
    • New: MSE and MAE - V2 by @dnaveenr in https://github.com/huggingface/datasets/pull/3874
    • Fix: METEOR - fix bug due to nltk version by @albertvillanova in https://github.com/huggingface/datasets/pull/3884

    Metric cards

    • Add perplexity to metrics by @emibaylor in https://github.com/huggingface/datasets/pull/3757
    • Create SQuAD metric README.md by @sashavor in https://github.com/huggingface/datasets/pull/3873
    • SQuAD v2 metric: create README.md by @sashavor in https://github.com/huggingface/datasets/pull/3879
    • Update README.md for SQuAD v2 metric by @sashavor in https://github.com/huggingface/datasets/pull/3908
    • Update README.md for SQuAD metric by @sashavor in https://github.com/huggingface/datasets/pull/3907
    • Create README.md for WER metric by @sashavor in https://github.com/huggingface/datasets/pull/3898
    • Create README.md for GLUE by @sashavor in https://github.com/huggingface/datasets/pull/3916

    New documentation

    • Update docs to new frontend/UI by @mishig25 in https://github.com/huggingface/datasets/pull/3690
    • Image process doc by @stevhliu in https://github.com/huggingface/datasets/pull/3882

    General improvements and bug fixes

    • Better TQDM output by @mariosasko in https://github.com/huggingface/datasets/pull/3654
    • Prioritize module.builder_kwargs over defaults in TestCommand by @lvwerra in https://github.com/huggingface/datasets/pull/3672
    • Extend support for streaming datasets that use os.path.relpath by @albertvillanova in https://github.com/huggingface/datasets/pull/3623
    • Add Fon language tag by @albertvillanova in https://github.com/huggingface/datasets/pull/3620
    • Remove unnecessary 'r' arg in by @bryant1410 in https://github.com/huggingface/datasets/pull/3661
    • Fix TestCommand to copy dataset_infos to local dir with only data files by @albertvillanova in https://github.com/huggingface/datasets/pull/3680
    • Upgrade black to version ~=22.0 by @LysandreJik in https://github.com/huggingface/datasets/pull/3691
    • Fix streaming for servers not supporting HTTP range requests by @albertvillanova in https://github.com/huggingface/datasets/pull/3689
    • Pin ElasticSearch by @lhoestq in https://github.com/huggingface/datasets/pull/3701
    • Raise informative error when loading a save_to_disk dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/3705
    • Fix ClassLabel to/from dict when passed names_file by @albertvillanova in https://github.com/huggingface/datasets/pull/3695
    • Fix CI code quality issue by @albertvillanova in https://github.com/huggingface/datasets/pull/3710
    • Check if indices values in Dataset.select are within bounds by @mariosasko in https://github.com/huggingface/datasets/pull/3719
    • Pin pandas to avoid bug in streaming mode by @albertvillanova in https://github.com/huggingface/datasets/pull/3725
    • Use config pandas version in CSV dataset builder by @albertvillanova in https://github.com/huggingface/datasets/pull/3726
    • Set base path to hub url for canonical datasets by @lhoestq in https://github.com/huggingface/datasets/pull/3709
    • Fix ValueError message formatting in int2str by @akulchik in https://github.com/huggingface/datasets/pull/3742
    • Patch all module attributes in its namespace by @albertvillanova in https://github.com/huggingface/datasets/pull/3727
    • Fix typo in train split name by @albertvillanova in https://github.com/huggingface/datasets/pull/3751
    • feat: 🎸 generate info if dataset_infos.json does not exist by @severo in https://github.com/huggingface/datasets/pull/3670
    • Support streaming in size estimation function in push_to_hub by @mariosasko in https://github.com/huggingface/datasets/pull/3732
    • Expose method and fix param by @severo in https://github.com/huggingface/datasets/pull/3767
    • Fix HfFileSystem docstring by @lhoestq in https://github.com/huggingface/datasets/pull/3768
    • process .opus files (for Multilingual Spoken Words) by @polinaeterna in https://github.com/huggingface/datasets/pull/3666
    • Fix: dataset name is stored in keys by @thomasw21 in https://github.com/huggingface/datasets/pull/3772
    • Use the same seed to shuffle shards and metadata in streaming mode by @lhoestq in https://github.com/huggingface/datasets/pull/3746
    • Start removing canonical datasets logic by @lhoestq in https://github.com/huggingface/datasets/pull/3777
    • Support passing str to iter_files by @albertvillanova in https://github.com/huggingface/datasets/pull/3783
    • Fix Google Drive URL to avoid Virus scan warning by @albertvillanova in https://github.com/huggingface/datasets/pull/3787
    • Skip checksum computation if ignore_verifications is True by @mariosasko in https://github.com/huggingface/datasets/pull/3796
    • Fix error message in CSV loader for newer Pandas versions by @mariosasko in https://github.com/huggingface/datasets/pull/3798
    • Add data_dir to data_files resolution and misc improvements to HfFileSystem by @mariosasko in https://github.com/huggingface/datasets/pull/3791
    • Error of writing with different schema, due to nonpreservation of nullability by @richarddwang in https://github.com/huggingface/datasets/pull/3782
    • Handle Nones in PyArrow struct by @mariosasko in https://github.com/huggingface/datasets/pull/3814
    • Fix iter_archive getting reset by @lhoestq in https://github.com/huggingface/datasets/pull/3815
    • Added computer vision tasks by @merveenoyan in https://github.com/huggingface/datasets/pull/3800
    • Fix typo in doc build yml by @mishig25 in https://github.com/huggingface/datasets/pull/3819
    • Allow not specifying feature cols other than predictions/references in Metric.compute by @mariosasko in https://github.com/huggingface/datasets/pull/3824
    • Logo float left by @mishig25 in https://github.com/huggingface/datasets/pull/3836
    • Pin responses to fix CI for Windows by @albertvillanova in https://github.com/huggingface/datasets/pull/3840
    • Fix dead dataset scripts creation link. by @dnaveenr in https://github.com/huggingface/datasets/pull/3834
    • Remove decode: true for image feature in head_qa by @craffel in https://github.com/huggingface/datasets/pull/3805
    • Update faiss device docstring by @lhoestq in https://github.com/huggingface/datasets/pull/3846
    • Udpate index.mdx margins by @gary149 in https://github.com/huggingface/datasets/pull/3858
    • Fix push_to_hub with null images by @lhoestq in https://github.com/huggingface/datasets/pull/3856
    • Redundant add dataset information and dead link. by @dnaveenr in https://github.com/huggingface/datasets/pull/3852
    • Update image dataset tags by @mariosasko in https://github.com/huggingface/datasets/pull/3864
    • Bring back imgs so that forsk dont get broken by @mishig25 in https://github.com/huggingface/datasets/pull/3866
    • Small typos in How-to-train tutorial. by @lkhphuc in https://github.com/huggingface/datasets/pull/3833
    • Small doc fixes by @mishig25 in https://github.com/huggingface/datasets/pull/3860
    • add pandas to env command by @patrickvonplaten in https://github.com/huggingface/datasets/pull/3871
    • Ignore duplicate keys if ignore_verifications=True by @mariosasko in https://github.com/huggingface/datasets/pull/3868
    • Update code blocks by @lhoestq in https://github.com/huggingface/datasets/pull/3863
    • Fix download_mode in dataset_module_factory by @albertvillanova in https://github.com/huggingface/datasets/pull/3876
    • Fix some shuffle docs by @lhoestq in https://github.com/huggingface/datasets/pull/3885
    • Fix race condition in doc build by @lhoestq in https://github.com/huggingface/datasets/pull/3891
    • Add default branch for doc building by @sgugger in https://github.com/huggingface/datasets/pull/3893
    • [docs] make dummy data creation optional by @lhoestq in https://github.com/huggingface/datasets/pull/3894
    • Fix code examples indentation by @lhoestq in https://github.com/huggingface/datasets/pull/3895
    • Align tqdm control/cache control with Transformers by @mariosasko in https://github.com/huggingface/datasets/pull/3897
    • Fix CLI test checksums by @albertvillanova in https://github.com/huggingface/datasets/pull/3892
    • Fix Google Drive URL to avoid Virus scan warning in streaming mode by @mariosasko in https://github.com/huggingface/datasets/pull/3843
    • Change the framework switches to the new syntax by @sgugger in https://github.com/huggingface/datasets/pull/3880

    New Contributors

    • @kayvane1 made their first contribution in https://github.com/huggingface/datasets/pull/3617
    • @JAugusto97 made their first contribution in https://github.com/huggingface/datasets/pull/3683
    • @shalymin-amzn made their first contribution in https://github.com/huggingface/datasets/pull/3669
    • @kashif made their first contribution in https://github.com/huggingface/datasets/pull/3722
    • @akulchik made their first contribution in https://github.com/huggingface/datasets/pull/3742
    • @abhi-mosaic made their first contribution in https://github.com/huggingface/datasets/pull/3740
    • @emibaylor made their first contribution in https://github.com/huggingface/datasets/pull/3757
    • @anna-kay made their first contribution in https://github.com/huggingface/datasets/pull/3781
    • @JoaoLages made their first contribution in https://github.com/huggingface/datasets/pull/3794
    • @mishig25 made their first contribution in https://github.com/huggingface/datasets/pull/3690
    • @antxa made their first contribution in https://github.com/huggingface/datasets/pull/3780
    • @dnaveenr made their first contribution in https://github.com/huggingface/datasets/pull/3834
    • @lkhphuc made their first contribution in https://github.com/huggingface/datasets/pull/3833
    • @rentruewang made their first contribution in https://github.com/huggingface/datasets/pull/3721
    • @gary149 made their first contribution in https://github.com/huggingface/datasets/pull/3858
    • @NielsRogge made their first contribution in https://github.com/huggingface/datasets/pull/3745
    • @sashavor made their first contribution in https://github.com/huggingface/datasets/pull/3873
    • @SivilTaram made their first contribution in https://github.com/huggingface/datasets/pull/3870
    • Document cases for github datasets by @lhoestq in https://github.com/huggingface/datasets/pull/3924
    • Fix text loader to split only on universal newlines by @albertvillanova in https://github.com/huggingface/datasets/pull/3910
    • Retry HfApi call inside push_to_hub when 504 error by @albertvillanova in https://github.com/huggingface/datasets/pull/3886

    Full Changelog: https://github.com/huggingface/datasets/compare/1.18.3...0.0.0

    Source code(tar.gz)
    Source code(zip)
  • 1.18.4(Mar 7, 2022)

    Bug fixes

    • Prioritize module.builder_kwargs over defaults in TestCommand #3672 (@lvwerra)
    • Fix TestCommand to copy dataset_infos to local dir with only data files #3680 (@albertvillanova)
    • Upgrade black to version ~=22.0 #3691 (@LysandreJik)
    • Fix streaming for servers not supporting HTTP range requests #3689 (@albertvillanova)
    • Pin ElasticSearch #3701 (@lhoestq)
    • Fix ClassLabel to/from dict when passed names_file #3695 (@albertvillanova)
    • Fix CI code quality issue #3710 (@albertvillanova)
    • Check if indices values in Dataset.select are within bounds #3719 (@mariosasko)
    • Pin pandas to avoid bug in streaming mode #3725 (@albertvillanova)
    • Use config pandas version in CSV dataset builder #3726 (@albertvillanova)
    • Fix dataset mirroring (@lhoestq)
    • Fix ValueError message formatting in int2str #3742 (@akulchik)
    • Patch all module attributes in its namespace #3727 (@albertvillanova)
    • Fix HfFileSystem docstring #3768 (@lhoestq)
    • Fix: dataset name is stored in keys #3772 (@thomasw21)
    • Fix Google Drive URL to avoid Virus scan warning #3787 (@albertvillanova)
    • Fix error message in CSV loader for newer Pandas versions #3798 (@mariosasko)
    • Pin responses to fix CI for Windows #3840 (@albertvillanova)

    Full Changelog: https://github.com/huggingface/datasets/compare/1.18.3...1.18.4

    Source code(tar.gz)
    Source code(zip)
  • 1.18.3(Feb 2, 2022)

    Bug fixes

    • Fix MP3 resampling when a dataset's audio files have different sampling rates by @lhoestq in https://github.com/huggingface/datasets/pull/3665
    • Extend dataset builder for streaming in get_dataset_split_names by @mariosasko in https://github.com/huggingface/datasets/pull/3657

    Dataset changes

    • New: Turkic X-WMT evaluation set for machine translation by @mirzakhalov in https://github.com/huggingface/datasets/pull/3605
    • New: British Library books dataset by @davanstrien in https://github.com/huggingface/datasets/pull/3603
    • Fix: wiki_bio - Update link by @jxmorris12 in https://github.com/huggingface/datasets/pull/3651

    Other improvements

    • sp. Columbia => Colombia by @serapio in https://github.com/huggingface/datasets/pull/3652
    • Run pyupgrade for Python 3.6+ by @bryant1410 in https://github.com/huggingface/datasets/pull/3560

    New Contributors

    • @serapio made their first contribution in https://github.com/huggingface/datasets/pull/3652
    • @mirzakhalov made their first contribution in https://github.com/huggingface/datasets/pull/3605

    Full Changelog: https://github.com/huggingface/datasets/compare/1.18.2...1.18.3

    Source code(tar.gz)
    Source code(zip)
  • 1.18.2(Jan 28, 2022)

    Bug fixes

    • Fix streaming datasets that are not reset correctly by @lhoestq in https://github.com/huggingface/datasets/pull/3646
    • Fix numpy rngs when shuffling with seed=None by @mariosasko in https://github.com/huggingface/datasets/pull/3641
    • Fix dataset slicing with negative bounds when indices mapping is not None by @mariosasko in https://github.com/huggingface/datasets/pull/3642
    • Fix add_column on datasets with indices mapping by @mariosasko in https://github.com/huggingface/datasets/pull/3647

    Other improvements

    • Update index.rst by @VioletteLepercq in https://github.com/huggingface/datasets/pull/3636
    • Fix Windows CI: bump python to 3.7 by @lhoestq in https://github.com/huggingface/datasets/pull/3648

    New Contributors

    • @VioletteLepercq made their first contribution in https://github.com/huggingface/datasets/pull/3636

    Full Changelog: https://github.com/huggingface/datasets/compare/1.18.1...1.18.2

    Source code(tar.gz)
    Source code(zip)
  • 1.18.1(Jan 26, 2022)

    Improvements

    • Make decoding of Audio and Image feature optional by @mariosasko in https://github.com/huggingface/datasets/pull/3430

    Bug fixes

    • Fix prepare_for_task() by @mariosasko in https://github.com/huggingface/datasets/pull/3614
    • Fix: Multilingual Librispeech - fix bad url formatting by @polinaeterna in https://github.com/huggingface/datasets/pull/3619

    Full Changelog: https://github.com/huggingface/datasets/compare/1.18.0...1.18.1

    Source code(tar.gz)
    Source code(zip)
  • 1.18.0(Jan 21, 2022)

    Datasets Changes

    • New: VCTK
      • Add VCTK dataset by @jaketae in https://github.com/huggingface/datasets/pull/3351
      • Fix VCTK encoding by @lhoestq in https://github.com/huggingface/datasets/pull/3493
      • Docs: Add VCTK dataset description by @jaketae in https://github.com/huggingface/datasets/pull/3500
    • New: CPPE-5 dataset by @mariosasko in https://github.com/huggingface/datasets/pull/3517
    • New: RedCaps dataset by @mariosasko in https://github.com/huggingface/datasets/pull/3424
    • New: WIDER FACE dataset by @mariosasko in https://github.com/huggingface/datasets/pull/3413
    • New: SVHN dataset by @mariosasko in https://github.com/huggingface/datasets/pull/3535
    • New: BNL newspapers by @davanstrien in https://github.com/huggingface/datasets/pull/3397
    • New: PASS dataset by @mariosasko in https://github.com/huggingface/datasets/pull/3576
    • New: Text2log Dataset by @apergo-ai in https://github.com/huggingface/datasets/pull/3579
    • Update: beans, cats_vs_dogs - Use iter_files instead of str(Path(...) in image dataset by @mariosasko in https://github.com/huggingface/datasets/pull/3477
    • Update : PIB - update version and make it streamable by @albertvillanova in https://github.com/huggingface/datasets/pull/3496
    • Update: code_x_glue_tt_text_to_text, compguesswhat - Remove print statements in datasets by @mariosasko in https://github.com/huggingface/datasets/pull/3546
    • Update: MuchoCine - add missing tasks by @mariosasko in https://github.com/huggingface/datasets/pull/3571
    • Fix: Tashkeela - fix to yield stripped text by @albertvillanova in https://github.com/huggingface/datasets/pull/3471
    • Fix: asset - change to raw.githubusercontent.com URLs by @VictorSanh in https://github.com/huggingface/datasets/pull/3516
    • Fix: CC100 - use HTTPS for the data source URL by @aajanki in https://github.com/huggingface/datasets/pull/3519
    • Fix: vision datsets - Fix bug in ImageClassifcation task template by @mariosasko in https://github.com/huggingface/datasets/pull/3557
    • Fix: tweet_qa - fix DuplicatedKeysError and improve card by @mariosasko in https://github.com/huggingface/datasets/pull/3559
    • Fix: mC4 - fix multiple language downloading by @polinaeterna in https://github.com/huggingface/datasets/pull/3594
    • Fix: CoNLL2003:
      • Use old url for conll2003 by @lhoestq in https://github.com/huggingface/datasets/pull/3600
      • Update url for conll2003 by @lhoestq in https://github.com/huggingface/datasets/pull/3602
      • Add conll2003 licensing by @lhoestq in https://github.com/huggingface/datasets/pull/3601

    Datasets Features

    • [Time series] Add support for time, date, duration, and decimal dtypes by @mariosasko in https://github.com/huggingface/datasets/pull/3591
    • [Image][Audio] Add flexible casting for Image and Audio + Support nested casting by @lhoestq in https://github.com/huggingface/datasets/pull/3575
    • Allows DatasetDict.filter to have batching option by @thomasw21 in https://github.com/huggingface/datasets/pull/3506
    • Add desc parameter to filter by @mariosasko in https://github.com/huggingface/datasets/pull/3513
    • Add gzip for to_json by @bhavitvyamalik in https://github.com/huggingface/datasets/pull/3492
    • Allow multiple task templates of the same type by @mariosasko in https://github.com/huggingface/datasets/pull/3562
    • Add parameter preserve_index to from_pandas by @Sorrow321 in https://github.com/huggingface/datasets/pull/3565
    • Dataset Streaming:
      • Fix str(Path(...)) conversion in streaming on Linux by @mariosasko in https://github.com/huggingface/datasets/pull/3472
      • Extend support for streaming datasets that use ET.parse by @albertvillanova in https://github.com/huggingface/datasets/pull/3476
      • Extend support for streaming datasets that use os.walk by @albertvillanova in https://github.com/huggingface/datasets/pull/3478

    Metrics Changes

    • Add Mauve metric by @jthickstun in https://github.com/huggingface/datasets/pull/3573

    Dataset cards

    • update pretty_name for first 200 datasets by @bhavitvyamalik in https://github.com/huggingface/datasets/pull/3498
    • update pretty_name for all the other datasets by @bhavitvyamalik in https://github.com/huggingface/datasets/pull/3536
    • pib: Update pib dataset card by @albertvillanova in https://github.com/huggingface/datasets/pull/3501
    • arabic_speech_corpus: Adding link to license. by @meg-huggingface in https://github.com/huggingface/datasets/pull/3524
    • Covost2: Update README.md by @meg-huggingface in https://github.com/huggingface/datasets/pull/3528
    • librispeech_asr: Update README.md by @meg-huggingface in https://github.com/huggingface/datasets/pull/3529
    • vivos: Update README.md by @meg-huggingface in https://github.com/huggingface/datasets/pull/3530
    • audio datasets: Audio datacard update - first pass by @meg-huggingface in https://github.com/huggingface/datasets/pull/3520
    • common_language: Update README.md by @meg-huggingface in https://github.com/huggingface/datasets/pull/3527
    • wiki_dpr: Update wiki_dpr README.md by @lhoestq in https://github.com/huggingface/datasets/pull/3534
    • qa4mre: Fix qa4mre tags by @lhoestq in https://github.com/huggingface/datasets/pull/3574
    • HellaSwag: Update HellaSwag README.md by @borgr in https://github.com/huggingface/datasets/pull/3588
    • ANLI: Update ANLI README.md by @borgr in https://github.com/huggingface/datasets/pull/3590
    • tweet_eval: Update README.md by @borgr in https://github.com/huggingface/datasets/pull/3593

    Documentation

    • Fix rendering of docs by @albertvillanova in https://github.com/huggingface/datasets/pull/3470
    • Fix to_tf_dataset references in docs by @mariosasko in https://github.com/huggingface/datasets/pull/3514
    • added PII statements and license links to data cards by @mcmillanmajora in https://github.com/huggingface/datasets/pull/3537
    • Readme usage update by @meg-huggingface in https://github.com/huggingface/datasets/pull/3538
    • Update the CC-100 dataset card by @aajanki in https://github.com/huggingface/datasets/pull/3542
    • Research wording for nc licenses by @meg-huggingface in https://github.com/huggingface/datasets/pull/3539
    • Added links to licensing and PII message in vctk dataset by @mcmillanmajora in https://github.com/huggingface/datasets/pull/3523
    • Give clearer instructions to add the YAML tags by @albertvillanova in https://github.com/huggingface/datasets/pull/3532

    General improvements and bug fixes

    • Fix overriding of filesystem info by @albertvillanova in https://github.com/huggingface/datasets/pull/3481
    • Update ADD_NEW_DATASET.md by @apergo-ai in https://github.com/huggingface/datasets/pull/3487
    • Fix weird spacing in ManualDownloadError message by @bryant1410 in https://github.com/huggingface/datasets/pull/3486
    • Clone full repo to detect new tags when mirroring datasets on the Hub by @lhoestq in https://github.com/huggingface/datasets/pull/3494
    • Remove unused phony rule from Makefile by @bryant1410 in https://github.com/huggingface/datasets/pull/3483
    • fix: 🐛 pass token when retrieving the split names by @severo in https://github.com/huggingface/datasets/pull/3545
    • Pin torchmetrics to fix the COMET test by @lhoestq in https://github.com/huggingface/datasets/pull/3589
    • Preserve encoding/decoding with features in Iterable.map call by @mariosasko in https://github.com/huggingface/datasets/pull/3556

    New Contributors

    • @apergo-ai made their first contribution in https://github.com/huggingface/datasets/pull/3487
    • @bryant1410 made their first contribution in https://github.com/huggingface/datasets/pull/3486
    • @meg-huggingface made their first contribution in https://github.com/huggingface/datasets/pull/3527
    • @aajanki made their first contribution in https://github.com/huggingface/datasets/pull/3519
    • @Sorrow321 made their first contribution in https://github.com/huggingface/datasets/pull/3565
    • @jthickstun made their first contribution in https://github.com/huggingface/datasets/pull/3573
    • @borgr made their first contribution in https://github.com/huggingface/datasets/pull/3588

    Full Changelog: https://github.com/huggingface/datasets/compare/1.17.0...1.18.0

    Source code(tar.gz)
    Source code(zip)
  • 1.17.0(Dec 21, 2021)

    Dataset Changes

    • New: The Pile
      • Add The Pile dataset and PubMed Central subset by @albertvillanova in https://github.com/huggingface/datasets/pull/3287
      • Add The Pile Free Law subset by @albertvillanova in https://github.com/huggingface/datasets/pull/3359
      • Add The Pile USPTO subset by @albertvillanova in https://github.com/huggingface/datasets/pull/3360
      • Add The Pile subsets by @albertvillanova in https://github.com/huggingface/datasets/pull/3378
      • Add The Pile Enron Emails subset by @albertvillanova in https://github.com/huggingface/datasets/pull/3427
    • New: British Library Books Genre by @davanstrien in https://github.com/huggingface/datasets/pull/3312
    • New: Americas NLI by @fdschmidt93 in https://github.com/huggingface/datasets/pull/3371
    • New: Speech commands by @polinaeterna in https://github.com/huggingface/datasets/pull/3335
    • New: eli5_category by @jingshenSN2 in https://github.com/huggingface/datasets/pull/3420
    • New: OneStopQa by @scaperex in https://github.com/huggingface/datasets/pull/3436
    • Update: LABR - make the dataset streamable by @albertvillanova in https://github.com/huggingface/datasets/pull/3352
    • Update: CLUE benchmark - update cluewsc2020, chid, c3 and tnews by @mariosasko in https://github.com/huggingface/datasets/pull/3376
    • Update: beans, cast_vs_dogs, cifar10, cifar100, fashion_mnist, mnist, head_qa: use the new Image feature type + streaming support by @mariosasko in https://github.com/huggingface/datasets/pull/3362
    • Update: CC100- add Georgian data by @AnzorGozalishvili in https://github.com/huggingface/datasets/pull/3383
    • Update: disaster_response_messages - update download urls (+ add validation split) by @mariosasko in https://github.com/huggingface/datasets/pull/3426
    • Update: swahili_news - update to new version by @albertvillanova in https://github.com/huggingface/datasets/pull/3463
    • Fix: WikiAuto, Jeopardy, definite_pronoun_resolution - fix URLs by @LashaO in https://github.com/huggingface/datasets/pull/3266
    • Fix: QED - fix type of bridge field by @mariosasko in https://github.com/huggingface/datasets/pull/3417
    • Fix: ASSET - fix dataset data URLs by @tianjianjiang in https://github.com/huggingface/datasets/pull/3342

    Dataset Features

    • Add Image feature by @mariosasko in https://github.com/huggingface/datasets/pull/3163
    • to_tf_dataset() refactor by @Rocketknight1 in https://github.com/huggingface/datasets/pull/3356
    • More robust None handling by @mariosasko in https://github.com/huggingface/datasets/pull/3195
    • Add cast_column to IterableDataset by @mariosasko in https://github.com/huggingface/datasets/pull/3439
    • Support streaming zipped dataset repo by passing only repo name by @albertvillanova in https://github.com/huggingface/datasets/pull/3375
    • Extend support for streaming datasets that use pd.read_excel by @albertvillanova in https://github.com/huggingface/datasets/pull/3355
    • Extend iter_archive to support file object input by @albertvillanova in https://github.com/huggingface/datasets/pull/3443
    • Extend text to support yielding lines, paragraphs or documents by @albertvillanova in https://github.com/huggingface/datasets/pull/3442
    • Push dataset_infos.json to Hub to preserve feature types by @lhoestq in https://github.com/huggingface/datasets/pull/3467

    Dataset cards

    • Change TriviaQA license (#3313) by @avinashsai in https://github.com/huggingface/datasets/pull/3330
    • Add missing tags to XTREME by @mariosasko in https://github.com/huggingface/datasets/pull/3322
    • Remove duplicate name from dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/3354
    • Fix typos in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/3386
    • Fix duplicated tag in wikicorpus dataset card by @lhoestq in https://github.com/huggingface/datasets/pull/3458

    Dataset Tasks

    • Create Language Modeling task by @albertvillanova in https://github.com/huggingface/datasets/pull/3387

    Metric Changes

    • BLEURT: Match key names to correspond with filename by @jaehlee in https://github.com/huggingface/datasets/pull/3348
    • Fix links in metrics description by @albertvillanova in https://github.com/huggingface/datasets/pull/3461
    • Fix METEOR missing NLTK's omw-1.4 by @lhoestq in https://github.com/huggingface/datasets/pull/3469

    Docs

    • Add ArrayXD docs by @stevhliu in https://github.com/huggingface/datasets/pull/3344
    • Document a training loop for streaming dataset by @lhoestq in https://github.com/huggingface/datasets/pull/3370
    • Fix formatting in IterableDataset.map docs by @mariosasko in https://github.com/huggingface/datasets/pull/3395
    • Correctly indent builder config in dataset script docs by @mariosasko in https://github.com/huggingface/datasets/pull/3432
    • Update BLEURT hyperlink by @lewtun in https://github.com/huggingface/datasets/pull/3437

    Additional improvements and bug fixes

    • Quick fix error formatting by @NouamaneTazi in https://github.com/huggingface/datasets/pull/3328
    • Fix error message and add extension fallback by @mariosasko in https://github.com/huggingface/datasets/pull/3332
    • Avoid content-encoding issue while streaming datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/3350
    • Fix JSON ClassLabel casting for integers by @lhoestq in https://github.com/huggingface/datasets/pull/3340
    • Better error message when download fails by @lhoestq in https://github.com/huggingface/datasets/pull/3343
    • Fix dict source_datasets tagset validator by @albertvillanova in https://github.com/huggingface/datasets/pull/3368
    • Fix typo in other-structured-to-text task tag by @albertvillanova in https://github.com/huggingface/datasets/pull/3367
    • Fix temporary dataset_path creation for URIs related to remote fs by @francisco-perez-sorrosal in https://github.com/huggingface/datasets/pull/3296
    • Fix flaky test of the temporary directory used by load_from_disk by @lhoestq in https://github.com/huggingface/datasets/pull/3388
    • More robust first elem check in encode/cast example by @mariosasko in https://github.com/huggingface/datasets/pull/3402
    • Fix module inference for archive with a directory by @albertvillanova in https://github.com/huggingface/datasets/pull/3406
    • Fix dependencies conflicts in Windows CI after conda update to 4.11 by @lhoestq in https://github.com/huggingface/datasets/pull/3410
    • Pass new_fingerprint in multiprocessing by @lhoestq in https://github.com/huggingface/datasets/pull/3409
    • Fix flaky test again for s3 serialization by @lhoestq in https://github.com/huggingface/datasets/pull/3412
    • Skip None encoding (line deleted by accident in #3195) by @mariosasko in https://github.com/huggingface/datasets/pull/3414
    • Clean squad dummy data by @lhoestq in https://github.com/huggingface/datasets/pull/3428
    • #3337 Add typing overloads to Dataset.getitem for mypy by @Dref360 in https://github.com/huggingface/datasets/pull/3382
    • Make cast cacheable (again) on Windows by @mariosasko in https://github.com/huggingface/datasets/pull/3429
    • Use max number of data files to infer module by @albertvillanova in https://github.com/huggingface/datasets/pull/3407
    • Fix iter_archive generator by @albertvillanova in https://github.com/huggingface/datasets/pull/3454
    • [Staging] Update dataset repos automatically on the Hub by @lhoestq in https://github.com/huggingface/datasets/pull/3451
    • Update supported versions of Python in setup.py by @mariosasko in https://github.com/huggingface/datasets/pull/3438
    • raise exception instead of using assertions. by @manisnesan in https://github.com/huggingface/datasets/pull/3349

    New Contributors

    • @avinashsai made their first contribution in https://github.com/huggingface/datasets/pull/3330
    • @NouamaneTazi made their first contribution in https://github.com/huggingface/datasets/pull/3328
    • @davanstrien made their first contribution in https://github.com/huggingface/datasets/pull/3312
    • @francisco-perez-sorrosal made their first contribution in https://github.com/huggingface/datasets/pull/3296
    • @LashaO made their first contribution in https://github.com/huggingface/datasets/pull/3266
    • @fdschmidt93 made their first contribution in https://github.com/huggingface/datasets/pull/3371
    • @polinaeterna made their first contribution in https://github.com/huggingface/datasets/pull/3335
    • @AnzorGozalishvili made their first contribution in https://github.com/huggingface/datasets/pull/3383
    • @tianjianjiang made their first contribution in https://github.com/huggingface/datasets/pull/3342
    • @jingshenSN2 made their first contribution in https://github.com/huggingface/datasets/pull/3420
    • @scaperex made their first contribution in https://github.com/huggingface/datasets/pull/3436

    Full Changelog: https://github.com/huggingface/datasets/compare/1.16.1...1.17.0

    Source code(tar.gz)
    Source code(zip)
  • 1.16.1(Nov 26, 2021)

    Bug fixes

    • Fix import datasets on python 3.10 by @lhoestq in https://github.com/huggingface/datasets/pull/3326
    • Fix wrongly converted assert by @eliasws in https://github.com/huggingface/datasets/pull/3323
    Source code(tar.gz)
    Source code(zip)
  • 1.16.0(Nov 26, 2021)

    Datasets Changes

    • New: riddle_sense by @ziyiwu9494 in https://github.com/huggingface/datasets/pull/3161
    • New: Multi-Lingual LibriSpeech by @patrickvonplaten in https://github.com/huggingface/datasets/pull/3198
    • New: XCSR by @yangxqiao in https://github.com/huggingface/datasets/pull/3074
    • New: CMU Hinglish DoG by @Ishan-Kumar2 in https://github.com/huggingface/datasets/pull/3149
    • New: Multidoc2dial by @sivasankalpp in https://github.com/huggingface/datasets/pull/3205
    • New: IndoNLI by @afaji in https://github.com/huggingface/datasets/pull/3307
    • Update: DaNE - updated URL for download by @MalteHB in https://github.com/huggingface/datasets/pull/3203
    • Update: xcopa - (fix checksum issues + add translated data) by @mariosasko in https://github.com/huggingface/datasets/pull/3254
    • Update: tatoeba - update to v2021-07-22 by @KoichiYasuoka in https://github.com/huggingface/datasets/pull/3225
    • Update: KILT - update metadata JSON by @albertvillanova in https://github.com/huggingface/datasets/pull/3276
    • Update: Covost 2 - update download instructions by @patrickvonplaten in https://github.com/huggingface/datasets/pull/3281
    • Update: Common Voice, OpenSLR, LibriSpeech ASR, Vivos - make several audio datasets streamable by @lhoestq in https://github.com/huggingface/datasets/pull/3290
    • Fix: tuple_ie - fix download url by @mariosasko in https://github.com/huggingface/datasets/pull/3213
    • Fix: id_newspapers_2018 - fix streaming by @lhoestq in https://github.com/huggingface/datasets/pull/3249
    • Fix: bookcorpusopen - fix RAM usage by @lhoestq in https://github.com/huggingface/datasets/pull/3280
    • Fix: Scielo - fix ConnectionError by @mariosasko in https://github.com/huggingface/datasets/pull/3260
    • Fix: tatoeba - fix URLs for a subset of xtreme by @mariosasko in https://github.com/huggingface/datasets/pull/3321

    Datasets Features

    • Push to hub capabilities for Dataset and DatasetDict by @LysandreJik in https://github.com/huggingface/datasets/pull/3098:
      • upload your dataset to the Hugging face Hub with the push_to_hub() method !
      • See documentation here
    • 200+ datasets now support streaming:
      • Stream TAR-based dataset using iter_archive by @lhoestq in https://github.com/huggingface/datasets/pull/3110
      • Stream from Google Drive and other hosts by @lhoestq in https://github.com/huggingface/datasets/pull/3248
      • Support Audio feature in streaming mode by @albertvillanova in https://github.com/huggingface/datasets/pull/3133
      • Support Audio feature for TAR archives in sequential access by @albertvillanova in https://github.com/huggingface/datasets/pull/3129
    • Resolve data_files by split name automatically by @lhoestq in https://github.com/huggingface/datasets/pull/3221
      • It takes into account the file names to know which file goes into which split
      • See documentation here
    • Filter method for batched=True by @thomasw21 in https://github.com/huggingface/datasets/pull/3244
    • Adding with_rank arg to pass process rank to map by @TevenLeScao in https://github.com/huggingface/datasets/pull/3314

    Dataset Cards

    • Add full tagset to conll2003 README by @BramVanroy in https://github.com/huggingface/datasets/pull/3230
    • Fix some contact information formats by @lhoestq in https://github.com/huggingface/datasets/pull/3274
    • Add wikipedia tags by @lhoestq in https://github.com/huggingface/datasets/pull/3301
    • Updating details of IRC disentanglement data by @jkkummerfeld in https://github.com/huggingface/datasets/pull/3259

    Metrics Changes

    • New: OpenAI's [email protected] code evaluation metric by @lvwerra in https://github.com/huggingface/datasets/pull/2916
    • Update: BLEURT - options to use updated bleurt checkpoints by @jaehlee in https://github.com/huggingface/datasets/pull/3235
    • Update: CER - update to support latest release by @mariosasko in https://github.com/huggingface/datasets/pull/3252
    • Update: WER - update to the documentation by @wooters in https://github.com/huggingface/datasets/pull/3278

    Documentation

    • Add docs for to_tf_dataset by @stevhliu in https://github.com/huggingface/datasets/pull/3175
    • Small updates to to_tf_dataset documentation by @Rocketknight1 in https://github.com/huggingface/datasets/pull/3215
    • Update link to Datasets Tagging app in Spaces by @albertvillanova in https://github.com/huggingface/datasets/pull/3194
    • Improve repository structure docs by @lhoestq in https://github.com/huggingface/datasets/pull/3233
    • Swap descriptions of v1 and raw-v1 configs of WikiText dataset and fix metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/3241
    • Add docs for audio processing by @stevhliu in https://github.com/huggingface/datasets/pull/3222
    • Add push_to_hub docs by @lhoestq in https://github.com/huggingface/datasets/pull/3319

    Additional improvements and bug fixes

    • Catch token invalid error in CI by @lhoestq in https://github.com/huggingface/datasets/pull/3200
    • Pin keras version until TF fixes its release by @albertvillanova in https://github.com/huggingface/datasets/pull/3208
    • Fix disable_nullable default value to False by @lhoestq in https://github.com/huggingface/datasets/pull/3211
    • Fix code quality in riddle_sense dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/3218
    • Better error msg if len(predictions) doesn't match len(references) in metrics by @mariosasko in https://github.com/huggingface/datasets/pull/3160
    • Use huggingface_hub.HfApi to list datasets/metrics by @mariosasko in https://github.com/huggingface/datasets/pull/3121
    • Pin version exclusion for tensorflow incompatible with keras by @albertvillanova in https://github.com/huggingface/datasets/pull/3216
    • Group tests in multiprocessing workers by test file by @albertvillanova in https://github.com/huggingface/datasets/pull/3231
    • Fix load_from_disk temporary directory by @lhoestq in https://github.com/huggingface/datasets/pull/3245
    • [tiny] fix typo in stream docs by @nollied in https://github.com/huggingface/datasets/pull/3246
    • Avoid PyArrow type optimization if it fails by @mariosasko in https://github.com/huggingface/datasets/pull/3234
    • Remove redundant isort module placement by @mariosasko in https://github.com/huggingface/datasets/pull/3243
    • asserts replaced by exception for text classification task with test. by @manisnesan in https://github.com/huggingface/datasets/pull/3256
    • Add os.listdir for streaming by @lhoestq in https://github.com/huggingface/datasets/pull/3270
    • asserts replaced with exception for image classification task, csv, json by @manisnesan in https://github.com/huggingface/datasets/pull/3262
    • Force data files extraction if download_mode='force_redownload' by @mariosasko in https://github.com/huggingface/datasets/pull/3275
    • Minor Typo Fix - Precision to Recall by @SebastinSanty in https://github.com/huggingface/datasets/pull/3279
    • Decode audio from remote by @lhoestq in https://github.com/huggingface/datasets/pull/3271
    • Fix build_docs CI by @lhoestq in https://github.com/huggingface/datasets/pull/3286
    • Allow datasets with indices table when concatenating along axis=1 by @mariosasko in https://github.com/huggingface/datasets/pull/3288
    • f-string formatting by @Mehdi2402 in https://github.com/huggingface/datasets/pull/3277
    • Unpin markdown for build_docs now that it's fixed by @lhoestq in https://github.com/huggingface/datasets/pull/3289
    • Pin version exclusion for Markdown by @albertvillanova in https://github.com/huggingface/datasets/pull/3293
    • Use f-strings in the dataset scripts by @Carlosbogo in https://github.com/huggingface/datasets/pull/3291
    • fix old_val typo in f-string by @Mehdi2402 in https://github.com/huggingface/datasets/pull/3302
    • asserts replaced with exception for fingerprint.py, search.py, arrow_writer.py and metric.py by @Ishan-Kumar2 in https://github.com/huggingface/datasets/pull/3305
    • fix: files counted twice in inferred structure by @borisdayma in https://github.com/huggingface/datasets/pull/3309
    • Finish transition to PyArrow 3.0.0 by @mariosasko in https://github.com/huggingface/datasets/pull/3318
    • Removing query params for dynamic URL caching by @anton-l in https://github.com/huggingface/datasets/pull/3315

    Citation

    • Update BibTeX entry by @albertvillanova in https://github.com/huggingface/datasets/pull/3223
    • Fix paper BibTeX citation with proceedings reference by @albertvillanova in https://github.com/huggingface/datasets/pull/3226
    • Add CITATION file by @albertvillanova in https://github.com/huggingface/datasets/pull/3228
    • Fix URL in CITATION file by @albertvillanova in https://github.com/huggingface/datasets/pull/3229

    Deprecations

    • Deprecate prepare_module by @albertvillanova in https://github.com/huggingface/datasets/pull/3166

    Full Changelog: https://github.com/huggingface/datasets/compare/1.15.1...1.16.0

    Source code(tar.gz)
    Source code(zip)
  • 1.15.1(Nov 2, 2021)

  • 1.15.0(Nov 2, 2021)

    Dataset Changes

    • Update: JNLBA - add tags names by @bhavitvyamalik in https://github.com/huggingface/datasets/pull/3092
    • Update: OpenSLR - add SLR83 to OpenSLR by @tyrius02 in https://github.com/huggingface/datasets/pull/3125 and https://github.com/huggingface/datasets/pull/3176
    • Update: RONEC - update to v2 by @dumitrescustefan in https://github.com/huggingface/datasets/pull/3184
    • Fix: Arabic Billion Words - Fix script to return all data by @albertvillanova in https://github.com/huggingface/datasets/pull/3136
    • Fix: HLGD - fix label mapping by @VictorSanh in https://github.com/huggingface/datasets/pull/3180

    Dataset Features

    • Allow dynamic first dimension for ArrayXD by @rpowalski in https://github.com/huggingface/datasets/pull/2891
    • add multi-proc in to_csv by @bhavitvyamalik in https://github.com/huggingface/datasets/pull/2896
    • QOL improvements: auto-flatten_indices and desc in map calls by @mariosasko in https://github.com/huggingface/datasets/pull/3196

    Dataset Cards

    • Fill in dataset card for NCBI disease dataset by @edugp in https://github.com/huggingface/datasets/pull/3115

    Metrics Changes

    • New: metric for the MATH dataset (competition_math). by @hacobe in https://github.com/huggingface/datasets/pull/3020
    • New: Google BLEU (aka GLEU) metric by @slowwavesleep in https://github.com/huggingface/datasets/pull/3108
    • New: TER by @BramVanroy in https://github.com/huggingface/datasets/pull/3153
    • New: ChrF(++) by @BramVanroy in https://github.com/huggingface/datasets/pull/3187

    General improvements and bug fixes

    • Correctly update metadata to preserve features when concatenating datasets with axis=1 by @mariosasko in https://github.com/huggingface/datasets/pull/3120
    • Fixes to to_tf_dataset by @Rocketknight1 in https://github.com/huggingface/datasets/pull/3085
    • Add security policy to the project by @albertvillanova in https://github.com/huggingface/datasets/pull/2958
    • Update doc links to point to new docs by @mariosasko in https://github.com/huggingface/datasets/pull/3116
    • Fix caching bugs by @mariosasko in https://github.com/huggingface/datasets/pull/3141
    • Fix numpy deprecation warning for ragged tensors by @lhoestq in https://github.com/huggingface/datasets/pull/3137
    • Fixed: duplicate parameter and missing parameter in docstring by @PanQiWei in https://github.com/huggingface/datasets/pull/3157
    • Fix some typos in the documentation by @h4iku in https://github.com/huggingface/datasets/pull/3152
    • Fix string encoding for Value type by @lhoestq in https://github.com/huggingface/datasets/pull/3158
    • Fix CLI test to ignore verfications when saving infos by @albertvillanova in https://github.com/huggingface/datasets/pull/3147
    • Make inspect.get_dataset_config_names always return a non-empty list by @albertvillanova in https://github.com/huggingface/datasets/pull/3159
    • Fix issue with filelock filename being too long on encrypted filesystems by @mariosasko in https://github.com/huggingface/datasets/pull/3173
    • Asserts replaced by exceptions (huggingface#3171) by @joseporiolayats in https://github.com/huggingface/datasets/pull/3174
    • Preserve ordering in zip_dict by @mariosasko in https://github.com/huggingface/datasets/pull/3170
    • Don't memoize strings when hashing since two identical strings may have different python ids by @lhoestq in https://github.com/huggingface/datasets/pull/3182
    • Re-add faiss to windows testing suite by @BramVanroy in https://github.com/huggingface/datasets/pull/3151
    • Add missing docstring to DownloadConfig by @mariosasko in https://github.com/huggingface/datasets/pull/3183
    • More efficient nested features encoding by @eladsegal in https://github.com/huggingface/datasets/pull/3124
    • Fix optimized encoding for arrays by @lhoestq in https://github.com/huggingface/datasets/pull/3197
    Source code(tar.gz)
    Source code(zip)
  • 1.14.0(Oct 19, 2021)

    Dataset changes

    • Update: LexGLUE and MultiEURLEX README - update dataset cards #3075 (@iliaschalkidis)
    • Update: SUPERB - use Audio features #3101 (@anton-l)
    • Fix: Blog Authorship Corpus - fix URLs #3106 (@albertvillanova)

    Dataset features

    • Add iter_archive #3066 (@lhoestq)

    General improvements and bug fixes

    • Replace FSTimeoutError with parent TimeoutError #3100 (@albertvillanova)
    • Fix project description in PyPI #3103 (@albertvillanova)
    • Align tqdm control with cache control #3031 (@mariosasko)
    • Add paper BibTeX citation #3107 (@albertvillanova)
    Source code(tar.gz)
    Source code(zip)
  • 1.13.3(Oct 15, 2021)

    Dataset changes

    • Update: Adapt all audio datasets #3081 (@patrickvonplaten)

    Bug fixes

    • Update BibTeX entry #3090 (@albertvillanova)
    • Use template column_mapping to transmit_format instead of template features #3088 (@mariosasko)
    • Fix Audio feature mp3 resampling #3096 (@albertvillanova)
    Source code(tar.gz)
    Source code(zip)
Owner
Hugging Face
Solving NLP, one commit at a time!
Hugging Face
Utilizing RBERT model for KLUE Relation Extraction task

RBERT for Relation Extraction task for KLUE Project Description Relation Extraction task is one of the task of Korean Language Understanding Evaluatio

snoop2head 14 Nov 15, 2022
Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

README Code for Two-stage Identifier: "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022. For details of the model a

Yongliang Shen 45 Nov 29, 2022
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

652 Jan 06, 2023
A complete NLP guideline for enthusiasts

NLP-NINJA A complete guide for Natural Language Processing in Python Table of Contents S.No. Topic Level Meaning 1 Tokenization 🤍 Beginner 2 Stemming

MAINAK CHAUDHURI 22 Dec 27, 2022
Opal-lang - A WIP programming language based on Python

thanks to aphitorite for the beautiful logo! opal opal is a WIP transcompiled pr

3 Nov 04, 2022
Code for the paper TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning Tasks

TestRank in Pytorch Code for the paper TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning Tasks by Yu Li, Min Li, Qiuxia Lai, Ya

3 May 19, 2022
Findings of ACL 2021

Assessing Dialogue Systems with Distribution Distances [arXiv][code] We propose to measure the performance of a dialogue system by computing the distr

Yahui Liu 16 Feb 24, 2022
结巴中文分词

jieba “结巴”中文分词:做最好的 Python 中文分词组件 "Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation

Sun Junyi 29.8k Jan 02, 2023
In this workshop we will be exploring NLP state of the art transformers, with SOTA models like T5 and BERT, then build a model using HugginFace transformers framework.

Transformers are all you need In this workshop we will be exploring NLP state of the art transformers, with SOTA models like T5 and BERT, then build a

Aymen Berriche 8 Apr 13, 2022
Blazing fast language detection using fastText model

Luga A blazing fast language detection using fastText's language models Luga is a Swahili word for language. fastText provides a blazing fast language

Prayson Wilfred Daniel 18 Dec 20, 2022
Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation.

Covid-19-BOT Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation. This bot uses torc

Neeraj Majhi 2 Nov 05, 2021
test

Lidar-data-decode In this project, you can decode your lidar data frame(pcap file) and make your own datasets(test dataset) in Windows without any hug

46 Dec 05, 2022
Deal or No Deal? End-to-End Learning for Negotiation Dialogues

Introduction This is a PyTorch implementation of the following research papers: (1) Hierarchical Text Generation and Planning for Strategic Dialogue (

Facebook Research 1.4k Dec 29, 2022
InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective

InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective This is the official code base for our ICLR 2021 paper

AI Secure 71 Nov 25, 2022
Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation This is the implementaion of our paper: Bridging the

hezw.tkcw 20 Dec 12, 2022
Simple Annotated implementation of GPT-NeoX in PyTorch

Simple Annotated implementation of GPT-NeoX in PyTorch This is a simpler implementation of GPT-NeoX in PyTorch. We have taken out several optimization

labml.ai 101 Dec 03, 2022
Exploration of BERT-based models on twitter sentiment classifications

twitter-sentiment-analysis Explore the relationship between twitter sentiment of Tesla and its stock price/return. Explore the effect of different BER

Sammy Cui 2 Oct 02, 2022
Задания КЕГЭ по информатике 2021 на Python

КЕГЭ 2021 на Python В этом репозитории мои решения типовых заданий КЕГЭ по информатике в 2021 году, БЕСПЛАТНО! Задания Взяты с https://inf-ege.sdamgia

8 Oct 13, 2022
Sequence model architectures from scratch in PyTorch

This repository implements a variety of sequence model architectures from scratch in PyTorch. Effort has been put to make the code well structured so that it can serve as learning material. The train

Brando Koch 11 Mar 28, 2022
Poetry PEP 517 Build Backend & Core Utilities

Poetry Core A PEP 517 build backend implementation developed for Poetry. This project is intended to be a light weight, fully compliant, self-containe

Poetry 293 Jan 02, 2023