catalogue_data
Scripts to prepare catalogue data.
Setup
Clone this repo.
Install git-lfs: https://github.com/git-lfs/git-lfs/wiki/Installation
sudo apt-get install git-lfs
git lfs install
Install dependencies:
sudo apt-add-repository non-free
sudo apt-get update
sudo apt-get install unrar
Create virtual environment, activate it and install dependencies:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Create User Access Token (with write access) at Hugging Face Hub: https://huggingface.co/settings/token and set environment variables in the .env
file at the root directory:
HF_USERNAME=
HF_USER_ACCESS_TOKEN=
GIT_USER=
GIT_EMAIL=
Create metadata
To create dataset metadata (in file dataset_infos.json
) run:
python create_metadata.py --repo <repo_id>
where you should replace
, e.g. bigscience-catalogue-lm-data/lm_ca_viquiquad
Aggregate datasets
To create an aggregated dataset from multiple datasets, and save it as sharded JSON Lines GZIP files, run:
python aggregate_datasets.py --dataset_ratios_path <path_to_file_with_dataset_ratios> --save_path <dir_path_to_save_aggregated_dataset>
where you should replace:
path_to_file_with_dataset_ratios
: path to JSON file containing a dict with dataset names (keys) and their ratio (values) between 0 and 1.