Catalogue data - A Python Scripts to prepare catalogue data

Last update: Mar 03, 2022

Related tags

Data Analysis catalogue_data

Overview

catalogue_data

Scripts to prepare catalogue data.

Setup

Clone this repo.

Install git-lfs: https://github.com/git-lfs/git-lfs/wiki/Installation

sudo apt-get install git-lfs
git lfs install

Install dependencies:

sudo apt-add-repository non-free
sudo apt-get update
sudo apt-get install unrar

Create virtual environment, activate it and install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Create User Access Token (with write access) at Hugging Face Hub: https://huggingface.co/settings/token and set environment variables in the .env file at the root directory:

HF_USERNAME=
   
    
HF_USER_ACCESS_TOKEN=
    
     
GIT_USER=
     
      
GIT_EMAIL=

Create metadata

To create dataset metadata (in file dataset_infos.json) run:

python create_metadata.py --repo <repo_id>

where you should replace , e.g. bigscience-catalogue-lm-data/lm_ca_viquiquad

Aggregate datasets

To create an aggregated dataset from multiple datasets, and save it as sharded JSON Lines GZIP files, run:

python aggregate_datasets.py --dataset_ratios_path <path_to_file_with_dataset_ratios> --save_path <dir_path_to_save_aggregated_dataset>

where you should replace:

path_to_file_with_dataset_ratios: path to JSON file containing a dict with dataset names (keys) and their ratio (values) between 0 and 1.
: directory path to save the aggregated dataset

Catalogue data - A Python Scripts to prepare catalogue data

Related tags

Overview

catalogue_data

Setup

Create metadata

Aggregate datasets

Owner

BigScience Workshop

Uses MIT/MEDSL, New York Times, and US Census datasources to analyze per-county COVID-19 deaths.

PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological images.

songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN

GWpy is a collaboration-driven Python package providing tools for studying data from ground-based gravitational-wave detectors

Lale is a Python library for semi-automated data science.

BIGDATA SIMULATION ONE PIECE WORLD CENSUS

The Dash Enterprise App Gallery "Oil & Gas Wells" example

A stock analysis app with streamlit

PyPDC is a Python package for calculating asymptotic Partial Directed Coherence estimations for brain connectivity analysis.

Top 50 best selling books on amazon

Python Project on Pro Data Analysis Track

Bearsql allows you to query pandas dataframe with sql syntax.

Py-price-monitoring - A Python price monitor

This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

Port of dplyr and other related R packages in python, using pipda.

Codes for the collection and predictive processing of bitcoin from the API of coinmarketcap

simple way to build the declarative and destributed data pipelines with python

Bigdata Simulation Library Of Dream By Sandman Books