Quickly download, clean up, and install public datasets into a database management system

Overview

Retriever logo

Python package Build Status (windows) Research software impact codecov.io Documentation Status License Join the chat at https://gitter.im/weecology/retriever DOI JOSS Publication Anaconda-Server Badge Anaconda-Server Badge Version NumFOCUS

Finding data is one thing. Getting it ready for analysis is another. Acquiring, cleaning, standardizing and importing publicly available data is time consuming because many datasets lack machine readable metadata and do not conform to established data structures and formats. The Data Retriever automates the first steps in the data analysis pipeline by downloading, cleaning, and standardizing datasets, and importing them into relational databases, flat files, or programming languages. The automation of this process reduces the time for a user to get most large datasets up and running by hours, and in some cases days.

Installing the Current Release

If you have Python installed you can install the current release using either pip:

pip install retriever

or conda after adding the conda-forge channel (conda config --add channels conda-forge):

conda install retriever

Depending on your system configuration this may require sudo for pip:

sudo pip install retriever

Precompiled binary installers are also available for Windows, OS X, and Ubuntu/Debian on the releases page. These do not require a Python installation.

List of Available Datasets

Installing From Source

To install the Data Retriever from source, you'll need Python 3.6.8+ with the following packages installed:

  • xlrd

The following packages are optionally needed to interact with associated database management systems:

  • PyMySQL (for MySQL)
  • sqlite3 (for SQLite)
  • psycopg2-binary (for PostgreSQL), previously psycopg2.
  • pyodbc (for MS Access - this option is only available on Windows)
  • Microsoft Access Driver (ODBC for windows)

To install from source

Either use pip to install directly from GitHub:

pip install git+https://[email protected]/weecology/retriever.git

or:

  1. Clone the repository
  2. From the directory containing setup.py, run the following command: pip install .. You may need to include sudo at the beginning of the command depending on your system (i.e., sudo pip install .).

More extensive documentation for those that are interested in developing can be found here

Using the Command Line

After installing, run retriever update to download all of the available dataset scripts. To see the full list of command line options and datasets run retriever --help. The output will look like this:

usage: retriever [-h] [-v] [-q]
                 {download,install,defaults,update,new,new_json,edit_json,delete_json,ls,citation,reset,help}
                 ...

positional arguments:
  {download,install,defaults,update,new,new_json,edit_json,delete_json,ls,citation,reset,help}
                        sub-command help
    download            download raw data files for a dataset
    install             download and install dataset
    defaults            displays default options
    update              download updated versions of scripts
    new                 create a new sample retriever script
    new_json            CLI to create retriever datapackage.json script
    edit_json           CLI to edit retriever datapackage.json script
    delete_json         CLI to remove retriever datapackage.json script
    ls                  display a list all available dataset scripts
    citation            view citation
    reset               reset retriever: removes configuration settings,
                        scripts, and cached data
    help

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -q, --quiet           suppress command-line output

To install datasets, use retriever install:

usage: retriever install [-h] [--compile] [--debug]
                         {mysql,postgres,sqlite,msaccess,csv,json,xml} ...

positional arguments:
  {mysql,postgres,sqlite,msaccess,csv,json,xml}
                        engine-specific help
    mysql               MySQL
    postgres            PostgreSQL
    sqlite              SQLite
    msaccess            Microsoft Access
    csv                 CSV
    json                JSON
    xml                 XML

optional arguments:
  -h, --help            show this help message and exit
  --compile             force re-compile of script before downloading
  --debug               run in debug mode

Examples

These examples are using the Iris flower dataset. More examples can be found in the Data Retriever documentation.

Using Install

retriever install -h   (gives install options)

Using specific database engine, retriever install {Engine}

retriever install mysql -h     (gives install mysql options)
retriever install mysql --user myuser --password ******** --host localhost --port 8888 --database_name testdbase iris

install data into an sqlite database named iris.db you would use:

retriever install sqlite iris -f iris.db

Using download

retriever download -h    (gives you help options)
retriever download iris
retriever download iris --path C:\Users\Documents

Using citation

retriever citation   (citation of the retriever engine)
retriever citation iris  (citation for the iris data)

Spatial Dataset Installation

Set up Spatial support

To set up spatial support for Postgres using Postgis please refer to the spatial set-up docs.

retriever install postgres harvard-forest # Vector data
retriever install postgres bioclim # Raster data
# Install only the data of USGS elevation in the given extent
retriever install postgres usgs-elevation -b -94.98704597353938 39.027001800158615 -94.3599408119917 40.69577051867074

Website

For more information see the Data Retriever website.

Acknowledgments

Development of this software was funded by the Gordon and Betty Moore Foundation's Data-Driven Discovery Initiative through Grant GBMF4563 to Ethan White and the National Science Foundation as part of a CAREER award to Ethan White.

Comments
  • [WIP] Allow consuming JSON data

    [WIP] Allow consuming JSON data

    NOTE: I closed this PR by mistake. I'll re-open this. This pull request is for catering to this [issue](Allow consuming JSON data #1334). Currently, we support 2 kinds of json datasets:

    1. Where the dataset's rows are present in a certain key of json. For example refer to this example, here the certain_key is data.
    2. Where the dataset's rows are present in certain key of differnent parts of the json. For example refer to this example, here the certain key is laureates.
    3. [WIP] Where the json is in the form of list, example. Current implementation for this is commented out, since its getting stuck in a recursion loop.
    opened by DumbMachine 37
  • Updated internal variable names to match that of datapackage #860

    Updated internal variable names to match that of datapackage #860

    Updated internal variable names to match that of datapackage spec #765 The following changes were done for the variable names

    tags -> keywords
    nulls -> missingValues
    name -> title
    shortname -> name
    
    The changes were done in the following files -
    
    retriever/lib/compile.py
    retriever/lib/datapackage.py
    retriever/lib/engine.py
    retriever/lib/parse_script_to_json.py
    retriever/lib/templates.py
    retriever/lib/tools.py
    scripts/bioclim.py
    scripts/biomass_allometry_db.py
    scripts/breed_bird_survey.py
    scripts/breed_bird_survey_50stop.py
    scripts/forest_inventory_analysis.py
    scripts/gentry_forest_transects.py
    scripts/npn.py
    scripts/plant_life_hist_eu.py
    scripts/prism_climate.py
    scripts/vertnet.py
    scripts/wood_density.py
    scripts/*.json(almost all datapackages) transition missingValues -> missing_values
    test/test_retriever.py
    retriever/__main__.py
    

    @henrykironde I have made the changes after updating it with the master branch

    Under Review and Tests 
    opened by jainamritanshu 37
  • Working on : Expansion of Spatial Data Support to the Data Retriever

    Working on : Expansion of Spatial Data Support to the Data Retriever

    I have started to look into and working on the project : Expansion of Spatial Data Support to the Data Retriever.

    @ethanwhite @henrykironde Please let me know of any aspects in particular I should be prioritising over others.

    Would it be okay if I carried on the discussion through this Issue?

    opened by ss-is-master-chief 31
  • Test OSX .app file

    Test OSX .app file

    I'm in the process of trying to get the EcoData Retriever fully functional on OSX (since so many of the awesome ecological informaticsy people I know use Macs). As I've mentioned elsewhere it looks like building from source now works, at least when using homebrew (http://ecodataretriever.org/getting_started.html).

    What I'm working on now is getting the .app working so that you don't need to be comfortable in the shell (and have XCode installed) to use the Retriever. I have a version that is working on the machine that I built it on (our only Mac) and was wondering if some kind Mac folks like @karthik, @sckott, @emhart, @sarahsupp and @dfalster might have a few minutes to give it a trial run.

    The file is here: https://www.dropbox.com/s/26b1pj91mqucc0l/retriever.zip

    Basically I'm just looking for folks to unzip it, double click on it, and see if:

    1. It opens at all.
    2. You can install things successfully when setting the database management system to CSV and sqlite (these don't have any external dependencies).
    3. If you have either MySQL or PostgreSQL installed if it works with them. (MySQL is a bit fragile at the moment. It is currently working for most datasets, but not all, so just try a few if you get errors).
    4. Report back.

    Thanks in advance. And, yes, I wrote this issue... on a Mac.

    opened by ethanwhite 31
  • Allow consuming JSON data

    Allow consuming JSON data

    Currently we only support ingesting delimited tabular data. It is increasingly common for tabular style data to be distributed in JSON files and it would be nice to also be able to consume this. We would probably just convert it to CSV as a starting point and then process it using our standard pipeline.

    There are a few not particularly active packages for doing this, but the code to do it is so simple enough that since none of the packages seem to be widely adopted we might be better off just writing and maintaining it ourselves.

    (no rush on this, just a thought while looking at a cool dataset that's only available in JSON: http://data.unpaywall.org/products/snapshot)

    Feature Request 
    opened by ethanwhite 30
  • Update internals in reference to issue #765

    Update internals in reference to issue #765

    @ethanwhite @henrykironde I have made the following changes

    • name -> title
    • shortname -> name
    • tags -> keywords
    • nulls -> missingValues

    I wanted to ask to change missingValues to missing_values as the original one seems to be in camel case and not in pep8 naming convention , if you allow I would change it. I am still finding such internal name cases which could be updated. I guess the code is clean. Kindly go through it once and if there are any suggestion I would start working on them

    Changes Requested 
    opened by jainamritanshu 30
  • Removed default encoding in reference to #716

    Removed default encoding in reference to #716

    Sir I have removed the hard coded assignments for encoding and added a field of encoding. I haven't edited the existing scripts for the existing data. Should I do them manually? Kindly review the code and tell me any changes needed for the code. @ethanwhite @henrykironde

    opened by jainamritanshu 29
  • an eBird Basic Dataset workflow

    an eBird Basic Dataset workflow

    Hey all,

    I've mostly gotten the eBird data into a PostgreSQL/PostGIS database, and I thought I'd share my code with you in case you wanted to integrate it into something more robust with EcoDataRetriever. If you know how to optimize it better, I'd love to hear what you come up with.

    If you do decide to include it, please acknowledge Matt Jones and Jim Regetz, since they helped me through this.

    Let me know if you have any questions!

    Dave

    PS the "world" data set unzips to be 50 gigabytes, so you'll probably want to work with something smaller...

    -- Data file available via http://ebird.org/ebird/data/download
    
    -- commands to extract the text file from the tarball:
       -- tar xvf ebd_relMay-2013.tar
       -- gunzip ebd_relMay-2013.txt.gz
    -- WARNING: The resulting file is almost 50 gigabytes!
    
    -- In retrospect, there's probably some premature optimization for some of these columns: if the data set changes,
    -- it might be safer to use longer varchar arguments.
    CREATE TABLE eBird (
      GLOBAL_UNIQUE_IDENTIFIER     char(50),      -- always 45-47 characters needed (so far)
      TAXONOMIC_ORDER              numeric,       -- Probably not needed
      CATEGORY                     varchar(20),   -- Probably 10 would be safe
      COMMON_NAME                  varchar(70),   -- Some hybrids have really long names
      SCIENTIFIC_NAME              varchar(70),   --  ''
      SUBSPECIES_COMMON_NAME       varchar(70),   --  ''
      SUBSPECIES_SCIENTIFIC_NAME   varchar(70),   --  ''
      OBSERVATION_COUNT            varchar(8),    -- Someone saw 1.3 million Auklets.
                                                  -- Unfortunately, it can't be an integer 
                                                  -- because some are just presence/absence
      BREEDING_BIRD_ATLAS_CODE     char(2),       -- need to confirm that these are always length 2
      AGE_SEX                      text,          -- Potentially long, but almost always blank
      COUNTRY                      varchar(50),   -- long enough for "Saint Helena, Ascension and Tristan da Cunha"
      COUNTRY_CODE                 char(2),       -- alpha-2 codes
      STATE_PROVINCE               varchar(50),   -- no idea if this is long enough? U.S. Virgin Islands may be almost 30
      SUBNATIONAL1_CODE            char(10),      -- looks standardized at 5 characters?
      COUNTY                       varchar(50),   -- who knows how long it could be
      SUBNATIONAL2_CODE            char(12),      -- looks standardized at 9 characters?
      IBA_CODE                     char(16),
      LOCALITY                     text,          -- unstructured/potentially long
      LOCALITY_ID                  char(10),      -- maximum observed so far is 8
      LOCALITY_TYPE                char(2),       -- short codes
      LATITUDE                     real,          -- Is this the appropriate level of precision?
      LONGITUDE                    real,          --    ''
      OBSERVATION_DATE             date,          -- Do I need to specify YMD somehow?
      TIME_OBSERVATIONS_STARTED    time,          -- How do I make this a time?
      TRIP_COMMENTS                text,          -- Comments are long, unstructured, 
      SPECIES_COMMENTS             text,          --    and inconsistent, but sometimes interesting
      OBSERVER_ID                  char(12),      -- max of 9 in the data I've seen so far
      FIRST_NAME                   text,          -- Already have observer IDs
      LAST_NAME                    text,          -- ''
      SAMPLING_EVENT_IDENTIFIER    char(12),      -- Probably want to index on this.
      PROTOCOL_TYPE                varchar(50),   -- Needs to be at least 30 for sure.
      PROJECT_CODE                 varchar(20),   -- Needs to be at least 10 for sure.
      DURATION_MINUTES             int,           -- bigint?
      EFFORT_DISTANCE_KM           real,          -- precision?
      EFFORT_AREA_HA               real,          -- precision?
      NUMBER_OBSERVERS             int,           -- just a small int
      ALL_SPECIES_REPORTED         int,           -- Seems to always be 1 or 0.  Maybe I could make this Boolean?
      GROUP_IDENTIFIER             varchar(10),   -- Appears to be max of 7 or 8
      APPROVED                     int,           -- Can be Boolean?
      REVIEWED                     int,           -- Can be Boolean?
      REASON                       char(17),      -- May need to be longer if data set includes unvetted data
      X                            text           -- Blank
    );
    
    
    COPY eBird
      FROM '/home/dharris/eBird/ebd_relMay-2013.txt' 
      HEADER
      CSV
      QUOTE E'\5'       -- The file has unbalanced quotes. Using an obscure character as a quote mark instead.
      DELIMITER E'\t';
    
    
    -- Note: it's probably slightly faster to load postgis and add a geographic column first (see below).
    -- I'm keeping the original ordering in this document for accuracy's sake.
    CREATE INDEX ON eBird (sampling_event_identifier)
    
    -- Example query: SELECT SCIENTIFIC_NAME FROM eBird WHERE SAMPLING_EVENT_IDENTIFIER = 'S9605852';
    -- Example query: SELECT count(SCIENTIFIC_NAME) FROM eBird WHERE SAMPLING_EVENT_IDENTIFIER = 'S9605852';
    
    
    CREATE EXTENSION postgis;
    ALTER TABLE eBird ADD COLUMN geog geography(POINT,4326); -- I hope 4326 is correct...
    UPDATE eBird SET geog = ST_GeogFromText('POINT(' || longitude || ' ' ||  latitude || ')');
    CREATE INDEX geog_index ON eBird USING GIST (geog); 
    
    -- Example query: find all the species within 1000 of my dorm:
    -- SELECT SCIENTIFIC_NAME FROM eBird WHERE ST_DWithin(geog, ST_GeographyFromText('SRID=4326;POINT(-119.6972 34.4208)'), 1000);
    
    -- Slightly fancier version:
    -- SELECT DISTINCT SCIENTIFIC_NAME, COMMON_NAME FROM eBird 
    --   WHERE ST_DWithin(geog, ST_GeographyFromText('SRID=4326;POINT(-119.855385 34.417239)'), 1000) 
    --   ORDER BY SCIENTIFIC_NAME;
    

    (Edited to add some amazing PostGIS queries and some better commets, etc.)

    PS: After poking around a bit more, it looks like I should have used doubles rather than reals to store lat/lon. I had misread the documentation about how much precision was used for reals.

    opened by davharris 28
  • Gracefully handle failed downloads

    Gracefully handle failed downloads

    It is not uncommon for a data source to go down (e.g. #902) or for a download to fail for some reason (e.g., #863). We should catch these, not cache the data that comes down (which is sometimes a corrupt file and sometimes a 404 html page), and report to the user that the source appears to be down and that they should try again and if it still fails later let us know.

    opened by ethanwhite 26
  • Updated internal variable names to match that of datapackage

    Updated internal variable names to match that of datapackage

    Updated internal variable names to match that of datapackage spec #765 The following changes were done for the variable names

    tags -> keywords
    nulls -> missingValues
    name -> title
    shortname -> name
    
    The changes were done in the following files -
    
    retriever/lib/compile.py
    retriever/lib/datapackage.py
    retriever/lib/engine.py
    retriever/lib/parse_script_to_json.py
    retriever/lib/templates.py
    retriever/lib/tools.py
    scripts/bioclim.py
    scripts/biomass_allometry_db.py
    scripts/breed_bird_survey.py
    scripts/breed_bird_survey_50stop.py
    scripts/forest_inventory_analysis.py
    scripts/gentry_forest_transects.py
    scripts/npn.py
    scripts/plant_life_hist_eu.py
    scripts/prism_climate.py
    scripts/vertnet.py
    scripts/wood_density.py
    scripts/*.json(almost all datapackages) transition missingValues -> missing_values
    test/test_retriever.py
    retriever/__main__.py
    
    Changes Requested 
    opened by henrykironde 25
  • Add fetch to python Interface

    Add fetch to python Interface

    Hi @henrykironde Sorry, I was off schedule last days so I couldn't work on this issue as I told you. This should solve #1019 but is this the right place for the method?

    Changes Requested 
    opened by adhaamehab 23
  • hacktoberfest guide

    hacktoberfest guide

    For contributors who want to take part in the hacktoberfest, please check the issue lists from the various projects

    Retriever: https://github.com/weecology/retriever/issues Retriever-recipes: https://github.com/weecology/retriever-recipes/issues Rdataretriever: https://github.com/ropensci/rdataretriever/issues Retriever.jl: https://github.com/weecology/Retriever.jl/issues

    opened by henrykironde 0
  • Downloading fails for files with no Content-Disposition

    Downloading fails for files with no Content-Disposition

    Example packages:
    1: Package file: https://github.com/weecology/retriever-recipes/blob/main/scripts/usda_agriculture_plants_database.py Sample url: https://plants.sc.egov.usda.gov/csvdownload?plantLst=plantCompleteList

    2: package file: https://github.com/weecology/retriever-recipes/blob/main/scripts/aquatic_animal_excretion.py url: https://esajournals.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fecy.1792&file=ecy1792-sup-0001-DataS1.zip

    opened by henrykironde 1
  •  display_all_rdatasets_names in rdatasets takes a list of package_name

    display_all_rdatasets_names in rdatasets takes a list of package_name

    display_all_rdatasets_names takes list of package_name insted of taking a string of package_name as a parameter

    >>> display_all_rdataset_names("aer")
    List of all available Rdatasets in packages: aer
    No package named 'a' found in Rdatasets
    No package named 'e' found in Rdatasets
    No package named 'r' found in Rdatasets
    
    >>> display_all_rdataset_names(["aer"])
    List of all available Rdatasets in packages: ['aer']
    Package: aer              Dataset: affairs                   Script Name: rdataset-aer-affairs
    Package: aer              Dataset: argentinacpi              Script Name: rdataset-aer-argentinacpi
    Package: aer              Dataset: bankwages                 Script Name: rdataset-aer-bankwages
    Package: aer              Dataset: benderlyzwick             Script Name: rdataset-aer-benderlyzwick
    Package: aer              Dataset: bondyield                 Script Name: rdataset-aer-bondyield
    Package: aer              Dataset: cartelstability           Script Name: rdataset-aer-cartelstability
    Package: aer              Dataset: caschools                 Script Name: rdataset-aer-caschools
    Package: aer              Dataset: chinaincome               Script Name: rdataset-aer-chinaincome
    Package: aer              Dataset: cigarettesb               Script Name: rdataset-aer-cigarettesb
    Package: aer              Dataset: cigarettessw              Script Name: rdataset-aer-cigarettessw
    Package: aer              Dataset: collegedistance           Script Name: rdataset-aer-collegedistance
    Package: aer              Dataset: consumergood              Script Name: rdataset-aer-consumergood
    Package: aer              Dataset: cps1985                   Script Name: rdataset-aer-cps1985
    Package: aer              Dataset: cps1988                   Script Name: rdataset-aer-cps1988
    ....
    opened by Nageshbansal 1
  • not able to use gdal==3.3.2 while working with

    not able to use gdal==3.3.2 while working with ".shp" files

    NOTES

    Expected behavior and actual behavior.

    While I am having gdal 3.2.2, if I try to import ogr in a script dealing with ".shp" files, it doesn't import, but if I downgrade my gdal to 3.0.2 I'm able to import ogr and script run successfully

    ogr_not_working

    ogr_not_defined

    ogr_working

    ogr_working

    Operating system

    Ubuntu 20.04 bit

    GDAL version and provenance

    GDAL 3.3.2 version from ubuntugis-unstable PPA

    opened by Nageshbansal 0
  • Make sure that the the R api dataset are run on the retrieverdash

    Make sure that the the R api dataset are run on the retrieverdash

    We have added some API to the retriever. Some of the APIs, like Tidycensus, can be run and tested on the retriever dashboard.

    You can clone the retrieverdash project and test locally using the developer docs for the dashboard https://retrieverdash.readthedocs.io/developer.html#setting-up-locally.

    When working locally, first you will need to have the APIs working well on the retriever. Use the DEV LIST in the retriever dashboard to test only the required scripts.

    opened by henrykironde 0
Releases(v3.1.0)
  • v3.1.0(Apr 26, 2022)

    v3.1.0

    Major changes

    Remove Travis and use GitHub actions Improve autocreate script template creation tool Update Server setup docs Change default branch from master to main Update Kaggle API function Add Anaconda badges Update BBS breed bird survey ADD hdf5 to CSV files conversion test ADD HDF5 engine XML to CSV conversion test JSON to CSV function with tests SQLite to CSV files conversion test Geojson to CSV conversion test script Added tidycensus dataset improve Dockerfile and automate Docker push to the registry Add support for clipping images Add Socrata API Added RDatasets API Add auto publish to testPyPi and PyPi

    Source code(tar.gz)
    Source code(zip)
  • v3.0.0(Jul 16, 2020)

    v3.0.0

    Major changes

    Add provenance support to the Data Retriever Use utf-8 as default Move scripts from Retriever to retriever-recipes repository Adapt google code style and add linters, use yapf. Test linters Extend CSV field size limit Improve output when connection is not made Add version to the interface Prompt user if a newer version of script is available Add all the recipes datasets Add test for installation of committed dataset Add function to commit dataset

    Minor changes

    Improve "argcomplete-command" Add NUMFOCUS logo in README

    Source code(tar.gz)
    Source code(zip)
  • v2.4.0(Jun 10, 2019)

  • v2.3.0(May 1, 2019)

    Retriever v2.3.0

    Major changes

    Change Psycopg2 to psycopg2-binary Add Spatial data testing on Docker Add option for pretty json keep order of fetched tables and order of processing resources Add reset to specific dataset and script function Use tqdm 4.30.0 Install data into custom director using data_dir option Download data into custom directory using sub_dir

    Minor changes

    Add tests for reset script Add smaller samples of GIS data for testing Reactivate MySQL tests on Travis Allow custom arguments for psql Add docs and examples for Postgis support Change testdb name to testdb_retriever Improve Pypi retriever description Update documentation for passwordless setup of Postgres on Windows Setting up infrastructure for automating script creation

    New datasets

    USA eco legions, ecoregions-us LTREB Prairie-forest ecotone of eastern Kansas/Foster Lab dataset Sonoran Desert, sonoran-desert Adding Acton Lake dataset acton-lake

    Dataset changes

    MammalSuperTree.py to mammal_super_tree.py lakecats_finaltables.json to lakecats_final_tables harvard_forests.json to harvard_forest.json macroalgal_communities to macroalgal-communities

    Source code(tar.gz)
    Source code(zip)
    mac.zip(77.86 MB)
    python3-retriever_2.3.0-1_all.deb(43.29 KB)
    RetrieverSetup.exe(22.84 MB)
  • v.2.2.0(Nov 6, 2018)

    Major changes

    Using requests package to fetch data. Add postgis, a Spatial support for postgres. Update ls, include more details about the scripts. update license lookup for datasets Update keywords lookup for datasets Use tqdm for all progress tracking. Changed all "-" in JSON files to "_"

    Minor changes

    Documention refinement. Connect to MySQL using preferred encoding. License search and keyword search added. Conda_Forge docs Add Zenodo badge to link to archive Add test for extracting data

    New datasets

    Add Noaa Fisheries trade, noaa-fisheries-trade. Add Fishery Statistical Collections data, fao-global-capture-product. Add bupa liver disorders dataset, bupa-liver-disorders. Add GLOBI interactions data. globi-interaction. Addition of the National Aquatic Resource Surveys (NARS), nla. Addition of partners in flight dataset, partners-in-flight. Add the ND-GAIN Country Index. nd-gain. Add world GDP in current US Dollars. dgp. Add airports dataset, airports. Repair aquatic animal excretion. Add Biotime dataset. Add lakecats final tables dataset, lakecats-final-tables. Add harvard forests data, harvard forests. Add USGS elevation data, usgs-elevation.

    Source code(tar.gz)
    Source code(zip)
    python-retriever_2.2.0-1_all.deb(38.28 KB)
    retriever-2.2.0.tar.gz(55.76 KB)
    retriever.app.zip(65.22 MB)
    RetrieverSetup.exe(28.16 MB)
  • v2.1.0(Oct 27, 2017)

    v2.1.0

    Major changes

    • Add Python interface
    • Add Retriever to conda
    • Auto complete of Retriever commands on Unix systems

    Minor changes

    • Add license to datasets
    • Change the structure of raw data from string to list
    • Add testing on any modified dataset
    • Improve memory usage in cross-tab processing
    • Add capabilitiy for datasets to use custom Encoding
    • Use new Python interface for regression testing
    • Use Frictionless Data specification terminology for internals

    New datasets

    • Add ant dataset and weather data to the portal dataset
    • NYC TreesCount
    • PREDICTS
    • aquatic_animal_excretion
    • biodiversity_response
    • bird_migration_data
    • chytr_disease_distr
    • croche_vegetation_data
    • dicerandra_frutescens
    • flensburg_food_web
    • great_basin_mammal_abundance
    • macroalgal_communities
    • macrocystis_variation
    • marine_recruitment_data
    • mediter_basin_plant_traits
    • nematode_traits
    • ngreatplains-flowering-dates
    • portal-dev
    • portal
    • predator_prey_body_ratio
    • predicts
    • socean_diet_data
    • species_exctinction_rates
    • streamflow_conditions
    • tree_canopy_geometries
    • turtle_offspring_nesting
    • Add vertnet individual datasets vertnet_amphibians vertnet_birds vertnet_fishes vertnet_mammals vertnet_reptiles
    Source code(tar.gz)
    Source code(zip)
    retriever.app.zip(10.16 MB)
    RetrieverSetup.exe(11.64 MB)
    retriever_2.1.0.deb(33.99 KB)
  • v2.0.0(Feb 24, 2017)

    v2.0.0

    Major changes

    • Add Python 3 support, python 2/3 compatibility
    • Add json and xml as output formats
    • Switch to using the frictionless data datapackage json standard. This a backwards incompatible change as the form of dataset description files the retriever uses to describe the location and processing of simple datasets has changed.
    • Add CLI for creating, editing, deleting datapackage.json scripts
    • Broaden scope to include non-ecological data and rename to Data Retriever
    • Major expansion of documentation and move documentation to Read the Docs
    • Add developer documentation
    • Remove the GUI
    • Use csv module for reading of raw data to improve handling of newlines in fields
    • Major expansion of integration testing
    • Refactor regression testing to produce a single hash for a dataset regardless of output format
    • Add continuous integration testing for Windows

    Minor changes

    • Use pyinstaller for creating exe for windows and app for mac and remove py2app
    • Use 3 level semantic versioning for both scripts and core code
    • Rename datasets with more descriptive names
    • Add a retriever minimum version for each dataset
    • Rename dataset description files to follow python modules conventions
    • Switch to py.test from nose
    • Expand unit testing
    • Add version requirements for sqlite and postgresql
    • Default to latin encoding
    • Improve UI for updating user on downloading and processing progress

    New datasets

    • Added machine Learning datasets from UC Irvine's machine learning data sets
    Source code(tar.gz)
    Source code(zip)
    python3-retriever_2.0.0-1_all.deb(33.13 KB)
    retriever-OSX.zip(10.41 MB)
    RetrieverSetup.exe(11.16 MB)
  • v1.8.3(Feb 12, 2016)

    v1.8.3

    • Fixed regression in GUI

    v1.8.2

    • Improved cleaning of column names
    • Fixed thread bug causing Gentry dataset to hang when installed via GUI
    • Removed support for 32-bit only Macs in binaries
    • Removed unused code

    v1.8.0

    • Added scripts for 21 new datasets: leaf herbivory, biomass allocation, community dynamics of shortgrass steppe plants, mammal and bird foraging attributes, tree demography in Indian, small mammal community dynamics in Chile, community dynamics of Sonoran Desert perennials, biovolumes of freshwater phytoplankton, plant dynamics in Montana, Antarctic Site Inventory breeding bird survey, community abundance data compiled from the literature, spatio-temporal population data for butterflies, fish parasite host ecological characteristics, eBird, Global Wood Density Database, multiscale community data on vascular plants in a North Carolina, vertebrate home range sizes, PRISM climate data, Amniote life history database, woody plan Biomass And Allometry Database, Vertnet data on amphibians, birds, fishes, mammals, reptiles
    • Added reset command to allow resetting database configuration settings, scripts, and cached raw data
    • Added Dockerfile for building docker containers of each version of the software for reproducibility
    • Added support for wxPython 3.0
    • Added support for tar and gz archives
    • Added support for archive files whose contents don't fit in memory
    • Added checks for and use of system proxies
    • Added ability to download archives from web services
    • Added tests for regressions in download engine
    • Added citation command to provide information on citing datasets
    • Improved column name cleanup
    • Improved whitespace consistency
    • Improved handling of Excel files
    • Improved function documentation
    • Improved unit testing and added coverage analysis
    • Improved the sample script by adding a url field
    • Improved script loading behavior by only loading a script the first time it is discovered
    • Improved operating system identification
    • Improved download engine by allowing ability to maintain archive and subdirectory structure (particular relevant for spatial data)
    • Improved cross-platform directory and line ending handling
    • Improved testing across platforms
    • Improved checking for updated scripts so that scripts are only downloaded if the current version isn't available
    • Improved metadata in setup.py
    • Fixed type issues in Portal dataset
    • Fixed GUI always downloading scripts instead of checking if it needed to
    • Fixed bug that sometimes resulted in .retriever directories not belonging to the user who did the installation
    • Fixed issues with downloading files to specific paths
    • Fixed BBS50 script to match newer structure of the data
    • Fixed bug where csv files were not being closed after installation
    • Fixed errors when closing the GUI
    • Fixed issue where enclosing quotes in csv files were not being respected during cross-tab restructuring
    • Fixed bug causing v1.6 to break when newer scripts were added to version.txt
    • Fixed Bioclim script to include hdr files
    • Fixed missing icon images on Windows
    • Removed unused code
    Source code(tar.gz)
    Source code(zip)
    python-retriever_1.8.3-1_all.deb(96.11 KB)
    retriever.zip(29.07 MB)
    RetrieverSetup.exe(8.32 MB)
  • v1.8.2(Feb 12, 2016)

    This is the 1.8 release of the EcoData Retriever.

    v1.8.2

    • Improved cleaning of column names
    • Fixed thread bug causing Gentry dataset to hang when installed via GUI
    • Removed support for 32-bit only Macs in binaries
    • Removed unused code

    v1.8.0

    • Added scripts for 21 new datasets: leaf herbivory, biomass allocation, community dynamics of shortgrass steppe plants, mammal and bird foraging attributes, tree demography in Indian, small mammal community dynamics in Chile, community dynamics of Sonoran Desert perennials, biovolumes of freshwater phytoplankton, plant dynamics in Montana, Antarctic Site Inventory breeding bird survey, community abundance data compiled from the literature, spatio-temporal population data for butterflies, fish parasite host ecological characteristics, eBird, Global Wood Density Database, multiscale community data on vascular plants in a North Carolina, vertebrate home range sizes, PRISM climate data, Amniote life history database, woody plan Biomass And Allometry Database, Vertnet data on amphibians, birds, fishes, mammals, reptiles
    • Added reset command to allow resetting database configuration settings, scripts, and cached raw data
    • Added Dockerfile for building docker containers of each version of the software for reproducibility
    • Added support for wxPython 3.0
    • Added support for tar and gz archives
    • Added support for archive files whose contents don't fit in memory
    • Added checks for and use of system proxies
    • Added ability to download archives from web services
    • Added tests for regressions in download engine
    • Added citation command to provide information on citing datasets
    • Improved column name cleanup
    • Improved whitespace consistency
    • Improved handling of Excel files
    • Improved function documentation
    • Improved unit testing and added coverage analysis
    • Improved the sample script by adding a url field
    • Improved script loading behavior by only loading a script the first time it is discovered
    • Improved operating system identification
    • Improved download engine by allowing ability to maintain archive and subdirectory structure (particular relevant for spatial data)
    • Improved cross-platform directory and line ending handling
    • Improved testing across platforms
    • Improved checking for updated scripts so that scripts are only downloaded if the current version isn't available
    • Improved metadata in setup.py
    • Fixed type issues in Portal dataset
    • Fixed GUI always downloading scripts instead of checking if it needed to
    • Fixed bug that sometimes resulted in .retriever directories not belonging to the user who did the installation
    • Fixed issues with downloading files to specific paths
    • Fixed BBS50 script to match newer structure of the data
    • Fixed bug where csv files were not being closed after installation
    • Fixed errors when closing the GUI
    • Fixed issue where enclosing quotes in csv files were not being respected during cross-tab restructuring
    • Fixed bug causing v1.6 to break when newer scripts were added to version.txt
    • Fixed Bioclim script to include hdr files
    • Fixed missing icon images on Windows
    • Removed unused code
    Source code(tar.gz)
    Source code(zip)
    python-retriever_1.8.2-1_all.deb(96.08 KB)
    retriever.zip(29.07 MB)
    RetrieverSetup.exe(8.32 MB)
  • v1.7.0(Oct 5, 2014)

    This is the v1.7.0 release of the EcoData Retriever.

    • Added ability to download files directly for non-tabular data
    • Added scripts to download Bioclim and Mammal Supertree data
    • Added a script for the MammalDIET database
    • Fixed bug where some nationally standardized FIA surveys where not included
    • Added check for wxpython on installation to allow non-gui installs
    • Fixed several minor issues with Gentry script including a missing site and a column in one file that was misnamed
    • Windows install now adds the retriever to the path to facilitate command line use
    • Fixed a bug preventing installation from PyPI
    • Added icons to installers
    • Fixed the retriever failing when given a script it couldn't handle
    Source code(tar.gz)
    Source code(zip)
    python-retriever_1.7.0-1_all.deb(96.21 KB)
    retriever-app.zip(17.61 MB)
    RetrieverSetup.exe(6.73 MB)
  • v1.6.0(Feb 11, 2014)

Python 3 wrapper for the Vultr API v2.0

Vultr Python Python wrapper for the Vultr API. https://www.vultr.com https://www.vultr.com/api This is currently a WIP and not complete, but has some

CSSNR 6 Apr 28, 2022
The purpose of this project is to share knowledge on how awesome Streamlit is and can be

Awesome Streamlit The fastest way to build Awesome Tools and Apps! Powered by Python! The purpose of this project is to share knowledge on how Awesome

Marc Skov Madsen 1.5k Jan 07, 2023
Żmija is a simple universal code generation tool.

Żmija Żmija is a simple universal code generation tool. It is intended to be used as a means to generate code that is both efficient and easily mainta

Adrian Samoticha 2 Nov 23, 2021
Uses diff command to compare expected output with student's submission output

AUTOGRADER for GRADESCOPE using diff with partial grading Description: Uses diff command to compare expected output with student's submission output U

2 Jan 11, 2022
Gtech μLearn Sample_bot

Ser_bot Gtech μLearn Sample_bot Do Greet a newly joined member in a channel (random message) While adding a reaction to a message send a message to a

Jerin Paul 1 Jan 19, 2022
A Python validator for SHACL

pySHACL A Python validator for SHACL. This is a pure Python module which allows for the validation of RDF graphs against Shapes Constraint Language (S

RDFLib 187 Dec 29, 2022
A course-planning, course-map rendering and GPA-calculation web service, designed for the SFU (Simon Fraser University) student.

SFU Course Planner What is the overall goal of the project (i.e. what does it do, or what problem is it solving)? As the title suggests, this project

Ash Peng 1 Oct 21, 2021
A Collection of Cheatsheets, Books, Questions, and Portfolio For DS/ML Interview Prep

Here are the sections: Data Science Cheatsheets Data Science EBooks Data Science Question Bank Data Science Case Studies Data Science Portfolio Data J

James Le 2.5k Jan 02, 2023
Convenient tools for using Swagger to define and validate your interfaces in a Pyramid webapp.

Convenient tools for using Swagger to define and validate your interfaces in a Pyramid webapp.

Scott Triglia 64 Sep 18, 2022
Loudchecker - Python script to check files for earrape

loudchecker python script to check files for earrape automatically installs depe

1 Jan 22, 2022
Test utility for validating OpenAPI documentation

DRF OpenAPI Tester This is a test utility to validate DRF Test Responses against OpenAPI 2 and 3 schema. It has built-in support for: OpenAPI 2/3 yaml

snok 106 Jan 05, 2023
Easy OpenAPI specs and Swagger UI for your Flask API

Flasgger Easy Swagger UI for your Flask API Flasgger is a Flask extension to extract OpenAPI-Specification from all Flask views registered in your API

Flasgger 3.1k Dec 24, 2022
SamrSearch - SamrSearch can get user info and group info with MS-SAMR

SamrSearch SamrSearch can get user info and group info with MS-SAMR.like net use

knight 10 Oct 06, 2022
A curated list of awesome mathematics resources

A curated list of awesome mathematics resources

Cyrille Rossant 6.7k Jan 05, 2023
Deduplicating archiver with compression and authenticated encryption.

More screencasts: installation, advanced usage What is BorgBackup? BorgBackup (short: Borg) is a deduplicating backup program. Optionally, it supports

BorgBackup 9k Jan 09, 2023
Always know what to expect from your data.

Great Expectations Always know what to expect from your data. Introduction Great Expectations helps data teams eliminate pipeline debt, through data t

Great Expectations 7.8k Jan 05, 2023
A fast time mocking alternative to freezegun that wraps libfaketime.

python-libfaketime: fast date/time mocking python-libfaketime is a wrapper of libfaketime for python. Some brief details: Linux and OS X, Pythons 3.5

Simon Weber 68 Jun 10, 2022
✨ Real-life Data Analysis and Model Training Workshop by Global AI Hub.

🎓 Data Analysis and Model Training Course by Global AI Hub Syllabus: Day 1 What is Data? Multimedia Structured and Unstructured Data Data Types Data

Global AI Hub 71 Oct 28, 2022
Sphinx Theme Builder

Sphinx Theme Builder Streamline the Sphinx theme development workflow, by building upon existing standardised tools. and provide a: simplified packagi

Pradyun Gedam 23 Dec 26, 2022
Python Advanced --- numpy, decorators, networking

Python Advanced --- numpy, decorators, networking (and more?) Hello everyone 👋 This is the project repo for the "Python Advanced - ..." introductory

Andreas Poehlmann 2 Nov 05, 2021