Quickly download, clean up, and install public datasets into a database management system

Overview

Retriever logo

Python package Build Status (windows) Research software impact codecov.io Documentation Status License Join the chat at https://gitter.im/weecology/retriever DOI JOSS Publication Anaconda-Server Badge Anaconda-Server Badge Version NumFOCUS

Finding data is one thing. Getting it ready for analysis is another. Acquiring, cleaning, standardizing and importing publicly available data is time consuming because many datasets lack machine readable metadata and do not conform to established data structures and formats. The Data Retriever automates the first steps in the data analysis pipeline by downloading, cleaning, and standardizing datasets, and importing them into relational databases, flat files, or programming languages. The automation of this process reduces the time for a user to get most large datasets up and running by hours, and in some cases days.

Installing the Current Release

If you have Python installed you can install the current release using either pip:

pip install retriever

or conda after adding the conda-forge channel (conda config --add channels conda-forge):

conda install retriever

Depending on your system configuration this may require sudo for pip:

sudo pip install retriever

Precompiled binary installers are also available for Windows, OS X, and Ubuntu/Debian on the releases page. These do not require a Python installation.

List of Available Datasets

Installing From Source

To install the Data Retriever from source, you'll need Python 3.6.8+ with the following packages installed:

  • xlrd

The following packages are optionally needed to interact with associated database management systems:

  • PyMySQL (for MySQL)
  • sqlite3 (for SQLite)
  • psycopg2-binary (for PostgreSQL), previously psycopg2.
  • pyodbc (for MS Access - this option is only available on Windows)
  • Microsoft Access Driver (ODBC for windows)

To install from source

Either use pip to install directly from GitHub:

pip install git+https://[email protected]/weecology/retriever.git

or:

  1. Clone the repository
  2. From the directory containing setup.py, run the following command: pip install .. You may need to include sudo at the beginning of the command depending on your system (i.e., sudo pip install .).

More extensive documentation for those that are interested in developing can be found here

Using the Command Line

After installing, run retriever update to download all of the available dataset scripts. To see the full list of command line options and datasets run retriever --help. The output will look like this:

usage: retriever [-h] [-v] [-q]
                 {download,install,defaults,update,new,new_json,edit_json,delete_json,ls,citation,reset,help}
                 ...

positional arguments:
  {download,install,defaults,update,new,new_json,edit_json,delete_json,ls,citation,reset,help}
                        sub-command help
    download            download raw data files for a dataset
    install             download and install dataset
    defaults            displays default options
    update              download updated versions of scripts
    new                 create a new sample retriever script
    new_json            CLI to create retriever datapackage.json script
    edit_json           CLI to edit retriever datapackage.json script
    delete_json         CLI to remove retriever datapackage.json script
    ls                  display a list all available dataset scripts
    citation            view citation
    reset               reset retriever: removes configuration settings,
                        scripts, and cached data
    help

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -q, --quiet           suppress command-line output

To install datasets, use retriever install:

usage: retriever install [-h] [--compile] [--debug]
                         {mysql,postgres,sqlite,msaccess,csv,json,xml} ...

positional arguments:
  {mysql,postgres,sqlite,msaccess,csv,json,xml}
                        engine-specific help
    mysql               MySQL
    postgres            PostgreSQL
    sqlite              SQLite
    msaccess            Microsoft Access
    csv                 CSV
    json                JSON
    xml                 XML

optional arguments:
  -h, --help            show this help message and exit
  --compile             force re-compile of script before downloading
  --debug               run in debug mode

Examples

These examples are using the Iris flower dataset. More examples can be found in the Data Retriever documentation.

Using Install

retriever install -h   (gives install options)

Using specific database engine, retriever install {Engine}

retriever install mysql -h     (gives install mysql options)
retriever install mysql --user myuser --password ******** --host localhost --port 8888 --database_name testdbase iris

install data into an sqlite database named iris.db you would use:

retriever install sqlite iris -f iris.db

Using download

retriever download -h    (gives you help options)
retriever download iris
retriever download iris --path C:\Users\Documents

Using citation

retriever citation   (citation of the retriever engine)
retriever citation iris  (citation for the iris data)

Spatial Dataset Installation

Set up Spatial support

To set up spatial support for Postgres using Postgis please refer to the spatial set-up docs.

retriever install postgres harvard-forest # Vector data
retriever install postgres bioclim # Raster data
# Install only the data of USGS elevation in the given extent
retriever install postgres usgs-elevation -b -94.98704597353938 39.027001800158615 -94.3599408119917 40.69577051867074

Website

For more information see the Data Retriever website.

Acknowledgments

Development of this software was funded by the Gordon and Betty Moore Foundation's Data-Driven Discovery Initiative through Grant GBMF4563 to Ethan White and the National Science Foundation as part of a CAREER award to Ethan White.

Comments
  • [WIP] Allow consuming JSON data

    [WIP] Allow consuming JSON data

    NOTE: I closed this PR by mistake. I'll re-open this. This pull request is for catering to this [issue](Allow consuming JSON data #1334). Currently, we support 2 kinds of json datasets:

    1. Where the dataset's rows are present in a certain key of json. For example refer to this example, here the certain_key is data.
    2. Where the dataset's rows are present in certain key of differnent parts of the json. For example refer to this example, here the certain key is laureates.
    3. [WIP] Where the json is in the form of list, example. Current implementation for this is commented out, since its getting stuck in a recursion loop.
    opened by DumbMachine 37
  • Updated internal variable names to match that of datapackage #860

    Updated internal variable names to match that of datapackage #860

    Updated internal variable names to match that of datapackage spec #765 The following changes were done for the variable names

    tags -> keywords
    nulls -> missingValues
    name -> title
    shortname -> name
    
    The changes were done in the following files -
    
    retriever/lib/compile.py
    retriever/lib/datapackage.py
    retriever/lib/engine.py
    retriever/lib/parse_script_to_json.py
    retriever/lib/templates.py
    retriever/lib/tools.py
    scripts/bioclim.py
    scripts/biomass_allometry_db.py
    scripts/breed_bird_survey.py
    scripts/breed_bird_survey_50stop.py
    scripts/forest_inventory_analysis.py
    scripts/gentry_forest_transects.py
    scripts/npn.py
    scripts/plant_life_hist_eu.py
    scripts/prism_climate.py
    scripts/vertnet.py
    scripts/wood_density.py
    scripts/*.json(almost all datapackages) transition missingValues -> missing_values
    test/test_retriever.py
    retriever/__main__.py
    

    @henrykironde I have made the changes after updating it with the master branch

    Under Review and Tests 
    opened by jainamritanshu 37
  • Working on : Expansion of Spatial Data Support to the Data Retriever

    Working on : Expansion of Spatial Data Support to the Data Retriever

    I have started to look into and working on the project : Expansion of Spatial Data Support to the Data Retriever.

    @ethanwhite @henrykironde Please let me know of any aspects in particular I should be prioritising over others.

    Would it be okay if I carried on the discussion through this Issue?

    opened by ss-is-master-chief 31
  • Test OSX .app file

    Test OSX .app file

    I'm in the process of trying to get the EcoData Retriever fully functional on OSX (since so many of the awesome ecological informaticsy people I know use Macs). As I've mentioned elsewhere it looks like building from source now works, at least when using homebrew (http://ecodataretriever.org/getting_started.html).

    What I'm working on now is getting the .app working so that you don't need to be comfortable in the shell (and have XCode installed) to use the Retriever. I have a version that is working on the machine that I built it on (our only Mac) and was wondering if some kind Mac folks like @karthik, @sckott, @emhart, @sarahsupp and @dfalster might have a few minutes to give it a trial run.

    The file is here: https://www.dropbox.com/s/26b1pj91mqucc0l/retriever.zip

    Basically I'm just looking for folks to unzip it, double click on it, and see if:

    1. It opens at all.
    2. You can install things successfully when setting the database management system to CSV and sqlite (these don't have any external dependencies).
    3. If you have either MySQL or PostgreSQL installed if it works with them. (MySQL is a bit fragile at the moment. It is currently working for most datasets, but not all, so just try a few if you get errors).
    4. Report back.

    Thanks in advance. And, yes, I wrote this issue... on a Mac.

    opened by ethanwhite 31
  • Allow consuming JSON data

    Allow consuming JSON data

    Currently we only support ingesting delimited tabular data. It is increasingly common for tabular style data to be distributed in JSON files and it would be nice to also be able to consume this. We would probably just convert it to CSV as a starting point and then process it using our standard pipeline.

    There are a few not particularly active packages for doing this, but the code to do it is so simple enough that since none of the packages seem to be widely adopted we might be better off just writing and maintaining it ourselves.

    (no rush on this, just a thought while looking at a cool dataset that's only available in JSON: http://data.unpaywall.org/products/snapshot)

    Feature Request 
    opened by ethanwhite 30
  • Update internals in reference to issue #765

    Update internals in reference to issue #765

    @ethanwhite @henrykironde I have made the following changes

    • name -> title
    • shortname -> name
    • tags -> keywords
    • nulls -> missingValues

    I wanted to ask to change missingValues to missing_values as the original one seems to be in camel case and not in pep8 naming convention , if you allow I would change it. I am still finding such internal name cases which could be updated. I guess the code is clean. Kindly go through it once and if there are any suggestion I would start working on them

    Changes Requested 
    opened by jainamritanshu 30
  • Removed default encoding in reference to #716

    Removed default encoding in reference to #716

    Sir I have removed the hard coded assignments for encoding and added a field of encoding. I haven't edited the existing scripts for the existing data. Should I do them manually? Kindly review the code and tell me any changes needed for the code. @ethanwhite @henrykironde

    opened by jainamritanshu 29
  • an eBird Basic Dataset workflow

    an eBird Basic Dataset workflow

    Hey all,

    I've mostly gotten the eBird data into a PostgreSQL/PostGIS database, and I thought I'd share my code with you in case you wanted to integrate it into something more robust with EcoDataRetriever. If you know how to optimize it better, I'd love to hear what you come up with.

    If you do decide to include it, please acknowledge Matt Jones and Jim Regetz, since they helped me through this.

    Let me know if you have any questions!

    Dave

    PS the "world" data set unzips to be 50 gigabytes, so you'll probably want to work with something smaller...

    -- Data file available via http://ebird.org/ebird/data/download
    
    -- commands to extract the text file from the tarball:
       -- tar xvf ebd_relMay-2013.tar
       -- gunzip ebd_relMay-2013.txt.gz
    -- WARNING: The resulting file is almost 50 gigabytes!
    
    -- In retrospect, there's probably some premature optimization for some of these columns: if the data set changes,
    -- it might be safer to use longer varchar arguments.
    CREATE TABLE eBird (
      GLOBAL_UNIQUE_IDENTIFIER     char(50),      -- always 45-47 characters needed (so far)
      TAXONOMIC_ORDER              numeric,       -- Probably not needed
      CATEGORY                     varchar(20),   -- Probably 10 would be safe
      COMMON_NAME                  varchar(70),   -- Some hybrids have really long names
      SCIENTIFIC_NAME              varchar(70),   --  ''
      SUBSPECIES_COMMON_NAME       varchar(70),   --  ''
      SUBSPECIES_SCIENTIFIC_NAME   varchar(70),   --  ''
      OBSERVATION_COUNT            varchar(8),    -- Someone saw 1.3 million Auklets.
                                                  -- Unfortunately, it can't be an integer 
                                                  -- because some are just presence/absence
      BREEDING_BIRD_ATLAS_CODE     char(2),       -- need to confirm that these are always length 2
      AGE_SEX                      text,          -- Potentially long, but almost always blank
      COUNTRY                      varchar(50),   -- long enough for "Saint Helena, Ascension and Tristan da Cunha"
      COUNTRY_CODE                 char(2),       -- alpha-2 codes
      STATE_PROVINCE               varchar(50),   -- no idea if this is long enough? U.S. Virgin Islands may be almost 30
      SUBNATIONAL1_CODE            char(10),      -- looks standardized at 5 characters?
      COUNTY                       varchar(50),   -- who knows how long it could be
      SUBNATIONAL2_CODE            char(12),      -- looks standardized at 9 characters?
      IBA_CODE                     char(16),
      LOCALITY                     text,          -- unstructured/potentially long
      LOCALITY_ID                  char(10),      -- maximum observed so far is 8
      LOCALITY_TYPE                char(2),       -- short codes
      LATITUDE                     real,          -- Is this the appropriate level of precision?
      LONGITUDE                    real,          --    ''
      OBSERVATION_DATE             date,          -- Do I need to specify YMD somehow?
      TIME_OBSERVATIONS_STARTED    time,          -- How do I make this a time?
      TRIP_COMMENTS                text,          -- Comments are long, unstructured, 
      SPECIES_COMMENTS             text,          --    and inconsistent, but sometimes interesting
      OBSERVER_ID                  char(12),      -- max of 9 in the data I've seen so far
      FIRST_NAME                   text,          -- Already have observer IDs
      LAST_NAME                    text,          -- ''
      SAMPLING_EVENT_IDENTIFIER    char(12),      -- Probably want to index on this.
      PROTOCOL_TYPE                varchar(50),   -- Needs to be at least 30 for sure.
      PROJECT_CODE                 varchar(20),   -- Needs to be at least 10 for sure.
      DURATION_MINUTES             int,           -- bigint?
      EFFORT_DISTANCE_KM           real,          -- precision?
      EFFORT_AREA_HA               real,          -- precision?
      NUMBER_OBSERVERS             int,           -- just a small int
      ALL_SPECIES_REPORTED         int,           -- Seems to always be 1 or 0.  Maybe I could make this Boolean?
      GROUP_IDENTIFIER             varchar(10),   -- Appears to be max of 7 or 8
      APPROVED                     int,           -- Can be Boolean?
      REVIEWED                     int,           -- Can be Boolean?
      REASON                       char(17),      -- May need to be longer if data set includes unvetted data
      X                            text           -- Blank
    );
    
    
    COPY eBird
      FROM '/home/dharris/eBird/ebd_relMay-2013.txt' 
      HEADER
      CSV
      QUOTE E'\5'       -- The file has unbalanced quotes. Using an obscure character as a quote mark instead.
      DELIMITER E'\t';
    
    
    -- Note: it's probably slightly faster to load postgis and add a geographic column first (see below).
    -- I'm keeping the original ordering in this document for accuracy's sake.
    CREATE INDEX ON eBird (sampling_event_identifier)
    
    -- Example query: SELECT SCIENTIFIC_NAME FROM eBird WHERE SAMPLING_EVENT_IDENTIFIER = 'S9605852';
    -- Example query: SELECT count(SCIENTIFIC_NAME) FROM eBird WHERE SAMPLING_EVENT_IDENTIFIER = 'S9605852';
    
    
    CREATE EXTENSION postgis;
    ALTER TABLE eBird ADD COLUMN geog geography(POINT,4326); -- I hope 4326 is correct...
    UPDATE eBird SET geog = ST_GeogFromText('POINT(' || longitude || ' ' ||  latitude || ')');
    CREATE INDEX geog_index ON eBird USING GIST (geog); 
    
    -- Example query: find all the species within 1000 of my dorm:
    -- SELECT SCIENTIFIC_NAME FROM eBird WHERE ST_DWithin(geog, ST_GeographyFromText('SRID=4326;POINT(-119.6972 34.4208)'), 1000);
    
    -- Slightly fancier version:
    -- SELECT DISTINCT SCIENTIFIC_NAME, COMMON_NAME FROM eBird 
    --   WHERE ST_DWithin(geog, ST_GeographyFromText('SRID=4326;POINT(-119.855385 34.417239)'), 1000) 
    --   ORDER BY SCIENTIFIC_NAME;
    

    (Edited to add some amazing PostGIS queries and some better commets, etc.)

    PS: After poking around a bit more, it looks like I should have used doubles rather than reals to store lat/lon. I had misread the documentation about how much precision was used for reals.

    opened by davharris 28
  • Gracefully handle failed downloads

    Gracefully handle failed downloads

    It is not uncommon for a data source to go down (e.g. #902) or for a download to fail for some reason (e.g., #863). We should catch these, not cache the data that comes down (which is sometimes a corrupt file and sometimes a 404 html page), and report to the user that the source appears to be down and that they should try again and if it still fails later let us know.

    opened by ethanwhite 26
  • Updated internal variable names to match that of datapackage

    Updated internal variable names to match that of datapackage

    Updated internal variable names to match that of datapackage spec #765 The following changes were done for the variable names

    tags -> keywords
    nulls -> missingValues
    name -> title
    shortname -> name
    
    The changes were done in the following files -
    
    retriever/lib/compile.py
    retriever/lib/datapackage.py
    retriever/lib/engine.py
    retriever/lib/parse_script_to_json.py
    retriever/lib/templates.py
    retriever/lib/tools.py
    scripts/bioclim.py
    scripts/biomass_allometry_db.py
    scripts/breed_bird_survey.py
    scripts/breed_bird_survey_50stop.py
    scripts/forest_inventory_analysis.py
    scripts/gentry_forest_transects.py
    scripts/npn.py
    scripts/plant_life_hist_eu.py
    scripts/prism_climate.py
    scripts/vertnet.py
    scripts/wood_density.py
    scripts/*.json(almost all datapackages) transition missingValues -> missing_values
    test/test_retriever.py
    retriever/__main__.py
    
    Changes Requested 
    opened by henrykironde 25
  • Add fetch to python Interface

    Add fetch to python Interface

    Hi @henrykironde Sorry, I was off schedule last days so I couldn't work on this issue as I told you. This should solve #1019 but is this the right place for the method?

    Changes Requested 
    opened by adhaamehab 23
  • hacktoberfest guide

    hacktoberfest guide

    For contributors who want to take part in the hacktoberfest, please check the issue lists from the various projects

    Retriever: https://github.com/weecology/retriever/issues Retriever-recipes: https://github.com/weecology/retriever-recipes/issues Rdataretriever: https://github.com/ropensci/rdataretriever/issues Retriever.jl: https://github.com/weecology/Retriever.jl/issues

    opened by henrykironde 0
  • Downloading fails for files with no Content-Disposition

    Downloading fails for files with no Content-Disposition

    Example packages:
    1: Package file: https://github.com/weecology/retriever-recipes/blob/main/scripts/usda_agriculture_plants_database.py Sample url: https://plants.sc.egov.usda.gov/csvdownload?plantLst=plantCompleteList

    2: package file: https://github.com/weecology/retriever-recipes/blob/main/scripts/aquatic_animal_excretion.py url: https://esajournals.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fecy.1792&file=ecy1792-sup-0001-DataS1.zip

    opened by henrykironde 1
  •  display_all_rdatasets_names in rdatasets takes a list of package_name

    display_all_rdatasets_names in rdatasets takes a list of package_name

    display_all_rdatasets_names takes list of package_name insted of taking a string of package_name as a parameter

    >>> display_all_rdataset_names("aer")
    List of all available Rdatasets in packages: aer
    No package named 'a' found in Rdatasets
    No package named 'e' found in Rdatasets
    No package named 'r' found in Rdatasets
    
    >>> display_all_rdataset_names(["aer"])
    List of all available Rdatasets in packages: ['aer']
    Package: aer              Dataset: affairs                   Script Name: rdataset-aer-affairs
    Package: aer              Dataset: argentinacpi              Script Name: rdataset-aer-argentinacpi
    Package: aer              Dataset: bankwages                 Script Name: rdataset-aer-bankwages
    Package: aer              Dataset: benderlyzwick             Script Name: rdataset-aer-benderlyzwick
    Package: aer              Dataset: bondyield                 Script Name: rdataset-aer-bondyield
    Package: aer              Dataset: cartelstability           Script Name: rdataset-aer-cartelstability
    Package: aer              Dataset: caschools                 Script Name: rdataset-aer-caschools
    Package: aer              Dataset: chinaincome               Script Name: rdataset-aer-chinaincome
    Package: aer              Dataset: cigarettesb               Script Name: rdataset-aer-cigarettesb
    Package: aer              Dataset: cigarettessw              Script Name: rdataset-aer-cigarettessw
    Package: aer              Dataset: collegedistance           Script Name: rdataset-aer-collegedistance
    Package: aer              Dataset: consumergood              Script Name: rdataset-aer-consumergood
    Package: aer              Dataset: cps1985                   Script Name: rdataset-aer-cps1985
    Package: aer              Dataset: cps1988                   Script Name: rdataset-aer-cps1988
    ....
    opened by Nageshbansal 1
  • not able to use gdal==3.3.2 while working with

    not able to use gdal==3.3.2 while working with ".shp" files

    NOTES

    Expected behavior and actual behavior.

    While I am having gdal 3.2.2, if I try to import ogr in a script dealing with ".shp" files, it doesn't import, but if I downgrade my gdal to 3.0.2 I'm able to import ogr and script run successfully

    ogr_not_working

    ogr_not_defined

    ogr_working

    ogr_working

    Operating system

    Ubuntu 20.04 bit

    GDAL version and provenance

    GDAL 3.3.2 version from ubuntugis-unstable PPA

    opened by Nageshbansal 0
  • Make sure that the the R api dataset are run on the retrieverdash

    Make sure that the the R api dataset are run on the retrieverdash

    We have added some API to the retriever. Some of the APIs, like Tidycensus, can be run and tested on the retriever dashboard.

    You can clone the retrieverdash project and test locally using the developer docs for the dashboard https://retrieverdash.readthedocs.io/developer.html#setting-up-locally.

    When working locally, first you will need to have the APIs working well on the retriever. Use the DEV LIST in the retriever dashboard to test only the required scripts.

    opened by henrykironde 0
Releases(v3.1.0)
  • v3.1.0(Apr 26, 2022)

    v3.1.0

    Major changes

    Remove Travis and use GitHub actions Improve autocreate script template creation tool Update Server setup docs Change default branch from master to main Update Kaggle API function Add Anaconda badges Update BBS breed bird survey ADD hdf5 to CSV files conversion test ADD HDF5 engine XML to CSV conversion test JSON to CSV function with tests SQLite to CSV files conversion test Geojson to CSV conversion test script Added tidycensus dataset improve Dockerfile and automate Docker push to the registry Add support for clipping images Add Socrata API Added RDatasets API Add auto publish to testPyPi and PyPi

    Source code(tar.gz)
    Source code(zip)
  • v3.0.0(Jul 16, 2020)

    v3.0.0

    Major changes

    Add provenance support to the Data Retriever Use utf-8 as default Move scripts from Retriever to retriever-recipes repository Adapt google code style and add linters, use yapf. Test linters Extend CSV field size limit Improve output when connection is not made Add version to the interface Prompt user if a newer version of script is available Add all the recipes datasets Add test for installation of committed dataset Add function to commit dataset

    Minor changes

    Improve "argcomplete-command" Add NUMFOCUS logo in README

    Source code(tar.gz)
    Source code(zip)
  • v2.4.0(Jun 10, 2019)

  • v2.3.0(May 1, 2019)

    Retriever v2.3.0

    Major changes

    Change Psycopg2 to psycopg2-binary Add Spatial data testing on Docker Add option for pretty json keep order of fetched tables and order of processing resources Add reset to specific dataset and script function Use tqdm 4.30.0 Install data into custom director using data_dir option Download data into custom directory using sub_dir

    Minor changes

    Add tests for reset script Add smaller samples of GIS data for testing Reactivate MySQL tests on Travis Allow custom arguments for psql Add docs and examples for Postgis support Change testdb name to testdb_retriever Improve Pypi retriever description Update documentation for passwordless setup of Postgres on Windows Setting up infrastructure for automating script creation

    New datasets

    USA eco legions, ecoregions-us LTREB Prairie-forest ecotone of eastern Kansas/Foster Lab dataset Sonoran Desert, sonoran-desert Adding Acton Lake dataset acton-lake

    Dataset changes

    MammalSuperTree.py to mammal_super_tree.py lakecats_finaltables.json to lakecats_final_tables harvard_forests.json to harvard_forest.json macroalgal_communities to macroalgal-communities

    Source code(tar.gz)
    Source code(zip)
    mac.zip(77.86 MB)
    python3-retriever_2.3.0-1_all.deb(43.29 KB)
    RetrieverSetup.exe(22.84 MB)
  • v.2.2.0(Nov 6, 2018)

    Major changes

    Using requests package to fetch data. Add postgis, a Spatial support for postgres. Update ls, include more details about the scripts. update license lookup for datasets Update keywords lookup for datasets Use tqdm for all progress tracking. Changed all "-" in JSON files to "_"

    Minor changes

    Documention refinement. Connect to MySQL using preferred encoding. License search and keyword search added. Conda_Forge docs Add Zenodo badge to link to archive Add test for extracting data

    New datasets

    Add Noaa Fisheries trade, noaa-fisheries-trade. Add Fishery Statistical Collections data, fao-global-capture-product. Add bupa liver disorders dataset, bupa-liver-disorders. Add GLOBI interactions data. globi-interaction. Addition of the National Aquatic Resource Surveys (NARS), nla. Addition of partners in flight dataset, partners-in-flight. Add the ND-GAIN Country Index. nd-gain. Add world GDP in current US Dollars. dgp. Add airports dataset, airports. Repair aquatic animal excretion. Add Biotime dataset. Add lakecats final tables dataset, lakecats-final-tables. Add harvard forests data, harvard forests. Add USGS elevation data, usgs-elevation.

    Source code(tar.gz)
    Source code(zip)
    python-retriever_2.2.0-1_all.deb(38.28 KB)
    retriever-2.2.0.tar.gz(55.76 KB)
    retriever.app.zip(65.22 MB)
    RetrieverSetup.exe(28.16 MB)
  • v2.1.0(Oct 27, 2017)

    v2.1.0

    Major changes

    • Add Python interface
    • Add Retriever to conda
    • Auto complete of Retriever commands on Unix systems

    Minor changes

    • Add license to datasets
    • Change the structure of raw data from string to list
    • Add testing on any modified dataset
    • Improve memory usage in cross-tab processing
    • Add capabilitiy for datasets to use custom Encoding
    • Use new Python interface for regression testing
    • Use Frictionless Data specification terminology for internals

    New datasets

    • Add ant dataset and weather data to the portal dataset
    • NYC TreesCount
    • PREDICTS
    • aquatic_animal_excretion
    • biodiversity_response
    • bird_migration_data
    • chytr_disease_distr
    • croche_vegetation_data
    • dicerandra_frutescens
    • flensburg_food_web
    • great_basin_mammal_abundance
    • macroalgal_communities
    • macrocystis_variation
    • marine_recruitment_data
    • mediter_basin_plant_traits
    • nematode_traits
    • ngreatplains-flowering-dates
    • portal-dev
    • portal
    • predator_prey_body_ratio
    • predicts
    • socean_diet_data
    • species_exctinction_rates
    • streamflow_conditions
    • tree_canopy_geometries
    • turtle_offspring_nesting
    • Add vertnet individual datasets vertnet_amphibians vertnet_birds vertnet_fishes vertnet_mammals vertnet_reptiles
    Source code(tar.gz)
    Source code(zip)
    retriever.app.zip(10.16 MB)
    RetrieverSetup.exe(11.64 MB)
    retriever_2.1.0.deb(33.99 KB)
  • v2.0.0(Feb 24, 2017)

    v2.0.0

    Major changes

    • Add Python 3 support, python 2/3 compatibility
    • Add json and xml as output formats
    • Switch to using the frictionless data datapackage json standard. This a backwards incompatible change as the form of dataset description files the retriever uses to describe the location and processing of simple datasets has changed.
    • Add CLI for creating, editing, deleting datapackage.json scripts
    • Broaden scope to include non-ecological data and rename to Data Retriever
    • Major expansion of documentation and move documentation to Read the Docs
    • Add developer documentation
    • Remove the GUI
    • Use csv module for reading of raw data to improve handling of newlines in fields
    • Major expansion of integration testing
    • Refactor regression testing to produce a single hash for a dataset regardless of output format
    • Add continuous integration testing for Windows

    Minor changes

    • Use pyinstaller for creating exe for windows and app for mac and remove py2app
    • Use 3 level semantic versioning for both scripts and core code
    • Rename datasets with more descriptive names
    • Add a retriever minimum version for each dataset
    • Rename dataset description files to follow python modules conventions
    • Switch to py.test from nose
    • Expand unit testing
    • Add version requirements for sqlite and postgresql
    • Default to latin encoding
    • Improve UI for updating user on downloading and processing progress

    New datasets

    • Added machine Learning datasets from UC Irvine's machine learning data sets
    Source code(tar.gz)
    Source code(zip)
    python3-retriever_2.0.0-1_all.deb(33.13 KB)
    retriever-OSX.zip(10.41 MB)
    RetrieverSetup.exe(11.16 MB)
  • v1.8.3(Feb 12, 2016)

    v1.8.3

    • Fixed regression in GUI

    v1.8.2

    • Improved cleaning of column names
    • Fixed thread bug causing Gentry dataset to hang when installed via GUI
    • Removed support for 32-bit only Macs in binaries
    • Removed unused code

    v1.8.0

    • Added scripts for 21 new datasets: leaf herbivory, biomass allocation, community dynamics of shortgrass steppe plants, mammal and bird foraging attributes, tree demography in Indian, small mammal community dynamics in Chile, community dynamics of Sonoran Desert perennials, biovolumes of freshwater phytoplankton, plant dynamics in Montana, Antarctic Site Inventory breeding bird survey, community abundance data compiled from the literature, spatio-temporal population data for butterflies, fish parasite host ecological characteristics, eBird, Global Wood Density Database, multiscale community data on vascular plants in a North Carolina, vertebrate home range sizes, PRISM climate data, Amniote life history database, woody plan Biomass And Allometry Database, Vertnet data on amphibians, birds, fishes, mammals, reptiles
    • Added reset command to allow resetting database configuration settings, scripts, and cached raw data
    • Added Dockerfile for building docker containers of each version of the software for reproducibility
    • Added support for wxPython 3.0
    • Added support for tar and gz archives
    • Added support for archive files whose contents don't fit in memory
    • Added checks for and use of system proxies
    • Added ability to download archives from web services
    • Added tests for regressions in download engine
    • Added citation command to provide information on citing datasets
    • Improved column name cleanup
    • Improved whitespace consistency
    • Improved handling of Excel files
    • Improved function documentation
    • Improved unit testing and added coverage analysis
    • Improved the sample script by adding a url field
    • Improved script loading behavior by only loading a script the first time it is discovered
    • Improved operating system identification
    • Improved download engine by allowing ability to maintain archive and subdirectory structure (particular relevant for spatial data)
    • Improved cross-platform directory and line ending handling
    • Improved testing across platforms
    • Improved checking for updated scripts so that scripts are only downloaded if the current version isn't available
    • Improved metadata in setup.py
    • Fixed type issues in Portal dataset
    • Fixed GUI always downloading scripts instead of checking if it needed to
    • Fixed bug that sometimes resulted in .retriever directories not belonging to the user who did the installation
    • Fixed issues with downloading files to specific paths
    • Fixed BBS50 script to match newer structure of the data
    • Fixed bug where csv files were not being closed after installation
    • Fixed errors when closing the GUI
    • Fixed issue where enclosing quotes in csv files were not being respected during cross-tab restructuring
    • Fixed bug causing v1.6 to break when newer scripts were added to version.txt
    • Fixed Bioclim script to include hdr files
    • Fixed missing icon images on Windows
    • Removed unused code
    Source code(tar.gz)
    Source code(zip)
    python-retriever_1.8.3-1_all.deb(96.11 KB)
    retriever.zip(29.07 MB)
    RetrieverSetup.exe(8.32 MB)
  • v1.8.2(Feb 12, 2016)

    This is the 1.8 release of the EcoData Retriever.

    v1.8.2

    • Improved cleaning of column names
    • Fixed thread bug causing Gentry dataset to hang when installed via GUI
    • Removed support for 32-bit only Macs in binaries
    • Removed unused code

    v1.8.0

    • Added scripts for 21 new datasets: leaf herbivory, biomass allocation, community dynamics of shortgrass steppe plants, mammal and bird foraging attributes, tree demography in Indian, small mammal community dynamics in Chile, community dynamics of Sonoran Desert perennials, biovolumes of freshwater phytoplankton, plant dynamics in Montana, Antarctic Site Inventory breeding bird survey, community abundance data compiled from the literature, spatio-temporal population data for butterflies, fish parasite host ecological characteristics, eBird, Global Wood Density Database, multiscale community data on vascular plants in a North Carolina, vertebrate home range sizes, PRISM climate data, Amniote life history database, woody plan Biomass And Allometry Database, Vertnet data on amphibians, birds, fishes, mammals, reptiles
    • Added reset command to allow resetting database configuration settings, scripts, and cached raw data
    • Added Dockerfile for building docker containers of each version of the software for reproducibility
    • Added support for wxPython 3.0
    • Added support for tar and gz archives
    • Added support for archive files whose contents don't fit in memory
    • Added checks for and use of system proxies
    • Added ability to download archives from web services
    • Added tests for regressions in download engine
    • Added citation command to provide information on citing datasets
    • Improved column name cleanup
    • Improved whitespace consistency
    • Improved handling of Excel files
    • Improved function documentation
    • Improved unit testing and added coverage analysis
    • Improved the sample script by adding a url field
    • Improved script loading behavior by only loading a script the first time it is discovered
    • Improved operating system identification
    • Improved download engine by allowing ability to maintain archive and subdirectory structure (particular relevant for spatial data)
    • Improved cross-platform directory and line ending handling
    • Improved testing across platforms
    • Improved checking for updated scripts so that scripts are only downloaded if the current version isn't available
    • Improved metadata in setup.py
    • Fixed type issues in Portal dataset
    • Fixed GUI always downloading scripts instead of checking if it needed to
    • Fixed bug that sometimes resulted in .retriever directories not belonging to the user who did the installation
    • Fixed issues with downloading files to specific paths
    • Fixed BBS50 script to match newer structure of the data
    • Fixed bug where csv files were not being closed after installation
    • Fixed errors when closing the GUI
    • Fixed issue where enclosing quotes in csv files were not being respected during cross-tab restructuring
    • Fixed bug causing v1.6 to break when newer scripts were added to version.txt
    • Fixed Bioclim script to include hdr files
    • Fixed missing icon images on Windows
    • Removed unused code
    Source code(tar.gz)
    Source code(zip)
    python-retriever_1.8.2-1_all.deb(96.08 KB)
    retriever.zip(29.07 MB)
    RetrieverSetup.exe(8.32 MB)
  • v1.7.0(Oct 5, 2014)

    This is the v1.7.0 release of the EcoData Retriever.

    • Added ability to download files directly for non-tabular data
    • Added scripts to download Bioclim and Mammal Supertree data
    • Added a script for the MammalDIET database
    • Fixed bug where some nationally standardized FIA surveys where not included
    • Added check for wxpython on installation to allow non-gui installs
    • Fixed several minor issues with Gentry script including a missing site and a column in one file that was misnamed
    • Windows install now adds the retriever to the path to facilitate command line use
    • Fixed a bug preventing installation from PyPI
    • Added icons to installers
    • Fixed the retriever failing when given a script it couldn't handle
    Source code(tar.gz)
    Source code(zip)
    python-retriever_1.7.0-1_all.deb(96.21 KB)
    retriever-app.zip(17.61 MB)
    RetrieverSetup.exe(6.73 MB)
  • v1.6.0(Feb 11, 2014)

📖 Generate markdown API documentation from Google-style Python docstring. The lazy alternative to Sphinx.

lazydocs Generate markdown API documentation for Google-style Python docstring. Getting Started • Features • Documentation • Support • Contribution •

Machine Learning Tooling 118 Dec 31, 2022
Fast syllable estimation library based on pattern matching.

Syllables: A fast syllable estimator for Python Syllables is a fast, simple syllable estimator for Python. It's intended for use in places where speed

ProseGrinder 26 Dec 14, 2022
MonsterManualPlus - An advanced monster manual for Tower of the Sorcerer.

Monster Manual + This is an advanced monster manual for Tower of the Sorcerer mods. Users can get a plenty of extra imformation for decision making wh

Yifan Zhou 1 Jan 01, 2022
Manage your WordPress installation directly from SublimeText SideBar and Command Palette.

WordpressPluginManager Manage your WordPress installation directly from SublimeText SideBar and Command Palette. Installation Dependencies You will ne

Art-i desenvolvimento 1 Dec 14, 2021
Python document object mapper (load python object from JSON and vice-versa)

lupin is a Python JSON object mapper lupin is meant to help in serializing python objects to JSON and unserializing JSON data to python objects. Insta

Aurélien Amilin 24 Nov 09, 2022
the project for the most brutal and effective language learning technique

- "The project for the most brutal and effective language learning technique" (c) Alex Kay The langflow project was created especially for language le

Alexander Kaigorodov 7 Dec 26, 2021
Easy OpenAPI specs and Swagger UI for your Flask API

Flasgger Easy Swagger UI for your Flask API Flasgger is a Flask extension to extract OpenAPI-Specification from all Flask views registered in your API

Flasgger 3.1k Jan 05, 2023
A PyTorch implementation of Deep SAD, a deep Semi-supervised Anomaly Detection method.

Deep SAD: A Method for Deep Semi-Supervised Anomaly Detection This repository provides a PyTorch implementation of the Deep SAD method presented in ou

Lukas Ruff 276 Jan 04, 2023
Python bindings to OpenSlide

OpenSlide Python OpenSlide Python is a Python interface to the OpenSlide library. OpenSlide is a C library that provides a simple interface for readin

OpenSlide 297 Dec 21, 2022
Coursera learning course Python the basics. Programming exercises and tasks

HSE_Python_the_basics Welcome to BAsics programming Python! You’re joining thousands of learners currently enrolled in the course. I'm excited to have

PavelRyzhkov 0 Jan 05, 2022
JTEX is a command line tool (CLI) for rendering LaTeX documents from jinja-style templates.

JTEX JTEX is a command line tool (CLI) for rendering LaTeX documents from jinja-style templates. This package uses Jinja2 as the template engine with

Curvenote 15 Dec 21, 2022
Python-samples - This project is to help someone need some practices when learning python language

Python-samples - This project is to help someone need some practices when learning python language

Gui Chen 0 Feb 14, 2022
Plotting and analysis tools for ARTIS simulations

Artistools Artistools is collection of plotting, analysis, and file format conversion tools for the ARTIS radiative transfer code. Installation First

ARTIS Monte Carlo Radiative Transfer 8 Nov 07, 2022
Run `black` on python code blocks in documentation files

blacken-docs Run black on python code blocks in documentation files. install pip install blacken-docs usage blacken-docs provides a single executable

Anthony Sottile 460 Dec 23, 2022
Docov - Light-weight, recursive docstring coverage analysis for python modules

docov Light-weight, recursive docstring coverage analysis for python modules. Ov

Richard D. Paul 3 Feb 04, 2022
script to calculate total GPA out of 4, based on input gpa.csv

gpa_calculator script to calculate total GPA out of 4 based on input gpa.csv to use, create a total.csv file containing only one integer showing the t

Mohamad Bastin 1 Feb 07, 2022
A simple flask application to collect annotations for the Turing Change Point Dataset, a benchmark dataset for change point detection algorithms

AnnotateChange Welcome to the repository of the "AnnotateChange" application. This application was created to collect annotations of time series data

The Alan Turing Institute 16 Jul 21, 2022
A next-generation curated knowledge sharing platform for data scientists and other technical professions.

Knowledge Repo The Knowledge Repo project is focused on facilitating the sharing of knowledge between data scientists and other technical roles using

Airbnb 5.2k Dec 27, 2022
An MkDocs plugin to export content pages as PDF files

MkDocs PDF Export Plugin An MkDocs plugin to export content pages as PDF files The pdf-export plugin will export all markdown pages in your MkDocs rep

Terry Zhao 266 Dec 13, 2022
PowerApps-docstring is a console based, pipeline ready application that automatically generates user and technical documentation for Power Apps.

powerapps-docstring PowerApps-docstring is a console based, pipeline ready application that automatically generates user and technical documentation f

Sebastian Muthwill 30 Nov 23, 2022