Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

Overview

AWS Data Wrangler

Pandas on AWS

Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler

An AWS Professional Service open source initiative | [email protected]

Release Python Version Code style: black License

Checked with mypy Coverage Static Checking Documentation Status

Source Downloads Installation Command
PyPi PyPI Downloads pip install awswrangler
Conda Conda Downloads conda install -c conda-forge awswrangler

⚠️ For platforms without PyArrow 3 support (e.g. EMR, Glue PySpark Job):
➡️ pip install pyarrow==2 awswrangler

Powered By

Table of contents

Quick Start

Installation command: pip install awswrangler

⚠️ For platforms without PyArrow 3 support (e.g. EMR, Glue PySpark Job):
➡️ pip install pyarrow==2 awswrangler

import awswrangler as wr
import pandas as pd
from datetime import datetime

df = pd.DataFrame({"id": [1, 2], "value": ["foo", "boo"]})

# Storing data on Data Lake
wr.s3.to_parquet(
    df=df,
    path="s3://bucket/dataset/",
    dataset=True,
    database="my_db",
    table="my_table"
)

# Retrieving the data directly from Amazon S3
df = wr.s3.read_parquet("s3://bucket/dataset/", dataset=True)

# Retrieving the data from Amazon Athena
df = wr.athena.read_sql_query("SELECT * FROM my_table", database="my_db")

# Get a Redshift connection from Glue Catalog and retrieving data from Redshift Spectrum
con = wr.redshift.connect("my-glue-connection")
df = wr.redshift.read_sql_query("SELECT * FROM external_schema.my_table", con=con)
con.close()

# Amazon Timestream Write
df = pd.DataFrame({
    "time": [datetime.now(), datetime.now()],   
    "my_dimension": ["foo", "boo"],
    "measure": [1.0, 1.1],
})
rejected_records = wr.timestream.write(df,
    database="sampleDB",
    table="sampleTable",
    time_col="time",
    measure_col="measure",
    dimensions_cols=["my_dimension"],
)

# Amazon Timestream Query
wr.timestream.query("""
SELECT time, measure_value::double, my_dimension
FROM "sampleDB"."sampleTable" ORDER BY time DESC LIMIT 3
""")

Read The Docs

Community Resources

Please send a Pull Request with your resource reference and @githubhandle.

Logging

Enabling internal logging examples:

import logging
logging.basicConfig(level=logging.INFO, format="[%(name)s][%(funcName)s] %(message)s")
logging.getLogger("awswrangler").setLevel(logging.DEBUG)
logging.getLogger("botocore.credentials").setLevel(logging.CRITICAL)

Into AWS lambda:

import logging
logging.getLogger("awswrangler").setLevel(logging.DEBUG)

Who uses AWS Data Wrangler?

Knowing which companies are using this library is important to help prioritize the project internally.

Please send a Pull Request with your company name and @githubhandle if you may.

What is Amazon SageMaker Data Wrangler?

Amazon SageMaker Data Wrangler is a new SageMaker Studio feature that has a similar name but has a different purpose than the AWS Data Wrangler open source project.

  • AWS Data Wrangler is open source, runs anywhere, and is focused on code.

  • Amazon SageMaker Data Wrangler is specific for the SageMaker Studio environment and is focused on a visual interface.

Comments
  • Enable Athena and Redshift tests, and address errors

    Enable Athena and Redshift tests, and address errors

    Feature or Bugfix

    • Feature

    Detail

    • Athena tests weren't enabled for the distributed mode

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    opened by LeonLuttenberger 64
  • Add tests for Glue Ray jobs

    Add tests for Glue Ray jobs

    Feature or Bugfix

    • Feature

    Detail

    • Added a CloudFormation stack which creates the Glue Ray job(s)
    • Created a load test which triggers an example Glue job and checks for successful and timely execution
    • Wrote a bash script which packages the working version of Wrangler and uploads it to S3. This can then be loaded by the Glue job so that we test the working version of Wrangler rather than the one pre-packaged into Glue.
      • This script will need to be executed from the CodeBuild job so that the working version of Wrangler is uploaded to S3 before execution

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    opened by LeonLuttenberger 43
  • distributed s3 write text

    distributed s3 write text

    Feature or Bugfix

    • Feature

    Detail

    • Adding distributed versions of s3.write_csv and s3.write_json

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    feature 
    opened by LeonLuttenberger 40
  • Load Testing Benchmark Analytics

    Load Testing Benchmark Analytics

    • Write load tests result to parquet dataset stored in internal S3.
    • ToDo: Determine whether to restrict to just default branch (i.e. release-3.0.0) or not.

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    opened by malachi-constant 36
  • Timestream write ray support

    Timestream write ray support

    Feature or Bugfix

    • Feature
    • Refactoring

    Detail

    • Ray support for timestream write
    • num_threads argument changed to use_threads to be consistent with the rest of awswrangler + support of os.cpu_count()

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    opened by cnfait 36
  • Load Test Benchmarking

    Load Test Benchmarking

    Load Test Benchmarking

    • Add custom metric fixture
    • Add logic to publish elapsed_time per test to custom metric
    • Environment variable controlling when or when not to opt-in to publishing.
      • Data should only be published when running against release-3.0.0
    • Metric data can be organized into dashboards as seen fit.
    Screen Shot 2022-12-19 at 5 32 18 PM

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    opened by malachi-constant 32
  • (feat): Refactor to distribute s3.read_parquet

    (feat): Refactor to distribute s3.read_parquet

    Feature or Bugfix

    • Feature
    • Refactoring

    Detail

    1. Refactor wr.s3.read_parquet and other methods in _read_parquet S3 module to reduce technical debt:
    • Leverage thread pool executor when possible
    • Simplify chunk generation logic
    • Reduce number of conditionals by generalising edge cases
    • Improve documentation
    1. Distribute both read_file_metadata and read_parquet calls
    • read_file_metadata is distributed as a @ray_remote method via the executor
    • read_parquet is distributed using a custom datasource and the read_datasource Ray public API

    Testing

    • Standard tests are passing with minimal changes to the tests
    • Two tests are added to the load_test (simple and partitioned case)

    Related Issue

    • #1490

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    major release feature 
    opened by jaidisido 27
  • (refactor): Make room for additional distributed engines

    (refactor): Make room for additional distributed engines

    Feature or Bugfix

    • Refactoring

    Detail

    Currently, the codebase assumes that there is a single distributed execution engine referred to with the distributed keyword. This is highly restrictive as it closes the door on adding new execution engines (e.g. pyspark, dask...) in the future.

    A major change in this PR is splitting the distributed dependency installation and configuration into two (modin AND ray instead of distributed only). I believe this has two benefits. 1) it's explicit, that is the user knows exactly what they are installing 2) it's flexible, allowing more combinations in the future such as modin on dask or mars on ray.

    This change includes:

    • Modify the extra dependency installation from pip install awswrangler['distributed'] to pip install awswrangler['modin', 'ray'] instead
    • Modify the configuration to use two items (execution_engine and memory_format)
    • Modify the conditionals across the codebase as a result
    • Move the distributed modules under the subdirectory distributed/ray

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    enhancement major release dependencies 
    opened by jaidisido 26
  • (feat): Add Amazon Neptune support 🚀

    (feat): Add Amazon Neptune support 🚀

    Issue #, if available:

    Description of changes: First draft of what a Neptune interface might look like.

    I did have an utstanding question though on the naming of the write function names. There seems to be several conventions (put, to_sql, index, etc.) that different services have used based on how they work. Is there a preferred naming convention we would like to follow here?

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    opened by bechbd 25
  • Ray Load Tests CDK Stack and Instructions for Load Testing

    Ray Load Tests CDK Stack and Instructions for Load Testing

    Feature or Bugfix

    • Load Testing Documentation

    Detail

    • Ray load testing documentation
    • Ray CDK stack for creating prerequisites for launching ray clusters in aws

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    documentation 
    opened by malachi-constant 24
  • Distributed s3 delete objects

    Distributed s3 delete objects

    Feature or Bugfix

    • Refactor s3.delete_objects to run in distributed fashion.

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    enhancement 
    opened by malachi-constant 24
  • (feat) opensearch serverless

    (feat) opensearch serverless

    Feature or Bugfix

    • Feature

    Detail

    • Update existing client to support serverless
    • Add wr.opensearch.create_collection
    • Add helpers to generate default encryption and network policies for collections
    • Update tests to run against serverless opensearch

    Relates

    • #1917

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    feature 
    opened by kukushking 3
  • I am getting ValueError: I/O operation on closed file

    I am getting ValueError: I/O operation on closed file

    I am getting ValueError: I/O operation on closed file on below , Kindly suggest if my path is S3://bucket/file_name.json is there any process to open and read lines explicitly ?

    wr.opensearch.index_json( client, path=path, # path can be s3 or local index="sf_restaurants_inspections_dedup", id_keys=["inspection_id"] # can be multiple fields. arg applicable to all index_* functions )

    opened by deeproker 0
  • Add integration with OpenSearch Serverless

    Add integration with OpenSearch Serverless

    Is your feature request related to a problem? Please describe. Given AWS OpenSearch Service now has OpenSearch Serverless in preview, if would be nice if AWS Panda SDK supports OpenSearch Serverless just like how it support OpenSearch.

    Describe the solution you'd like AWS Panda SDK start integrating with OpenSearch Serverless like it does with OpenSearch. Knowing it might need to make sure some of the dependencies integrated with OpenSearch Serverless first.

    Describe alternatives you've considered N/A

    Additional context AWS Panda SDK should be able to

    • Initialize collections in OpenSearch Serverless
    • index data to collections
    • search data in collections
    • delete data in collections

    Similar to how it supports AWS OpenSearch https://github.com/aws/aws-sdk-pandas/blob/main/tutorials/031%20-%20OpenSearch.ipynb

    P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.

    feature 
    opened by RobotCharlie 2
  • (poc) mutation testing

    (poc) mutation testing

    POC of using mutation testing to improve coverage.

    • Added an example workflow to mutate S3 list module
    • Runs mocked tests against the mutants
    • Generates console and HTML reports

    Note we will probably not really need any workflows to use this concept, this is merely an example to share with the team.

    Proper mutation testing workflow is described here.

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    opened by kukushking 1
  • pandas FutureWarning in to_parquet with length-1 partition_cols argument

    pandas FutureWarning in to_parquet with length-1 partition_cols argument

    Describe the bug

    When writing a parquet dataset via to_parquet and setting the partition_cols argument as a length-1 list (to just partition on a single column), I get the following warning:

    .../awswrangler/s3/_write_dataset.py:92: FutureWarning: In a future version of pandas, a length 1 tuple will be returned when iterating over a groupby with a grouper equal to a list of length 1. Don't supply a list with a single grouper to avoid this warning. for keys, subgroup in df.groupby(by=partition_cols, observed=True):

    How to Reproduce

    awswrangler version 2.18.0 pandas version 1.5.1

    from awswrangler.s3 import to_parquet
    import pandas as pd
    
    df = pd.DataFrame(data={'col1':[1,2,2,3], 'col2':['a','b','c','d']})
    to_parquet(df, 's3://my-bucket/dataset/', dataset=True, partition_cols = ['col1'])
    

    Expected behavior

    No warning should be given, since awswrangler should properly call pandas groupby when given a single column as the partition column. I suggest allowing the partition_cols argument to be either a list of strings or a single string.

    Your project

    No response

    Screenshots

    No response

    OS

    Linux

    Python version

    3.9.13

    AWS SDK for pandas version

    2.18.0

    Additional context

    No response

    bug 
    opened by abefrandsen 2
Releases(2.18.0)
  • 2.18.0(Dec 2, 2022)

    Noteworthy

    • Pyarrow 10 support 🔥 by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1731
    • Lambda layers now available in af-south-1 (Cape Town) 🌍 by @malachi-constant

    Features & enhancements

    • Add unload_approach to athena.read_sql_table by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1634
    • Pass additional partition projection params to wr.s3.to_parquet & cat… by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1627
    • Regenerate poetry.lock with no update by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1663
    • Upgrading poetry installed in workflow by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1677
    • Improve bucketing series generation by casting only the required columns by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1664
    • Add get_query_executions generating DataFrames from Athena query executions detail by @KhueNgocDang in https://github.com/aws/aws-sdk-pandas/pull/1676
    • Dependency: Set Pandas Version != 1.5.0 bue to memory leak by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1688
    • read_csv: read file as binary when encoding_errors is set to ignore by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1723
    • Deps: Remove upper bound limit on 'python' version by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1720
    • (enhancement) Redshift: Adding 'primary_keys' to parameter validation by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1728
    • Add describe_log_streams and filter_log_events to the CloudWatch module by @KhueNgocDang in https://github.com/aws/aws-sdk-pandas/pull/1785
    • Update lambda layers with pyarrow 10 by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1758
    • Add ctas_write_compression argument to athena.read_sql_query by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1795
    • Add auto termination policy to EMR by @vikramsg in https://github.com/aws/aws-sdk-pandas/pull/1818
    • timestream.query: add QueryId and NextToken to df attributes by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1821
    • Add support for boto3 kwargs to timestream.create_table by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1819
    • Adding args to submit spark step by @vikramsg in https://github.com/aws/aws-sdk-pandas/pull/1826

    Bug fixes

    • Fix athena.read_sql_query for empty table and chunk size not returning an empty frame generator by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1685
    • Fixing index column validation in s3.read.parquet() validate schema by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1735
    • Bug: Replace extra_registries with extra_public_registries by @vikramsg in https://github.com/aws/aws-sdk-pandas/pull/1757
    • Fix: map datatype issue of athena by @pal0064 in https://github.com/aws/aws-sdk-pandas/pull/1753
    • Fix Redshift commands breaking with hyphenated table names by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1762
    • Add correct service names for timestream boto3 clients by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1716
    • Allow read partitions with extra = in the value by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1779

    Documentation

    • Update install page in docs with screenshot of new managed layer name by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1636
    • Remove semicolon from python code eol in s3 tutorial by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1673
    • Consistent kernel for jupyter notebooks by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1674
    • Correct a few typos in our ipynb tutorials by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1694
    • Fix broken links in readme by @lucasasmith in https://github.com/aws/aws-sdk-pandas/pull/1702
    • Typos in comments and docs by @mycaule in https://github.com/aws/aws-sdk-pandas/pull/1761

    Tests

    • Support for test infrastructure in private subnets by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1698
    • Upgrade engine versions to match defaults from aws console by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1709
    • Set redshift and Neptune clusters removal policy to destroy by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1675
    • Upgrade pytest-xdist by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1760
    • Fix timestream endpoint tests by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1781

    New Contributors

    • @lucasasmith made their first contribution in https://github.com/aws/aws-sdk-pandas/pull/1702
    • @vikramsg made their first contribution in https://github.com/aws/aws-sdk-pandas/pull/1757
    • @mycaule made their first contribution in https://github.com/aws/aws-sdk-pandas/pull/1761
    • @pal0064 made their first contribution in https://github.com/aws/aws-sdk-pandas/pull/1753

    Thanks

    We thank the following contributors/users for their work on this release: @lucasasmith, @vikramsg, @mycaule, @pal0064, @LeonLuttenberger, @cnfait, @malachi-constant, @kukushking, @jaidisido

    Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/2.17.0...2.18.0

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.18.0-py3-none-any.whl(249.29 KB)
    awswrangler-layer-2.18.0-py3.7.zip(45.85 MB)
    awswrangler-layer-2.18.0-py3.8-arm64.zip(43.38 MB)
    awswrangler-layer-2.18.0-py3.8.zip(47.38 MB)
    awswrangler-layer-2.18.0-py3.9-arm64.zip(43.40 MB)
    awswrangler-layer-2.18.0-py3.9.zip(47.35 MB)
  • 3.0.0rc2(Nov 23, 2022)

    What's Changed

    • (enhancement): Enable missing unit tests and Redshift, Athena, LF load tests by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1736
    • (enhancement): configure scheduling options, remove dependencies on internal ray impl by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1734
    • (testing): Enable Athena and Redshift tests, and address errors by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1721
    • (feat): Make tqdm progress reporting opt-in by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1741

    Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.0.0rc1...3.0.0rc2

    Source code(tar.gz)
    Source code(zip)
  • 3.0.0rc1(Oct 27, 2022)

    What's Changed

    • (enhancement): Move RayLogger out of non-distributed modules by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1686
    • (perf): Distribute data types inference by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1692
    • (docs): Update config tutorial to include new configuration values by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1696
    • (fix): partition block overwriting by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1695
    • (refactor): Optimize distributed CSV I/O by adding PyArrow-based datasource by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1699
    • (docs): Improve documentation on running SDK for pandas at scale by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1697
    • (enhancement): Apply modin repartitioning where required only by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1701
    • (enhancement): Remove local from ray.init call by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1708
    • (feat): Validate partitions along row axis, add warning by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1700
    • (feat): Expand SQL formatter to LakeFormation by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1684
    • (feat): Distribute parquet datasource and add missing features, enable all tests by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1711
    • (convention): Add Arrow prefix to parquet datasource for consistency by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1724
    • (perf): Distribute Timestream write with executor by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1715

    Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.0.0b3...3.0.0rc1

    Source code(tar.gz)
    Source code(zip)
  • 3.0.0b3(Oct 12, 2022)

    What's Changed

    • (feat): Add partitioning on block level by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1653
    • (refactor): Make room for additional distributed engines by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1646
    • (feat): Distribute s3 write text by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1631
    • (docs): Add "Introduction to Ray" Tutorial by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1661
    • (fix): Return address config param by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1660
    • (refactor): Enable new engines with custom dispatching and other constructs by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1666
    • (deps): Uptick modin to 0.16 by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1659

    Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.0.0b2...3.0.0b3

    Source code(tar.gz)
    Source code(zip)
  • 3.0.0b2(Sep 30, 2022)

    What's Changed

    • (feat) Update to Ray 2.0 by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1635
    • (feat) Ray logging by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1623
    • (enhancement): Reduce LOC in S3 write methods create_table by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1626
    • (docs) Tutorial: Run SDK for pandas job on ray cluster by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1616

    Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.0.0b1...3.0.0b2

    Source code(tar.gz)
    Source code(zip)
    awswrangler-3.0.0b2-py3-none-any.whl(261.29 KB)
    awswrangler-3.0.0b2.tar.gz(200.86 KB)
  • 3.0.0b1(Sep 22, 2022)

    What's Changed

    • (test) Consolidate unit and load tests by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1525
    • (feat) Distribute S3 read text by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1567
    • (feat) Distribute s3 wait_objects by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1539
    • (test) Ray Load Tests CDK Stack and Instructions for Load Testing by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1583
    • (fix) Fix S3 read text with version ID was not working by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1587
    • (feat) Add distributed s3 write parquet by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1526
    • (fix) Distribute write text regression, change to singledispatch, add repartitioning utility by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1611
    • (enhancement) Optimise distributed s3.read_text to load data in chunks by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1607

    Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.0.0a2...3.0.0b1

    Source code(tar.gz)
    Source code(zip)
  • 2.17.0(Sep 20, 2022)

    New Functionalities

    Enhancements

    • Returning empty DataFrame for empty TimeStream query #1430
    • Added support for INSERT IGNORE for mysql.to_sql #1429
    • Added use_column_names to redshift.copy akin to redshift.to_sql #1437
    • Enable passing kwargs to redshift.connect #1467
    • Add timestream_endpoint_url property to the config #1483
    • Add support for upserting to an empty Glue table #1579

    Documentation

    • Fix typos in documentation #1434

    Bug Fix

    • validate_schema=True for wr.s3.read_parquet breaks with partition columns and dataset=True #1426
    • wr.neptune.to_property_graph failing for Neptune version 1.1.1.0 #1407
    • ValueError when using opensearch.index_df with documents with an array field #1444
    • Missing catalog_id in wr.catalog.create_database #1480
    • Check for pair of brackets in query preparation for Athena cache #1529
    • Fix wrong type hint for TagColumnOperation in quicksight.create_athena_dataset #1570
    • s3.to_json compression parameters is passed twice when dataset=True #1585
    • Cast Athena array, map & struct types to pandas object #1581
    • In the OpenSearch module, use SSL only for HTTPS (port 443) #1603

    Noteworthy

    AWS Lambda Managed Layers

    Since the last release, the library has been accepted as an official SDK for AWS, and rebranded as AWS SDK for pandas 🚀. The module names in Python will remain the same. One noteworthy change, however, is that the AWS Lambda Manager layer name has been renamed from AWSDataWrangler to AWSSDKPandas.

    You can view the ARN value for the layers here.

    PyArrow 7 Support

    ⚠️ For platforms without PyArrow 7 support (e.g. MWAA, EMR, Glue PySpark Job):

    pip install pyarrow==2 awswrangler

    Thanks

    We thank the following contributors/users for their work on this release:

    @bechbd, @maxispeicher, @timgates42, @aeeladawy, @KhueNgocDang, @szemek, @malachi-constant, @cnfait, @jaidisido, @LeonLuttenberger, @kukushking

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.17.0-py3-none-any.whl(245.73 KB)
    awswrangler-layer-2.17.0-py3.7.zip(43.01 MB)
    awswrangler-layer-2.17.0-py3.8-arm64.zip(40.31 MB)
    awswrangler-layer-2.17.0-py3.8.zip(44.57 MB)
    awswrangler-layer-2.17.0-py3.9-arm64.zip(40.32 MB)
    awswrangler-layer-2.17.0-py3.9.zip(44.54 MB)
  • 3.0.0a2(Aug 17, 2022)

    This is a pre-release for the [email protected] project

    What's Changed

    • (feat): Add directory for Distributed Wrangler Load Tests by @malachi-constant in https://github.com/awslabs/aws-data-wrangler/pull/1464
    • (CI): Distribute tests in tox config by @malachi-constant in https://github.com/awslabs/aws-data-wrangler/pull/1469
    • (feat): Distribute s3 delete objects by @malachi-constant in https://github.com/awslabs/aws-data-wrangler/pull/1474
    • (CI): Enable new CI pipeline for standard & distributed tests by @malachi-constant in https://github.com/awslabs/aws-data-wrangler/pull/1481
    • (feat): Refactor to distribute s3.read_parquet by @jaidisido in https://github.com/awslabs/aws-data-wrangler/pull/1513
    • (bug): s3 delete tests failing in distributed codebase by @malachi-constant in https://github.com/awslabs/aws-data-wrangler/pull/1517

    Full Changelog: https://github.com/awslabs/aws-data-wrangler/compare/3.0.0a1...3.0.0a2

    Source code(tar.gz)
    Source code(zip)
  • 3.0.0a1(Aug 17, 2022)

    This is a pre-release for the [email protected] project

    What's Changed

    • (feat): Add distributed config flag and initialise method by @jaidisido in https://github.com/awslabs/aws-data-wrangler/pull/1389
    • (feat): Add distributed Lake Formation read by @jaidisido in https://github.com/awslabs/aws-data-wrangler/pull/1397
    • (feat): Distribute S3 select over multiple paths and scan ranges by @jaidisido in https://github.com/awslabs/aws-data-wrangler/pull/1445
    • (refactor): Refactor threading/ray; add single-path distributed s3 select impl by @kukushking in https://github.com/awslabs/aws-data-wrangler/pull/1446

    Full Changelog: https://github.com/awslabs/aws-data-wrangler/compare/2.16.1...3.0.0a1

    Source code(tar.gz)
    Source code(zip)
  • 2.16.1(Jun 28, 2022)

    Noteworthy

    🐛 Fixed issue introduced by 2.16.0 to method s3.read_parquet()

    Patch

    • Fix bug: pq_file.schema.names(): TypeError: 'list' object is not callable s3.read_parquet() #1412

    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Full Changelog: https://github.com/awslabs/aws-data-wrangler/compare/2.16.0...2.16.1

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.16.1-py3-none-any.whl(242.74 KB)
    awswrangler-layer-2.16.1-py3.7.zip(42.48 MB)
    awswrangler-layer-2.16.1-py3.8-arm64.zip(39.51 MB)
    awswrangler-layer-2.16.1-py3.8.zip(43.72 MB)
    awswrangler-layer-2.16.1-py3.9-arm64.zip(39.52 MB)
    awswrangler-layer-2.16.1-py3.9.zip(43.70 MB)
  • 2.16.0(Jun 22, 2022)

    Noteworthy

    ⚠️ For platforms without PyArrow 7 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    New Functionalities

    • Add support for Oracle Database 🔥 #1259 Check out the tutorial.

    Enhancements

    • add test infrastructure for oracle database #1274
    • revisiting S3 Select performance #1287
    • migrate test infra from cdk v1 to cdk v2 #1288
    • to_sql() make column names quoted identifiers to allow sql keywords #1392
    • throw NoFilesFound exception on 404 #1290
    • fast executemany #1299
    • add precombine key to upsert method for Redshift #1304
    • pass precombine to redshift.copy() #1319
    • use DataFrame column names in INSERT statement for UPSERT operation #1317
    • add data_source param to athena.repair_table #1324
    • modify athena2quicksight datatypes to allow startswith for varchar #1332
    • add TagColumnOperation to quicksight.create_athena_dataset #1342
    • enable list timestream databases and tables #1345
    • enable s3.to_parquet to receive "zstd" compression type #1369
    • create a way to perform PartiQL queries to a Dynamo DB table #1390
    • s3 proxy support with data wrangler #1361

    Documentation

    • be more explicit about awswrangler.s3.to_parquet overwrite behavior #1300
    • fix Python Version in Readme #1302

    Bug Fix

    • set encoding to utf-8 when no encoding is specified when reading/writing to s3 #1257
    • fix Redshift Locking Behavior #1305
    • specify cfn deletion policy for sqlserver and oracle instances #1378
    • to_sql() make column names quoted identifiers to allow sql keywords #1392
    • fix extension dtype index handling #1333
    • fix issue with redshift.to_sql() method when mode set to "upsert" and schema contains a hyphen #1360
    • timestream - array cols to str #1368
    • read_parquet Does Not Throw Error for Missing Column #1370

    Thanks

    We thank the following contributors/users for their work on this release:

    @bnimam, @IldarAlmakaev, @syokoysn, @IldarAlmakaev, @thomasniebler, @maxdavidson91, @takeknock, @Sleekbobby1011, @snikolakis, @willsmith28, @malachi-constant, @cnfait, @jaidisido, @kukushking


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.16.0-py3-none-any.whl(242.73 KB)
    awswrangler-layer-2.16.0-py3.7.zip(42.48 MB)
    awswrangler-layer-2.16.0-py3.8-arm64.zip(39.02 MB)
    awswrangler-layer-2.16.0-py3.8.zip(43.54 MB)
    awswrangler-layer-2.16.0-py3.9-arm64.zip(39.01 MB)
    awswrangler-layer-2.16.0-py3.9.zip(43.54 MB)
  • 2.15.1(Apr 11, 2022)

    Noteworthy

    ⚠️ Dropped Python 3.6 support

    ⚠️ For platforms without PyArrow 7 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Patch

    • Add sparql extra & make SPARQLWrapper dependency optional #1252

    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.15.1-py3-none-any.whl(234.00 KB)
    awswrangler-layer-2.15.1-py3.7.zip(42.34 MB)
    awswrangler-layer-2.15.1-py3.8-arm64.zip(38.90 MB)
    awswrangler-layer-2.15.1-py3.8.zip(43.42 MB)
    awswrangler-layer-2.15.1-py3.9-arm64.zip(38.88 MB)
    awswrangler-layer-2.15.1-py3.9.zip(43.42 MB)
  • 2.15.0(Mar 28, 2022)

    Noteworthy

    ⚠️ Dropped Python 3.6 support

    ⚠️ For platforms without PyArrow 7 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    New Functionalities

    • Amazon Neptune module 🚀 #1084 Check out the tutorial. Thanks to @bechbd & @sakti-mishra !
    • ARM64 Support for Python 3.8 and 3.9 layers 🔥 #1129 Many thanks @cnfait !

    Enhancements

    • Timestream module - support multi-measure records #1214
    • Warnings for implicit float conversion of nulls in to_parquet #1221
    • Support additional sql params in Redshift COPY operation #1210
    • Add create_ctas_table to Athena module #1207
    • S3 Proxy support #1206
    • Add Athena get_named_query_statement #1183
    • Add manifest parameter to 'redshift.copy_from_files' method #1164

    Documentation

    • Update install section #1242
    • Update lambda layers section #1236

    Bug Fix

    • Give precedence to user path for Athena UNLOAD S3 Output Location #1216
    • Honor User specified workgroup in athena.read_sql_query with unload_approach=True #1178
    • Support map type in Redshift copy #1185
    • data_api.rds.read_sql_query() does not preserve data type when column is all NULLS - switches to Boolean #1158
    • Allow decimal values within struct when writing to parquet #1179

    Thanks

    We thank the following contributors/users for their work on this release:

    @bechbd, @sakti-mishra, @mateogianolio, @jasadams, @malachi-constant, @cnfait, @jaidisido, @kukushking


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.15.0-py3-none-any.whl(233.14 KB)
    awswrangler-layer-2.15.0-py3.7.zip(43.98 MB)
    awswrangler-layer-2.15.0-py3.8-arm64.zip(40.51 MB)
    awswrangler-layer-2.15.0-py3.8.zip(45.04 MB)
    awswrangler-layer-2.15.0-py3.9-arm64.zip(40.50 MB)
    awswrangler-layer-2.15.0-py3.9.zip(45.04 MB)
  • 2.14.0(Jan 28, 2022)

    Caveats

    ⚠️ For platforms without PyArrow 6 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    New Functionalities

    • Support Athena Unload 🚀 #1038

    Enhancements

    • Add the ExcludeColumnSchema=True argument to the glue.get_partitions call to reduce response size #1094
    • Add PyArrow flavor argument to write_parquet via pyarrow_additional_kwargs #1057
    • Add rename_duplicate_columns and handle_duplicate_columns flag to sanitize_dataframe_columns_names method #1124
    • Add timestamp_as_object argument to all database read_sql_table methods #1130
    • Add ignore_null to read_parquet_metadata method #1125

    Documentation

    • Improve documentation on installing SAR Lambda layers with the CDK #1097
    • Fix broken link to tutorial in to_parquet method #1058

    Bug Fix

    • Ensure that partition locations retrieved from AWS Glue always end in a "/" #1094
    • Fix bucketing overflow issue in Athena #1086

    Thanks

    We thank the following contributors/users for their work on this release:

    @dennyau, @kailukowiak, @lucasmo, @moykeen, @RigoIce, @vlieven, @kepler, @mdavis-xyz, @ConstantinoSchillebeeckx, @kukushking, @jaidisido


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.14.0-py3-none-any.whl(221.29 KB)
    awswrangler-layer-2.14.0-py3.6.zip(37.31 MB)
    awswrangler-layer-2.14.0-py3.7.zip(40.59 MB)
    awswrangler-layer-2.14.0-py3.8.zip(41.70 MB)
    awswrangler-layer-2.14.0-py3.9.zip(41.68 MB)
  • 2.13.0(Dec 3, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 6 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Breaking changes

    • Fix sanitize methods to align with Glue/Hive naming conventions #579

    New Functionalities

    • AWS Lake Formation Governed Tables 🚀 #570
    • Support for Python 3.10 🔥 #973
    • Add partitioning to JSON datasets #962
    • Add ability to use unbuffered cursor for large MySQL datasets #928

    Enhancements

    • Add awswrangler.s3.list_buckets #997
    • Add partitions_parameters to catalog partitions methods #1035
    • Refactor pagination config in list objects #955
    • Add error message to EmptyDataframe exception #991

    Documentation

    • Clarify docs & add tutorial on schema evolution for CSV datasets #964

    Bug Fix

    • catalog.add_column() without column_comment triggers exception #1017
    • catalog.create_parquet_table Key in dictionary does not always exist #998
    • Fix Catalog StorageDescriptor get #969

    Thanks

    We thank the following contributors/users for their work on this release:

    @csabz09, @Falydoor, @moritzkoerber, @maxispeicher, @kukushking, @jaidisido


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.13.0-py3-none-any.whl(217.33 KB)
    awswrangler-layer-2.13.0-py3.6.zip(38.81 MB)
    awswrangler-layer-2.13.0-py3.7.zip(40.52 MB)
    awswrangler-layer-2.13.0-py3.8.zip(41.02 MB)
    awswrangler-layer-2.13.0-py3.9.zip(41.00 MB)
  • 2.12.1(Oct 18, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 5 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Patch

    • Removing unnecessary dev dependencies from main #961

    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.12.1-py3-none-any.whl(206.15 KB)
    awswrangler-layer-2.12.1-py3.6.zip(37.33 MB)
    awswrangler-layer-2.12.1-py3.7.zip(39.09 MB)
    awswrangler-layer-2.12.1-py3.8.zip(39.66 MB)
  • 2.12.0(Oct 13, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 5 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    New Functionalities

    • Add Support for Opensearch #891 🔥 Check out the tutorial. Many thanks to @AssafMentzer and @mureddy19 for this contribution

    Enhancements

    • redshift.read_sql_query - handle empty table corner case #874
    • Refactor read parquet table to reduce file list scan based on available partitions #878
    • Shrink lambda layer with strip command #884
    • Enabling DynamoDB endpoint URL #887
    • EMR jobs concurrency #889
    • Add feature to allow custom AMI for EMR #907
    • wr.redshift.unload_to_files empty the S3 folder instead of overwriting existing files #914
    • Add catalog_id arg to wr.catalog.does_table_exist #920
    • Ad enpoint_url for AWS Secrets Manager #929

    Documentation

    • Update docs for awswrangler.s3.to_csv #868

    Bug Fix

    • wr.mysql.to_sql with use_column_names=True when column names are reserved words #918

    Thanks

    We thank the following contributors/users for their work on this release:

    @AssafMentzer, @mureddy19, @isichei, @DonnaArt, @kukushking, @jaidisido


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.12.0-py3-none-any.whl(206.20 KB)
    awswrangler-layer-2.12.0-py3.6.zip(59.05 MB)
    awswrangler-layer-2.12.0-py3.7.zip(60.79 MB)
    awswrangler-layer-2.12.0-py3.8.zip(61.29 MB)
  • 2.11.0(Sep 1, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 5 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    New Functionalities

    • Redshift and RDS Data Api Support #828 🚀 Check out the tutorial. Many thanks to @pwithams for this contribution

    Enhancements

    • Upgrade to PyArrow 5 #861
    • Add Pagination for TimestreamDB #838

    Documentation

    • Clarifying structure of SSM secrets in connect methods #871

    Bug Fix

    • Use botocores' Loader and ServiceModel to extract accepted kwargs #832

    Thanks

    We thank the following contributors/users for their work on this release:

    @pwithams, @maxispeicher, @kukushking, @jaidisido


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.11.0-py3-none-any.whl(194.22 KB)
    awswrangler-layer-2.11.0-py3.6.zip(44.41 MB)
    awswrangler-layer-2.11.0-py3.7.zip(46.18 MB)
    awswrangler-layer-2.11.0-py3.8.zip(47.26 MB)
  • 2.10.0(Jul 21, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 4 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Enhancements

    • Add upsert support for Postgresql #807
    • Add schema evolution parameter to wr.s3.to_csv #787
    • Enable order by in CTAS Athena queries #785
    • Add header to wr.s3.to_csv when dataset = True #765
    • Add CSV as unload format to wr.redshift.unload_files #761

    Bug Fix

    • Fix deleting CTAS temporary Glue tables #782
    • Ensure safe get of Glue table parameters #779 and #783

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @kukushking, @jaidisido, @mohdaliiqbal


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.10.0-py3-none-any.whl(180.47 KB)
    awswrangler-layer-2.10.0-py3.6.zip(42.68 MB)
    awswrangler-layer-2.10.0-py3.7.zip(44.42 MB)
    awswrangler-layer-2.10.0-py3.8.zip(45.08 MB)
  • 2.9.0(Jun 18, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 4 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Documentation

    Enhancements

    • Enable server-side predicate filtering using S3 Select 🚀 #678
    • Support VersionId parameter for S3 read operations #721
    • Enable prefix in output S3 files for wr.redshift.unload_to_files #729
    • Add option to skip commit on wr.redshift.to_sql #705
    • Move integration test infrastructure to CDK 🎉 #706

    Bug Fix

    • Wait until athena query results bucket is created #735
    • Remove explicit Excel engine configuration #742
    • Fix bucketing types #719
    • Change end_time to UTC #720

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @kukushking, @jaidisido


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.9.0-py3-none-any.whl(179.25 KB)
    awswrangler-layer-2.9.0-py3.6.zip(42.65 MB)
    awswrangler-layer-2.9.0-py3.7.zip(43.24 MB)
    awswrangler-layer-2.9.0-py3.8.zip(43.87 MB)
  • 2.8.0(May 19, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 4 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Documentation

    • Install Lambda Layers and Python wheels from public S3 bucket 🎉 #666
    • Clarified docs around potential in-place mutation of dataframe when using to_parquet #669

    Enhancements

    • Enable parallel s3 downloads (~20% speedup) 🚀 #644
    • Apache Arrow 4.0.0 support (enables ARM instances support as well) #557
    • Enable LOCK before concurrent COPY calls in Redshift #665
    • Make use of Pyarrow iter_batches (>= 3.0.0 only) #660
    • Enable additional options when overwriting Redshift table (drop, truncate, cascade) #671
    • Reuse s3 client across threads for s3 range requests #684

    Bug Fix

    • Add dtypes for empty ctas athena queries #659
    • Add Serde properties when creating CSV table #672
    • Pass SSL properties from Glue Connection to MySQL #554

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @kukushking, @igorborgest, @gballardin, @eferm, @jaklan, @Falydoor, @chariottrider, @chriscugliotta, @konradsemsch, @gvermillion, @russellbrooks, @mshober.


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.8.0-py3-none-any.whl(175.13 KB)
    awswrangler-layer-2.8.0-py3.6.zip(42.64 MB)
    awswrangler-layer-2.8.0-py3.7.zip(43.22 MB)
    awswrangler-layer-2.8.0-py3.8.zip(43.86 MB)
  • 2.7.0(Apr 15, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 3 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Documentation

    • Updated documentation to clarify wr.athena.read_sql_query params argument use #609

    New Functionalities

    • Supporting MySQL upserts #608
    • Enable prepending S3 parquet files with a prefix in wr.s3.write.to_parquet #617
    • Add exist_ok flag to safely create a Glue database #642
    • Add "Unsupported Pyarrow type" exception #639

    Bug Fix

    • Fix chunked mode in wr.s3.read_parquet_table #627
    • Fix missing \ character from wr.s3.read_parquet_table method #638
    • Support postgres as an engine value #630
    • Add default workgroup result configuration #633
    • Raise exception when merge_upsert_table fails or data_quality is insufficient #601
    • Fixing nested structure bug in athena2pyarrow method #612

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @igorborgest, @mattboyd-aws, @vlieven, @bentkibler, @adarsh-chauhan, @impredicative, @nmduarteus, @JoshCrosby, @TakumiHaruta, @zdk123, @tuannguyen0901, @jiteshsoni, @luminita.


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.7.0-py3-none-any.whl(172.06 KB)
    awswrangler-layer-2.7.0-py3.6.zip(41.19 MB)
    awswrangler-layer-2.7.0-py3.7.zip(41.78 MB)
    awswrangler-layer-2.7.0-py3.8.zip(41.84 MB)
  • 2.6.0(Mar 16, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 3 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Enhancements

    • Added a chunksize parameter to the to_sql function. Default set to 200. Decreased insertion time from 120 to 1 second #599
    • path argument is now optional in s3.to_parquet and s3.to_csv functions #586
    • Added a map_types boolean (set to True by default) to convert pyarrow DataTypes to pandas ExtensionDtypes #580
    • Added optional ctas_database_name argument to store ctas_temporary_table in an alternative database #576

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @igorborgest, @ilyanoskov, @VashMKS, @jmahlik, @dimapod, @Reeska


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.6.0-py3-none-any.whl(170.55 KB)
    awswrangler-layer-2.6.0-py3.6.zip(41.08 MB)
    awswrangler-layer-2.6.0-py3.7.zip(41.66 MB)
    awswrangler-layer-2.6.0-py3.8.zip(41.70 MB)
  • 2.5.0(Mar 3, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 3 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Documentation

    • New HTML tutorials #551
    • Use bump2version for changing version numbers #573
    • Mishandling of wildcard characters in read_parquet #564

    Enhancements

    • Support for ExpectedBucketOwner #562

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @impredicative, @adarsh-chauhan, @Malkard.


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.5.0-py3-none-any.whl(168.46 KB)
    awswrangler-layer-2.5.0-py3.6.zip(40.96 MB)
    awswrangler-layer-2.5.0-py3.7.zip(41.53 MB)
    awswrangler-layer-2.5.0-py3.8.zip(41.57 MB)
  • 2.4.0-docs(Feb 4, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 3 support (e.g. EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Documentation

    • Update to include PyArrow 3 caveats for EMR and Glue PySpark Job. #546 #547

    New Functionalities

    • Redshift COPY now supports the new SUPER type (i.e. SERIALIZETOJSON) #514
    • S3 Upload/download files #506
    • Include dataset BUCKETING for s3 datasets writing #443
    • Enable Merge Upsert for existing Glue Tables on Primary Keys #503
    • Support Requester Pays S3 Buckets #430
    • Add botocore Config to wr.config #535

    Enhancements

    • Pandas 1.2.1 support #525
    • Numpy 1.20.0 support
    • Apache Arrow 3.0.0 support #531
    • Python 3.9 support #454

    Bug Fix

    • Return DataFrame with unique index for Athena CTAS queries #527
    • Remove unnecessary schema inference. #524

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @danielwo, @jiteshsoni, @igorborgest, @njdanielsen, @eric-valente, @gvermillion, @zseder, @gdbassett, @orenmazor, @senorkrabs, @Natalie-Caruana, @dragonH, @nikwerhypoport, @hwangji.


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.4.0-py3-none-any.whl(167.60 KB)
    awswrangler-layer-2.4.0-py3.6.zip(40.95 MB)
    awswrangler-layer-2.4.0-py3.7.zip(41.51 MB)
    awswrangler-layer-2.4.0-py3.8.zip(41.56 MB)
  • 2.4.0(Feb 3, 2021)

    New Functionalities

    • Redshift COPY now supports the new SUPER type (i.e. SERIALIZETOJSON) #514
    • S3 Upload/download files #506
    • Include dataset BUCKETING for s3 datasets writing #443
    • Enable Merge Upsert for existing Glue Tables on Primary Keys #503
    • Support Requester Pays S3 Buckets #430
    • Add botocore Config to wr.config #535

    Enhancements

    • Pandas 1.2.1 support #525
    • Numpy 1.20.0 support
    • Apache Arrow 3.0.0 support #531
    • Python 3.9 support #454

    Bug Fix

    • Return DataFrame with unique index for Athena CTAS queries #527
    • Remove unnecessary schema inference. #524

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @danielwo, @jiteshsoni, @igorborgest, @njdanielsen, @eric-valente, @gvermillion, @zseder, @gdbassett, @orenmazor, @senorkrabs, @Natalie-Caruana.


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.4.0-py3-none-any.whl(167.60 KB)
    awswrangler-layer-2.4.0-py3.6.zip(40.95 MB)
    awswrangler-layer-2.4.0-py3.7.zip(41.51 MB)
    awswrangler-layer-2.4.0-py3.8.zip(41.56 MB)
  • 2.3.0(Jan 10, 2021)

    New Functionalities

    • DynamoDB support #448
    • SQLServer support (Driver must be installed separately) #356
    • Excel files support #419 #509
    • Amazon S3 Access Point support #393
    • Amazon Chime initial support #494
    • Write compressed CSV and JSON files on S3 #308 #359 #412

    Enhancements

    • Add query parameters for Athena #432
    • Add metadata caching for Athena #461
    • Add suffix filters for s3.read_parquet_table() #495

    Bug Fix

    • Fix keep_files behavior for failed Redshift COPY executions #505

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @danielwo, @jiteshsoni, @gvermillion, @rodalarcon, @imanebosch, @dwbelliston, @tochandrashekhar, @kylepierce, @njdanielsen, @jasadams, @gtossou, @JasonSanchez, @kokes, @hanan-vian @igorborgest.


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.3.0-py3-none-any.whl(160.79 KB)
    awswrangler-layer-2.3.0-py3.6.zip(40.52 MB)
    awswrangler-layer-2.3.0-py3.7.zip(40.73 MB)
    awswrangler-layer-2.3.0-py3.8.zip(40.79 MB)
  • 2.2.0(Dec 23, 2020)

    New Functionalities

    • Add aws_access_key_id, aws_secret_access_key, aws_session_token and boto3_session for Redshift copy/unload #484

    Bug Fix

    • Remove dtype print statement #487

    Thanks

    We thank the following contributors/users for their work on this release:

    @danielwo, @thetimbecker, @njdanielsen, @igorborgest.


    P.S. Lambda Layer zip file and Glue wheel/egg files are available below. Just upload it and run!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.2.0-py3-none-any.whl(147.74 KB)
    awswrangler-2.2.0-py3.6.egg(319.53 KB)
    awswrangler-layer-2.2.0-py3.6.zip(39.52 MB)
    awswrangler-layer-2.2.0-py3.7.zip(39.45 MB)
    awswrangler-layer-2.2.0-py3.8.zip(39.52 MB)
  • 2.1.0(Dec 21, 2020)

    New Functionalities

    • Add secretmanager module and support for databases connections #402
    con = wr.redshift.connect(secret_id="my-secret", dbname="my-db")
    df = wr.redshift.read_sql_query("SELECT ...", con=con)
    con.close()
    

    Bug Fix

    • Fix connection attributes quoting for wr.*.connect() #481
    • Fix parquet table append for nested struct columns #480

    Thanks

    We thank the following contributors/users for their work on this release:

    @danielwo, @nmduarteus, @nivf33, @kinghuang, @igorborgest.


    P.S. Lambda Layer zip file and Glue wheel/egg files are available below. Just upload it and run!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.1.0-py3-none-any.whl(147.04 KB)
    awswrangler-2.1.0-py3.6.egg(318.06 KB)
    awswrangler-layer-2.1.0-py3.6.zip(39.52 MB)
    awswrangler-layer-2.1.0-py3.7.zip(39.45 MB)
    awswrangler-layer-2.1.0-py3.8.zip(39.51 MB)
Owner
Amazon Web Services - Labs
AWS Labs
Amazon Web Services - Labs
Some scripts for microsoft SQL server in old version.

MSSQL_Stuff Some scripts for microsoft SQL server which is in old version. Table of content Overview Usage References Overview These script works when

小离 5 Dec 29, 2022
An extension package of 🤗 Datasets that provides support for executing arbitrary SQL queries on HF datasets

datasets_sql A 🤗 Datasets extension package that provides support for executing arbitrary SQL queries on HF datasets. It uses DuckDB as a SQL engine

Mario Šaško 19 Dec 15, 2022
A database migrations tool for SQLAlchemy.

Alembic is a database migrations tool written by the author of SQLAlchemy. A migrations tool offers the following functionality: Can emit ALTER statem

SQLAlchemy 1.7k Jan 01, 2023
Dinamopy is a python helper library for dynamodb

Dinamopy is a python helper library for dynamodb. You can define your access patterns in a json file and can use dynamic method names to make operations.

Rasim Andıran 2 Jul 18, 2022
Anomaly detection on SQL data warehouses and databases

With CueObserve, you can run anomaly detection on data in your SQL data warehouses and databases. Getting Started Install via Docker docker run -p 300

Cuebook 171 Dec 18, 2022
A Telegram Bot to manage Redis Database.

A Telegram Bot to manage Redis database. Direct deploy on heroku Manual Deployment python3, git is required Clone repo git clone https://github.com/bu

Amit Sharma 4 Oct 21, 2022
Py2neo is a client library and toolkit for working with Neo4j from within Python

Py2neo Py2neo is a client library and toolkit for working with Neo4j from within Python applications. The library supports both Bolt and HTTP and prov

py2neo.org 1.2k Jan 02, 2023
MinIO Client SDK for Python

MinIO Python SDK for Amazon S3 Compatible Cloud Storage MinIO Python SDK is Simple Storage Service (aka S3) client to perform bucket and object operat

High Performance, Kubernetes Native Object Storage 582 Dec 28, 2022
Pysolr — Python Solr client

pysolr pysolr is a lightweight Python client for Apache Solr. It provides an interface that queries the server and returns results based on the query.

Haystack Search 626 Dec 01, 2022
Example Python codes that works with MySQL and Excel files (.xlsx)

Python x MySQL x Excel by Zinglecode Example Python codes that do the processes between MySQL database and Excel spreadsheet files. YouTube videos MyS

Potchara Puttawanchai 1 Feb 07, 2022
PyMongo - the Python driver for MongoDB

PyMongo Info: See the mongo site for more information. See GitHub for the latest source. Documentation: Available at pymongo.readthedocs.io Author: Mi

mongodb 3.7k Jan 08, 2023
Toolkit for storing files and attachments in web applications

DEPOT - File Storage Made Easy DEPOT is a framework for easily storing and serving files in web applications on Python2.6+ and Python3.2+. DEPOT suppo

Alessandro Molina 139 Dec 25, 2022
Databank is an easy-to-use Python library for making raw SQL queries in a multi-threaded environment.

Databank Databank is an easy-to-use Python library for making raw SQL queries in a multi-threaded environment. No ORM, no frills. Thread-safe. Only ra

snapADDY GmbH 4 Apr 04, 2022
Async ODM (Object Document Mapper) for MongoDB based on python type hints

ODMantic Documentation: https://art049.github.io/odmantic/ Asynchronous ODM(Object Document Mapper) for MongoDB based on standard python type hints. I

Arthur Pastel 732 Dec 31, 2022
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

Amazon Web Services - Labs 3.3k Dec 31, 2022
dbd is a database prototyping tool that enables data analysts and engineers to quickly load and transform data in SQL databases.

dbd: database prototyping tool dbd is a database prototyping tool that enables data analysts and engineers to quickly load and transform data in SQL d

Zdenek Svoboda 47 Dec 07, 2022
Sample code to extract data directly from the NetApp AIQUM MySQL Database

This sample code shows how to connect to the AIQUM Database and pull user quota details from it. AIQUM Requirements: 1. AIQUM 9.7 or higher. 2. An

1 Nov 08, 2021
A simple python package that perform SQL Server Source Control and Auto Deployment.

deploydb Deploy your database objects automatically when the git branch is updated. Production-ready! ⚙️ Easy-to-use 🔨 Customizable 🔧 Installation I

Mert Güvençli 10 Dec 07, 2022
SQL queries to collections

SQC SQL Queries to Collections Examples from sqc import sqc data = [ {"a": 1, "b": 1}, {"a": 2, "b": 1}, {"a": 3, "b": 2}, ] Simple filte

Alexander Volkovsky 0 Jul 06, 2022
Use SQL query in a jupyter notebook!

SQL-query Use SQL query in a jupyter notebook! The table I used can be found on UN Data. Or you can just click the link and download the file undata_s

Chuqin 2 Oct 05, 2022