Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

Overview

AWS Data Wrangler

Pandas on AWS

Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler

An AWS Professional Service open source initiative | [email protected]

Release Python Version Code style: black License

Checked with mypy Coverage Static Checking Documentation Status

Source Downloads Installation Command
PyPi PyPI Downloads pip install awswrangler
Conda Conda Downloads conda install -c conda-forge awswrangler

⚠️ For platforms without PyArrow 3 support (e.g. EMR, Glue PySpark Job):
➡️ pip install pyarrow==2 awswrangler

Powered By

Table of contents

Quick Start

Installation command: pip install awswrangler

⚠️ For platforms without PyArrow 3 support (e.g. EMR, Glue PySpark Job):
➡️ pip install pyarrow==2 awswrangler

import awswrangler as wr
import pandas as pd
from datetime import datetime

df = pd.DataFrame({"id": [1, 2], "value": ["foo", "boo"]})

# Storing data on Data Lake
wr.s3.to_parquet(
    df=df,
    path="s3://bucket/dataset/",
    dataset=True,
    database="my_db",
    table="my_table"
)

# Retrieving the data directly from Amazon S3
df = wr.s3.read_parquet("s3://bucket/dataset/", dataset=True)

# Retrieving the data from Amazon Athena
df = wr.athena.read_sql_query("SELECT * FROM my_table", database="my_db")

# Get a Redshift connection from Glue Catalog and retrieving data from Redshift Spectrum
con = wr.redshift.connect("my-glue-connection")
df = wr.redshift.read_sql_query("SELECT * FROM external_schema.my_table", con=con)
con.close()

# Amazon Timestream Write
df = pd.DataFrame({
    "time": [datetime.now(), datetime.now()],   
    "my_dimension": ["foo", "boo"],
    "measure": [1.0, 1.1],
})
rejected_records = wr.timestream.write(df,
    database="sampleDB",
    table="sampleTable",
    time_col="time",
    measure_col="measure",
    dimensions_cols=["my_dimension"],
)

# Amazon Timestream Query
wr.timestream.query("""
SELECT time, measure_value::double, my_dimension
FROM "sampleDB"."sampleTable" ORDER BY time DESC LIMIT 3
""")

Read The Docs

Community Resources

Please send a Pull Request with your resource reference and @githubhandle.

Logging

Enabling internal logging examples:

import logging
logging.basicConfig(level=logging.INFO, format="[%(name)s][%(funcName)s] %(message)s")
logging.getLogger("awswrangler").setLevel(logging.DEBUG)
logging.getLogger("botocore.credentials").setLevel(logging.CRITICAL)

Into AWS lambda:

import logging
logging.getLogger("awswrangler").setLevel(logging.DEBUG)

Who uses AWS Data Wrangler?

Knowing which companies are using this library is important to help prioritize the project internally.

Please send a Pull Request with your company name and @githubhandle if you may.

What is Amazon SageMaker Data Wrangler?

Amazon SageMaker Data Wrangler is a new SageMaker Studio feature that has a similar name but has a different purpose than the AWS Data Wrangler open source project.

  • AWS Data Wrangler is open source, runs anywhere, and is focused on code.

  • Amazon SageMaker Data Wrangler is specific for the SageMaker Studio environment and is focused on a visual interface.

Comments
  • Enable Athena and Redshift tests, and address errors

    Enable Athena and Redshift tests, and address errors

    Feature or Bugfix

    • Feature

    Detail

    • Athena tests weren't enabled for the distributed mode

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    opened by LeonLuttenberger 64
  • Add tests for Glue Ray jobs

    Add tests for Glue Ray jobs

    Feature or Bugfix

    • Feature

    Detail

    • Added a CloudFormation stack which creates the Glue Ray job(s)
    • Created a load test which triggers an example Glue job and checks for successful and timely execution
    • Wrote a bash script which packages the working version of Wrangler and uploads it to S3. This can then be loaded by the Glue job so that we test the working version of Wrangler rather than the one pre-packaged into Glue.
      • This script will need to be executed from the CodeBuild job so that the working version of Wrangler is uploaded to S3 before execution

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    opened by LeonLuttenberger 43
  • distributed s3 write text

    distributed s3 write text

    Feature or Bugfix

    • Feature

    Detail

    • Adding distributed versions of s3.write_csv and s3.write_json

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    feature 
    opened by LeonLuttenberger 40
  • Load Testing Benchmark Analytics

    Load Testing Benchmark Analytics

    • Write load tests result to parquet dataset stored in internal S3.
    • ToDo: Determine whether to restrict to just default branch (i.e. release-3.0.0) or not.

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    opened by malachi-constant 36
  • Timestream write ray support

    Timestream write ray support

    Feature or Bugfix

    • Feature
    • Refactoring

    Detail

    • Ray support for timestream write
    • num_threads argument changed to use_threads to be consistent with the rest of awswrangler + support of os.cpu_count()

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    opened by cnfait 36
  • Load Test Benchmarking

    Load Test Benchmarking

    Load Test Benchmarking

    • Add custom metric fixture
    • Add logic to publish elapsed_time per test to custom metric
    • Environment variable controlling when or when not to opt-in to publishing.
      • Data should only be published when running against release-3.0.0
    • Metric data can be organized into dashboards as seen fit.
    Screen Shot 2022-12-19 at 5 32 18 PM

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    opened by malachi-constant 32
  • (feat): Refactor to distribute s3.read_parquet

    (feat): Refactor to distribute s3.read_parquet

    Feature or Bugfix

    • Feature
    • Refactoring

    Detail

    1. Refactor wr.s3.read_parquet and other methods in _read_parquet S3 module to reduce technical debt:
    • Leverage thread pool executor when possible
    • Simplify chunk generation logic
    • Reduce number of conditionals by generalising edge cases
    • Improve documentation
    1. Distribute both read_file_metadata and read_parquet calls
    • read_file_metadata is distributed as a @ray_remote method via the executor
    • read_parquet is distributed using a custom datasource and the read_datasource Ray public API

    Testing

    • Standard tests are passing with minimal changes to the tests
    • Two tests are added to the load_test (simple and partitioned case)

    Related Issue

    • #1490

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    major release feature 
    opened by jaidisido 27
  • (refactor): Make room for additional distributed engines

    (refactor): Make room for additional distributed engines

    Feature or Bugfix

    • Refactoring

    Detail

    Currently, the codebase assumes that there is a single distributed execution engine referred to with the distributed keyword. This is highly restrictive as it closes the door on adding new execution engines (e.g. pyspark, dask...) in the future.

    A major change in this PR is splitting the distributed dependency installation and configuration into two (modin AND ray instead of distributed only). I believe this has two benefits. 1) it's explicit, that is the user knows exactly what they are installing 2) it's flexible, allowing more combinations in the future such as modin on dask or mars on ray.

    This change includes:

    • Modify the extra dependency installation from pip install awswrangler['distributed'] to pip install awswrangler['modin', 'ray'] instead
    • Modify the configuration to use two items (execution_engine and memory_format)
    • Modify the conditionals across the codebase as a result
    • Move the distributed modules under the subdirectory distributed/ray

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    enhancement major release dependencies 
    opened by jaidisido 26
  • (feat): Add Amazon Neptune support 🚀

    (feat): Add Amazon Neptune support 🚀

    Issue #, if available:

    Description of changes: First draft of what a Neptune interface might look like.

    I did have an utstanding question though on the naming of the write function names. There seems to be several conventions (put, to_sql, index, etc.) that different services have used based on how they work. Is there a preferred naming convention we would like to follow here?

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    opened by bechbd 25
  • Ray Load Tests CDK Stack and Instructions for Load Testing

    Ray Load Tests CDK Stack and Instructions for Load Testing

    Feature or Bugfix

    • Load Testing Documentation

    Detail

    • Ray load testing documentation
    • Ray CDK stack for creating prerequisites for launching ray clusters in aws

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    documentation 
    opened by malachi-constant 24
  • Distributed s3 delete objects

    Distributed s3 delete objects

    Feature or Bugfix

    • Refactor s3.delete_objects to run in distributed fashion.

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    enhancement 
    opened by malachi-constant 24
  • (feat) opensearch serverless

    (feat) opensearch serverless

    Feature or Bugfix

    • Feature

    Detail

    • Update existing client to support serverless
    • Add wr.opensearch.create_collection
    • Add helpers to generate default encryption and network policies for collections
    • Update tests to run against serverless opensearch

    Relates

    • #1917

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    feature 
    opened by kukushking 3
  • I am getting ValueError: I/O operation on closed file

    I am getting ValueError: I/O operation on closed file

    I am getting ValueError: I/O operation on closed file on below , Kindly suggest if my path is S3://bucket/file_name.json is there any process to open and read lines explicitly ?

    wr.opensearch.index_json( client, path=path, # path can be s3 or local index="sf_restaurants_inspections_dedup", id_keys=["inspection_id"] # can be multiple fields. arg applicable to all index_* functions )

    opened by deeproker 0
  • Add integration with OpenSearch Serverless

    Add integration with OpenSearch Serverless

    Is your feature request related to a problem? Please describe. Given AWS OpenSearch Service now has OpenSearch Serverless in preview, if would be nice if AWS Panda SDK supports OpenSearch Serverless just like how it support OpenSearch.

    Describe the solution you'd like AWS Panda SDK start integrating with OpenSearch Serverless like it does with OpenSearch. Knowing it might need to make sure some of the dependencies integrated with OpenSearch Serverless first.

    Describe alternatives you've considered N/A

    Additional context AWS Panda SDK should be able to

    • Initialize collections in OpenSearch Serverless
    • index data to collections
    • search data in collections
    • delete data in collections

    Similar to how it supports AWS OpenSearch https://github.com/aws/aws-sdk-pandas/blob/main/tutorials/031%20-%20OpenSearch.ipynb

    P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.

    feature 
    opened by RobotCharlie 2
  • (poc) mutation testing

    (poc) mutation testing

    POC of using mutation testing to improve coverage.

    • Added an example workflow to mutate S3 list module
    • Runs mocked tests against the mutants
    • Generates console and HTML reports

    Note we will probably not really need any workflows to use this concept, this is merely an example to share with the team.

    Proper mutation testing workflow is described here.

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    opened by kukushking 1
  • pandas FutureWarning in to_parquet with length-1 partition_cols argument

    pandas FutureWarning in to_parquet with length-1 partition_cols argument

    Describe the bug

    When writing a parquet dataset via to_parquet and setting the partition_cols argument as a length-1 list (to just partition on a single column), I get the following warning:

    .../awswrangler/s3/_write_dataset.py:92: FutureWarning: In a future version of pandas, a length 1 tuple will be returned when iterating over a groupby with a grouper equal to a list of length 1. Don't supply a list with a single grouper to avoid this warning. for keys, subgroup in df.groupby(by=partition_cols, observed=True):

    How to Reproduce

    awswrangler version 2.18.0 pandas version 1.5.1

    from awswrangler.s3 import to_parquet
    import pandas as pd
    
    df = pd.DataFrame(data={'col1':[1,2,2,3], 'col2':['a','b','c','d']})
    to_parquet(df, 's3://my-bucket/dataset/', dataset=True, partition_cols = ['col1'])
    

    Expected behavior

    No warning should be given, since awswrangler should properly call pandas groupby when given a single column as the partition column. I suggest allowing the partition_cols argument to be either a list of strings or a single string.

    Your project

    No response

    Screenshots

    No response

    OS

    Linux

    Python version

    3.9.13

    AWS SDK for pandas version

    2.18.0

    Additional context

    No response

    bug 
    opened by abefrandsen 2
Releases(2.18.0)
  • 2.18.0(Dec 2, 2022)

    Noteworthy

    • Pyarrow 10 support 🔥 by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1731
    • Lambda layers now available in af-south-1 (Cape Town) 🌍 by @malachi-constant

    Features & enhancements

    • Add unload_approach to athena.read_sql_table by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1634
    • Pass additional partition projection params to wr.s3.to_parquet & cat… by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1627
    • Regenerate poetry.lock with no update by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1663
    • Upgrading poetry installed in workflow by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1677
    • Improve bucketing series generation by casting only the required columns by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1664
    • Add get_query_executions generating DataFrames from Athena query executions detail by @KhueNgocDang in https://github.com/aws/aws-sdk-pandas/pull/1676
    • Dependency: Set Pandas Version != 1.5.0 bue to memory leak by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1688
    • read_csv: read file as binary when encoding_errors is set to ignore by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1723
    • Deps: Remove upper bound limit on 'python' version by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1720
    • (enhancement) Redshift: Adding 'primary_keys' to parameter validation by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1728
    • Add describe_log_streams and filter_log_events to the CloudWatch module by @KhueNgocDang in https://github.com/aws/aws-sdk-pandas/pull/1785
    • Update lambda layers with pyarrow 10 by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1758
    • Add ctas_write_compression argument to athena.read_sql_query by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1795
    • Add auto termination policy to EMR by @vikramsg in https://github.com/aws/aws-sdk-pandas/pull/1818
    • timestream.query: add QueryId and NextToken to df attributes by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1821
    • Add support for boto3 kwargs to timestream.create_table by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1819
    • Adding args to submit spark step by @vikramsg in https://github.com/aws/aws-sdk-pandas/pull/1826

    Bug fixes

    • Fix athena.read_sql_query for empty table and chunk size not returning an empty frame generator by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1685
    • Fixing index column validation in s3.read.parquet() validate schema by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1735
    • Bug: Replace extra_registries with extra_public_registries by @vikramsg in https://github.com/aws/aws-sdk-pandas/pull/1757
    • Fix: map datatype issue of athena by @pal0064 in https://github.com/aws/aws-sdk-pandas/pull/1753
    • Fix Redshift commands breaking with hyphenated table names by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1762
    • Add correct service names for timestream boto3 clients by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1716
    • Allow read partitions with extra = in the value by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1779

    Documentation

    • Update install page in docs with screenshot of new managed layer name by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1636
    • Remove semicolon from python code eol in s3 tutorial by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1673
    • Consistent kernel for jupyter notebooks by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1674
    • Correct a few typos in our ipynb tutorials by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1694
    • Fix broken links in readme by @lucasasmith in https://github.com/aws/aws-sdk-pandas/pull/1702
    • Typos in comments and docs by @mycaule in https://github.com/aws/aws-sdk-pandas/pull/1761

    Tests

    • Support for test infrastructure in private subnets by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1698
    • Upgrade engine versions to match defaults from aws console by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1709
    • Set redshift and Neptune clusters removal policy to destroy by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1675
    • Upgrade pytest-xdist by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1760
    • Fix timestream endpoint tests by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1781

    New Contributors

    • @lucasasmith made their first contribution in https://github.com/aws/aws-sdk-pandas/pull/1702
    • @vikramsg made their first contribution in https://github.com/aws/aws-sdk-pandas/pull/1757
    • @mycaule made their first contribution in https://github.com/aws/aws-sdk-pandas/pull/1761
    • @pal0064 made their first contribution in https://github.com/aws/aws-sdk-pandas/pull/1753

    Thanks

    We thank the following contributors/users for their work on this release: @lucasasmith, @vikramsg, @mycaule, @pal0064, @LeonLuttenberger, @cnfait, @malachi-constant, @kukushking, @jaidisido

    Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/2.17.0...2.18.0

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.18.0-py3-none-any.whl(249.29 KB)
    awswrangler-layer-2.18.0-py3.7.zip(45.85 MB)
    awswrangler-layer-2.18.0-py3.8-arm64.zip(43.38 MB)
    awswrangler-layer-2.18.0-py3.8.zip(47.38 MB)
    awswrangler-layer-2.18.0-py3.9-arm64.zip(43.40 MB)
    awswrangler-layer-2.18.0-py3.9.zip(47.35 MB)
  • 3.0.0rc2(Nov 23, 2022)

    What's Changed

    • (enhancement): Enable missing unit tests and Redshift, Athena, LF load tests by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1736
    • (enhancement): configure scheduling options, remove dependencies on internal ray impl by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1734
    • (testing): Enable Athena and Redshift tests, and address errors by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1721
    • (feat): Make tqdm progress reporting opt-in by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1741

    Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.0.0rc1...3.0.0rc2

    Source code(tar.gz)
    Source code(zip)
  • 3.0.0rc1(Oct 27, 2022)

    What's Changed

    • (enhancement): Move RayLogger out of non-distributed modules by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1686
    • (perf): Distribute data types inference by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1692
    • (docs): Update config tutorial to include new configuration values by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1696
    • (fix): partition block overwriting by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1695
    • (refactor): Optimize distributed CSV I/O by adding PyArrow-based datasource by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1699
    • (docs): Improve documentation on running SDK for pandas at scale by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1697
    • (enhancement): Apply modin repartitioning where required only by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1701
    • (enhancement): Remove local from ray.init call by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1708
    • (feat): Validate partitions along row axis, add warning by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1700
    • (feat): Expand SQL formatter to LakeFormation by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1684
    • (feat): Distribute parquet datasource and add missing features, enable all tests by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1711
    • (convention): Add Arrow prefix to parquet datasource for consistency by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1724
    • (perf): Distribute Timestream write with executor by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1715

    Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.0.0b3...3.0.0rc1

    Source code(tar.gz)
    Source code(zip)
  • 3.0.0b3(Oct 12, 2022)

    What's Changed

    • (feat): Add partitioning on block level by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1653
    • (refactor): Make room for additional distributed engines by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1646
    • (feat): Distribute s3 write text by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1631
    • (docs): Add "Introduction to Ray" Tutorial by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1661
    • (fix): Return address config param by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1660
    • (refactor): Enable new engines with custom dispatching and other constructs by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1666
    • (deps): Uptick modin to 0.16 by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1659

    Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.0.0b2...3.0.0b3

    Source code(tar.gz)
    Source code(zip)
  • 3.0.0b2(Sep 30, 2022)

    What's Changed

    • (feat) Update to Ray 2.0 by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1635
    • (feat) Ray logging by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1623
    • (enhancement): Reduce LOC in S3 write methods create_table by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1626
    • (docs) Tutorial: Run SDK for pandas job on ray cluster by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1616

    Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.0.0b1...3.0.0b2

    Source code(tar.gz)
    Source code(zip)
    awswrangler-3.0.0b2-py3-none-any.whl(261.29 KB)
    awswrangler-3.0.0b2.tar.gz(200.86 KB)
  • 3.0.0b1(Sep 22, 2022)

    What's Changed

    • (test) Consolidate unit and load tests by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1525
    • (feat) Distribute S3 read text by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1567
    • (feat) Distribute s3 wait_objects by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1539
    • (test) Ray Load Tests CDK Stack and Instructions for Load Testing by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1583
    • (fix) Fix S3 read text with version ID was not working by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1587
    • (feat) Add distributed s3 write parquet by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1526
    • (fix) Distribute write text regression, change to singledispatch, add repartitioning utility by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1611
    • (enhancement) Optimise distributed s3.read_text to load data in chunks by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1607

    Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.0.0a2...3.0.0b1

    Source code(tar.gz)
    Source code(zip)
  • 2.17.0(Sep 20, 2022)

    New Functionalities

    Enhancements

    • Returning empty DataFrame for empty TimeStream query #1430
    • Added support for INSERT IGNORE for mysql.to_sql #1429
    • Added use_column_names to redshift.copy akin to redshift.to_sql #1437
    • Enable passing kwargs to redshift.connect #1467
    • Add timestream_endpoint_url property to the config #1483
    • Add support for upserting to an empty Glue table #1579

    Documentation

    • Fix typos in documentation #1434

    Bug Fix

    • validate_schema=True for wr.s3.read_parquet breaks with partition columns and dataset=True #1426
    • wr.neptune.to_property_graph failing for Neptune version 1.1.1.0 #1407
    • ValueError when using opensearch.index_df with documents with an array field #1444
    • Missing catalog_id in wr.catalog.create_database #1480
    • Check for pair of brackets in query preparation for Athena cache #1529
    • Fix wrong type hint for TagColumnOperation in quicksight.create_athena_dataset #1570
    • s3.to_json compression parameters is passed twice when dataset=True #1585
    • Cast Athena array, map & struct types to pandas object #1581
    • In the OpenSearch module, use SSL only for HTTPS (port 443) #1603

    Noteworthy

    AWS Lambda Managed Layers

    Since the last release, the library has been accepted as an official SDK for AWS, and rebranded as AWS SDK for pandas 🚀. The module names in Python will remain the same. One noteworthy change, however, is that the AWS Lambda Manager layer name has been renamed from AWSDataWrangler to AWSSDKPandas.

    You can view the ARN value for the layers here.

    PyArrow 7 Support

    ⚠️ For platforms without PyArrow 7 support (e.g. MWAA, EMR, Glue PySpark Job):

    pip install pyarrow==2 awswrangler

    Thanks

    We thank the following contributors/users for their work on this release:

    @bechbd, @maxispeicher, @timgates42, @aeeladawy, @KhueNgocDang, @szemek, @malachi-constant, @cnfait, @jaidisido, @LeonLuttenberger, @kukushking

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.17.0-py3-none-any.whl(245.73 KB)
    awswrangler-layer-2.17.0-py3.7.zip(43.01 MB)
    awswrangler-layer-2.17.0-py3.8-arm64.zip(40.31 MB)
    awswrangler-layer-2.17.0-py3.8.zip(44.57 MB)
    awswrangler-layer-2.17.0-py3.9-arm64.zip(40.32 MB)
    awswrangler-layer-2.17.0-py3.9.zip(44.54 MB)
  • 3.0.0a2(Aug 17, 2022)

    This is a pre-release for the [email protected] project

    What's Changed

    • (feat): Add directory for Distributed Wrangler Load Tests by @malachi-constant in https://github.com/awslabs/aws-data-wrangler/pull/1464
    • (CI): Distribute tests in tox config by @malachi-constant in https://github.com/awslabs/aws-data-wrangler/pull/1469
    • (feat): Distribute s3 delete objects by @malachi-constant in https://github.com/awslabs/aws-data-wrangler/pull/1474
    • (CI): Enable new CI pipeline for standard & distributed tests by @malachi-constant in https://github.com/awslabs/aws-data-wrangler/pull/1481
    • (feat): Refactor to distribute s3.read_parquet by @jaidisido in https://github.com/awslabs/aws-data-wrangler/pull/1513
    • (bug): s3 delete tests failing in distributed codebase by @malachi-constant in https://github.com/awslabs/aws-data-wrangler/pull/1517

    Full Changelog: https://github.com/awslabs/aws-data-wrangler/compare/3.0.0a1...3.0.0a2

    Source code(tar.gz)
    Source code(zip)
  • 3.0.0a1(Aug 17, 2022)

    This is a pre-release for the [email protected] project

    What's Changed

    • (feat): Add distributed config flag and initialise method by @jaidisido in https://github.com/awslabs/aws-data-wrangler/pull/1389
    • (feat): Add distributed Lake Formation read by @jaidisido in https://github.com/awslabs/aws-data-wrangler/pull/1397
    • (feat): Distribute S3 select over multiple paths and scan ranges by @jaidisido in https://github.com/awslabs/aws-data-wrangler/pull/1445
    • (refactor): Refactor threading/ray; add single-path distributed s3 select impl by @kukushking in https://github.com/awslabs/aws-data-wrangler/pull/1446

    Full Changelog: https://github.com/awslabs/aws-data-wrangler/compare/2.16.1...3.0.0a1

    Source code(tar.gz)
    Source code(zip)
  • 2.16.1(Jun 28, 2022)

    Noteworthy

    🐛 Fixed issue introduced by 2.16.0 to method s3.read_parquet()

    Patch

    • Fix bug: pq_file.schema.names(): TypeError: 'list' object is not callable s3.read_parquet() #1412

    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Full Changelog: https://github.com/awslabs/aws-data-wrangler/compare/2.16.0...2.16.1

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.16.1-py3-none-any.whl(242.74 KB)
    awswrangler-layer-2.16.1-py3.7.zip(42.48 MB)
    awswrangler-layer-2.16.1-py3.8-arm64.zip(39.51 MB)
    awswrangler-layer-2.16.1-py3.8.zip(43.72 MB)
    awswrangler-layer-2.16.1-py3.9-arm64.zip(39.52 MB)
    awswrangler-layer-2.16.1-py3.9.zip(43.70 MB)
  • 2.16.0(Jun 22, 2022)

    Noteworthy

    ⚠️ For platforms without PyArrow 7 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    New Functionalities

    • Add support for Oracle Database 🔥 #1259 Check out the tutorial.

    Enhancements

    • add test infrastructure for oracle database #1274
    • revisiting S3 Select performance #1287
    • migrate test infra from cdk v1 to cdk v2 #1288
    • to_sql() make column names quoted identifiers to allow sql keywords #1392
    • throw NoFilesFound exception on 404 #1290
    • fast executemany #1299
    • add precombine key to upsert method for Redshift #1304
    • pass precombine to redshift.copy() #1319
    • use DataFrame column names in INSERT statement for UPSERT operation #1317
    • add data_source param to athena.repair_table #1324
    • modify athena2quicksight datatypes to allow startswith for varchar #1332
    • add TagColumnOperation to quicksight.create_athena_dataset #1342
    • enable list timestream databases and tables #1345
    • enable s3.to_parquet to receive "zstd" compression type #1369
    • create a way to perform PartiQL queries to a Dynamo DB table #1390
    • s3 proxy support with data wrangler #1361

    Documentation

    • be more explicit about awswrangler.s3.to_parquet overwrite behavior #1300
    • fix Python Version in Readme #1302

    Bug Fix

    • set encoding to utf-8 when no encoding is specified when reading/writing to s3 #1257
    • fix Redshift Locking Behavior #1305
    • specify cfn deletion policy for sqlserver and oracle instances #1378
    • to_sql() make column names quoted identifiers to allow sql keywords #1392
    • fix extension dtype index handling #1333
    • fix issue with redshift.to_sql() method when mode set to "upsert" and schema contains a hyphen #1360
    • timestream - array cols to str #1368
    • read_parquet Does Not Throw Error for Missing Column #1370

    Thanks

    We thank the following contributors/users for their work on this release:

    @bnimam, @IldarAlmakaev, @syokoysn, @IldarAlmakaev, @thomasniebler, @maxdavidson91, @takeknock, @Sleekbobby1011, @snikolakis, @willsmith28, @malachi-constant, @cnfait, @jaidisido, @kukushking


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.16.0-py3-none-any.whl(242.73 KB)
    awswrangler-layer-2.16.0-py3.7.zip(42.48 MB)
    awswrangler-layer-2.16.0-py3.8-arm64.zip(39.02 MB)
    awswrangler-layer-2.16.0-py3.8.zip(43.54 MB)
    awswrangler-layer-2.16.0-py3.9-arm64.zip(39.01 MB)
    awswrangler-layer-2.16.0-py3.9.zip(43.54 MB)
  • 2.15.1(Apr 11, 2022)

    Noteworthy

    ⚠️ Dropped Python 3.6 support

    ⚠️ For platforms without PyArrow 7 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Patch

    • Add sparql extra & make SPARQLWrapper dependency optional #1252

    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.15.1-py3-none-any.whl(234.00 KB)
    awswrangler-layer-2.15.1-py3.7.zip(42.34 MB)
    awswrangler-layer-2.15.1-py3.8-arm64.zip(38.90 MB)
    awswrangler-layer-2.15.1-py3.8.zip(43.42 MB)
    awswrangler-layer-2.15.1-py3.9-arm64.zip(38.88 MB)
    awswrangler-layer-2.15.1-py3.9.zip(43.42 MB)
  • 2.15.0(Mar 28, 2022)

    Noteworthy

    ⚠️ Dropped Python 3.6 support

    ⚠️ For platforms without PyArrow 7 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    New Functionalities

    • Amazon Neptune module 🚀 #1084 Check out the tutorial. Thanks to @bechbd & @sakti-mishra !
    • ARM64 Support for Python 3.8 and 3.9 layers 🔥 #1129 Many thanks @cnfait !

    Enhancements

    • Timestream module - support multi-measure records #1214
    • Warnings for implicit float conversion of nulls in to_parquet #1221
    • Support additional sql params in Redshift COPY operation #1210
    • Add create_ctas_table to Athena module #1207
    • S3 Proxy support #1206
    • Add Athena get_named_query_statement #1183
    • Add manifest parameter to 'redshift.copy_from_files' method #1164

    Documentation

    • Update install section #1242
    • Update lambda layers section #1236

    Bug Fix

    • Give precedence to user path for Athena UNLOAD S3 Output Location #1216
    • Honor User specified workgroup in athena.read_sql_query with unload_approach=True #1178
    • Support map type in Redshift copy #1185
    • data_api.rds.read_sql_query() does not preserve data type when column is all NULLS - switches to Boolean #1158
    • Allow decimal values within struct when writing to parquet #1179

    Thanks

    We thank the following contributors/users for their work on this release:

    @bechbd, @sakti-mishra, @mateogianolio, @jasadams, @malachi-constant, @cnfait, @jaidisido, @kukushking


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.15.0-py3-none-any.whl(233.14 KB)
    awswrangler-layer-2.15.0-py3.7.zip(43.98 MB)
    awswrangler-layer-2.15.0-py3.8-arm64.zip(40.51 MB)
    awswrangler-layer-2.15.0-py3.8.zip(45.04 MB)
    awswrangler-layer-2.15.0-py3.9-arm64.zip(40.50 MB)
    awswrangler-layer-2.15.0-py3.9.zip(45.04 MB)
  • 2.14.0(Jan 28, 2022)

    Caveats

    ⚠️ For platforms without PyArrow 6 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    New Functionalities

    • Support Athena Unload 🚀 #1038

    Enhancements

    • Add the ExcludeColumnSchema=True argument to the glue.get_partitions call to reduce response size #1094
    • Add PyArrow flavor argument to write_parquet via pyarrow_additional_kwargs #1057
    • Add rename_duplicate_columns and handle_duplicate_columns flag to sanitize_dataframe_columns_names method #1124
    • Add timestamp_as_object argument to all database read_sql_table methods #1130
    • Add ignore_null to read_parquet_metadata method #1125

    Documentation

    • Improve documentation on installing SAR Lambda layers with the CDK #1097
    • Fix broken link to tutorial in to_parquet method #1058

    Bug Fix

    • Ensure that partition locations retrieved from AWS Glue always end in a "/" #1094
    • Fix bucketing overflow issue in Athena #1086

    Thanks

    We thank the following contributors/users for their work on this release:

    @dennyau, @kailukowiak, @lucasmo, @moykeen, @RigoIce, @vlieven, @kepler, @mdavis-xyz, @ConstantinoSchillebeeckx, @kukushking, @jaidisido


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.14.0-py3-none-any.whl(221.29 KB)
    awswrangler-layer-2.14.0-py3.6.zip(37.31 MB)
    awswrangler-layer-2.14.0-py3.7.zip(40.59 MB)
    awswrangler-layer-2.14.0-py3.8.zip(41.70 MB)
    awswrangler-layer-2.14.0-py3.9.zip(41.68 MB)
  • 2.13.0(Dec 3, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 6 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Breaking changes

    • Fix sanitize methods to align with Glue/Hive naming conventions #579

    New Functionalities

    • AWS Lake Formation Governed Tables 🚀 #570
    • Support for Python 3.10 🔥 #973
    • Add partitioning to JSON datasets #962
    • Add ability to use unbuffered cursor for large MySQL datasets #928

    Enhancements

    • Add awswrangler.s3.list_buckets #997
    • Add partitions_parameters to catalog partitions methods #1035
    • Refactor pagination config in list objects #955
    • Add error message to EmptyDataframe exception #991

    Documentation

    • Clarify docs & add tutorial on schema evolution for CSV datasets #964

    Bug Fix

    • catalog.add_column() without column_comment triggers exception #1017
    • catalog.create_parquet_table Key in dictionary does not always exist #998
    • Fix Catalog StorageDescriptor get #969

    Thanks

    We thank the following contributors/users for their work on this release:

    @csabz09, @Falydoor, @moritzkoerber, @maxispeicher, @kukushking, @jaidisido


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.13.0-py3-none-any.whl(217.33 KB)
    awswrangler-layer-2.13.0-py3.6.zip(38.81 MB)
    awswrangler-layer-2.13.0-py3.7.zip(40.52 MB)
    awswrangler-layer-2.13.0-py3.8.zip(41.02 MB)
    awswrangler-layer-2.13.0-py3.9.zip(41.00 MB)
  • 2.12.1(Oct 18, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 5 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Patch

    • Removing unnecessary dev dependencies from main #961

    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.12.1-py3-none-any.whl(206.15 KB)
    awswrangler-layer-2.12.1-py3.6.zip(37.33 MB)
    awswrangler-layer-2.12.1-py3.7.zip(39.09 MB)
    awswrangler-layer-2.12.1-py3.8.zip(39.66 MB)
  • 2.12.0(Oct 13, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 5 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    New Functionalities

    • Add Support for Opensearch #891 🔥 Check out the tutorial. Many thanks to @AssafMentzer and @mureddy19 for this contribution

    Enhancements

    • redshift.read_sql_query - handle empty table corner case #874
    • Refactor read parquet table to reduce file list scan based on available partitions #878
    • Shrink lambda layer with strip command #884
    • Enabling DynamoDB endpoint URL #887
    • EMR jobs concurrency #889
    • Add feature to allow custom AMI for EMR #907
    • wr.redshift.unload_to_files empty the S3 folder instead of overwriting existing files #914
    • Add catalog_id arg to wr.catalog.does_table_exist #920
    • Ad enpoint_url for AWS Secrets Manager #929

    Documentation

    • Update docs for awswrangler.s3.to_csv #868

    Bug Fix

    • wr.mysql.to_sql with use_column_names=True when column names are reserved words #918

    Thanks

    We thank the following contributors/users for their work on this release:

    @AssafMentzer, @mureddy19, @isichei, @DonnaArt, @kukushking, @jaidisido


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.12.0-py3-none-any.whl(206.20 KB)
    awswrangler-layer-2.12.0-py3.6.zip(59.05 MB)
    awswrangler-layer-2.12.0-py3.7.zip(60.79 MB)
    awswrangler-layer-2.12.0-py3.8.zip(61.29 MB)
  • 2.11.0(Sep 1, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 5 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    New Functionalities

    • Redshift and RDS Data Api Support #828 🚀 Check out the tutorial. Many thanks to @pwithams for this contribution

    Enhancements

    • Upgrade to PyArrow 5 #861
    • Add Pagination for TimestreamDB #838

    Documentation

    • Clarifying structure of SSM secrets in connect methods #871

    Bug Fix

    • Use botocores' Loader and ServiceModel to extract accepted kwargs #832

    Thanks

    We thank the following contributors/users for their work on this release:

    @pwithams, @maxispeicher, @kukushking, @jaidisido


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.11.0-py3-none-any.whl(194.22 KB)
    awswrangler-layer-2.11.0-py3.6.zip(44.41 MB)
    awswrangler-layer-2.11.0-py3.7.zip(46.18 MB)
    awswrangler-layer-2.11.0-py3.8.zip(47.26 MB)
  • 2.10.0(Jul 21, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 4 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Enhancements

    • Add upsert support for Postgresql #807
    • Add schema evolution parameter to wr.s3.to_csv #787
    • Enable order by in CTAS Athena queries #785
    • Add header to wr.s3.to_csv when dataset = True #765
    • Add CSV as unload format to wr.redshift.unload_files #761

    Bug Fix

    • Fix deleting CTAS temporary Glue tables #782
    • Ensure safe get of Glue table parameters #779 and #783

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @kukushking, @jaidisido, @mohdaliiqbal


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.10.0-py3-none-any.whl(180.47 KB)
    awswrangler-layer-2.10.0-py3.6.zip(42.68 MB)
    awswrangler-layer-2.10.0-py3.7.zip(44.42 MB)
    awswrangler-layer-2.10.0-py3.8.zip(45.08 MB)
  • 2.9.0(Jun 18, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 4 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Documentation

    Enhancements

    • Enable server-side predicate filtering using S3 Select 🚀 #678
    • Support VersionId parameter for S3 read operations #721
    • Enable prefix in output S3 files for wr.redshift.unload_to_files #729
    • Add option to skip commit on wr.redshift.to_sql #705
    • Move integration test infrastructure to CDK 🎉 #706

    Bug Fix

    • Wait until athena query results bucket is created #735
    • Remove explicit Excel engine configuration #742
    • Fix bucketing types #719
    • Change end_time to UTC #720

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @kukushking, @jaidisido


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.9.0-py3-none-any.whl(179.25 KB)
    awswrangler-layer-2.9.0-py3.6.zip(42.65 MB)
    awswrangler-layer-2.9.0-py3.7.zip(43.24 MB)
    awswrangler-layer-2.9.0-py3.8.zip(43.87 MB)
  • 2.8.0(May 19, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 4 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Documentation

    • Install Lambda Layers and Python wheels from public S3 bucket 🎉 #666
    • Clarified docs around potential in-place mutation of dataframe when using to_parquet #669

    Enhancements

    • Enable parallel s3 downloads (~20% speedup) 🚀 #644
    • Apache Arrow 4.0.0 support (enables ARM instances support as well) #557
    • Enable LOCK before concurrent COPY calls in Redshift #665
    • Make use of Pyarrow iter_batches (>= 3.0.0 only) #660
    • Enable additional options when overwriting Redshift table (drop, truncate, cascade) #671
    • Reuse s3 client across threads for s3 range requests #684

    Bug Fix

    • Add dtypes for empty ctas athena queries #659
    • Add Serde properties when creating CSV table #672
    • Pass SSL properties from Glue Connection to MySQL #554

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @kukushking, @igorborgest, @gballardin, @eferm, @jaklan, @Falydoor, @chariottrider, @chriscugliotta, @konradsemsch, @gvermillion, @russellbrooks, @mshober.


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.8.0-py3-none-any.whl(175.13 KB)
    awswrangler-layer-2.8.0-py3.6.zip(42.64 MB)
    awswrangler-layer-2.8.0-py3.7.zip(43.22 MB)
    awswrangler-layer-2.8.0-py3.8.zip(43.86 MB)
  • 2.7.0(Apr 15, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 3 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Documentation

    • Updated documentation to clarify wr.athena.read_sql_query params argument use #609

    New Functionalities

    • Supporting MySQL upserts #608
    • Enable prepending S3 parquet files with a prefix in wr.s3.write.to_parquet #617
    • Add exist_ok flag to safely create a Glue database #642
    • Add "Unsupported Pyarrow type" exception #639

    Bug Fix

    • Fix chunked mode in wr.s3.read_parquet_table #627
    • Fix missing \ character from wr.s3.read_parquet_table method #638
    • Support postgres as an engine value #630
    • Add default workgroup result configuration #633
    • Raise exception when merge_upsert_table fails or data_quality is insufficient #601
    • Fixing nested structure bug in athena2pyarrow method #612

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @igorborgest, @mattboyd-aws, @vlieven, @bentkibler, @adarsh-chauhan, @impredicative, @nmduarteus, @JoshCrosby, @TakumiHaruta, @zdk123, @tuannguyen0901, @jiteshsoni, @luminita.


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.7.0-py3-none-any.whl(172.06 KB)
    awswrangler-layer-2.7.0-py3.6.zip(41.19 MB)
    awswrangler-layer-2.7.0-py3.7.zip(41.78 MB)
    awswrangler-layer-2.7.0-py3.8.zip(41.84 MB)
  • 2.6.0(Mar 16, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 3 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Enhancements

    • Added a chunksize parameter to the to_sql function. Default set to 200. Decreased insertion time from 120 to 1 second #599
    • path argument is now optional in s3.to_parquet and s3.to_csv functions #586
    • Added a map_types boolean (set to True by default) to convert pyarrow DataTypes to pandas ExtensionDtypes #580
    • Added optional ctas_database_name argument to store ctas_temporary_table in an alternative database #576

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @igorborgest, @ilyanoskov, @VashMKS, @jmahlik, @dimapod, @Reeska


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.6.0-py3-none-any.whl(170.55 KB)
    awswrangler-layer-2.6.0-py3.6.zip(41.08 MB)
    awswrangler-layer-2.6.0-py3.7.zip(41.66 MB)
    awswrangler-layer-2.6.0-py3.8.zip(41.70 MB)
  • 2.5.0(Mar 3, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 3 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Documentation

    • New HTML tutorials #551
    • Use bump2version for changing version numbers #573
    • Mishandling of wildcard characters in read_parquet #564

    Enhancements

    • Support for ExpectedBucketOwner #562

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @impredicative, @adarsh-chauhan, @Malkard.


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.5.0-py3-none-any.whl(168.46 KB)
    awswrangler-layer-2.5.0-py3.6.zip(40.96 MB)
    awswrangler-layer-2.5.0-py3.7.zip(41.53 MB)
    awswrangler-layer-2.5.0-py3.8.zip(41.57 MB)
  • 2.4.0-docs(Feb 4, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 3 support (e.g. EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Documentation

    • Update to include PyArrow 3 caveats for EMR and Glue PySpark Job. #546 #547

    New Functionalities

    • Redshift COPY now supports the new SUPER type (i.e. SERIALIZETOJSON) #514
    • S3 Upload/download files #506
    • Include dataset BUCKETING for s3 datasets writing #443
    • Enable Merge Upsert for existing Glue Tables on Primary Keys #503
    • Support Requester Pays S3 Buckets #430
    • Add botocore Config to wr.config #535

    Enhancements

    • Pandas 1.2.1 support #525
    • Numpy 1.20.0 support
    • Apache Arrow 3.0.0 support #531
    • Python 3.9 support #454

    Bug Fix

    • Return DataFrame with unique index for Athena CTAS queries #527
    • Remove unnecessary schema inference. #524

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @danielwo, @jiteshsoni, @igorborgest, @njdanielsen, @eric-valente, @gvermillion, @zseder, @gdbassett, @orenmazor, @senorkrabs, @Natalie-Caruana, @dragonH, @nikwerhypoport, @hwangji.


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.4.0-py3-none-any.whl(167.60 KB)
    awswrangler-layer-2.4.0-py3.6.zip(40.95 MB)
    awswrangler-layer-2.4.0-py3.7.zip(41.51 MB)
    awswrangler-layer-2.4.0-py3.8.zip(41.56 MB)
  • 2.4.0(Feb 3, 2021)

    New Functionalities

    • Redshift COPY now supports the new SUPER type (i.e. SERIALIZETOJSON) #514
    • S3 Upload/download files #506
    • Include dataset BUCKETING for s3 datasets writing #443
    • Enable Merge Upsert for existing Glue Tables on Primary Keys #503
    • Support Requester Pays S3 Buckets #430
    • Add botocore Config to wr.config #535

    Enhancements

    • Pandas 1.2.1 support #525
    • Numpy 1.20.0 support
    • Apache Arrow 3.0.0 support #531
    • Python 3.9 support #454

    Bug Fix

    • Return DataFrame with unique index for Athena CTAS queries #527
    • Remove unnecessary schema inference. #524

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @danielwo, @jiteshsoni, @igorborgest, @njdanielsen, @eric-valente, @gvermillion, @zseder, @gdbassett, @orenmazor, @senorkrabs, @Natalie-Caruana.


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.4.0-py3-none-any.whl(167.60 KB)
    awswrangler-layer-2.4.0-py3.6.zip(40.95 MB)
    awswrangler-layer-2.4.0-py3.7.zip(41.51 MB)
    awswrangler-layer-2.4.0-py3.8.zip(41.56 MB)
  • 2.3.0(Jan 10, 2021)

    New Functionalities

    • DynamoDB support #448
    • SQLServer support (Driver must be installed separately) #356
    • Excel files support #419 #509
    • Amazon S3 Access Point support #393
    • Amazon Chime initial support #494
    • Write compressed CSV and JSON files on S3 #308 #359 #412

    Enhancements

    • Add query parameters for Athena #432
    • Add metadata caching for Athena #461
    • Add suffix filters for s3.read_parquet_table() #495

    Bug Fix

    • Fix keep_files behavior for failed Redshift COPY executions #505

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @danielwo, @jiteshsoni, @gvermillion, @rodalarcon, @imanebosch, @dwbelliston, @tochandrashekhar, @kylepierce, @njdanielsen, @jasadams, @gtossou, @JasonSanchez, @kokes, @hanan-vian @igorborgest.


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.3.0-py3-none-any.whl(160.79 KB)
    awswrangler-layer-2.3.0-py3.6.zip(40.52 MB)
    awswrangler-layer-2.3.0-py3.7.zip(40.73 MB)
    awswrangler-layer-2.3.0-py3.8.zip(40.79 MB)
  • 2.2.0(Dec 23, 2020)

    New Functionalities

    • Add aws_access_key_id, aws_secret_access_key, aws_session_token and boto3_session for Redshift copy/unload #484

    Bug Fix

    • Remove dtype print statement #487

    Thanks

    We thank the following contributors/users for their work on this release:

    @danielwo, @thetimbecker, @njdanielsen, @igorborgest.


    P.S. Lambda Layer zip file and Glue wheel/egg files are available below. Just upload it and run!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.2.0-py3-none-any.whl(147.74 KB)
    awswrangler-2.2.0-py3.6.egg(319.53 KB)
    awswrangler-layer-2.2.0-py3.6.zip(39.52 MB)
    awswrangler-layer-2.2.0-py3.7.zip(39.45 MB)
    awswrangler-layer-2.2.0-py3.8.zip(39.52 MB)
  • 2.1.0(Dec 21, 2020)

    New Functionalities

    • Add secretmanager module and support for databases connections #402
    con = wr.redshift.connect(secret_id="my-secret", dbname="my-db")
    df = wr.redshift.read_sql_query("SELECT ...", con=con)
    con.close()
    

    Bug Fix

    • Fix connection attributes quoting for wr.*.connect() #481
    • Fix parquet table append for nested struct columns #480

    Thanks

    We thank the following contributors/users for their work on this release:

    @danielwo, @nmduarteus, @nivf33, @kinghuang, @igorborgest.


    P.S. Lambda Layer zip file and Glue wheel/egg files are available below. Just upload it and run!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.1.0-py3-none-any.whl(147.04 KB)
    awswrangler-2.1.0-py3.6.egg(318.06 KB)
    awswrangler-layer-2.1.0-py3.6.zip(39.52 MB)
    awswrangler-layer-2.1.0-py3.7.zip(39.45 MB)
    awswrangler-layer-2.1.0-py3.8.zip(39.51 MB)
Owner
Amazon Web Services - Labs
AWS Labs
Amazon Web Services - Labs
New generation PostgreSQL database adapter for the Python programming language

Psycopg 3 -- PostgreSQL database adapter for Python Psycopg 3 is a modern implementation of a PostgreSQL adapter for Python. Installation Quick versio

The Psycopg Team 880 Jan 08, 2023
Python script to clone SQL dashboard from one workspace to another

Databricks dashboard clone Unofficial project to allow Databricks SQL dashboard copy from one workspace to another. Resource clone Setup: Create a fil

Quentin Ambard 12 Jan 01, 2023
A supercharged SQLite library for Python

SuperSQLite: a supercharged SQLite library for Python A feature-packed Python package and for utilizing SQLite in Python by Plasticity. It is intended

Plasticity 703 Dec 30, 2022
Prometheus instrumentation library for Python applications

Prometheus Python Client The official Python 2 and 3 client for Prometheus. Three Step Demo One: Install the client: pip install prometheus-client Tw

Prometheus 3.2k Jan 07, 2023
Pure-python PostgreSQL driver

pg-purepy pg-purepy is a pure-Python PostgreSQL wrapper based on the anyio library. A lot of this library was inspired by the pg8000 library. Credits

Lura Skye 11 May 23, 2022
#crypto #cipher #encode #decode #hash

🌹 CYPHER TOOLS 🌹 Written by TMRSWRR Version 1.0.0 All in one tools for CRYPTOLOGY. Instagram: Capture the Root 🖼️ Screenshots 🖼️ 📹 How to use 📹

50 Dec 23, 2022
SAP HANA Connector in pure Python

SAP HANA Database Client for Python Important Notice This public repository is read-only and no longer maintained. The active maintained alternative i

SAP Archive 299 Nov 20, 2022
Amazon S3 Transfer Manager for Python

s3transfer - An Amazon S3 Transfer Manager for Python S3transfer is a Python library for managing Amazon S3 transfers. Note This project is not curren

the boto project 158 Jan 07, 2023
Asynchronous interface for peewee ORM powered by asyncio

peewee-async Asynchronous interface for peewee ORM powered by asyncio. Important notes Since version 0.6.0a only peewee 3.5+ is supported If you still

05Bit 666 Dec 30, 2022
Creating a python package to convert /transfer excelsheet data to a mysql Database Table

Creating a python package to convert /transfer excelsheet data to a mysql Database Table

Odiwuor Lameck 1 Jan 07, 2022
PyPika is a python SQL query builder that exposes the full richness of the SQL language using a syntax that reflects the resulting query. PyPika excels at all sorts of SQL queries but is especially useful for data analysis.

PyPika - Python Query Builder Abstract What is PyPika? PyPika is a Python API for building SQL queries. The motivation behind PyPika is to provide a s

KAYAK 1.9k Jan 04, 2023
A tiny python web application based on Flask to set, get, expire, delete keys of Redis database easily with direct link at the browser.

First Redis Python (CRUD) A tiny python web application based on Flask to set, get, expire, delete keys of Redis database easily with direct link at t

Max Base 9 Dec 24, 2022
SpyQL - SQL with Python in the middle

SpyQL SQL with Python in the middle Concept SpyQL is a query language that combines: the simplicity and structure of SQL with the power and readabilit

Daniel Moura 853 Dec 30, 2022
Estoult - a Python toolkit for data mapping with an integrated query builder for SQL databases

Estoult Estoult is a Python toolkit for data mapping with an integrated query builder for SQL databases. It currently supports MySQL, PostgreSQL, and

halcyon[nouveau] 15 Dec 29, 2022
Logica is a logic programming language that compiles to StandardSQL and runs on Google BigQuery.

Logica: language of Big Data Logica is an open source declarative logic programming language for data manipulation. Logica is a successor to Yedalog,

Evgeny Skvortsov 1.5k Dec 30, 2022
Redis Python Client - The Python interface to the Redis key-value store.

redis-py The Python interface to the Redis key-value store. Installation | Contributing | Getting Started | Connecting To Redis Installation redis-py

Redis 11k Jan 08, 2023
Pysolr — Python Solr client

pysolr pysolr is a lightweight Python client for Apache Solr. It provides an interface that queries the server and returns results based on the query.

Haystack Search 626 Dec 01, 2022
Motor - the async Python driver for MongoDB and Tornado or asyncio

Motor Info: Motor is a full-featured, non-blocking MongoDB driver for Python Tornado and asyncio applications. Documentation: Available at motor.readt

mongodb 2.1k Dec 26, 2022
python-bigquery Apache-2python-bigquery (🥈34 · ⭐ 3.5K · 📈) - Google BigQuery API client library. Apache-2

Python Client for Google BigQuery Querying massive datasets can be time consuming and expensive without the right hardware and infrastructure. Google

Google APIs 550 Jan 01, 2023
PostgreSQL database access simplified

Queries: PostgreSQL Simplified Queries is a BSD licensed opinionated wrapper of the psycopg2 library for interacting with PostgreSQL. The popular psyc

Gavin M. Roy 251 Oct 25, 2022