In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift.

Overview

ETL Pipeline for AWS

Project Description

In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift. The data is loaded from S3 to stagging tables on Redshift and SQL queries are written to create analytics tables from staging tables.

Dataset Structure

The dataset is composed of two files the Songs data and Logs data that is present in S3 bucket.

Song Data

The song data is dataset with million of entries. Each file is in JSON format that contains the data about song, artist of that song. Moreover, the files are partitioned by the first three letters of song ID. The single entry of the song dataset looks like

  • {
       "num_songs":1,
       "artist_id":"ARJIE2Y1187B994AB7",
       "artist_latitude":null,
       "artist_longitude":null,
       "artist_location":"",
       "artist_name":"Line Renaud",
       "song_id":"SOUPIRU12A6D4FA1E1",
       "title":"Der Kleine Dompfaff",
       "duration":152.92036,
       "year":0
    }
    

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from an imaginary music streaming app based on configuration settings.

Logs Data

The logs dataset is also in the JSON formatted, which is formed by the event simulator based on the songs dataset. The logs dataset is the activity logs from the music app.

  • {
        "artist": "Pavement",
        "auth": "Logged in",
        "firstName": "Sylvie",
        "gender": "F",
        "iteminSession": 0,
        "lastName": "Cruz",
        "length": 99.16036,
        "level": "free",
        "location": "Kiamath Falls, OR",
        "method": "PUT",
        "page": "NextSong",
        "registration": 1.540266e+12,
        "sessionId": 345,
        "song": "Mercy: The Laundromat",
        "status": 200,
        "ts": 1541990258796,
        "userAgent": "Mozzilla/5.0...",
        "userId": 10
    }
    

Data Warehouse schema

There are two staging tables: Event table: artist VARCHAR, auth VARCHAR, firstName VARCHAR, gender VARCHAR, itemInSession INT, lastName VARCHAR, length DOUBLE PRECISION, level VARCHAR, location VARCHAR, method VARCHAR , page VARCHAR, registration VARCHAR, sessionid INT, song VARCHAR, status INT, ts VARCHAR, userAgent VARCHAR, userId INT*

Song table* num_songs INTEGER,* artist_id VARCHAR, artist_latitude VARCHAR, artist_longitude VARCHAR, artist_location VARCHAR , artist_name VARCHAR, song_id VARCHAR, title VARCHAR, duration NUMERIC NOT NULL, year integer*

These staging tables helps forming dimension tables and fact tables:

Dimension Tables:
users:
*user_id, first_name, last_name, gender, level*
songs:
*song_id, title, artist_id, year, duration*
artists:
*artist_id, name, location, latitude, longitude*
time:
*start_time, hour, day, week, month, year, weekday*
Fact tables:
Songplays:
*songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent*

All the tables contains Primary Key as there should be something unique to identify the rows in the table.

ETL Process

The ETL process is comprises of two steps:

  • Getting data from S3 bucket to staging table
  • Insert the data in dimension and fact table from staging tables using Star Schema

Files Description

- create_tables.py: When create_tables.py run, it will first create tables and drop if table already exists. 
- etl.py: read and process data files
- dwh.cfg: File contains the data warehouse settings for AWS. It contains CLUSTER, IAM_ROLE and S3 settings for the ETL pipeline
- sql_queries: Contains the sql queries for dropping, creation, selection data from tables.
Owner
Mobeen Ahmed
Mobeen Ahmed
Template for a Dataflow Flex Template in Python

Dataflow Flex Template in Python This repository contains a template for a Dataflow Flex Template written in Python that can easily be used to build D

STOIX 5 Apr 28, 2022
A simplified prototype for an as-built tracking database with API

Asbuilt_Trax A simplified prototype for an as-built tracking database with API The purpose of this project is to: Model a database that tracks constru

Ryan Pemberton 1 Jan 31, 2022
Repository created with LinkedIn profile analysis project done

EN/en Repository created with LinkedIn profile analysis project done. The datase

Mayara Canaver 4 Aug 06, 2022
follow-analyzer helps GitHub users analyze their following and followers relationship

follow-analyzer follow-analyzer helps GitHub users analyze their following and followers relationship by providing a report in html format which conta

Yin-Chiuan Chen 2 May 02, 2022
A CLI tool to reduce the friction between data scientists by reducing git conflicts removing notebook metadata and gracefully resolving git conflicts.

databooks is a package for reducing the friction data scientists while using Jupyter notebooks, by reducing the number of git conflicts between different notebooks and assisting in the resolution of

dataroots 86 Dec 25, 2022
Numerical Analysis toolkit centred around PDEs, for demonstration and understanding purposes not production

Numerics Numerical Analysis toolkit centred around PDEs, for demonstration and understanding purposes not production Use procedure: Initialise a new i

George Whittle 1 Nov 13, 2021
Data pipelines built with polars

valves Warning: the project is very much work in progress. Valves is a collection of functions for your data .pipe()-lines. This project aimes to host

14 Jan 03, 2023
AptaMat is a simple script which aims to measure differences between DNA or RNA secondary structures.

AptaMAT Purpose AptaMat is a simple script which aims to measure differences between DNA or RNA secondary structures. The method is based on the compa

GEC UTC 3 Nov 03, 2022
Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data.

Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data.

HoloViz 2.9k Jan 06, 2023
Statistical Rethinking: A Bayesian Course Using CmdStanPy and Plotnine

Statistical Rethinking: A Bayesian Course Using CmdStanPy and Plotnine Intro This repo contains the python/stan version of the Statistical Rethinking

Andrés Suárez 3 Nov 08, 2022
A lightweight interface for reading in output from the Weather Research and Forecasting (WRF) model into xarray Dataset

xwrf A lightweight interface for reading in output from the Weather Research and Forecasting (WRF) model into xarray Dataset. The primary objective of

National Center for Atmospheric Research 43 Nov 29, 2022
An Aspiring Drop-In Replacement for NumPy at Scale

Legate NumPy is a Legate library that aims to provide a distributed and accelerated drop-in replacement for the NumPy API on top of the Legion runtime. Using Legate NumPy you do things like run the f

Legate 502 Jan 03, 2023
PandaPy has the speed of NumPy and the usability of Pandas 10x to 50x faster (by @firmai)

PandaPy "I came across PandaPy last week and have already used it in my current project. It is a fascinating Python library with a lot of potential to

Derek Snow 527 Jan 02, 2023
yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data.

The yt Project yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data. yt supports structured, varia

The yt project 367 Dec 25, 2022
Data Science Environment Setup in single line

datascienv is package that helps your to setup your environment in single line of code with all dependency and it is also include pyforest that provide single line of import all required ml libraries

Ashish Patel 55 Dec 16, 2022
In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift.

ETL Pipeline for AWS Project Description In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift. The data is loaded from S3 t

Mobeen Ahmed 1 Nov 01, 2021
Spectral Analysis in Python

SPECTRUM : Spectral Analysis in Python contributions: Please join https://github.com/cokelaer/spectrum contributors: https://github.com/cokelaer/spect

Thomas Cokelaer 280 Dec 16, 2022
Very useful and necessary functions that simplify working with data

Additional-function-for-pandas Very useful and necessary functions that simplify working with data random_fill_nan(module_name, nan) - Replaces all sp

Alexander Goldian 2 Dec 02, 2021
Evaluation of a Monocular Eye Tracking Set-Up

Evaluation of a Monocular Eye Tracking Set-Up As part of my master thesis, I implemented a new state-of-the-art model that is based on the work of Che

Pascal 19 Dec 17, 2022
Fast, flexible and easy to use probabilistic modelling in Python.

Please consider citing the JMLR-MLOSS Manuscript if you've used pomegranate in your academic work! pomegranate is a package for building probabilistic

Jacob Schreiber 3k Jan 02, 2023