Pyspark Spotify ETL

Description

This is my first Data Engineering project, it extracts data from the user's recently played tracks using Spotify's API, transforms data and then loads it into Postgresql using SQLAlchemy engine. Data is shown as a Spark Dataframe before loading and the whole ETL job is scheduled with crontab. Token never expires since an HTTP POST method with Spotify's token API is used in the beginning of the script.

The purpose of this is to help those that want to become Data Engineers, like myself, create their first project.

Essentials

Extra libraries that must be imported: sys, json, datetime.

ETL Execution

Install all the necessary libraries from the Pipfile.
Read the "Token_request_instructions" to get your own refresh token. In case you don't want that you can get one from this website https://developer.spotify.com/console/get-recently-played/ which will have to be changed every hour.
Add your you postgreSQL credentials in the engine variable. In case you'll be using another RDBMS, use this website https://docs.sqlalchemy.org/en/14/core/engines.html.
Create SQL Database/Table (Optional).
Create a bash file. This file is were you'll write down the path to Spark, Python and your script. If this isn't created you'll get the "ModuleNotFoundError" for each module you import inside your script. (Think of this as the ETL's own ~/.bash_profile)
Create a new crontab or use the existing one if you want the job to run on midnight every day.

Extras

To verify that your scheduled job is working you can change the crontab to "* * * * *".
Here is the website https://developer.spotify.com/documentation/general/guides/scopes/ with other Spotify scopes in case you don't want to use "recently played tracks".
Thank you Karolina Sowinska for your DE Beginners guide.

Pyspark Spotify ETL

Related tags

Overview

Pyspark Spotify ETL

Owner

Data science/Analysis Health Care Portfolio

An Integrated Experimental Platform for time series data anomaly detection.

Python library for creating data pipelines with chain functional programming

Falcon: Interactive Visual Analysis for Big Data

This module is used to create Convolutional AutoEncoders for Variational Data Assimilation

This is a tool for speculation of ancestral allel, calculation of sfs and drawing its bar plot.

A data parser for the internal syncing data format used by Fog of World.

For making Tagtog annotation into csv dataset

Datashredder is a simple data corruption engine written in python. You can corrupt anything text, images and video.

A Numba-based two-point correlation function calculator using a grid decomposition

A Python package for the mathematical modeling of infectious diseases via compartmental models

VevestaX is an open source Python package for ML Engineers and Data Scientists.

InDels analysis of CRISPR lines by NGS amplicon sequencing technology for a multicopy gene family.

This repo contains a simple but effective tool made using python which can be used for quality control in statistical approach.

Advanced Pandas Vault — Utilities, Functions and Snippets (by @firmai).

:truck: Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

Get mutations in cluster by querying from LAPIS API

Educational project on how to build an ETL (Extract, Transform, Load) data pipeline, orchestrated with Airflow.

Repository created with LinkedIn profile analysis project done

Kennedy Institute of Rheumatology University of Oxford Project November 2019