ETL pipeline on movie data using Python and postgreSQL

Last update: Jul 07, 2021

Overview

Movies-ETL

ETL pipeline on movie data using Python and postgreSQL

Overview

This project consisted on a automated Extraction, Transformation and Load pipeline. This ETL extracted movie data from wikipedia, kaggle, and MovieLens to clean it, transform it, and merge it using Pandas. The product was a merged table with movies and ratings loaded to PostgreSQL.

Resources

Data sources:
- movies_metadata.csv
- ratings.csv
- wikipedia_movies.json
Software:
- Python
- PostgreSQL
- Pandas
- SQLAlchemy
- Regular Expressions

Results

Final output table: FINAL_Merged_Movies_and_Ratings.csv
Datasets uploaded to PostgreSQL for other users to analyze movie data (Hacketon):

Summary

The pipeline was created under the following assumptions:

I was able to join the wikipedia, kaggle, and ratings movie data on the IMDB ID column.
The wikipedia dataset didn't have a IMDB ID, so I had to extract it from the url link given.
Each dataset had to be cleaned on their own because they had overlapping columns, suck as 'Director' and 'Directed By', unecessary columns, many null values, TV shows, outliers, duplicates, incorrect data types, formatting, and other errors.
The wikipedia movie data was in json format.
Not every every movie had a rating for each rating level.
The ratings dataset had more than 26 million entries which generated a time constraint and a processing data challenge.

ETL pipeline on movie data using Python and postgreSQL

Related tags

Overview

Movies-ETL

ETL pipeline on movie data using Python and postgreSQL

Overview

Resources

Results

Summary

Owner

Juan Nicolas Serrano

Full ELT process on GCP environment.

pandas: powerful Python data analysis toolkit

Hydrogen (or other pure gas phase species) depressurization calculations

A tax calculator for stocks and dividends activities.

Code for the DH project "Dhimmis & Muslims – Analysing Multireligious Spaces in the Medieval Muslim World"

Advanced Pandas Vault — Utilities, Functions and Snippets (by @firmai).

Universal data analysis tools for atmospheric sciences

We're Team Arson and we're using the power of predictive modeling to combat wildfires.

PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j.

Project under the certification "Data Analysis with Python" on FreeCodeCamp

This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

This is an analysis and prediction project for house prices in King County, USA based on certain features of the house

Bearsql allows you to query pandas dataframe with sql syntax.

Implementation in Python of the reliability measures such as Omega.

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python

A collection of robust and fast processing tools for parsing and analyzing web archive data.

Vectorizers for a range of different data types

This repo contains a simple but effective tool made using python which can be used for quality control in statistical approach.

Tools for analyzing data collected with a custom unity-based VR for insects.

A library to create multi-page Streamlit applications with ease.