A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

Overview

Streamify

A data pipeline with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

Description

Objective

The project will stream events generated from a fake music streaming service (like Spotify) and create a data pipeline that consumes the real-time data. The data coming in would be similar to an event of a user listening to a song, navigating on the website, authenticating. The data would be processed in real-time and stored to the data lake periodically (every two minutes). The hourly batch job will then consume this data, apply transformations, and create the desired tables for our dashboard to generate analytics. We will try to analyze metrics like popular songs, active users, user demographics etc.

Dataset

Eventsim is a program that generates event data to replicate page requests for a fake music web site. The results look like real use data, but are totally fake. The docker image is borrowed from viirya's fork of it, as the original project has gone without maintenance for a few years now.

Eventsim uses song data from Million Songs Dataset to generate events. I have used a subset of 10000 songs.

Tools & Technologies

Architecture

streamify-architecture

Final Result

dashboard

Setup

WARNING: You will be charged for all the infra setup. You can avail 300$ in credit by creating a new account on GCP.

Pre-requisites

If you already have a Google Cloud account and a working terraform setup, you can skip the pre-requisite steps.

Get Going!

A video walkthrough of how I run my project - YouTube Video

  • Procure infra on GCP with Terraform - Setup
  • (Extra) SSH into your VMs, Forward Ports - Setup
  • Setup Kafka Compute Instance and start sending messages from Eventsim - Setup
  • Setup Spark Cluster for stream processing - Setup
  • Setup Airflow on Compute Instance to trigger the hourly data pipeline - Setup

Debug

If you run into issues, see if you find something in this debug guide.

How can I make this better?!

A lot can still be done :).

  • Choose managed Infra
    • Cloud Composer for Airflow
    • Confluent Cloud for Kafka
  • Create your own VPC network
  • Build dimensions and facts incrementally instead of full refresh
  • Write data quality tests
  • Create dimensional models for additional business processes
  • Include CI/CD
  • Add more visualizations

Special Mentions

I'd like to thank the DataTalks.Club for offering this Data Engineering course for completely free. All the things I learnt there, enabled me to come up with this project. If you want to upskill on Data Engineering technologies, please check out the course. :)

Owner
Ankur Chavda
Data Engineer-ish
Ankur Chavda
An app about keyboards, originating from the design of u/Sonnenschirm

keebapp-backend An app about keyboards, originating from the design of u/Sonnenschirm Setup Firstly, ensure that the environment for python is install

8 Sep 04, 2022
TallerStereoVision Convencion Python Chile 2021

TallerStereoVision Convencion Python Chile 2021 Taller Stereo Vision & Python PyCon.cl 2021 Instalación Se recomienta utilizar Virtual Environment pyt

2 Oct 20, 2022
MDAnalysis tool to calculate membrane curvature.

The MDAkit for membrane curvature analysis is part of the Google Summer of Code program and it is linked to a Code of Conduct.

MDAnalysis 19 Oct 20, 2022
Stock Monitoring

Stock Monitoring Description It is a stock monitoring script. This repository is still under developing. Getting Started Prerequisites & Installing pi

Sission 1 Feb 03, 2022
Welcome to my pod transcript search webb app!

pod_transcript_search Welcome to the pod transcript search webb app! Tech stack used: Languages used: Python (for the back-end), JavaScript (for the f

3 Feb 04, 2022
The Python Fuzzer that the world deserves 🐍

pip3 install frelatage Current release : 0.0.2 The Python Fuzzer that the world deserves Installation | How it works | Features | Use Frelatage | Conf

Rog3r 219 Dec 21, 2022
Simple project to learn more about Bézier curves

Python Quadratic Bézier Simple project to learn more about Bézier curves. On this project i used some api's to graphics and gui pygame thorpy in theor

Kenned Ferreira 2 Mar 06, 2022
A person does not exist image bot

A person does not exist image bot

Fayas Noushad 3 Dec 12, 2021
CupScript is a simple programing language made with python

CupScript CupScript is a simple programming language made with python It includes some basic functions, variables, loops, and some other built in func

FUSEN 23 Dec 29, 2022
py-js: python3 objects for max

Simple (and extensible) python3 externals for MaxMSP

Shakeeb Alireza 39 Nov 20, 2022
A Python tool to check ASS subtitles for common mistakes and errors.

A Python tool to check ASS subtitles for common mistakes and errors.

1 Dec 18, 2021
2 Way Sync Between Notion Database and Google Calendar

Notion-and-Google-Calendar-2-Way-Sync 2 Way Sync Between a Notion Database and Google Calendar WARNING: This repo will be undergoing a good bit of cha

248 Dec 26, 2022
Flow control is the order in which statements or blocks of code are executed at runtime based on a condition. Learn Conditional statements, Iterative statements, and Transfer statements

03_Python_Flow_Control Introduction 👋 The control flow statements are an essential part of the Python programming language. A control flow statement

Milaan Parmar / Милан пармар / _米兰 帕尔马 209 Oct 31, 2022
One Ansible Module for using LINE notify API to send notification. It can be required in the collection list.

Ansible Collection - hazel_shen.line_notify Documentation for the collection. ansible-galaxy collection install hazel_shen.line_notify --ignore-certs

Hazel Shen 4 Jul 19, 2021
4Geeks Academy Full-Stack Developer program final project.

Final Project Chavi, Clara y Pablo 4Geeks Academy Full-Stack Developer program final project. Authors Javier Manteca - Coding - chavisam Clara Rojano

1 Feb 05, 2022
This is a small compiler to demonstrate how compilers work.

This is a small compiler to demonstrate how compilers work. It compiles our own dialect to C, while being written in Python.

Md. Tonoy Akando 2 Jul 19, 2022
Get a list of all offline/online members in a discord server

Discord server insights Get a list of all offline/online members in a discord server. Uses Selenium to crawl invite links. Config Download Chrome driv

Prakhar Gurunani 3 Oct 21, 2022
A test repository to build a python package and publish the package to Artifact Registry using GCB

A test repository to build a python package and publish the package to Artifact Registry using GCB. Then have the package be a dependency in a GCF function.

1 Feb 09, 2022
CBLang is a programming language aiming to fix most of my problems with Python

CBLang A bad programming language made in Python. CBLang is a programming language aiming to fix most of my problems with Python (this means that you

Chadderbox 43 Dec 22, 2022
Attempt at creating organized collection of little handy snippets of code I'm receiving along the way

ChaosCode Attempt at creating organized collection of little handy snippets of code I'm receiving along the way I always considered coding and program

INFU 4 Nov 26, 2022