A solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

Overview

Crime data- Batch Processing:

RDBMS Data Extraction Implementation

This project is intended to implement a solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

  • There is an airflow dag script, 2 pyspark application scripts, and a bootstrap actions script in this project which are explained below.

Deployment

Preparation:

  • An AWS RDS MySQL instance is created to store the batch of data.
    • An EC2 instance is created to communicate with the RDS instance.
    • The data is loaded onto the EC2 instance.
    • The database and table are created on the RDS instance with the help of the above created EC2 instance. The data is loaded in the table created above.
    • The create&Load.sql file contains the code for the above table data preparation step.
    • A secret on the Secrets Manager console is stored to communicate with the RDS instance secretly. Also, password rotation after 30 days has been configured for security purposes.
  • The following dag loads the data created from the above step into the AWS environment.

Implementation:

  • The airflow dag is put in the s3://yavula-da-capstone/dag/ location in the S3 bucket. An environment is created on the Amazon Managed Workflows for Apache Airflow(MWAA) console in a specific VPC.
  • The dag is scheduled to run on a daily basis along with SLA monitoring to trigger an alarm if the tasks take more than 36 minutes to finish the whole ETL process.
  • It usually takes 32-34 minutes to finish the dag processes. But if it takes, more than that, it means that something has interrupted the dag from finishing its process and we can check the logs accordingly.

emr_job_flow_manual_steps_dag.py

This script is used to create an airflow dag.

Description

  • The script has steps for the airflow to create an EMR cluster on AWS for a process which is explained later in the next steps.
  • It runs the STEPS that process the spark script on the EMR along with the bootstrap actions present in the bootstrap_actions.sh script which is in an s3 bucket that will install the required package like boto3 onto the EMR instance.
  • Then the step checker is also added to watch this process. This step sensor will periodically check if that last step is completed or skipped or terminated.

spark_ingest_script.py

The spark script which is put into S3 manually, is used to ingest the required data from a table which is present on an RDS isntance and store the data into a raw s3 bucket and catalog into Glue.

Description

  • The ingest script connects to the RDS instance using the mysql-connector.
  • It takes the required crime data from the table and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

spark_process_script.py

The spark script which is put into S3 manually, is used to query the latest target table, filter required crime details from it, then store the query results into a new final table and further save it to a latest partition.

Description

  • The spark script uses the crime data and performs some query processing using it.
  • It queries the required crime data from the table, performs some processing and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

bootstrap_actions.sh

Required for the bootstrap actions.

Description

Used to install the packages and dependencies on the cluster that are required for the processes inside the spark script to run.

Deployment

  • This bootstrap script is put manually in an S3 bucket.
  • The location of this bucket is used inside the airflow dag to mention in the bootstrap actions that the required actions are present in the script which is in this particular s3 location.

Business Analysis

The final processed table had the crime type details for all the crimes for which the arrest is not made yet. This business analysis can be viewed from Athena and also has been imported into QuickSight Spice to view the details of different types of crimes and their comparisions.

Owner
Yesaswi Avula
An Applied Data Science student with an escalating learning and performance graph Data analytics, Data engineering, Business Intelligence, ML, Big Data & Cloud
Yesaswi Avula
𝗖𝝠𝝦𝝩𝝠𝝞𝝥 𝝦𝗥𝝞𝗖𝝽°™️ 🇱🇰 Is An All In One Media Inline Bot Made For Inline Your Media Effectively With Some Advance Security Tools♥️

𝗖𝝠𝝦𝝩𝝠𝝞𝝥 𝝦𝗥𝝞𝗖𝝽° ™️ 🇱🇰 𝗙𝗘𝝠𝝩𝗨𝗥𝗘𝗦 Auto Filter IMDB Admin Commands Broadcast Index IMDB Search Inline Search Random Pics Ids & User I

Kɪꜱᴀʀᴀ Pᴇꜱᴀɴᴊɪᴛʜ Pᴇʀᴇʀᴀ 〄 13 Jun 21, 2022
ShotsGram - For sending captures from your monitor to a telegram chat (robot)

ShotsGram pt-BR Envios de capturas do seu monitor para um chat do telegram. Essa

Carlos Alberto 1 Apr 24, 2022
💻 Discord-Auto-Translate-Bot - If you type in the chat room, it automatically translates.

💻 Discord-Auto-Translate-Bot - If you type in the chat room, it automatically translates.

LeeSooHyung 2 Jan 20, 2022
Battle.net and PlayStation title watcher that reports updates via Discord.

Renovate Renovate is a Battle.net and PlayStation title watcher that reports updates via Discord. Usage Open config_example.json and provide the confi

Ethan 1 Nov 23, 2022
Rotates Amazon Personalize filters on a schedule based on dynamic templates

Amazon Personalize Filter Rotation This project contains the source code and supporting files for deploying a serverless application that provides aut

James Jory 2 Nov 12, 2021
Tools to download and aggregate feeds of vaccination clinic location information in the United States.

vaccine-feed-ingest Pipeline for ingesting nationwide feeds of vaccine facilities. Contributing How to Configure your environment (instructions on the

Call the Shots 26 Aug 05, 2022
Framework for creating and running trading strategies. Blatantly stolen copy of qtpylib to make it work for Indian markets.

_• Kinetick Trade Bot Kinetick is a framework for creating and running trading strategies without worrying about integration with broker and data str

Vinay 41 Dec 31, 2022
Telegram bot to host python bots

Host-Bot Setup the api Upload the flask api on your host #its not important to do #i used it just for simple captcha system + save ids on your host!

Plugin 15 Feb 11, 2022
The purpose of this bot is to take soundcloud track requests, that are posted in the stream-requests channel, and add them to a playlist, to make the process of scrolling through the requests easier for Root

Discord Song Collector Dont know if anyone is actually going to read this, but the purpose of this bot is to check every message in the stream-request

2 Mar 01, 2022
Sentiment Analysis web app using Streamlit - American Airlines Tweets

Analyse des sentiments à partir des Tweets L'application est développée par Streamlit L'analyse sentimentale est effectuée sur l'ensemble de données d

Abida Hassan 2 Feb 04, 2022
Discord Token Creator 🥵

Discord Token Creator 🥵

dropout 304 Jan 03, 2023
Telegram Userbot built with Pyrogram

Pyrogram Userbot A Telegram Userbot based on Pyrogram This repository contains the source code of a Telegram Userbot and the instructions for running

Athfan Khaleel 113 Jan 03, 2023
Pythonic event-processing library based on decorators

Process Events In Style This library aims to simplify the common pattern of event processing. It simplifies the process of filtering, dispatching and

Nicolas Marier 3 Sep 01, 2022
This is a story bot, that will scrape stories from r/stories subreddit and convert it into an Audio File.

Introduction This is a story bot, that will scrape stories from r/stories subreddit and convert it into an Audio File. Installation pip install -r req

Yasho 11 Jun 30, 2022
AK-LEECH-BOT - AK LEECH BOT For python

Benefits :- ✓ Google Drive link cloning using gclone.(wip) ✓ Telegram File mirro

5 Mar 24, 2022
Reads and prints information from the website MalAPI.io

MalAPIReader Reads and prints information from the website MalAPI.io optional arguments:

Squiblydoo 16 Nov 10, 2022
This repository will be a draft of a package about the latest total marine fish production in Indonesia. Data will be collected from PIPP (Pusat Informasi Pelabuhan Perikanan).

indomarinefish This package will give us information about the latest total marine fish production in Indonesia. The Name of the fish is written in In

1 Oct 13, 2021
File-sharing-Bot: Telegram Bot to store Posts and Documents and it can Access by Special Links.

Bromélia HSS bromelia-hss is the second official implementation of a Diameter-based protocol application by using the Python micro framework Bromélia.

1 Dec 17, 2021
PRNT.sc Image Grabber

PRNTSender PRNT.sc Image Grabber PRNTSender is a script that takes images posted on PRNT.sc and sends them to a Discord webhook, if you want to know h

neox 2 Dec 10, 2021
A modular bot running on python3 with anime theme and have a lot features

STINKY ROBOT Emiko Robot is a modular bot running on python3 with anime theme and have a lot features. Easiest Way To Deploy On Heroku This Bot is Cre

Riyan.rz 3 Jan 21, 2022