A solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

Overview

Crime data- Batch Processing:

RDBMS Data Extraction Implementation

This project is intended to implement a solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

  • There is an airflow dag script, 2 pyspark application scripts, and a bootstrap actions script in this project which are explained below.

Deployment

Preparation:

  • An AWS RDS MySQL instance is created to store the batch of data.
    • An EC2 instance is created to communicate with the RDS instance.
    • The data is loaded onto the EC2 instance.
    • The database and table are created on the RDS instance with the help of the above created EC2 instance. The data is loaded in the table created above.
    • The create&Load.sql file contains the code for the above table data preparation step.
    • A secret on the Secrets Manager console is stored to communicate with the RDS instance secretly. Also, password rotation after 30 days has been configured for security purposes.
  • The following dag loads the data created from the above step into the AWS environment.

Implementation:

  • The airflow dag is put in the s3://yavula-da-capstone/dag/ location in the S3 bucket. An environment is created on the Amazon Managed Workflows for Apache Airflow(MWAA) console in a specific VPC.
  • The dag is scheduled to run on a daily basis along with SLA monitoring to trigger an alarm if the tasks take more than 36 minutes to finish the whole ETL process.
  • It usually takes 32-34 minutes to finish the dag processes. But if it takes, more than that, it means that something has interrupted the dag from finishing its process and we can check the logs accordingly.

emr_job_flow_manual_steps_dag.py

This script is used to create an airflow dag.

Description

  • The script has steps for the airflow to create an EMR cluster on AWS for a process which is explained later in the next steps.
  • It runs the STEPS that process the spark script on the EMR along with the bootstrap actions present in the bootstrap_actions.sh script which is in an s3 bucket that will install the required package like boto3 onto the EMR instance.
  • Then the step checker is also added to watch this process. This step sensor will periodically check if that last step is completed or skipped or terminated.

spark_ingest_script.py

The spark script which is put into S3 manually, is used to ingest the required data from a table which is present on an RDS isntance and store the data into a raw s3 bucket and catalog into Glue.

Description

  • The ingest script connects to the RDS instance using the mysql-connector.
  • It takes the required crime data from the table and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

spark_process_script.py

The spark script which is put into S3 manually, is used to query the latest target table, filter required crime details from it, then store the query results into a new final table and further save it to a latest partition.

Description

  • The spark script uses the crime data and performs some query processing using it.
  • It queries the required crime data from the table, performs some processing and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

bootstrap_actions.sh

Required for the bootstrap actions.

Description

Used to install the packages and dependencies on the cluster that are required for the processes inside the spark script to run.

Deployment

  • This bootstrap script is put manually in an S3 bucket.
  • The location of this bucket is used inside the airflow dag to mention in the bootstrap actions that the required actions are present in the script which is in this particular s3 location.

Business Analysis

The final processed table had the crime type details for all the crimes for which the arrest is not made yet. This business analysis can be viewed from Athena and also has been imported into QuickSight Spice to view the details of different types of crimes and their comparisions.

Owner
Yesaswi Avula
An Applied Data Science student with an escalating learning and performance graph Data analytics, Data engineering, Business Intelligence, ML, Big Data & Cloud
Yesaswi Avula
Telegram Link Shortener Bot (With 20 Shorteners)

Telegram ShortenerBot ShortenerBot: 🇬🇧 Telegram Link Shortener Bot (11 + 9 Shorteners) 🇹🇷 Telegram Link Kısaltıcı Bot (11 + 9 Kısaltıcı) All suppo

Hüzünlü Artemis [HuzunluArtemis] 10 May 24, 2022
Please Do Not Throw Sausage Pizza Away - Side Scrolling Up The OSI Stack

Please Do Not Throw Sausage Pizza Away - Side Scrolling Up The OSI Stack

John Capobianco 2 Jan 25, 2022
Python client for Toyota North America service API

toyota-na Python client for Toyota North America service API Install pip install toyota-na[qt] [qt] is required for generating authorization code. Us

Gavin Ni 18 Sep 06, 2022
The source code of the bot that displays erotic images on Discord

説明 このコードはDiscord.pyとNeko APIを使ったNsfw画像表示ボットのソースコードです。 成人向けコンテンツを含むボットなので、不快になる方はこのボットの作成中止をおすすめします。 使い方 まず、install.batを起動してください。 そのあとに、config.json を開き

はなくそ 1 Dec 28, 2021
telegram bot that calculates file hash / Dosya toplamı hesaplayan telegram botu

Telegram File Hash Bot FileHashBot: 🇬🇧 Bot that calculates file hashes. 🇹🇷 Dosya toplamları hesaplayan bot. Demo in Telegram: @FileHashBot 🇬🇧 Se

Hüzünlü Artemis [HuzunluArtemis] 5 Jun 29, 2022
A Fork of Gitlab's Permifrost tool for managing Snowflake Permissions

permifrost-fork This is a fork of the GitLab permifrost project. As the GitLab team is not currently maintaining the project, we've taken on maintenac

Hightouch 7 Oct 13, 2021
Bezlik Year Calendar Planner

Bezlik Year Calendar Planner Scribus script for creating year planners on one page A1 paper format. Script is based on Year-Calendar-Script-for-Scribu

Bohdan Bobrowski 2 May 24, 2022
Slash util - A simple script to add application command support to discord.py v2.0

slash_util is a simple wrapper around slash commands for discord.py This is writ

Maya 28 Nov 16, 2022
New developed moderation discord bot by archisha

Monitor42 New developed moderation discord bot by αrchιshα#5518. Details Prefix: 42! Commands: Moderation Use 42!help to get command list. Invite http

Kamilla Youver 0 Jun 29, 2022
a script to bulk check usernames on multiple site. includes proxy & threading support.

linked-bulk-checker bulk checks username availability on multiple sites info people have been selling these so i just made one to release dm my discor

krul 9 Sep 20, 2021
Python Discord Server Nuker

Untitled Nuker Python Discord Server Nuker Features: Ban Everyone Kick Everyone Rename Everyone Spam To All Channels Delete All Channels Delete All Ro

22 Dec 22, 2022
Search all history of Chrome in terminal

Chrotry Search all history of Chrome in terminal. Demo Usages Move the Chrome history file to current directory by running move_history.sh Rename hist

Xiaoxu HU 2 Jun 13, 2022
Python API Client for Twitter API v2

🐍 Python Client For Twitter API v2 🚀 Why Twitter Stream ? Twitter-Stream.py a python API client for Twitter API v2 now supports FilteredStream, Samp

Twitivity 31 Nov 19, 2022
A Python Library to Make Quote Images

Quote2Image A Python Library to Make Quote Images How To Use? Download The Latest Package From Releases Extract The Zip File And Place Every File In I

Secrets 28 Dec 30, 2022
Official Python wrapper for the Quantel Finance API

Quantel is a powerful financial data and insights API. It provides easy access to world-class financial information. Quantel goes beyond just financial statements, giving users valuable information l

Guy 47 Oct 16, 2022
This is a free python bot program that crosses you to farm with auto click in space crypto NFT game, having fun :) Creator: Marlon Zanardi

🚀 Space Crypto auto click bot ready-to-use 🚀 This is a free python bot program that crosses you to farm with auto click in space crypto NFT game, ha

170 Dec 20, 2022
A python tool to Automate Whatsapp through Whatsapp web

This python tool is used to Automate Whatsapp through Whatsapp web. We can add number of contacts whom we want to send text messages on perticular time

5 Jul 21, 2022
Discord nuke bot with python

Discord-nuke-bot 🇷🇺 🇷🇺 🇷🇺 🇷🇺 🇷🇺 TODO: Добавить команду: Удаления всех ролей Спама каналами Спама во все каналы @everyone Удаления всего aka

Nikita Maykov 10 Oct 14, 2022
Access Undenied parses AWS AccessDenied CloudTrail events, explains the reasons for them, and offers actionable remediation steps. Open-sourced by Ermetic.

Access Undenied on AWS Access Undenied parses AWS AccessDenied CloudTrail events, explains the reasons for them, and offers actionable fixes. Access U

Ermetic 204 Jan 02, 2023
ETL for tononkira.serasera.org

python-tononkiramalagasy-api Api Endpoints: ### get artists - /artists/int:page [page_offset = 20] ### get artist's songs, index was given by

Titosy Manankasina 1 Dec 24, 2021