Purpose: This is integrate bus data from a variety of sources such as: csv, json api, sensor data ... into Relational Database (batch processing and real time processing)
Technique:
- Python
- Application: Kafka, MQTT Explorer, Grafana, Influxdb, MS VS Studio 2019, MS SQL Server, PowerBI Desktop
- Framework: kafka-python, numpy, paho-mqtt, pandas, pyodbc, pyspark
- Database: sql -- install MS SQL Server
- Evironment: window 10 64bit
- Editor: cmd
Workflow:
- Import raw data offline from csv, txt file source into DataLake (stored in MS SQL Server) with python. Then ETL (Extract Transform Load) data from DataLake into Data Warehouse with SSIS (SQL Server Integration Services).
- Setup schedule for pipeline ETL.
- Modeling and Visualization from DWH.
- Crawl the online General Transport Feed Spec (GTFS) file into JSON file. Convert from Protobuf to JSON file or CSV then save it to my database with python and kafka streaming. Source: https://developer.nationaltransport.ie/
- Streaming and draw the data into the dashboard to show the performance by sensor data with paho-mqtt (or kafka-python) and BI tool Grafana.
Output:
- Data pipeline from data sources into target data.
- Data stored in Data warehouse for analysis.
- Raw data from Crawl the online General Transport Feed Spec.
- Real-time dashboard with streaming processing.
Next Step:
- Analysis data in DWH
- Build Real-time dashboard for raw data from Crawl the online General Transport Feed Spec.