Data Orchestration Platform

Related tags

Miscellaneousdop
Overview

Table of contents

What is DOP

Design Concept

DOP is designed to simplify the orchestration effort across many connected components using a configuration file without the need to write any code. We have a vision to make orchestration easier to manage and more accessible to a wider group of people.

Here are some of the key design concept behind DOP,

  • Built on top of Apache Airflow - Utilises it’s DAG capabilities with interactive GUI
  • DAGs without code - YAML + SQL
  • Native capabilities (SQL) - Materialisation, Assertion and Invocation
  • Extensible via plugins - DBT job, Spark job, Egress job, Triggers, etc
  • Easy to setup and deploy - fully automated dev environment and easy to deploy
  • Open Source - open sourced under the MIT license

Please note that this project is heavily optimised to run with GCP (Google Cloud Platform) services which is our current focus. By focusing on one cloud provider, it allows us to really improve on end user experience through automation

A Typical DOP Orchestration Flow

Typical DOP Flow

Prerequisites - Run in Docker

Note that all the IAM related prerequisites will be available as a Terraform template soon!

For DOP Native Features

  1. Download and install Docker https://docs.docker.com/get-docker/ (if you are on Windows, please follow instruction here as there are some additional steps required for it to work https://docs.docker.com/docker-for-windows/install/)
  2. Download and install Google Cloud Platform (GCP) SDK following instructions here https://cloud.google.com/sdk/docs/install.
  3. Create a dedicated service account for docker with limited permissions for the development GCP project, the Docker instance is not designed to be connected to the production environment
    1. Call it dop-docker-user@<your GCP project id> and create it in https://console.cloud.google.com/iam-admin/serviceaccounts?project=<your GCP project id>
    2. Assign the roles/bigquery.dataEditor and roles/bigquery.jobUser role to the service account under https://console.cloud.google.com/iam-admin/iam?project=<your GCP project id>
  4. Your GCP user / group will need to be given the roles/iam.serviceAccountUser and the roles/iam.serviceAccountTokenCreator role on thedevelopment project just for the dop-docker-user service account in order to enable Service Account Impersonation.
    Grant service account user
  5. Authenticating with your GCP environment by typing in gcloud auth application-default login in your terminal and following instructions. Make sure you proceed to the stage where application_default_credentials.json is created on your machine (For windows users, make a note of the path, this will be required on a later stage)
  6. Clone this repository to your machine.

For DBT

  1. Setup a service account for your GCP project called dop-dbt-user in https://console.cloud.google.com/iam-admin/serviceaccounts?project=<your GCP project id>
  2. Assign the roles/bigquery.dataEditor and roles/bigquery.jobUser role to the service account at project level under https://console.cloud.google.com/iam-admin/iam?project=<your GCP project id>
  3. Your GCP user / group will need to be given the roles/iam.serviceAccountUser and the roles/iam.serviceAccountTokenCreator role on the development project just for the dop-dbt-user service account in order to enable Service Account Impersonation.

Instructions for Setting things up

Run Airflow with DOP in Docker - Mac

See README in the service project setup and follow instructions.

Once it's setup, you should see example DOP DAGs such as dop__example_covid19 Airflow in Docker

Run Airflow with DOP in Docker - Windows

This is currently working in progress, however the instructions on what needs to be done is in the Makefile

Run on Composer

Prerequisites

  1. Create a dedicate service account for Composer and call it dop-composer-user with following roles at project level
    • roles/bigquery.dataEditor
    • roles/bigquery.jobUser
    • roles/composer.worker
    • roles/compute.viewer
  2. Create a dedicated service account for DBT with limited permissions.
    1. [Already done in here if it’s DEV] Call it dop-dbt-user@<GCP project id> and create in https://console.cloud.google.com/iam-admin/serviceaccounts?project=<your GCP project id>
    2. [Already done in here if it’s DEV] Assign the roles/bigquery.dataEditor and roles/bigquery.jobUser role to the service account at project level under https://console.cloud.google.com/iam-admin/iam?project=<your GCP project id>
    3. The dop-composer-user will need to be given the roles/iam.serviceAccountUser and the roles/iam.serviceAccountTokenCreator role just for the dop-dbt-user service account in order to enable Service Account Impersonation.

Create Composer Cluster

  1. Use the service account already created dop-composer-user instead of the default service account
  2. Use the following environment variables
    DOP_PROJECT_ID : {REPLACE WITH THE GCP PROJECT ID WHERE DOP WILL PERSIST ALL DATA TO}
    DOP_LOCATION : {REPLACE WITH GCP REGION LOCATION WHRE DOP WILL PERSIST ALL DATA TO}
    DOP_SERVICE_PROJECT_PATH := {REPLACE WITH THE ABSOLUTE PATH OF THE Service Project, i.e. /home/airflow/gcs/dags/dop_{service project name}
    DOP_INFRA_PROJECT_ID := {REPLACE WITH THE GCP INFRASTRUCTURE PROJECT ID WHERE BUILD ARTIFACTS ARE STORED, i.e. a DBT docker image stored in GCR}
    
    and optionally
    DOP_GCR_PULL_SECRET_NAME:= {This maybe needed if the project storing the gcr images are not he same as where Cloud Composer runs, however this might be a better alternative https://medium.com/google-cloud/using-single-docker-repository-with-multiple-gke-projects-1672689f780c}
    
  3. Add the following Python Packages
    dataclasses==0.7
    
  4. Finally create a new node pool with the following k8 label
    key: cloud.google.com/gke-nodepool
    value: kubernetes-task-pool
    

Deployment

See Service Project README

Misc

Service Account Impersonation

Impersonation is a GCP feature allows a user / service account to impersonate as another service account.
This is a very useful feature and offers the following benefits

  • When doing development locally, especially with automation involved (i.e using Docker), it is very risky to interact with GCP services by using your user account directly because it may have a lot of permissions. By impersonate as another service account with less permissions, it is a lot safer (least privilege)
  • There is no credential needs to be downloaded, all permissions are linked to the user account. If an employee leaves the company, access to GCP will be revoked immediately because the impersonation process is no longer possible

The following diagram explains how we use Impersonation in DOP when it runs in Docker DOP Docker Account Impersonation

And when running DBT jobs on production, we are also using this technique to use the composer service account to impersonate as the dop-dbt-user service account so that service account keys are not required.

There are two very google articles explaining how impersonation works and why using it

You might also like...
Cross-platform config and manager for click console utilities.

climan Help the project financially: Donate: https://smartlegion.github.io/donate/ Yandex Money: https://yoomoney.ru/to/4100115206129186 PayPal: https

YourCity is a platform to match people to their prefect city.
YourCity is a platform to match people to their prefect city.

YourCity YourCity is a city matching App that matches users to their ideal city. It is a fullstack React App made with a Redux state manager and a bac

A multi-platform fuzzer for poking at userland binaries and servers

litefuzz A multi-platform fuzzer for poking at userland binaries and servers litefuzz intro why how it works what it does what it doesn't do support p

A platform for developers 👩‍💻  who wants to share their programs and projects.
A platform for developers 👩‍💻 who wants to share their programs and projects.

Fest-Practice-2021 This project is excluded from Hacktoberfest 2021. Please use this as a testing repo/project. A platform for developers 👩‍💻 who wa

Speed up your typing by some exercises in the multi-platform(Windows/Ubuntu).

Introduction This project purpose is speed up your typing by some exercises in the multi-platform(Windows/Ubuntu). Build Environment Software Environm

An Airdrop alternative for cross-platform users only for desktop with Python

PyDrop An Airdrop alternative for cross-platform users only for desktop with Python, -version 1.0 with less effort, just as a practice. ##############

Platform Tree for Xiaomi Redmi Note 7/7S (lavender)
Platform Tree for Xiaomi Redmi Note 7/7S (lavender)

The Xiaomi Redmi Note 7 (codenamed "lavender") is a mid-range smartphone from Xiaomi announced in January 2019. Device specifications Device Xiaomi Re

A Classroom Engagement Platform

Project Introduction This is project introduction Setup Setting up Postgres This is the most tricky part when setting up the application. You will nee

Traffic flow test platform, especially for reinforcement learning
Traffic flow test platform, especially for reinforcement learning

Traffic Flow Test Platform Traffic flow test platform, especially for reinforcement learning, named TFTP. A traffic signal control framework that can

Comments
  • Release DOP v0.3.0

    Release DOP v0.3.0

    A number of new features where added in this version

    DOP v0.3.0 — 2021-08-11

    Features

    • Support for "generic" airflow operators: you can now use regular python operators as part of your config files.

    • Support for “dbt docs” command to generate documentation for all dbt tasks: Users can now add “docs generate” as a target in their DOP configuration and additionally specify a GCS bucket with the --bucket and --bucket-path options where documents are copied to.

    • Serve dbt docs: Documents generated by dbt can be served as a web page by deploying the provided app on GAE. Note that deploying is an additional step that needs to be carried out after docs have been generated. See infrastructure/dbt-docs/README.md for details.

    • dbt tasks artifacts run_results created by dbt tasks saved to BigQuery: This json file contains information on completed dbt invocations and is saved in the BQ table “run_results” for analysis and debugging.

    • Add support for Airflow v1.10.14 and v1.10.15 local environments: Users can specify which version they want to use by setting the AIRFLOW_VERSION environment variable.

    • Pre-commit linters: added pre-commit hooks to ensure python, yaml and some support for plain text file consistency in formatting and style throughout DOP codebase.

    Changes

    • Ensure DAGs using the same DBT project do not run concurrently: Safety feature to safely allow selective execution of workflows by calling specific commands or tags (e.g. dbt run --m) within a single dbt project. This avoids creating inter-dependant workflows to avoid overriding each other's artifacts, since they will share the same target location (within the dbt container).

    • Test time-partitioning: Time-partitioning of datetime type properly validated as part of schema validation.

    • Use Python 3.7 and dbt 0.19.1 in Composer K8s Operator

    • Add Dataflow example task: with the introduction of "regular" in the yaml config Airflow Operators, it is now possible to run compute intensive Dataflow jobs. Check example_dataflow_template for an example on how to implement a Dataflow pipeline.

    opened by dinigo 0
Releases(v0.3.0)
  • v0.3.0(Aug 17, 2021)

    Features

    • Support for "generic" airflow operators: you can now use regular python operators as part of your config files.

    • Support for “dbt docs” command to generate documentation for all dbt tasks: Users can now add “docs generate” as a target in their DOP configuration and additionally specify a GCS bucket with the --bucket and --bucket-path options where documents are copied to.

    • Serve dbt docs: Documents generated by dbt can be served as a web page by deploying the provided app on GAE. Note that deploying is an additional step that needs to be carried out after docs have been generated. See infrastructure/dbt-docs/README.md for details.

    • dbt tasks artifacts run_results created by dbt tasks saved to BigQuery: This json file contains information on completed dbt invocations and is saved in the BQ table “run_results” for analysis and debugging.

    • Add support for Airflow v1.10.14 and v1.10.15 local environments: Users can specify which version they want to use by setting the AIRFLOW_VERSION environment variable.

    • Pre-commit linters: added pre-commit hooks to ensure python, yaml and some support for plain text file consistency in formatting and style throughout DOP codebase.

    Changes

    • Ensure DAGs using the same DBT project do not run concurrently: Safety feature to safely allow selective execution of workflows by calling specific commands or tags (e.g. dbt run --m) within a single dbt project. This avoids creating inter-dependant workflows to avoid overriding each other's artifacts, since they will share the same target location (within the dbt container).

    • Test time-partitioning: Time-partitioning of datetime type properly validated as part of schema validation.

    • Use Python 3.7 and dbt 0.19.1 in Composer K8s Operator

    • Add Dataflow example task: with the introduction of "regular" in the yaml config Airflow Operators, it is now possible to run compute intensive Dataflow jobs. Check example_dataflow_template for an example on how to implement a Dataflow pipeline.

    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Mar 30, 2021)

Owner
Datatonic
We accelerate business impact through Machine Learning and Analytics
Datatonic
Path of Exile Vendor Recipe Tracker (Chaos/Regal orb)

Path of Exile Vendor Trade Tracker Are you tired of manually keeping track of collected and missing items for farming Chaos or Regal Orbs in PoE? Me t

1 Nov 09, 2021
This script can be used to get unlimited Gb for WARP.

Warp-Unlimited-GB This script can be used to get unlimited Gb for WARP. How to use Change the value of the 'referrer' to warp id of yours You can down

Anix Sam Saji 1 Feb 14, 2022
This code makes the logs provided by Fiddler proxy of the Google Analytics events coming from iOS more readable.

GA-beautifier-iOS This code makes the logs provided by Fiddler proxy of the Google Analytics events coming from iOS more readable. To run it, create a

Rafael Machado 3 Feb 02, 2022
Pymon is like nodemon but it is for python,

Pymon is like nodemon but it is for python,

Swaraj Puppalwar 2 Jun 11, 2022
This repo created to complete the task HACKTOBER 2021, contribute now and get your special T-Shirt & Sticker. TO SUPPORT OWNER PLEASE PRESS STAR BUTTON

❤ THIS REPO WILL CLOSED IN 31 OCT 00:00 ❤ This repository will automatically assign the hacktoberfest and hacktoberfest-accepted labels to all submitt

Rajendra Rakha 307 Dec 27, 2022
Versión preliminar análisis general de Covid-19 en Colombia

Covid_Colombia_v09 Versión: Python 3.8.8 1/ La base de datos del Ministerio de Salud (Minsalud Colombia) está en https://www.datos.gov.co/Salud-y-Prot

Julián Gómez 1 Jan 30, 2022
Watcher for systemdrun user scopes

Systemctl Memory Watcher Animated watcher for systemdrun user scopes. Usage Launch some process in your GNU-Linux or compatible OS with systemd-run co

Antonio Vanegas 2 Jan 20, 2022
Web-based Sudoku solver built using Python. A demonstration of how backtracking works.

Sudoku Solver A web-based Sudoku solver built using Python and Python only The motivation is to demonstrate how Backtracking algorithm works. Some of

Jerry Ng 2 Dec 31, 2022
Lightweight and Modern kernel for VK Bots

This is the kernel for creating VK Bots written in Python 3.9

Yrvijo 4 Nov 21, 2021
Python Script to add OpenGapps, Magisk, libhoudini translation library and libndk translation library to waydroid !

Waydroid Extras Script Script to add gapps and other stuff to waydroid ! Installation/Usage "lzip" is required for this script to work, install it usi

Casu Al Snek 331 Jan 02, 2023
Macros in Python: quasiquotes, case classes, LINQ and more!

MacroPy3 1.1.0b2 MacroPy is an implementation of Syntactic Macros in the Python Programming Language. MacroPy provides a mechanism for user-defined fu

Li Haoyi 3.2k Jan 06, 2023
Process GPX files (adding sensor metrics, uploading to InfluxDB, etc.) exported from imxingzhe.com

Xingzhe GPX Processor 行者轨迹处理工具 Xingzhe sells cheap GPS bike meters with sensor support including cadence, heart rate and power. But the GPX files expo

Shengqi Chen 8 Sep 23, 2022
NFT-Image-Generator - Utility to generate a large collection of unique images

NFT-Image-Generator Utility for creating a generative art collection from suppli

Sem Moolenschot 60 Dec 15, 2022
An a simple sistem code in python

AMS OS An a simple code in python ⁕¿What is AMS OS? AMS OS is an a simple sistem code writed in python. This code helps you with the cotidian task, yo

1 Nov 10, 2021
A Classroom Engagement Platform

Project Introduction This is project introduction Setup Setting up Postgres This is the most tricky part when setting up the application. You will nee

Santosh Kumar Patro 1 Nov 18, 2021
This is a batch script created to WEB-DL.

widevine-L3-WEB-DL-Script This is a batch script created to WEB-DL. Works well with .mpd files , for m3u8 please use n_m3u8 program (not included in t

Paranjay Singh 312 Dec 31, 2022
Purge all transformation orientations addon for Blender 2.8 and newer versions

CTO Purge This add-on adds a new button to Blender's Transformation Orientation panel which empowers the user to purge all of his/her custom transform

MMMrqs 10 Dec 29, 2022
Install packages with pip as if you were in the past!

A PyPI time machine Do you wish you could just install packages with pip as if you were at some fixed date in the past? If so, the PyPI time machine i

Thomas Robitaille 51 Jan 09, 2023
Create standalone, installable R Shiny apps using Electron

WARNING This is still very much a work in progress and nothing can be assumed stable in any way Temp notes: Two types of created installer, based on w

Chase Clark 5 Dec 24, 2021
Hitchhikers-guide - The Hitchhiker's Guide to Data Science for Social Good

Welcome to the Hitchhiker's Guide to Data Science for Social Good. What is the Data Science for Social Good Fellowship? The Data Science for Social Go

Data Science for Social Good 907 Jan 01, 2023