OpenDILab RL Kubernetes Custom Resource and Operator Lib

Overview

DI Orchestrator

DI Orchestrator is designed to manage DI (Decision Intelligence) jobs using Kubernetes Custom Resource and Operator.

Prerequisites

  • A well-prepared kubernetes cluster. Follow the instructions to create a kubernetes cluster, or create a local kubernetes node referring to kind or minikube
  • Cert-manager. Installation on kubernetes please refer to cert-manager docs. Or you can install it by the following command.
kubectl create -f ./config/certmanager/cert-manager.yaml

Install DI Orchestrator

DI Orchestrator consists of two components: di-operator and di-server. Install di-operator and di-server with the following command.

kubectl create -f ./config/di-manager.yaml

di-operator and di-server will be installed in di-system namespace.

$ kubectl get pod -n di-system
NAME                               READY   STATUS    RESTARTS   AGE
di-operator-57cc65d5c9-5vnvn   1/1     Running   0          59s
di-server-7b86ff8df4-jfgmp     1/1     Running   0          59s

Install global components of DIJob defined in AggregatorConfig:

kubectl create -f config/samples/agconfig.yaml -n di-system

Submit DIJob

# submit DIJob
$ kubectl create -f config/samples/dijob-cartpole.yaml

# get pod and you will see coordinator is created by di-operator
# a few seconds later, you will see collectors and learners created by di-server
$ kubectl get pod

# get logs of coordinator
$ kubectl logs cartpole-dqn-coordinator

User Guide

Refers to user-guide. For Chinese version, please refer to 中文手册

Contributing

Refers to developer-guide.

Contact us throw [email protected]

Comments
  • 在 Pod 内增加集群信息

    在 Pod 内增加集群信息

    希望以 dijob replica 方式提交时,每个 pod 都能见到整个 replica 的 host 信息和自己的启动顺序,增加以下几个环境变量:

    1. replica 中所有 pod 的 FQDN,依据启动顺序排序
    2. 当前 pod 的 FQDN
    3. 当前 pod 的顺序编号

    DI-engine 中会根据这些变量实现对应的网络连接,attach-to 的生成逻辑可以从 di-orchestrator 中移除

    enhancement 
    opened by sailxjx 3
  • add tasks to dijob spec

    add tasks to dijob spec

    1. goal

    There is only one pod template defined in a dijob, which results in that we can not define different commands or resources for different componets of di-engine such as collector, learner and evaluator. So we are supposed to find a more general way to define a custom resource of dijob.

    2. design *

    Inspired by VolcanoJob, we define the spec.tasks to describe different componets of di-engine. spec.tasks is a list, which allows us to define multiple tasks. We can specify different task.type to label the task as one of collector, learner, evaluator and none. none means the task is a general task, which is the default value.

    After change, the dijob can be defined as follow:

    apiVersion: diengine.opendilab.org/v2alpha1
    kind: DIJob
    metadata:
      name: job-with-tasks
    spec:
      priority: "normal"  # job priority, which is a reserved field for allocator
      backoffLimit: 0  # restart count
      cleanPodPolicy: "Running"  # the policy to clean pods after job completion
      preemptible: false  # job is preemtible or not
      minReplicas: 2  
      maxReplicas: 5
      tasks:
      - replicas: 1
        name: "learner"
        type: learner
        template:
          metadata:
            name: di
          spec:
            containers:
            - image: registry.sensetime.com/xlab/ding:nightly
              imagePullPolicy: IfNotPresent
              name: pydi
              env:
              - name: NCCL_DEBUG
                value: "INFO"
              command: ["/bin/bash", "-c",]
              args: 
              - |
                ditask --label learner xxx
              resources:
                requests:
                  cpu: "1"
                  nvidia.com/gpu: 1
            restartPolicy: Never
      - replicas: 1
        name: "evaluator"
        type: evaluator
        template:
          metadata:
            name: di
          spec:
            containers:
            - image: registry.sensetime.com/xlab/ding:nightly
              imagePullPolicy: IfNotPresent
              name: pydi
              env:
              - name: NCCL_DEBUG
                value: "INFO"
              command: ["/bin/bash", "-c",]
              args: 
              - |
                ditask --label evaluator xxx
            restartPolicy: Never
      - replicas: 2
        name: "collector"
        type: collector
        template:
          metadata:
            name: di
          spec:
            containers:
            - image: registry.sensetime.com/xlab/ding:nightly
              imagePullPolicy: IfNotPresent
              name: pydi
              env:
              - name: NCCL_DEBUG
                value: "INFO"
              command: ["/bin/bash", "-c",]
              args: 
              - |
                ditask --label collector xxx
            restartPolicy: Never
    status:
      conditions:
      - lastTransitionTime: "2022-05-26T07:25:11Z"
        lastUpdateTime: "2022-05-26T07:25:11Z"
        message: job created.
        reason: JobPending
        status: "False"
        type: Pending
      - lastTransitionTime: "2022-05-26T07:25:11Z"
        lastUpdateTime: "2022-05-26T07:25:11Z"
        message: job is starting since all pods are created.
        reason: JobStarting
        status: "False"
        type: Starting
      phase: Starting
      profilings: {}
      readyReplicas: 0
      replicas: 4
      taskStatus:
        learner:
          Pending: 1
        evaluator:
          Pending: 1
        collector:
          Pending: 2
      reschedules: 0
      restarts: 0
    

    task definition:

    type Task struct {
    	Name string `json:"name,omitempty"`
    
    	Type TaskType `json:"type,omitempty"`
    
    	Replicas int32 `json:"replicas,omitempty"`
    
    	Template corev1.PodTemplateSpec `json:"template,omitempty"`
    }
    
    type TaskType string
    
    const (
    	TaskTypeLearner TaskType = "learner"
    
    	TaskTypeCollector TaskType = "collector"
    
    	TaskTypeEvaluator TaskType = "evaluator"
    
    	TaskTypeNone TaskType = "none"
    )
    
    

    status.taskStatus definition:

    type DIJobStatus struct {
      // Phase defines the observed phase of the job
      // +kubebuilder:default=Pending
      Phase Phase `json:"phase,omitempty"`
    
      // ...
      
      // map for different task statuses. key: task.name, value: TaskStatus
      TaskStatus map[string]TaskStatus
    
      // ...
    }
    
    // count of different pod phases
    type TaskStatus map[corev1.PodPhase]int32
    
    enhancement 
    opened by konnase 1
  • new version for di-engine new architecture

    new version for di-engine new architecture

    release notes

    features

    • v1.0.0 for DI-engine new architecture
    • remove webhook
    • manage commands with cobra
    • refactor orchestrator architecture inspired from adaptdl
    • use gin to rewrite di-server
    • update di-server http interface
    enhancement 
    opened by konnase 1
  • v0.2.0

    v0.2.0

    • [x] split webhook and operator
    • [x] add dockerfile.dev
    • [x] update CleanPolicyALL to CleanPolicyAll
    • [x] remove k8s service related operations from server, and operator is responsible for managing services
    • [x] add e2e test
    enhancement 
    opened by konnase 1
  • refactor job spec

    refactor job spec

    • refactor job spec definition and add spec.tasks to support multi tasks #20
    • add DI_RANK to pod env and remove engineFields in job.spec #16
    • add e2e test
    • add validator to validate the correctness of dijob spec
    • change job.phase to Pending when job replicas scaled to 0
    • implement a processor to process di-server requests
    • refactor project structure
    enhancement 
    opened by konnase 0
  • Release/v1.0

    Release/v1.0

    release notes

    features

    • v1.0.0 for DI-engine new architecture
    • remove webhook
    • manage commands with cobra
    • refactor orchestrator architecture inspired from adaptdl
    • use gin to rewrite di-server
    • update di-server http interface
    enhancement 
    opened by konnase 0
  • fix: job failed submit when collector/learner missed

    fix: job failed submit when collector/learner missed

    job failed submit when collector/learner missed because webhook create an empty dijob, and golang builder add some default value to some feilds of collector/learner, which result in invalid type error. solved by make coordinator/collector/learner as pointers.

    bug 
    opened by konnase 0
  • Feat/job create event

    Feat/job create event

    • add event handler for dijob, and mark job as Created when job submitted
    • mark collector and learner as optional, only coordinator is required(https://github.com/opendilab/DI-orchestrator/pull/13/commits/653e64af01ec7752b08d4bf8381738d566fca224)
    • mark job Failed when the submitted job is incorrect(https://github.com/opendilab/DI-orchestrator/pull/13/commits/bea840a5eee3508be18b53b325168a5647daff94), but it's hard to test since client-go reflector decodes DIJob strictly, we have no chance to handle DIJob add event when incorrect job submitted
    • version -> v0.2.1
    enhancement 
    opened by konnase 0
  • allocate的一些问题

    allocate的一些问题

    1.目前的allocator的逻辑,对于不可被抢占的job的初始分配,仅利用minreplicas修改replicas属性,那job的pods部署到哪个节点是完全由K8S决定吗?而且Release1.13代码的allocator.go中对不可被抢占job的初始分配部分貌似还没有写。 2.job是否可以被抢占的含义具体是什么?和是否能被调度是不是等价的? 3.调度策略的FitPolicy的Allocate和Optimize方法也没有进行实现,这部分内容什么时候可以补充? 4.文档中存在许多与最新代码不符合的地方,比如DIJob.Spec.Group属性在代码中已经被移除,文档中提到的job.spec.minreplicas属性代码中也没有,而是在JobInfo中。可以更新一下文档吗? 感谢!

    opened by RZ-Q 3
Releases(v1.1.3)
  • v1.1.3(Aug 22, 2022)

  • v1.1.2(Jul 21, 2022)

    bugs fix

    • global cmd flag error(https://github.com/opendilab/DI-orchestrator/pull/23)
    • wrong pod subdomain(https://github.com/opendilab/DI-orchestrator/pull/24)
    • incorrect to get global rank(https://github.com/opendilab/DI-orchestrator/pull/25)
    Source code(tar.gz)
    Source code(zip)
    di-manager.yaml(445.36 KB)
  • v1.1.1(Jul 4, 2022)

  • v1.1.0(Jun 30, 2022)

    • refactor job spec definition and add spec.tasks to support multi tasks #20
    • add DI_RANK to pod env and remove engineFields in job.spec #16
    • add e2e test
    • add validator to validate the correctness of dijob spec
    • change job.phase to Pending when job replicas scaled to 0
    • implement a processor to process di-server requests
    • refactor project structure

    see details in https://github.com/opendilab/DI-orchestrator/pull/21

    Source code(tar.gz)
    Source code(zip)
    di-manager.yaml(374.01 KB)
  • v1.0.0(Mar 23, 2022)

  • v0.2.2(Dec 15, 2021)

  • v0.2.1(Oct 12, 2021)

    feature

    • add event handler for dijob, and mark job as Created when job submitted(https://github.com/opendilab/DI-orchestrator/pull/13)
    • mark collector and learner as optional, only coordinator is required(https://github.com/opendilab/DI-orchestrator/pull/13/commits/653e64af01ec7752b08d4bf8381738d566fca224)
    • mark job Failed when the submitted job is incorrect(https://github.com/opendilab/DI-orchestrator/pull/13/commits/bea840a5eee3508be18b53b325168a5647daff94), but it's hard to test since client-go reflector decodes DIJob strictly, we have no chance to handle DIJob add event when incorrect job submitted
    Source code(tar.gz)
    Source code(zip)
    di-manager.yaml(1.38 MB)
  • v0.2.0(Sep 28, 2021)

  • v0.2.0-rc.0(Sep 6, 2021)

    • split webhook and operator
    • add dockerfile.dev
    • update CleanPolicyALL to CleanPolicyAll
    • remove k8s service related operations from server, and operator is responsible for managing services
    • add e2e test
    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Jul 8, 2021)

    Features

    • Define DIJob CRD to support DI jobs' submission
    • Define AggregatorConfig CRD to support aggregator definition
    • Add webhook to validate DIJob submission
    • Provide http service for DI jobs to request for DI modules
    • Docs to introduce DI-orchestrator architecture
    Source code(tar.gz)
    Source code(zip)
Owner
OpenDILab
Open sourced Decision Intelligence (DI)
OpenDILab
[CVPR'21] DeepSurfels: Learning Online Appearance Fusion

DeepSurfels: Learning Online Appearance Fusion Paper | Video | Project Page This is the official implementation of the CVPR 2021 submission DeepSurfel

Online Reconstruction 52 Nov 14, 2022
Independent and minimal implementations of some reinforcement learning algorithms using PyTorch (including PPO, A3C, A2C, ...).

PyTorch RL Minimal Implementations There are implementations of some reinforcement learning algorithms, whose characteristics are as follow: Less pack

Gemini Light 4 Dec 31, 2022
Keep CALM and Improve Visual Feature Attribution

Keep CALM and Improve Visual Feature Attribution Jae Myung Kim1*, Junsuk Choe1*, Zeynep Akata2, Seong Joon Oh1† * Equal contribution † Corresponding a

NAVER AI 90 Dec 07, 2022
Official TensorFlow code for the forthcoming paper

~ Efficient-CapsNet ~ Are you tired of over inflated and overused convolutional neural networks? You're right! It's time for CAPSULES :)

Vittorio Mazzia 203 Jan 08, 2023
Predictive Maintenance LSTM

Predictive-Maintenance-LSTM - Predictive maintenance study for Complex case study, we've obtained failure causes by operational error and more deeply by design mistakes.

Amir M. Sadafi 1 Dec 31, 2021
Use Python, OpenCV, and MediaPipe to control a keyboard with facial gestures

CheekyKeys A Face-Computer Interface CheekyKeys lets you control your keyboard using your face. View a fuller demo and more background on the project

69 Nov 09, 2022
Generating Digital Painting Lighting Effects via RGB-space Geometry (SIGGRAPH2020/TOG2020)

Project PaintingLight PaintingLight is a project conducted by the Style2Paints team, aimed at finding a method to manipulate the illumination in digit

651 Dec 29, 2022
unet for image segmentation

Implementation of deep learning framework -- Unet, using Keras The architecture was inspired by U-Net: Convolutional Networks for Biomedical Image Seg

zhixuhao 4.1k Dec 31, 2022
Shitty gaze mouse controller

demo.mp4 shitty_gaze_mouse_cotroller install tensofflow, cv2 run the main.py and as it starts it will collect data so first raise your left eyebrow(bo

16 Aug 30, 2022
Learning RGB-D Feature Embeddings for Unseen Object Instance Segmentation

Unseen Object Clustering: Learning RGB-D Feature Embeddings for Unseen Object Instance Segmentation Introduction In this work, we propose a new method

NVIDIA Research Projects 132 Dec 13, 2022
This is the official implementation of VaxNeRF (Voxel-Accelearated NeRF).

VaxNeRF Paper | Google Colab This is the official implementation of VaxNeRF (Voxel-Accelearated NeRF). This codebase is implemented using JAX, buildin

naruya 132 Nov 21, 2022
HTSeq is a Python library to facilitate processing and analysis of data from high-throughput sequencing (HTS) experiments.

HTSeq DEVS: https://github.com/htseq/htseq DOCS: https://htseq.readthedocs.io A Python library to facilitate programmatic analysis of data from high-t

HTSeq 57 Dec 20, 2022
WTTE-RNN a framework for churn and time to event prediction

WTTE-RNN Weibull Time To Event Recurrent Neural Network A less hacky machine-learning framework for churn- and time to event prediction. Forecasting p

Egil Martinsson 727 Dec 28, 2022
The implementation of "Bootstrapping Semantic Segmentation with Regional Contrast".

ReCo - Regional Contrast This repository contains the source code of ReCo and baselines from the paper, Bootstrapping Semantic Segmentation with Regio

Shikun Liu 128 Dec 30, 2022
EM-POSE 3D Human Pose Estimation from Sparse Electromagnetic Trackers.

EM-POSE: 3D Human Pose Estimation from Sparse Electromagnetic Trackers This repository contains the code to our paper published at ICCV 2021. For ques

Facebook Research 62 Dec 14, 2022
Red Team tool for exfiltrating files from a target's Google Drive that you have access to, via Google's API.

GD-Thief Red Team tool for exfiltrating files from a target's Google Drive that you(the attacker) has access to, via the Google Drive API. This includ

Antonio Piazza 39 Dec 27, 2022
A list of all papers and resoureces on Semantic Segmentation

Semantic-Segmentation A list of all papers and resoureces on Semantic Segmentation. Dataset importance SemanticSegmentation_DL Some implementation of

Alan Tang 1.1k Dec 12, 2022
Time Delayed NN implemented in pytorch

Pytorch Time Delayed NN Time Delayed NN implemented in PyTorch. Usage kernels = [(1, 25), (2, 50), (3, 75), (4, 100), (5, 125), (6, 150)] tdnn = TDNN

Daniil Gavrilov 79 Aug 04, 2022
fastgradio is a python library to quickly build and share gradio interfaces of your trained fastai models.

fastgradio is a python library to quickly build and share gradio interfaces of your trained fastai models.

Ali Abdalla 34 Jan 05, 2023
ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning. In ICCV, 2021.

ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning This repository contains the code for our ICCV 202

sangho.lee 28 Nov 08, 2022