Skip to content

Managing RHOAI with GitOps

GitOps is a common way to manage and deploy applications and resouces on Kubernetes clusters.

This page is intended to provide an overview of the different objects involved in manaing both the installation, administration, and usage of OpenShift AI components using GitOps. This page is by no means intended to be an exhaustive tutorial on each object and all of the features available in them.

When first implimenting features with GitOps it is highly recommended to deploy the resources manually using the Dashboard, then extract the resources created by the Dashboard and duplicate them in your gitops repo.

Installation

Operator Installation

The Red Hat OpenShift AI operator is installed and managed by OpenShift's Operator Lifecycle Manager (OLM) and follows common patterns that can be used to install many different operators.

The Red Hat OpenShift AI operator should be installed in the redhat-ods-operator namespace:

apiVersion: v1
kind: Namespace
metadata:
  annotations:
    openshift.io/display-name: "Red Hat OpenShift AI"
  labels:
    openshift.io/cluster-monitoring: 'true'
  name: redhat-ods-operator

After creating the namespace, OLM requires you create an Operator Group to help manage any operators installed in that namespace:

apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: redhat-ods-operator-group
  namespace: redhat-ods-operator

Finally, a Subscription can be created to install the operator:

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: rhods-operator
  namespace: redhat-ods-operator
spec:
  channel: stable # <1>
  installPlanApproval: Automatic # <2>
  name: rhods-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace

Subscription Options:

  1. Operator versions are managed with the channel in OLM. Users are able to select a channel that corresponds to the upgrade lifecycle they wish to follow and OLM will update versions as they are released on that channel. To learn more about the available channels and the release lifecycle, please refer to the official lifecycle documentation
  2. Platform administrators also have an option to set how upgrades are applied for the operator with the installPlanApproval. If set to Automatic RHOAI is automatically updated to the latest version that is available on the selected channel. If set to Manual administrators will be required to approve all upgrades.

Component Configuration

When the operator is installed it automatically creates a DSCInitialization object that sets up several default configurations. While it is not required, administrators can choose to manage the DSCinitialization object via GitOps.

apiVersion: dscinitialization.opendatahub.io/v1
kind: DSCInitialization
metadata:
  name: default-dsci
spec:
  applicationsNamespace: redhat-ods-applications
  monitoring:
    managementState: Managed
    namespace: redhat-ods-monitoring
  serviceMesh:
    auth:
      audiences:
        - 'https://kubernetes.default.svc'
    controlPlane:
      metricsCollection: Istio
      name: data-science-smcp
      namespace: istio-system
    managementState: Managed # <1>
  trustedCABundle:
    customCABundle: ''
    managementState: Managed # <2>

DSCInitialization Options:

  1. KServe requires a ServiceMesh instance to be installed on the cluster. By default the Red Hat OpenShift AI operator will attempt to configure an instance if the ServiceMesh operator is installed. If your cluster already has ServiceMesh configured, you may choose to skip this option.
  2. As part of the ServiceMesh configuration, the Red Hat OpenShift AI operator will configure a self-signed cert for any routes created by ServiceMesh.

After the operator is installed, a DataScienceCluster object will need to be configured with the different components. Each component has a managementState option which can be set to Managed or Removed. Admins can choose which components are installed on the cluster.

kind: DataScienceCluster
apiVersion: datasciencecluster.opendatahub.io/v1
metadata:
  name: default
spec:
  components:
    codeflare:
      managementState: Managed
    kserve:
      managementState: Managed
      serving:
        ingressGateway:
          certificate:
            type: SelfSigned
        managementState: Managed
        name: knative-serving
    trustyai:
      managementState: Removed
    ray:
      managementState: Managed
    kueue:
      managementState: Managed
    workbenches:
      managementState: Managed
    dashboard:
      managementState: Managed
    modelmeshserving:
      managementState: Managed
    datasciencepipelines:
      managementState: Managed

After the DataScienceCluster object is created, the operator will install and configure the different components on the cluster. Only one DataScienceCluster object can be created on a cluster.

Administration

Dashboard Configs

The Red Hat OpenShift AI Dashboard has many different configurable options through the UI that can be managed using the OdhDashboardConfig config object. A default OdhDashboardConfig is created when the Dashboard component is installed

apiVersion: opendatahub.io/v1alpha
kind: OdhDashboardConfig
metadata:
  name: odh-dashboard-config
  namespace: redhat-ods-applications
  labels:
    app.kubernetes.io/part-of: rhods-dashboard
    app.opendatahub.io/rhods-dashboard: 'true'
spec:
  dashboardConfig:
    enablement: true
    disableDistributedWorkloads: false
    disableProjects: false
    disableBiasMetrics: false
    disableSupport: false
    disablePipelines: false
    disableProjectSharing: false
    disableModelServing: false
    disableKServe: false
    disableAcceleratorProfiles: false
    disableCustomServingRuntimes: false
    disableModelMesh: false
    disableKServeAuth: false
    disableISVBadges: false
    disableInfo: false
    disableClusterManager: false
    disablePerformanceMetrics: false
    disableBYONImageStream: false
    disableModelRegistry: true
    disableTracking: false
  groupsConfig:
    adminGroups: rhods-admins # <1>
    allowedGroups: 'system:authenticated' # <2>
  modelServerSizes: # <3>
    - name: Small
      resources:
        limits:
          cpu: '2'
          memory: 8Gi
        requests:
          cpu: '1'
          memory: 4Gi
    - name: Medium
      resources:
        limits:
          cpu: '8'
          memory: 10Gi
        requests:
          cpu: '4'
          memory: 8Gi
    - name: Large
      resources:
        limits:
          cpu: '10'
          memory: 20Gi
        requests:
          cpu: '6'
          memory: 16Gi
  notebookController:
    enabled: true
    notebookNamespace: rhods-notebooks
    pvcSize: 20Gi # <4>
  notebookSizes: # <5>
    - name: Small
      resources:
        limits:
          cpu: '2'
          memory: 8Gi
        requests:
          cpu: '1'
          memory: 8Gi
    - name: Medium
      resources:
        limits:
          cpu: '6'
          memory: 24Gi
        requests:
          cpu: '3'
          memory: 24Gi
    - name: Large
      resources:
        limits:
          cpu: '14'
          memory: 56Gi
        requests:
          cpu: '7'
          memory: 56Gi
    - name: X Large
      resources:
        limits:
          cpu: '30'
          memory: 120Gi
        requests:
          cpu: '15'
          memory: 120Gi
  templateDisablement: []
  templateOrder:
    - caikit-tgis-runtime
    - kserve-ovms
    - ovms
    - tgis-grpc-runtime
    - vllm-runtime

OdhDashboardConfig Options:

  1. The Dashboard creates a group called rhods-admins by default which users can be added to be granted admin privileges through the Dashboard. Additionally, any user with the cluster-admin role are admins in the Dashboard by default. If you wish to change the group which is used to manage admin access, this option can be updated. It is important to note that this field only impacts a users ability to modify settings in the Dashboard, and will have no impact to a users ability to modify configurations through the Kubernetes objects such as this OdhDashboardConfig object.
  2. By default any user that has access to the OpenShift cluster where Red Hat OpenShift AI is installed will have the ability to access the Dashboard. If you wish to restrict who has access to the Dashboard this option can be updated to another group. Like the admin group option, this option only impacts the users ability to access the Dashboard and does not restrict their ability to interact directly with the Kubernetes objects used to deploy AI resources.
  3. When a user creates a new Model Server through the Dashboard they are presented with an option to choose a server size which will impact the resources available to the pod created for the Model Server. Administrators have the ability to configure the default options that are available to their users.
  4. When creating a new Workbench, users are asked to create storage for their Workbench. The storage will default to the value set here and users will have the option to choose a different amount of storage if their use case requires more or less storage. Admins can choose another default storage size that is presented to users by configuring this option.
  5. Like the Model Server size, users are presented with a drop down menu of options to select what size of Workbench they wish to create. Admins have the ability to customize the size options that are presented to users.

Idle Notebook Culling

Admins have the ability to enable Idle Notebook Culling which will automatically stop any Notebooks/Workbenches that users haven't interacted with in a period of time by creating the following ConfigMap:

kind: ConfigMap
apiVersion: v1
metadata:
  name: notebook-controller-culler-config
  namespace: redhat-ods-applications
  labels:
    opendatahub.io/dashboard: 'true'
data:
  CULL_IDLE_TIME: '240' # <1>
  ENABLE_CULLING: 'true'
  IDLENESS_CHECK_PERIOD: '1'

Idle Notebook Culling Options:

  1. The CULL_IDLE_TIME looks for metrics from Jupyter to understand when the last time a user interacted with the Workbench and will shut the pod down if it has passed the time set here. The time is the number of minutes so 240 minutes or 4 hours.

Accelerator Profiles

Accelerator Profiles allow admins to configure different types of GPU options that they can present to end users and automatically configure a toleration on Workbenches or Model Servers when they are selected. Admins can configure an Accelerator Profile with the AcceleratorProfile object:

apiVersion: dashboard.opendatahub.io/v1
kind: AcceleratorProfile
metadata:
  name: nvidia-gpu
  namespace: redhat-ods-applications
spec:
  displayName: nvidia-gpu
  enabled: true
  identifier: nvidia.com/gpu
  tolerations:
    - effect: NoSchedule
      key: nvidia-gpu-only
      operator: Exists
      value: ''

Notebook Images

Red Hat OpenShift AI ships with several out of the box Notebook/Workbench Images but admins can create additional custom images that users can use to launch new Workbench instances. A Notebook Image is managed with an OpenShift ImageStream object with some required labels:

kind: ImageStream
apiVersion: image.openshift.io/v1
metadata:
  annotations:
    opendatahub.io/notebook-image-desc: A custom Jupyter Notebook built for my organization # <1>
    opendatahub.io/notebook-image-name: My Custom Notebook # <2>
  name: my-custom-notebook
  namespace: redhat-ods-applications
  labels: # <3>
    app.kubernetes.io/created-by: byon
    opendatahub.io/dashboard: 'true'
    opendatahub.io/notebook-image: 'true'
spec:
  lookupPolicy:
    local: true
  tags:
    - name: '1.0' # <4>
      annotations:
        opendatahub.io/notebook-python-dependencies: '[{"name":"PyTorch","version":"2.2"}]' # <5>
        opendatahub.io/notebook-software: '[{"name":"Python","version":"v3.11"}]' # <6>
        opendatahub.io/workbench-image-recommended: 'true' # <7>
      from:
        kind: DockerImage
        name: 'quay.io/my-org/my-notebook:latest' # <8>
      importPolicy:
        importMode: Legacy
      referencePolicy:
        type: Source

Notebook Image Options:

  1. A description for the purpose of the notebook image
  2. The name that will be displayed to end users in the drop down menu when creating a Workbench
  3. The notebook image requires several labels for them to appear in the Dashboard, including the app.kubernetes.io/created-by: byon label. While traditionally this label is utilized to trace where an object originated from, this label is required for the notebooks to be made available to end users.
  4. Multiple image versions can be configured as part of the same Notebook and users have the ability to select which version of the image they wish to use. This is helpful if you release updated versions of the Notebook image and you wish to avoid breaking end user environments with package changes and allow them to upgrade as they wish.
  5. When selecting a Notebook image users will be presented with some information about the notebook based on the information presented in this annotation. opendatahub.io/notebook-python-dependencies is most commonly used to present information about versions from the most important Python packages that are pre-installed in the Image.
  6. Like the python dependencies annotation, the opendatahub.io/notebook-software annotation is used to present the end user with information about what software is installed in the Image. Most commonly this field is used to present information such as the Python version, Jupyter versions, or CUDA versions.
  7. When multiple tags are created on the ImageStream, the opendatahub.io/workbench-image-recommended is used to control what version of the image is presented by default to end users. Only one tag should be set to true at any give time.
  8. Notebook images are generally recommended to be stored in an Image Registry outside of the cluster and referenced in the ImageStream.

While it is possible to build a Notebook Image on an OpenShift cluster and publish it directly to an ImageStream using a BuildConfig or a Tekton Pipeline, it can be challenging to get that image to be seen by the Red Hat OpenShift AI Dashboard. The Dashboard is only looks at images listed in the spec.tags section and images pushed directly to the internal image registry are recorded in the status.tags. As a work around, it is possible to "link" a tag pushed directly to the internal image registry to a tag that is visible by the Dashboard:

kind: ImageStream
apiVersion: image.openshift.io/v1
metadata:
  annotations:
    opendatahub.io/notebook-image-desc: A custom Jupyter Notebook built for my organization
    opendatahub.io/notebook-image-name: My Custom Notebook
  name: my-custom-notebook
  namespace: redhat-ods-applications
  labels:
    app.kubernetes.io/created-by: byon
    opendatahub.io/dashboard: 'true'
    opendatahub.io/notebook-image: 'true'
spec:
  lookupPolicy:
    local: false
  tags:
    - name: '1.0'
      annotations:
        opendatahub.io/notebook-python-dependencies: '[{"name":"PyTorch","version":"2.2"}]'
        opendatahub.io/notebook-software: '[{"name":"Python","version":"v3.11"}]'
        opendatahub.io/workbench-image-recommended: 'true'
      from:
        kind: DockerImage
        name: 'image-registry.openshift-image-registry.svc:5000/redhat-ods-applications/my-custom-workbench:latest'
      importPolicy:
        importMode: Legacy
      referencePolicy:
        type: Source
status:
  dockerImageRepository: 'image-registry.openshift-image-registry.svc:5000/redhat-ods-applications/dsp-example'
  tags:
    - tag: latest

Serving Runtime Templates

Red Hat OpenShift AI ships with several out of the box Serving Runtime Templates such as OpenVino and vLLM, but admins have the ability to configure additional templates that allow users to deploy additional ServingRuntimes. A Serving Runtime template is an OpenShift Template object that wraps around a ServingRuntime object:

kind: Template
apiVersion: template.openshift.io/v1
metadata:
  name: trition-serving-runtime
  namespace: redhat-ods-applications
  labels:
    opendatahub.io/dashboard: 'true'
  annotations:
    opendatahub.io/apiProtocol: REST
    opendatahub.io/modelServingSupport: '["multi"]'
objects:
  - apiVersion: serving.kserve.io/v1alpha1
    kind: ServingRuntime
    metadata:
      name: triton-23.05
      labels:
        name: triton-23.05
      annotations:
        maxLoadingConcurrency: '2'
        openshift.io/display-name: Triton runtime 23.05
    spec:
      supportedModelFormats:
        - name: keras
          version: '2'
          autoSelect: true
        - name: onnx
          version: '1'
          autoSelect: true
        - name: pytorch
          version: '1'
          autoSelect: true
        - name: tensorflow
          version: '1'
          autoSelect: true
        - name: tensorflow
          version: '2'
          autoSelect: true
        - name: tensorrt
          version: '7'
          autoSelect: true
      protocolVersions:
        - grpc-v2
      multiModel: true
      grpcEndpoint: 'port:8085'
      grpcDataEndpoint: 'port:8001'
      volumes:
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 2Gi
      containers:
        - name: triton
          image: 'nvcr.io/nvidia/tritonserver:23.05-py3'
          command:
            - /bin/sh
          args:
            - '-c'
            - 'mkdir -p /models/_triton_models; chmod 777 /models/_triton_models; exec tritonserver "--model-repository=/models/_triton_models" "--model-control-mode=explicit" "--strict-model-config=false" "--strict-readiness=false" "--allow-http=true" "--allow-sagemaker=false" '
          volumeMounts:
            - name: shm
              mountPath: /dev/shm
          resources:
            requests:
              cpu: 500m
              memory: 1Gi
            limits:
              cpu: '5'
              memory: 1Gi
          livenessProbe:
            exec:
              command:
                - curl
                - '--fail'
                - '--silent'
                - '--show-error'
                - '--max-time'
                - '9'
                - 'http://localhost:8000/v2/health/live'
            initialDelaySeconds: 5
            periodSeconds: 30
            timeoutSeconds: 10
      builtInAdapter:
        serverType: triton
        runtimeManagementPort: 8001
        memBufferBytes: 134217728
        modelLoadingTimeoutMillis: 90000

End User Resources

Data Science Projects

Data Science Projects are simply a normal OpenShift Project with an extra label to distinguish them from normal OpenShift projects by the Red Hat OpenShift AI Dashboard. Like OpenShift Projects it is recommended to create a namespace object and allow OpenShift to create the corresponding project object:

apiVersion: v1
kind: Namespace
metadata:
  name: my-data-science-project
  labels:
    opendatahub.io/dashboard: "true"

Additionally, when a project going to be utilized by ModelMesh for Multi-model serving, there is an additional ModelMesh label that should be applied to the namespace:

apiVersion: v1
kind: Namespace
metadata:
  name: my-multi-model-serving-project
  labels:
    opendatahub.io/dashboard: "true"
    modelmesh-enabled: "true"

Workbenches

Workbench objects are managed using the Notebook custom resource. The Notebook object contains a fairly complex configuration, with many items that will be autogenerated, and required annotations to display correctly in the Dashboard. The Notebook object essentially acts as a wrapper around a normal pod definition and you will find many similarities to managing a pod with options such as the image, pvcs, secrets, etc.

It is highly recommended to thoroughly test any Notebook configurations configured with GitOps.

apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
  annotations:
    notebooks.opendatahub.io/inject-oauth: 'true' # <1>
    opendatahub.io/image-display-name: Minimal Python
    notebooks.opendatahub.io/oauth-logout-url: 'https://rhods-dashboard-redhat-ods-applications.apps.my-cluster.com/projects/my-data-science-project?notebookLogout=my-workbench'
    opendatahub.io/accelerator-name: ''
    openshift.io/description: ''
    openshift.io/display-name: my-workbench
    notebooks.opendatahub.io/last-image-selection: 's2i-minimal-notebook:2024.1'
    notebooks.kubeflow.org/last_activity_check_timestamp: '2024-07-30T20:43:25Z'
    notebooks.opendatahub.io/last-size-selection: Small
    opendatahub.io/username: 'kube:admin'
    notebooks.kubeflow.org/last-activity: '2024-07-30T20:27:25Z'
  name: my-workbench
  namespace: my-data-science-project
spec:
  template:
    spec:
      affinity: {}
      containers:
        - resources: # <2>
            limits:
              cpu: '2'
              memory: 8Gi
            requests:
              cpu: '1'
              memory: 8Gi
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /notebook/my-data-science-project/my-workbench/api
              port: notebook-port
              scheme: HTTP
            initialDelaySeconds: 10
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          name: my-workbench
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /notebook/my-data-science-project/my-workbench/api
              port: notebook-port
              scheme: HTTP
            initialDelaySeconds: 10
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          env:
            - name: NOTEBOOK_ARGS
              value: |-
                --ServerApp.port=8888
                                  --ServerApp.token=''
                                  --ServerApp.password=''
                                  --ServerApp.base_url=/notebook/my-data-science-project/my-workbench
                                  --ServerApp.quit_button=False
                                  --ServerApp.tornado_settings={"user":"kube-3aadmin","hub_host":"https://rhods-dashboard-redhat-ods-applications.apps.my-cluster.com","hub_prefix":"/projects/my-data-science-project"}
            - name: JUPYTER_IMAGE
              value: 'image-registry.openshift-image-registry.svc:5000/redhat-ods-applications/s2i-minimal-notebook:2024.1'
            - name: PIP_CERT
              value: /etc/pki/tls/custom-certs/ca-bundle.crt
            - name: REQUESTS_CA_BUNDLE
              value: /etc/pki/tls/custom-certs/ca-bundle.crt
            - name: SSL_CERT_FILE
              value: /etc/pki/tls/custom-certs/ca-bundle.crt
            - name: PIPELINES_SSL_SA_CERTS
              value: /etc/pki/tls/custom-certs/ca-bundle.crt
            - name: GIT_SSL_CAINFO
              value: /etc/pki/tls/custom-certs/ca-bundle.crt
          ports:
            - containerPort: 8888
              name: notebook-port
              protocol: TCP
          imagePullPolicy: Always
          volumeMounts:
            - mountPath: /opt/app-root/src
              name: my-workbench
            - mountPath: /dev/shm
              name: shm
            - mountPath: /etc/pki/tls/custom-certs/ca-bundle.crt
              name: trusted-ca
              readOnly: true
              subPath: ca-bundle.crt
          image: 'image-registry.openshift-image-registry.svc:5000/redhat-ods-applications/s2i-minimal-notebook:2024.1' # <3>
          workingDir: /opt/app-root/src
        - resources: # <4>
            limits:
              cpu: 100m
              memory: 64Mi
            requests:
              cpu: 100m
              memory: 64Mi
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /oauth/healthz
              port: oauth-proxy
              scheme: HTTPS
            initialDelaySeconds: 5
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          name: oauth-proxy
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /oauth/healthz
              port: oauth-proxy
              scheme: HTTPS
            initialDelaySeconds: 30
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          env:
            - name: NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
          ports:
            - containerPort: 8443
              name: oauth-proxy
              protocol: TCP
          imagePullPolicy: Always
          volumeMounts:
            - mountPath: /etc/oauth/config
              name: oauth-config
            - mountPath: /etc/tls/private
              name: tls-certificates
          image: 'registry.redhat.io/openshift4/ose-oauth-proxy@sha256:4bef31eb993feb6f1096b51b4876c65a6fb1f4401fee97fa4f4542b6b7c9bc46'
          args:
            - '--provider=openshift'
            - '--https-address=:8443'
            - '--http-address='
            - '--openshift-service-account=my-workbench'
            - '--cookie-secret-file=/etc/oauth/config/cookie_secret'
            - '--cookie-expire=24h0m0s'
            - '--tls-cert=/etc/tls/private/tls.crt'
            - '--tls-key=/etc/tls/private/tls.key'
            - '--upstream=http://localhost:8888'
            - '--upstream-ca=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt'
            - '--email-domain=*'
            - '--skip-provider-button'
            - '--openshift-sar={"verb":"get","resource":"notebooks","resourceAPIGroup":"kubeflow.org","resourceName":"my-workbench","namespace":"$(NAMESPACE)"}'
            - '--logout-url=https://rhods-dashboard-redhat-ods-applications.apps.my-cluster.com/projects/my-data-science-project?notebookLogout=my-workbench'
      enableServiceLinks: false
      serviceAccountName: my-workbench
      volumes:
        - name: my-workbench
          persistentVolumeClaim:
            claimName: my-workbench
        - emptyDir:
            medium: Memory
          name: shm
        - configMap:
            items:
              - key: ca-bundle.crt
                path: ca-bundle.crt
            name: workbench-trusted-ca-bundle
            optional: true
          name: trusted-ca
        - name: oauth-config
          secret:
            defaultMode: 420
            secretName: my-workbench-oauth-config
        - name: tls-certificates
          secret:
            defaultMode: 420
            secretName: my-workbench-tls
  1. The Notebook object contains several different annotations that are used by OpenShift AI, but the inject-oauth annotation is one of the most important. There are several oauth based configurations that in the Notebook that will be automatically generated by this annotation, allowing you to exclude a large amount of notebook configuration from what is contained in your GitOps repo.
  2. While selecting the resource size through the Dashboard you have more limited options for what sizes you can select, you can choose any size you wish for your notebook through the YAML. By selecting a non-standard size the Dashboard may report an "unknown" size however.
  3. Just like the resources size, you can choose any number of images for the Notebook, including ones that are not available in the Dashboard. By selecting a non-standard notebook image the Dashboard may report issues however.
  4. The oauth-proxy container is one such item that can be removed from the gitops based configuration when utilizing the inject-oauth annotation. Instead of including this section and some other oauth related configurations, you can simply rely on the annotation, and allow the Notebook controller to manage this portion of the object for you. This will help to prevent problems when upgrading RHOAI.

Users have the ability to start and stop the Workbench to help conserve resources on the cluster. To stop a Notebook, the following annotation should be applied to the Notebook object:

metadata:
  annotations:
    kubeflow-resource-stopped: '2024-07-30T20:52:37Z'

Generally, you do not want to include this annotation in your GitOps configuration, as it will enforce the Notebook to be shutdown, not allowing users to start their Notebooks. The value of the annotation doesn't matter, but by default the Dashboard will apply a timestamp with the time the Notebook was shut down.

Data Science Connections

A Data Science Connection is a normal Kubernetes Secret object with several annotations that follow a specific format for the data.

kind: Secret
apiVersion: v1
type: Opaque
metadata:
  name: aws-connection-my-dataconnection # <1>
  labels:
    opendatahub.io/dashboard: 'true' # <2>
    opendatahub.io/managed: 'true'
  annotations:
    opendatahub.io/connection-type: s3 # <3>
    openshift.io/display-name: my-dataconnection # <4>
data: # <5>
  AWS_ACCESS_KEY_ID: dGVzdA==
  AWS_DEFAULT_REGION: 'dGVzdA=='
  AWS_S3_BUCKET: 'dGVzdA=='
  AWS_S3_ENDPOINT: dGVzdA==
  AWS_SECRET_ACCESS_KEY: dGVzdA==
  1. When creating a data connection through the Dashboard, the name is automatically generated as aws-connection-<your-entered-name>. When generating the data connection from outside of the Dashboard, you do not need to follow this naming convention.
  2. The opendatahub.io/dashboard: 'true' label is used to help determine what secrets to display in the Dashboard. This option must be set to true if you wish for it to be available in the UI.
  3. At this point in time, the Dashboard only supports the S3 as a connection-type, but other types may be supported in the future.
  4. The name of the data connection as it will appear in the Dashboard UI
  5. Like all secrets, data connections data is stored in a base64 encoding. This data is not secure to be stored in this format and users should instead look into tools such as SealedSecrets or ExternalSecrets to manage secret data in a gitops workflow.

Data Science Pipelines

When setting up a new project, a Data Science Pipeline instance needs to be created using the DataSciencePipelineApplication object. The DSPA will create the pipeline servers for the project and allow users to begin interacting with Data Science Pipelines.

apiVersion: datasciencepipelinesapplications.opendatahub.io/v1alpha1
kind: DataSciencePipelinesApplication
metadata:
  name: dspa # <1>
  namespace: my-data-science-project
spec:
  apiServer:
    caBundleFileMountPath: ''
    stripEOF: true
    dbConfigConMaxLifetimeSec: 120
    applyTektonCustomResource: true
    caBundleFileName: ''
    deploy: true
    enableSamplePipeline: false
    autoUpdatePipelineDefaultVersion: true
    archiveLogs: false
    terminateStatus: Cancelled
    enableOauth: true
    trackArtifacts: true
    collectMetrics: true
    injectDefaultScript: true
  database:
    disableHealthCheck: false
    mariaDB:
      deploy: true
      pipelineDBName: mlpipeline
      pvcSize: 10Gi
      username: mlpipeline
  dspVersion: v2
  objectStorage:
    disableHealthCheck: false
    enableExternalRoute: false
    externalStorage: # <2>
      basePath: ''
      bucket: pipelines
      host: 'minio.ai-example-training.svc.cluster.local:9000'
      port: ''
      region: us-east-1
      s3CredentialsSecret:
        accessKey: AWS_SECRET_ACCESS_KEY
        secretKey: AWS_ACCESS_KEY_ID
        secretName: aws-connection-my-dataconnection
      scheme: http
  persistenceAgent:
    deploy: true
    numWorkers: 2
  scheduledWorkflow:
    cronScheduleTimezone: UTC
    deploy: true
  1. The Dashboard expects to look for an object called dspa and it is not recommended to deploy more than a single DataSciencePipelineApplication object in a single namespace.
  2. The externalStorage is a critical configuration for setting up S3 backend storage for Data Science Pipelines. While using the dashboard you are required to configure the connection details. While you can import these details from a data connection, it will create a separate secret containing the s3 secrets instead of reusing the existing data connection secret.

Once a Data Science Pipeline instance has been created, users may wish to configure and manage their pipelines via GitOps. It is important to note that Data Science Pipelines is not "gitops friendly". While working with Elyra or a kfp pipeline, users are required to manually upload a pipeline file to the Dashboard which does not generate a corresponding Kubernetes object. Additionally, when executing a pipeline run, uses may find a ArgoWorkflow object that is generated for the run, however this object can not be re-used in a gitops application to create a new pipeline run in Data Science Pipelines.

As a work around, one common pattern to "gitops-ify" a Data Science Pipeline while using kfp is to instead create a Tekton pipeline that either compiles the pipeline, and uses the kfp skd to upload the pipeline to Data Science Pipelines, or the kfp sdk can automatically trigger a new pipeline run directly from your pipeline code.

Model Serving

Model Serving in RHOAI has two different flavors, Single Model Serving (KServe) and Multi-Model Serving (ModelMesh). Both model server options utilize the same Kubernetes objects (ServingRuntime and InferenceService), but have different controllers managing them.

As mentioned in the Data Science Project section, in order to utilize ModelMesh, a modelmesh-enabled label must be applied to the namespace:

apiVersion: v1
kind: Namespace
metadata:
  name: my-multi-model-serving-project
  labels:
    opendatahub.io/dashboard: "true"
    modelmesh-enabled: "true"

When creating a model server through the Dashboard, users can select a "Serving Runtime Template" which will create a ServingRuntime instance in their namespace which can be managed via GitOps. The ServingRuntime helps to define different things such as the container definition, the supported model types, and available ports.

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  annotations: # <1>
    enable-route: 'true'
    opendatahub.io/accelerator-name: ''
    opendatahub.io/apiProtocol: REST
    opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
    opendatahub.io/template-display-name: OpenVINO Model Server
    opendatahub.io/template-name: ovms
    openshift.io/display-name: multi-model-server
  name: multi-model-server
  labels:
    opendatahub.io/dashboard: 'true'
spec:
  supportedModelFormats:
    - autoSelect: true
      name: openvino_ir
      version: opset1
    - autoSelect: true
      name: onnx
      version: '1'
    - autoSelect: true
      name: tensorflow
      version: '2'
  builtInAdapter:
    env:
      - name: OVMS_FORCE_TARGET_DEVICE
        value: AUTO
    memBufferBytes: 134217728
    modelLoadingTimeoutMillis: 90000
    runtimeManagementPort: 8888
    serverType: ovms
  multiModel: true
  containers:
    - args:
        - '--port=8001'
        - '--rest_port=8888'
        - '--config_path=/models/model_config_list.json'
        - '--file_system_poll_wait_seconds=0'
        - '--grpc_bind_address=127.0.0.1'
        - '--rest_bind_address=127.0.0.1'
      image: 'quay.io/modh/openvino_model_server@sha256:5d04d405526ea4ce5b807d0cd199ccf7f71bab1228907c091e975efa770a4908'
      name: ovms
      resources:
        limits:
          cpu: '2'
          memory: 8Gi
        requests:
          cpu: '1'
          memory: 4Gi
      volumeMounts:
        - mountPath: /dev/shm
          name: shm
  protocolVersions:
    - grpc-v1
  grpcEndpoint: 'port:8085'
  volumes:
    - emptyDir:
        medium: Memory
        sizeLimit: 2Gi
      name: shm
  replicas: 1
  tolerations: []
  grpcDataEndpoint: 'port:8001'
  1. While KServe and ModelMesh share the same object definition, they have some subtle differences, in particular the annotations that are available on them. enable-route is one annotation that is available on a ModelMesh ServingRuntime that is not available on a KServe based Model Server.

The InferenceService is responsible for a definition of the model that will be deployed as well as which ServingRuntime it should use to deploy it.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    openshift.io/display-name: fraud-detection-model
    serving.kserve.io/deploymentMode: ModelMesh
  name: fraud-detection-model
  labels:
    opendatahub.io/dashboard: 'true'
spec:
  predictor:
    model:
      modelFormat:
        name: onnx
        version: '1'
      name: ''
      resources: {}
      runtime: multi-model-server  # <1>
      storage:
        key: aws-connection-multi-model
        path: models/fraud-detection-model/frauddetectionmodel.onnx
  1. The runtime must match the name of the ServingRuntime object that you wish to utilize to deploy the model.

One major difference between ModelMesh and KServe is which object is responsible for creating and managing the pod where the model is deployed.

With KServe, the ServingRuntime acts as a "pod template" and each InferenceService creates it's own pod to deploy a model. A ServingRuntime can be used by multiple InferenceServices and each InferenceService will create a separate pod to deploy a model.

By contrast, a ServingRuntime creates a pod with ModelMesh, and the InferenceService simply tells the model server pod what models to load and from where. With ModelMesh a single ServingRuntime with multiple InferenceServices will create a single pod to load all of the models.

ArgoCD Health Checks

Out of the box, ArgoCD and OpenShift GitOps ship with a health check for a KServe InferenceService which is not compatible with a ModelMesh InferenceService. When attempting to deploy a ModelMesh based InferenceService, ArgoCD will report the object as degraded.

Custom health checks can be added to your ArgoCD instance that are compatible with both KServe and ModelMesh as well as other RHOAI objects to resolve this issue. The Red Hat AI Services Practice maintains several custom health checks that you can utilize in your own ArgoCD instance here.