Skip to content

Deploying and using a Custom Serving Runtime in ODH/RHOAI

Although these instructions were tested mostly using RHOAI (Red Hat OpenShift AI), they apply to ODH (Open Data Hub) as well.

Before you start

This document will guide you through the broad steps necessary to deploy a custom Serving Runtime in order to serve a model using the Triton Runtime (NVIDIA Triton Inference Server).

While RHOAI supports your ability to add your own runtime, it does not support the runtimes themselves. Therefore, it is up to you to configure, adjust and maintain your custom runtimes.

This document expects a bit of familiarity with RHOAI.

The sources used to create this document are mostly:

Adding the custom triton runtime

  1. Log in to your OpenShift AI with a user who is part of the RHOAI admin group.
    1. (by default, cluster-admins and dedicated admins are).
  2. Navigate to the Settings menu, then Serving Runtimes

    Serving Runtime Settings

  3. Click on the Add Serving Runtime button:

    Add Serving Runtime

  4. From the drop down menu, select **Multi-model serving platform. The option for REST will be selected automatically:

    Select Multi-model serving

  5. Click on Start from scratch and in the window that opens up, paste the following YAML:

    # Copyright 2021 IBM Corporation
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    # kind: ClusterServingRuntime     ## changed by EG
    kind: ServingRuntime
      name: triton-23.05-20230804
        name: triton-23.05-20230804
        maxLoadingConcurrency: "2" "Triton runtime 23.05 - added on 20230804 - with /dev/shm"
        - name: keras
          version: "2" # 2.6.0
          autoSelect: true
        - name: onnx
          version: "1" # 1.5.3
          autoSelect: true
        - name: pytorch
          version: "1" # 1.8.0a0+17f8c32
          autoSelect: true
        - name: tensorflow
          version: "1" # 1.15.4
          autoSelect: true
        - name: tensorflow
          version: "2" # 2.3.1
          autoSelect: true
        - name: tensorrt
          version: "7" # 7.2.1
          autoSelect: true
        - grpc-v2
      multiModel: true
      grpcEndpoint: "port:8085"
      grpcDataEndpoint: "port:8001"
        - name: shm
            medium: Memory
            sizeLimit: 2Gi
        - name: triton
          # image: tritonserver-2:replace   ## changed by EG
          command: [/bin/sh]
            - -c
            - 'mkdir -p /models/_triton_models;
              chmod 777 /models/_triton_models;
              exec tritonserver
            - name: shm
              mountPath: /dev/shm
              cpu: 500m
              memory: 1Gi
              cpu: "5"
              memory: 1Gi
            # the server is listening only on, so an httpGet probe sent
            # from the kublet running on the node cannot connect to the server
            # (not even with the Host header or host field)
            # exec a curl call to have the request originate from localhost in the
            # container
                - curl
                - --fail
                - --silent
                - --show-error
                - --max-time
                - "9"
                - http://localhost:8000/v2/health/live
            initialDelaySeconds: 5
            periodSeconds: 30
            timeoutSeconds: 10
        serverType: triton
        runtimeManagementPort: 8001
        memBufferBytes: 134217728
        modelLoadingTimeoutMillis: 90000

  6. You will likely want to update the name , as well as other parameters.
  7. Click Add
  8. Confirm the new Runtime is in the list, and re-order the list as needed. (the order chosen here is the order in which the users will see these choices)

    Runtime List

Creating a project

  • Create a new Data Science Project
  • In this example, the project is called fraud

Creating a model server

  1. In your project, scroll down to the "Models and Model Servers" Section
  2. Click on Configure server

    Configure Server

  3. Fill out the details:

    Server Details - 1

    Server Details - 2

  4. Click Configure

Deploying a model into it

  1. If you don't have any model files handy, you can grab a copy of this file and upload it to your Object Storage of choice.
  2. Click on Deploy Model

    Deploy a Model

  3. Choose a model name and a framework:

    Configure a Model

  4. Then create a new data connection containing the details of where your model is stored in Object Storage:

    Create a Data Connection

  5. After a little while, you should see the following:

    Deploy model

Validating the model

  1. If you've used the model mentioned earlier in this document, you can run the following command from a Linux prompt:
    function val-model {
        echo "validating host $myhost"
        time curl -X POST -k "${myhost}" -d '{"inputs": [{ "name": "dense_input", "shape": [1, 7], "datatype": "FP32", "data": [57.87785658389723,0.3111400080477545,1.9459399775518593,1.0,1.0,0.0,0.0]}]}' | jq
    val-model ""
  2. Change the host to match the address for your model.
  3. You should see an output similar to:
      "model_name": "fraud-model__isvc-c1529f9667",
      "model_version": "1",
      "outputs": [
          "name": "dense_3",
          "datatype": "FP32",
          "shape": [
          "data": [

Extra considerations for Disconnected environments.

The YAML included in this file makes a reference to the following Nvidia Triton Image:

Ensure that this image is properly mirrored into the mirror registry.

Also, update the YAML definition as needed to point to the image address that matches the image registry.

Each of the activities performed via the user interface will create a Kubernetes Object inside your OpenShift Cluster.

  • The addition of a new runtime creates a template in the redhat-ods-applications namespace.
  • Each model server is defined as a ServingRuntime
  • Each model is defined as an InferenceService
  • Each Data Connection is stored as a Secret