Skip to content

Fully GitOpsified implementation of a RHOAI platform

What is the RHOAI BU Cluster Repository?

At Red Hat, we provide an internal OpenShift AI environment known as the RHOAI BU Cluster, where "BU" stands for our AI Business Unit. This cluster provides a centralized platform for experimentation, prototyping, and the scalable deployment of AI solutions across the company. Its operations are managed through the RHOAI BU Cluster GitOps Repository, which implements a comprehensive GitOps approach using declarative configuration to maintain and evolve the infrastructure behind Red Hat OpenShift AI.

What it manages:

  • Two OpenShift clusters: development (rhoaibu-cluster-dev) and production (rhoaibu-cluster-prod)
  • Complete AI/ML platform infrastructure using GitOps practices
  • Models as a Service (MaaS) platform with 15+ AI models

Purpose:

  • Working example of GitOps for AI infrastructure
  • Reference architecture for organizations implementing AI/ML platforms

Internal Service Notice

This is an internal development and testing service. No Service Level Agreements (SLAs) or Service Level Objectives (SLOs) are provided or guaranteed. This service is not intended for production use cases or mission-critical applications.

Unlike basic GitOps examples, this repository manages:

  • Complete cluster lifecycle from development to production workloads
  • Models as a Service (MaaS) platform with 15+ AI models (Granite, Llama, Mistral, Phi-4), 3Scale API gateway, Red Hat SSO authentication, self-service portal, and usage analytics for internal development and testing
  • Multi-environment support with dev and prod cluster configurations
  • Advanced AI-specific components including GPU autoscaling, declarative model serving, and custom workbenches
  • Advanced features like OAuth integration, RBAC, and certificate management

Info

This guide provides an overview of the RHOAI BU Cluster GitOps Repository, an implementation of GitOps for managing the RHOAIBU Cluster infrastructure at scale including OpenShift AI and MaaS.

For foundational OpenShift AI & GitOps concepts and object definitions, please refer to our Managing RHOAI with GitOps guide first.

Why GitOps for AI Infrastructure?

GitOps provides unique advantages for AI/ML workloads that traditional infrastructure management approaches struggle to deliver. Rather than reinventing the wheel, this implementation builds upon the proven GitOps Catalog from Red Hat Community of Practice, providing battle-tested GitOps patterns and components as the foundation for our AI-specific infrastructure.

πŸ”„ Infrastructure Reproducibility: AI experiments require consistent environments. GitOps ensures your development, and production clusters are identical, eliminating "works on my machine" issues.

πŸ“Š GPU Resource Management: Automated scaling and configuration of expensive GPU resources based on declarative policies, reducing costs while ensuring availability.

πŸš€ Faster Iteration: Version-controlled infrastructure changes enable rapid experimentation with different configurations, operators, and serving runtimes.

πŸ›‘οΈ Compliance & Auditing: Complete audit trail of all infrastructure changes, critical for regulated industries deploying AI models.

Repository Architecture & Hierarchy

The repository follows a layered architecture that separates concerns while maintaining flexibility:

rhoaibu-cluster/
β”œβ”€β”€ bootstrap/          # Initial cluster setup and GitOps installation
β”œβ”€β”€ clusters/           # Environment-specific configurations
β”‚   β”œβ”€β”€ base/          # Shared cluster resources
β”‚   └── overlays/      # Dev/prod environment customizations
β”œβ”€β”€ components/        # Modular GitOps components
β”‚   β”œβ”€β”€ argocd/       # ArgoCD projects and applications
β”‚   β”œβ”€β”€ configs/      # Cluster-wide configurations
β”‚   β”œβ”€β”€ instances/    # Operator instance configurations
β”‚   β”œβ”€β”€ operators/    # Core Red Hat operators
β”‚   └── operators-extra/ # Community and third-party operators
β”œβ”€β”€ demos/            # Demo applications and examples
└── docs/             # Documentation and development guides

Core Components Deep Dive

1. Bootstrap Layer (bootstrap/)

The bootstrap layer handles initial cluster setup and GitOps installation. Key features include:

  • Cluster Installation: OpenShift cluster deployment with GPU machinesets
  • GitOps Bootstrap: OpenShift GitOps operator installation and initial configuration
  • Authentication Setup: Google OAuth integration for internal access
  • Certificate Management: Let's Encrypt certificates with automatic renewal

πŸ“– Reference: Bootstrap Documentation

2. Operators Management (components/operators/)

Core Red Hat Operators managed through GitOps:

  • Red Hat OpenShift AI (RHOAI)
  • NVIDIA GPU Operator
  • OpenShift Data Foundation
  • OpenShift Cert Manager
  • OpenShift Service Mesh
  • OpenShift Serverless

πŸ“– Reference: Core Operators Documentation

Community & Third-Party Operators (components/operators-extra/):

πŸ“– Reference: Operators Documentation

3. Instance Configurations (components/instances/)

Configurations for each operator:

  • RHOAI Instance: Complete DataScienceCluster configuration with custom accelerator profiles, workbenches, and dashboard settings
  • GPU Management: NVIDIA operator policies optimized for AI workloads
  • Storage: OpenShift Data Foundation instance for proving storage capabilities (including RWX) for OpenShift AI workloads.
  • Certificates: Automated TLS certificate management for model serving endpoints

πŸ“– Reference: Operator Instances Documentation

4. Cluster Configurations (components/configs/)

Cluster-wide settings:

  • Authentication: OAuth providers and RBAC configurations
  • Autoscaling: GPU-optimized cluster autoscaler with support for several GPUs
  • Console Customization: OpenShift and RHOAI console with AI-focused navigation
  • Namespace Management: Project request templates and resource quotas

πŸ“– Reference: Cluster Configurations Documentation

Comprehensive Configuration Details

For detailed Kubernetes object definitions and advanced configuration options for model serving, data connections, accelerator profiles, and InferenceService objects, see our comprehensive Managing RHOAI with GitOps guide.

Key GitOps Workflows

Development and Testing Workflow

The repository supports a complete development lifecycle:

  1. Fork and Branch: Developers create feature branches for infrastructure changes
  2. Local Testing: Kustomize allows local validation before deployment
  3. Dev Environment: Changes are tested in the development cluster first
  4. Production Promotion: Validated changes are promoted to production via GitOps

πŸ“– Reference: Development Workflow

GPU Autoscaling for AI Workloads

GPU autoscaling is critical for AI workloads since GPUs are expensive resources that need to scale based on demand. The cluster automatically provisions GPU nodes from separate pools for different use cases (Tesla T4 for cost-effective training/inference, NVIDIA A10G for high-memory workloads and model serving), with configurations for both shared and private access patterns.

πŸ“– Reference: Autoscaling Configuration

Accelerator Profiles

OpenShift AI Accelerator Profiles configurations for optimal GPU utilization:

# Example accelerator profiles
- NVIDIATesla T4: 16GB VRAM, ideal for inference and small model training
- NVIDIA A10G: 24GB VRAM, optimized for large model training
- NVIDIA L40: 24GB VRAM, optimized for large model training
- NVIDIA L40s: 48GB VRAM, designed for multi-modal AI workloads
- NVIDIA H100: 80GB VRAM, cutting-edge performance for large-scale training

πŸ“– Reference: Accelerator Profiles Configuration

AI-Specific Components

Models as a Service (MaaS)

The RHOAI BU Cluster serves as internal infrastructure hosting a complete Models as a Service (MaaS) platform. MaaS runs entirely on top of this GitOps-managed cluster infrastructure, providing AI model serving capabilities for development and testing purposes internally to all Red Hat Employees.

The MaaS implementation leverages the Models as a Service repository, which demonstrates how to set up 3Scale and Red Hat SSO in front of models served by OpenShift AI.

Internal Service Notice

This is an internal development and testing service. No Service Level Agreements (SLAs) or Service Level Objectives (SLOs) are provided or guaranteed. This service is not intended for production use cases or mission-critical applications.

Declarative Model Serving as Code

# MaaS Configuration Structure
components/configs/maas/
β”œβ”€β”€ model-serving/base/              # Model deployment configurations
β”‚   β”œβ”€β”€ granite-3.3-8b-instruct.yaml
β”‚   β”œβ”€β”€ llama-4-scout.yaml
β”‚   β”œβ”€β”€ llama-3.2-3b.yaml
β”‚   β”œβ”€β”€ mistral-small-24b.yaml
β”‚   β”œβ”€β”€ phi-4.yaml
β”‚   β”œβ”€β”€ nomic-embed-text-v1-5.yaml
β”‚   β”œβ”€β”€ docling.yaml
β”‚   β”œβ”€β”€ sdxl-custom-runtime.yaml
β”‚   └── serving-runtimes/
└── 3scale-config/base/              # API management configurations
    β”œβ”€β”€ granite-3-3-8b-instruct/
    β”œβ”€β”€ llama-4-scout/
    β”œβ”€β”€ llama-3-2-3b/
    β”œβ”€β”€ mistral-small-24b/
    β”œβ”€β”€ phi-4/
    └── [per-model configs for authentication, rate limiting, documentation]

The MaaS platform organizes configurations for 15+ models with dedicated GitOps configurations for both OpenShift AI model serving and 3Scale API management.

Available Model Types:

  • Large Language Models: Llama-4-Scout, Granite 3.3 8B, Llama 3.2 3B, Mistral Small 24B, Phi-4
  • Embedding Models: Nomic Embed Text v1.5 for semantic search and RAG applications
  • Vision Models: Granite Vision 3.2 2B, Qwen2.5 VL 7B for multimodal AI
  • Specialized Models: Document processing (Docling), safety checking, image generation (SDXL)
  • Lightweight Models: TinyLlama 1.1B (running on CPU)

πŸ“– Reference: MaaS Configurations

This GitOps approach ensures:

  • Consistent Deployments: Identical model configurations across dev/prod environments
  • Version Control: Full audit trail of model deployment changes
  • Easy Rollbacks: Quick reversion to previous model versions
  • Automated Scaling: GPU autoscaling based on model demand

Custom AI Workbenches

Pre-configured Jupyter environments optimized for specific AI tasks:

  • PyTorch 2.1.4 with CUDA support for deep learning
  • Elyra with R for data science workflows
  • AnythingLLM for conversational AI development
  • Docling for document processing pipelines
  • Custom tools for specialized AI workflows

πŸ“– Reference: Custom Workbenches Configuration

While this implementation serves as a practical reference architecture, organizations can extend it with additional features based on their specific requirements.

Next Steps

This repository demonstrates that GitOps isn't just a deployment strategyβ€”it's a comprehensive approach to managing the complex, rapidly-evolving infrastructure requirements of modern AI platforms.