Fully GitOpsified implementation of a RHOAI platform
What is the RHOAI BU Cluster Repository?
At Red Hat, we provide an internal OpenShift AI environment known as the RHOAI BU Cluster, where "BU" stands for our AI Business Unit. This cluster provides a centralized platform for experimentation, prototyping, and the scalable deployment of AI solutions across the company. Its operations are managed through the RHOAI BU Cluster GitOps Repository, which implements a comprehensive GitOps approach using declarative configuration to maintain and evolve the infrastructure behind Red Hat OpenShift AI.
What it manages:
- Two OpenShift clusters: development (
rhoaibu-cluster-dev
) and production (rhoaibu-cluster-prod
) - Complete AI/ML platform infrastructure using GitOps practices
- Models as a Service (MaaS) platform with 15+ AI models
Purpose:
- Working example of GitOps for AI infrastructure
- Reference architecture for organizations implementing AI/ML platforms
Internal Service Notice
This is an internal development and testing service. No Service Level Agreements (SLAs) or Service Level Objectives (SLOs) are provided or guaranteed. This service is not intended for production use cases or mission-critical applications.
Unlike basic GitOps examples, this repository manages:
- Complete cluster lifecycle from development to production workloads
- Models as a Service (MaaS) platform with 15+ AI models (Granite, Llama, Mistral, Phi-4), 3Scale API gateway, Red Hat SSO authentication, self-service portal, and usage analytics for internal development and testing
- Multi-environment support with dev and prod cluster configurations
- Advanced AI-specific components including GPU autoscaling, declarative model serving, and custom workbenches
- Advanced features like OAuth integration, RBAC, and certificate management
Info
This guide provides an overview of the RHOAI BU Cluster GitOps Repository, an implementation of GitOps for managing the RHOAIBU Cluster infrastructure at scale including OpenShift AI and MaaS.
For foundational OpenShift AI & GitOps concepts and object definitions, please refer to our Managing RHOAI with GitOps guide first.
Why GitOps for AI Infrastructure?
GitOps provides unique advantages for AI/ML workloads that traditional infrastructure management approaches struggle to deliver. Rather than reinventing the wheel, this implementation builds upon the proven GitOps Catalog from Red Hat Community of Practice, providing battle-tested GitOps patterns and components as the foundation for our AI-specific infrastructure.
π Infrastructure Reproducibility: AI experiments require consistent environments. GitOps ensures your development, and production clusters are identical, eliminating "works on my machine" issues.
π GPU Resource Management: Automated scaling and configuration of expensive GPU resources based on declarative policies, reducing costs while ensuring availability.
π Faster Iteration: Version-controlled infrastructure changes enable rapid experimentation with different configurations, operators, and serving runtimes.
π‘οΈ Compliance & Auditing: Complete audit trail of all infrastructure changes, critical for regulated industries deploying AI models.
Repository Architecture & Hierarchy
The repository follows a layered architecture that separates concerns while maintaining flexibility:
rhoaibu-cluster/
βββ bootstrap/ # Initial cluster setup and GitOps installation
βββ clusters/ # Environment-specific configurations
β βββ base/ # Shared cluster resources
β βββ overlays/ # Dev/prod environment customizations
βββ components/ # Modular GitOps components
β βββ argocd/ # ArgoCD projects and applications
β βββ configs/ # Cluster-wide configurations
β βββ instances/ # Operator instance configurations
β βββ operators/ # Core Red Hat operators
β βββ operators-extra/ # Community and third-party operators
βββ demos/ # Demo applications and examples
βββ docs/ # Documentation and development guides
Core Components Deep Dive
1. Bootstrap Layer (bootstrap/
)
The bootstrap layer handles initial cluster setup and GitOps installation. Key features include:
- Cluster Installation: OpenShift cluster deployment with GPU machinesets
- GitOps Bootstrap: OpenShift GitOps operator installation and initial configuration
- Authentication Setup: Google OAuth integration for internal access
- Certificate Management: Let's Encrypt certificates with automatic renewal
π Reference: Bootstrap Documentation
2. Operators Management (components/operators/
)
Core Red Hat Operators managed through GitOps:
- Red Hat OpenShift AI (RHOAI)
- NVIDIA GPU Operator
- OpenShift Data Foundation
- OpenShift Cert Manager
- OpenShift Service Mesh
- OpenShift Serverless
π Reference: Core Operators Documentation
Community & Third-Party Operators (components/operators-extra/
):
π Reference: Operators Documentation
3. Instance Configurations (components/instances/
)
Configurations for each operator:
- RHOAI Instance: Complete DataScienceCluster configuration with custom accelerator profiles, workbenches, and dashboard settings
- GPU Management: NVIDIA operator policies optimized for AI workloads
- Storage: OpenShift Data Foundation instance for proving storage capabilities (including RWX) for OpenShift AI workloads.
- Certificates: Automated TLS certificate management for model serving endpoints
π Reference: Operator Instances Documentation
4. Cluster Configurations (components/configs/
)
Cluster-wide settings:
- Authentication: OAuth providers and RBAC configurations
- Autoscaling: GPU-optimized cluster autoscaler with support for several GPUs
- Console Customization: OpenShift and RHOAI console with AI-focused navigation
- Namespace Management: Project request templates and resource quotas
π Reference: Cluster Configurations Documentation
Comprehensive Configuration Details
For detailed Kubernetes object definitions and advanced configuration options for model serving, data connections, accelerator profiles, and InferenceService objects, see our comprehensive Managing RHOAI with GitOps guide.
Key GitOps Workflows
Development and Testing Workflow
The repository supports a complete development lifecycle:
- Fork and Branch: Developers create feature branches for infrastructure changes
- Local Testing: Kustomize allows local validation before deployment
- Dev Environment: Changes are tested in the development cluster first
- Production Promotion: Validated changes are promoted to production via GitOps
π Reference: Development Workflow
GPU Autoscaling for AI Workloads
GPU autoscaling is critical for AI workloads since GPUs are expensive resources that need to scale based on demand. The cluster automatically provisions GPU nodes from separate pools for different use cases (Tesla T4 for cost-effective training/inference, NVIDIA A10G for high-memory workloads and model serving), with configurations for both shared and private access patterns.
π Reference: Autoscaling Configuration
Accelerator Profiles
OpenShift AI Accelerator Profiles configurations for optimal GPU utilization:
# Example accelerator profiles
- NVIDIATesla T4: 16GB VRAM, ideal for inference and small model training
- NVIDIA A10G: 24GB VRAM, optimized for large model training
- NVIDIA L40: 24GB VRAM, optimized for large model training
- NVIDIA L40s: 48GB VRAM, designed for multi-modal AI workloads
- NVIDIA H100: 80GB VRAM, cutting-edge performance for large-scale training
π Reference: Accelerator Profiles Configuration
AI-Specific Components
Models as a Service (MaaS)
The RHOAI BU Cluster serves as internal infrastructure hosting a complete Models as a Service (MaaS) platform. MaaS runs entirely on top of this GitOps-managed cluster infrastructure, providing AI model serving capabilities for development and testing purposes internally to all Red Hat Employees.
The MaaS implementation leverages the Models as a Service repository, which demonstrates how to set up 3Scale and Red Hat SSO in front of models served by OpenShift AI.
Internal Service Notice
This is an internal development and testing service. No Service Level Agreements (SLAs) or Service Level Objectives (SLOs) are provided or guaranteed. This service is not intended for production use cases or mission-critical applications.
Declarative Model Serving as Code
# MaaS Configuration Structure
components/configs/maas/
βββ model-serving/base/ # Model deployment configurations
β βββ granite-3.3-8b-instruct.yaml
β βββ llama-4-scout.yaml
β βββ llama-3.2-3b.yaml
β βββ mistral-small-24b.yaml
β βββ phi-4.yaml
β βββ nomic-embed-text-v1-5.yaml
β βββ docling.yaml
β βββ sdxl-custom-runtime.yaml
β βββ serving-runtimes/
βββ 3scale-config/base/ # API management configurations
βββ granite-3-3-8b-instruct/
βββ llama-4-scout/
βββ llama-3-2-3b/
βββ mistral-small-24b/
βββ phi-4/
βββ [per-model configs for authentication, rate limiting, documentation]
The MaaS platform organizes configurations for 15+ models with dedicated GitOps configurations for both OpenShift AI model serving and 3Scale API management.
Available Model Types:
- Large Language Models: Llama-4-Scout, Granite 3.3 8B, Llama 3.2 3B, Mistral Small 24B, Phi-4
- Embedding Models: Nomic Embed Text v1.5 for semantic search and RAG applications
- Vision Models: Granite Vision 3.2 2B, Qwen2.5 VL 7B for multimodal AI
- Specialized Models: Document processing (Docling), safety checking, image generation (SDXL)
- Lightweight Models: TinyLlama 1.1B (running on CPU)
π Reference: MaaS Configurations
This GitOps approach ensures:
- Consistent Deployments: Identical model configurations across dev/prod environments
- Version Control: Full audit trail of model deployment changes
- Easy Rollbacks: Quick reversion to previous model versions
- Automated Scaling: GPU autoscaling based on model demand
Custom AI Workbenches
Pre-configured Jupyter environments optimized for specific AI tasks:
- PyTorch 2.1.4 with CUDA support for deep learning
- Elyra with R for data science workflows
- AnythingLLM for conversational AI development
- Docling for document processing pipelines
- Custom tools for specialized AI workflows
π Reference: Custom Workbenches Configuration
While this implementation serves as a practical reference architecture, organizations can extend it with additional features based on their specific requirements.
Next Steps
- Review our foundational GitOps guide for object-level details
- Explore the RHOAI BU Cluster Repository for complete implementation examples
- Check out the AI on OpenShift examples for application-level patterns
This repository demonstrates that GitOps isn't just a deployment strategyβit's a comprehensive approach to managing the complex, rapidly-evolving infrastructure requirements of modern AI platforms.