Upgrade Kubeflow Operators

This page describes the manual actions that may be required after upgrading the Kubeflow operators.

For installation steps, see Install Kubeflow Operators.

Migrating from v1.x (Cluster Plugin) to v26.3.0 (OLM Operator)

Starting in v26.3.0 (Alauda AI v2.3), Kubeflow components ship as OLM Helm Operators (kfbase-operator, kfp-operator, kubeflow-trainer-operator) instead of Cluster Plugins (kfbase, kfp, kftraining, kubeflow-trainer). There is no in-place upgrade path between the two form factors — the Cluster Plugin install descriptor (ModuleInfo) and the OLM Subscription are mutually incompatible.

  1. Back up user data:

    • Snapshot Notebook PVCs in every user namespace.
    • Export Profile CRs and any custom RoleBinding / AuthorizationPolicy you created for Kubeflow users.
    • Export your KFP pipelines, experiments, and scheduled runs via the kfp CLI.
    • Export any TrainingRuntime and active TrainJob CRs.
  2. Uninstall the v1.x Cluster Plugin installs from the AC UI (Cluster Plugins): remove kubeflow-trainer, kfp, kftraining (if present), and kfbase in that order. The matching ModuleInfo / ModuleConfig resources are removed automatically.

    Warning: Do not proceed until the backups in step 1 are confirmed. If uninstalling removes the kubeflow.org CRDs, all Profile CRs (and the user namespaces / Notebook PVCs they own) may be cascade-deleted. Verify whether your user namespaces and PVCs survive the uninstall before relying on the restore step below.

  3. Install the v26.3.0 operator set from Administrator > MarketPlace > OperatorHub:

    • kfbase-operator first (other operators depend on the base components).
    • kfp-operator if you need Kubeflow Pipelines (amd64 clusters only).
    • kubeflow-trainer-operator if you need Trainer v2.
  4. Create the matching CR instances (KubeflowBase, KubeflowPipelines, KubeflowTrainer) and reuse the configuration from your v1.x install. The chart values previously set in the Cluster Plugin install form are now exposed through the operator's CSV specDescriptors — most field names are unchanged.

  5. Restore user data: If the Profile CRs and their PVCs were preserved through the uninstall, they reattach automatically through the Notebook controller reconcile. If they were removed, re-apply the Profile CRs you exported in step 1 and restore the PVCs from your snapshots first. In all cases, re-import KFP pipelines and TrainingRuntimes.

What changed between v1.x and v26.3.0

  • Form factor: Cluster Plugin → OLM Helm Operator.
  • Install descriptor: ModuleInfo → OLM Subscription + ClusterServiceVersion + operator-owned CR.
  • Trainer: kftraining (Training Operator v1, deprecated) is removed; replaced by kubeflow-trainer-operator (Trainer v2).
  • Upstream alignment: all charts re-pinned to kubeflow/manifests 26.03.
  • Architecture: kfp-operator is now amd64-only; kfbase-operator and kubeflow-trainer-operator remain amd64 + arm64.

Upgrade Notes for kfbase (v1.x history)

Upgrading from v1.10.13 or earlier

Versions up to v1.10.13 expose the Kubeflow dashboard through NodePort. After the upgrade, the recommended access method is through the gateway endpoint instead.

After the upgrade:

  • Check the kubeflowDomain field in the kfbase plugin configuration to get <your-kubeflow-domain>.
  • Run kubectl -n istio-system get gateway kubeflow-external-gateway to get the gateway IP address.
  • Update DNS resolution, or your local hosts file, so that <your-kubeflow-domain> resolves to the gateway IP address.

If you still need to use NodePort, manually change the istio-system/kubeflow-istio-ingressgateway service to type NodePort, then get the assigned port for 443 by running:

kubectl -n istio-system get service kubeflow-istio-ingressgateway

You can then access the dashboard through:

https://<ip-of-master-node-of-the-cluster>:<NodePort>/

Upgrading from v1.10.9 or earlier

Before the upgrade, set a default storage class in your cluster, which will be used for the pgStorageClass parameter in the kfbase plugin configuration. If no default storage class is set, the upgrade may fail due to missing required parameters. These parameters were introduced in version v1.10.10.