1 - Integrating with Grafana

Configure Grafana dashboards

Introduction

The Resiliency Operator exports metrics in Prometheus format that can be visualized using custom Grafana dashboards.

Prerequisites

  • Prometheus installed in the Kubernetes cluster
  • Grafana configured to access the Prometheus

Process

1. Import dashboard for Assets

Access to Grafana and navigate to Home > Dashboards > Import.

Set the dashboard URL to https://astronetes.io/deploy/resiliency-operator/v1.3.5/grafana-dashboard-assets.json and click Load.

Configure the import and click con Import button to complete the process.

2. Import dashboard for Synchronizations

Access to Grafana and navigate to Home > Dashboards > Import.

Set the dashboard URL to https://astronetes.io/deploy/resiliency-operator/v1.3.5/grafana-dashboard-synchronizations.json and click Load.

Configure the import and click con Import button to complete the process.

2 - Integrating with Grafana Operator

Configure Grafana dashboards

Introduction

The Resiliency Operator exports metrics in Prometheus format that can be visualized using custom Grafana dashboards.

Prerequisites

  • Prometheus installed in the Kubernetes cluster
  • Grafana Operator installed in the cluster and configured to access the Prometheus

Process

1. Create the GrafanaDashboard for Assets

Create the GrafanaDashboard for Assets from the release manifests:

kubectl apply -f https://astronetes.io/deploy/resiliency-operator/v1.3.5/dashboard-assets.yaml

2. Create the GrafanaDashboard for Synchronizations

Create the GrafanaDashboard for Synchronizations from the release manifests:

kubectl apply -f https://astronetes.io/deploy/resiliency-operator/v1.3.5/dashboard-synchronizations.yaml

3 - Integrating with OpenShift Alerting

Manage alerts based on Prometheus metrics through OpenShift

Introduction

OpenShift allows the creation of alerts based on Prometheus metrics to provide additional information about the functioning and status of Astronetes operator.

Prerequisites

  • Access Requirement: cluster-admin access to the OpenShift cluster

Configure alerts

Two types of alerts are provided for managing the operator’s integration within the cluster and for monitoring the synchronization

Platform alerts

Metrics defined to assess the functionality of the integration between the product and the assets

Applying these rules:

oc apply -f https://astronetes.io/deploy/resiliency-operator/v1.3.5/alert-rules-resiliency-operator.yaml

Synchronization alerts

Metrics are employed to assess the status of synchronized objects.

For configuring this rule its necesary to follow these steps:

  1. Create this PrometheusRule manifest:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: failed-synchronize-items
  namespace: <your-synchronization-namespace>
spec:
  groups:
  - name: synchronization-alerts
    rules:
    - alert: SynchronizationNotInSync
      annotations:
        summary: "There are synchronization items not in sync."
        description: "Synchronization {{ $labels.synchronizationName }} is out of sync in namespace {{ $labels.synchronizationNamespace }}"
      expr: astronetes_total_synchronized_objects{objectStatus!="Sync"} > 0
      for: 1h
      labels:
        severity: warning
    - alert: WriteOperationsFailed
      annotations:
        summary: "There are one or more write operations failed"
        description: "Synchronization {{ $labels.synchronizationName }} failed write operator in namespace {{ $labels.synchronizationNamespace }}"
      expr: astronetes_total_write_operations{writeStatus="failed"} > 0
      for: 1h
      labels:
        severity: warning
  1. Edit namespace: Use the namespace where synchronizes are deployed

  2. Applying this rule:

kubectl apply -f <path-to-your-modified-yaml-file>.yaml

How to configure custom alerts

Prometheus provides a powerful set of metrics that can be used to monitor the status of your cluster and the functionality of your operator by creating customized alert rules.

The PrometheusRule should be created in the same namespace as the process that generates these metrics to ensure proper functionality and visibility.

Here is an example of a PrometheusRule YAML file:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: <alert-name>
  namespace: <namespace>
spec:
  groups:
  - name: <group-name>
    rules:
    - alert: <alert-rule-name>
      annotations:
        description: <description>
        summary: <summary>
      expr: <expresion>
      for: <duration>
      labels:
        severity: <severity-level>

Field Value Descriptions

In the PrometheusRule YAML file, several fields are essential for defining your alerting rules. Below is a table describing the values that can be used for each field:

FieldDescriptionExample Values
alertSpecifies the name of the alert that will be triggered. It should be descriptive.AssetFailure, HighCPUUsage, MemoryThresholdExceeded
forDefines the duration for which the condition must be true before the alert triggers.5m, 1h, 30s
severityIndicates the criticality of the alert. Helps prioritize alerts.critical, warning, info
exprThe Prometheus expression (in PromQL) that determines the alerting condition based on metrics.sum(rate(http_requests_total[5m])) > 100, node_memory_usage > 90

Apply to the cluster

Create new prometheus rule in the cluster:

oc apply -f <path-to-your-prometheus-rule-file>.yaml

Checking alerts

1. Access OpenShift Web Console:

  • Open your browser and go to the OpenShift web console URL.
  • Log in with your credentials.

2. Navigate to Observe:

  • In the OpenShift console, go to the Observe section from the main menu.
  • In the Alerts tab, you’ll find a list of active and silenced alerts.
  • Check for any alerts triggered based on the custom rules you can create in Prometheus.
  • Also you can see the entire list of alerting rules configurated.

3. Filter Custom Alerts:

  • To filter the custom alerts, use the source field and set its value to user. This will display only the alerts that were generated based on user-defined rules. Check openshift docs about filtering.

4 - Update license key

Steps to update the license key for the Resiliency Operator

There is no need to reinstall the operator when updating the license key.

Process

1. Update the license key

Update the Kubernetes Secret that stores the license key with the new license:

kubectl -n resiliency-operator apply -f new-license-key.yaml
oc -n resiliency-operator apply -f new-license-key.yaml

2. Restart the Resiliency Operator

Restart the Resiliency Operator Deployment to apply the new license:

kubectl -n resiliency-operator rollout restart deployment resiliency-operator-bucket-controller
kubectl -n resiliency-operator rollout restart deployment resiliency-operator-database-controller
kubectl -n resiliency-operator rollout restart deployment resiliency-operator-kubernetescluster-controller
kubectl -n resiliency-operator rollout restart deployment resiliency-operator-livesynchronization-controller
kubectl -n resiliency-operator rollout restart deployment resiliency-operator-synchronization-controller
kubectl -n resiliency-operator rollout restart deployment resiliency-operator-synchronizationplan-controller
kubectl -n resiliency-operator rollout restart deployment resiliency-operator-task-controller
kubectl -n resiliency-operator rollout restart deployment resiliency-operator-taskrun-controller
oc -n resiliency-operator rollout restart deployment resiliency-operator-bucket-controller
oc -n resiliency-operator rollout restart deployment resiliency-operator-database-controller
oc -n resiliency-operator rollout restart deployment resiliency-operator-kubernetescluster-controller
oc -n resiliency-operator rollout restart deployment resiliency-operator-livesynchronization-controller
oc -n resiliency-operator rollout restart deployment resiliency-operator-synchronization-controller
oc -n resiliency-operator rollout restart deployment resiliency-operator-synchronizationplan-controller
oc -n resiliency-operator rollout restart deployment resiliency-operator-task-controller
oc -n resiliency-operator rollout restart deployment resiliency-operator-taskrun-controller

3. Wait for the Pods restart

Wait a couple of minutes until all the Resiliency Operator Pods are restarted with the new license.

kubectl -n resiliency-operator wait --for=condition=available deployment/resiliency-operator-bucket-controller
kubectl -n resiliency-operator wait --for=condition=available deployment/resiliency-operator-database-controller
kubectl -n resiliency-operator wait --for=condition=available deployment/resiliency-operator-kubernetescluster-controller
kubectl -n resiliency-operator wait --for=condition=available deployment/resiliency-operator-livesynchronization-controller
kubectl -n resiliency-operator wait --for=condition=available deployment/resiliency-operator-synchronization-controller
kubectl -n resiliency-operator wait --for=condition=available deployment/resiliency-operator-synchronizationplan-controller
kubectl -n resiliency-operator wait --for=condition=available deployment/resiliency-operator-task-controller
kubectl -n resiliency-operator wait --for=condition=available deployment/resiliency-operator-taskrun-controller
oc -n resiliency-operator wait --for=condition=available deployment/resiliency-operator-bucket-controller
oc -n resiliency-operator wait --for=condition=available deployment/resiliency-operator-database-controller
oc -n resiliency-operator wait --for=condition=available deployment/resiliency-operator-kubernetescluster-controller
oc -n resiliency-operator wait --for=condition=available deployment/resiliency-operator-livesynchronization-controller
oc -n resiliency-operator wait --for=condition=available deployment/resiliency-operator-synchronization-controller
oc -n resiliency-operator wait --for=condition=available deployment/resiliency-operator-synchronizationplan-controller
oc -n resiliency-operator wait --for=condition=available deployment/resiliency-operator-task-controller
oc -n resiliency-operator wait --for=condition=available deployment/resiliency-operator-taskrun-controller