Plugins implement the logic to synchronize data from a particular type of asset to another instance that runs the same or another technology.
Specifying which one to use is required whether the synchronization is managed through Synchronization or SynchronizationPlan Custom Resources.
1 - Kubernetes Objects to Kubernetes
Synchronization across two Kubernetes clusters.
1.1 - Architecture
Kubernetes to Kubernetes plugin Architecture
The cluster is protected with a warm stand-by paired cluster where the workloads will be offloaded when the disaster occurs. The resources can be deactivated while in the destination cluster until such event takes place, avoiding unnecessary resource consumption and optimizing organizational costs.
Resiliency Operator extracts the resources from the source cluster and syncs them on the destination cluster maintaining a consistent state between them.
Operator monitoring is attached to the operator and it is independent of either cluster.
The operator can be deployed in either a 2-clusters or 3-clusters architecture.
2-clusters
This configuration is recommended for training, testing, validation or when the 3-clusters option is not optimal or possible.
The currently active cluster will be the source cluster, while the passive is the destination cluster. The operator, including all the Custom Resource Definitions (CRD) and processes, is installed in the latter. The operator will listen for new resources that fulfill the requirements and clone them into the destination cluster.
The source cluster is never aware of the destination cluster and can exist and operate as normal without its presence. The destination cluster needs to have access to it through a KubernetesCluster resource.
3-clusters
In addition of the already existing 2 clusters, this modality includes the management cluster. The operator synchronization workflow is delegated in it instead of depending on the destination cluster. The management cluster is in charge of reading the changes and new resources in the source cluster and syncing them to the destination. Neither source or destination cluster needs to know of the existence of the management cluster and can operate without it. Having a separate cluster that is decoupled from direct production activity lowers operational risks and eases access control to both human and software operators. The operator needs to be installed in the destination cluster as well to start the recovery process without depending on other clusters. Custom Resources that configure the synchronization are deployed in the management cluster while those only relevant when executing the recovery process are deployed in the destination cluster.
This structure fits organizations that are already depending on a management cluster for other tasks or ones that are planning to do so. Resiliency Operator does not require a standalone management cluster and can be installed and managed from an existing one.
Components
Synchronization across clusters is managed through Kubesync, Astronetes solution for Kubernetes cluster replication. The following components are deployed when synchronization between two clusters is started:
Component
Description
Source cluster permissions
Destination cluster permissions
Events listener
Read events in the source cluster.
Cluster reader
N/A
Processor
Filter and transform the objects read from the source cluster.
Cluster reader
N/A
Synchronizer
Write processed objects in the destination cluster.
N/A
Write
Reconciler
Sends delete events whenever it founds discrepancies between source and destination.
Cluster reader
Cluster reader
NATS
Used by other components to send and receive data.
N/A
N/A
Redis
Stores metadata about the synchronization state. Most LiveSynchronization components interact with it.
N/A
N/A
Metrics exporter
Export metrics about the LiveSynchronization status.
N/A
N/A
1.2 - Components
Kubernetes to Kubernetes plugin Components
Synchronization across clusters is managed through Kubesync, Astronetes solution for Kubernetes cluster replication. The following components are deployed when synchronization between two clusters is started:
Component
Description
Source cluster permissions
Destination cluster permissions
Events listener
Read events in the source cluster.
Cluster reader
N/A
Processor
Filter and transform the objects read from the source cluster.
Cluster reader
N/A
Synchronizer
Write processed objects in the destination cluster.
N/A
Write
Reconciler
Sends delete events whenever it founds discrepancies between source and destination.
Cluster reader
Cluster reader
NATS
Used by other components to send and receive data.
N/A
N/A
Redis
Stores metadata about the synchronization state. Most LiveSynchronization components interact with it.
N/A
N/A
Metrics exporter
Export metrics about the LiveSynchronization status.
N/A
N/A
1.3 - Post-installation configuration
Steps to configure the Kubernetes objects to Kubernetes cluster plugin
1.3.1 - Setting a Kubernetes cluster
Granting access to source and destination cluster
Introduction
Connection to both the source and destination clusters is set using the KubernetesCluster resource. Credentials are stored in Kubernetes secrets from which the KubernetesCluster collection access to connect to the clusters.
Requirements
The kubeconfig file to access as read-only to the source cluster
The kubeconfig file to access as cluster-admin to the destination cluster
The Secret provided by AstroKube to access the Image Registry
Process
1. Prepare
Create Namespace
Create the namespace to configure the recovery process:
kubectl create namespace <namespace_name>
Setup registry credentials
Create the Secret that stores the credentials to the AstroKube image registry:
How to proctect the platform resources from a disaster
Introduction
A LiveSynchronization resource indicates a set of Kubernetes resource to replicate or synchronize between the source cluster and the destination cluster.
Create the livesynchronization.yaml file according to your requirements. For this example, the goal is to synchronize deployments with the disaster-recovery label set to enabled. It is also desirable that when its replication is completed that no pod is created in the destination cluster and that after a recovery is launched the deployment launches active pods again.
spec.config.sourceName and spec.config.destinationName refers to the name and namespace of the KubernetesCluster resources for the corresponding clusters.
The spec.config.replication.resources is a list of the set of resources to deploy. A single LiveSynchronization can cover multiple types or groups of resources, although this example only manages deployments.
The type of the resource is defined at spec.config.replication.resources[*].resource. The filters can be located in spec.config.replication.resources[*].filters. In this case, the RecoveryPlan is matching the content of the disaster-recovery label.
The spec.config.replication.resources[*].transformation and spec.config.replication.resources[*].recoveryProcess establish the actions to take after each resource is synchronized and after they are affected by the recovery process respectively. In this case, while being replicated, each deployment will set their replicas to 0 in the destination cluster and will get back to one after a successful recovery. The resource parameters are always left intact in the source cluster.
2. Suspending and resumen a recovery plan
A keen eye might have noticed the spec.suspend parameter. In this example it is set to true to indicate that the recovery plan is inactive. An inactive or suspended recovery plan will not replicate new or existing resources until it is resumed. Resuming a recovery plan can be done by setting spec.suspend to false and applying the changes in yaml. Alternatively, a patch with kubectl will work as well and will not require the original yaml file:
Depending on the use case and the chosen solution for Resiliency Operator, it is convenient that resources synchronized in the destination cluster differ from the original copy. Taking as example a warm standby scenario, in order to optimize infrastructure resources, certain objects such as Deployments or Cronjobs do not need to be actively running until there is a disaster. The standby destination cluster can run with minimal computing power and autoscale as soon as the recovery process starts, reducing the required overhead expenditure.
While a resource is being synchronized into the destination cluster, its properties can be transformed to adapt them to the organization necessities. Then, if and when a disaster occurs, the resource characteristics can be restored to either its original state or an alternative one with the established recover process.
Filters
FIlters are useful to select only the exact objects to synchronize. They are set in the spec.config.replication.resources[*].filters parameter.
Name selector
The nameSelector filters by the name of the resources of the version and type indicated. The following example selects only the Configmaps that follow the regular expression config.*:
The namespaceSelector filters resources taking in consideration the namespace they belong to. This selector is useful to synchronize entire applications if they are stored in a namespace. The following example selects every deployment that is placed in a namespace with the label disaster-recovery: enabled:
Transformations are set in the spec.config.replication.resources[*].transformation parameter and are managed through patches.
Patch modifications alter the underlying object definiton using the same mechanism as kubectl patch. As with jsonpatch, the allowed operations are replace, add and remove. Patches are defined in the spec.config.replication.resources[*].transformation.patch list and admits an arbitary number of modifications.
While Resiliency Operator supports multiple transformations for the same LiveSynchronization, it does not cover having more than one transformation for the same resource group. Transformations that cover different resources of the same resource group should be in different recovery plans. The same resource or resource set can only be affected by up to one transformation and cannot be present in more than one LiveSynchronization.
RecoveryProcess
The RecoveryProcess of a LiveSynchronization is executed in the case of a disaster to recover the original status of the application in the destination cluster. A resource can be either restored from the original definition stored in a bucket or by performing custom patches like with Transformations.
To restore from the original data, read the Recovering from a Bucket section. This option will disregard performed transformations and replace the parameters with those of the source cluster.
Patching when recovering is accessible at spec.config.replication.resources[*].recoveryProcess.fromPatch list and admits an arbitary number of modifications. It will act on the current state of the resource in the destination cluster, meaning it will take into consideration the transformations performed when it was synchronized unlike when recovering from original. As with jsonpatch, the allowed operations are replace, add and remove.
How save objects and recover them using object storage.
Introduction
A Bucket resource indicates an Object Storage that will be used to restore original objects when recovering from a disaster.
Object Storage stores data in an unstructured format in which each entry represents an object. Unlike other storage solutions, there is not a relationship or hierarchy between the data being stored. Organizations can access their files as easy as with traditional hierarchical or tiered storage. Object Storage benefits include virtually infinite scalability and high availability of data.
Many Cloud Providers include their own flavor of Object Storage and most tools and SDKs can interact with them as their share the same interface. Resiliency Operator officially supports the following Object Storage solutions:
Resiliency Operator can support multiple buckets in different providers as each one is managed independently.
Contents stored in a bucket
A bucket is assigned to a LiveSynchronization by setting it in a spec.config.bucketName item. It stores every synchronized object in the destination cluster with some internal control annotations added. In the case of a disaster, resources with recoveryProcess.fromOriginal.enabled equal to true will be restored using the bucket configuration.
The path of a stored object is as follows: <bucket_namespace>/<bucket_name>/<object_group-version-resource>/<object_namespace>.<object_name>.
Requirements
At least an instance of a ObjectStorage service in one of the supported Cloud Providers. This is commonly known as a bucket and will be referred as so in the documentation.
At least one pair of accessKeyID and secretAccessKey that gives both write and read permissions over all objects of the bucket. Refer to the chosen cloud provider documentation to learn how to create and extract them. It is recommended that each access key pair has only access to a single bucket.
Preparing and setting the bucket
Create the secret
Store the following file and apply it into the cluster substituting the template parameters with real ones.
If the LiveSynchronizationdoes not set spec.resources[x].recoveryProcess.fromOriginal.enabled equal to true, where x refers to the index of the desired resource, the contents of the bucket will not be used. For the configuration to work, make sure both the bucket reference and recovery process transformations are correctly set.
Indicating which bucket to use can accomplished by configuring the spec.config.bucketName like in the following example:
Store the following file and apply it into the cluster substituting the template parameters with real ones.
S3 requires that the region in the endpoint matches the region of the target bucket. It has to be explicitely set as AWS does not infer buckets region e.g. us-east-1 for North Virginia.
If the Recovery Plan does not set spec.resources[x].recoveryProcess.fromOriginal.enabled equal to true, where x refers to the index of the desired resource, the contents of the bucket will not be used. For the configuration to work, make sure both the bucket reference and recovery process transformations are correctly set.
Indicating which bucket to use can accomplished by configuring the spec.BucketRef like in the following example:
Synchronized resources reconciliation between source and destination cluster.
Introduction
Due to particular circumstances it might be possible that there are objects that were not synchronized from the source cluster to the destination cluster. To cover this case, Resiliency Operator offers a reconciliation process that adds, deletes or updates objects in the destination cluster if its state differs from the source.
Auto pruning
The resynchronization process will delete resources in the destination cluster that are not present in the source cluster. It is recommended that before recovering from a disaster the target LiveSynchronization is suspended to avoid potential data loss.
Architecture
Reconciliation is performed at the LiveSynchronization level. Every Live Synchronization is in charge of their covered objects and that they are up to date with the specification. Reconciliation is started by two components, EventsListener and Reconciler. The former is in charge of additive reconciliation and the latter of substractive reconciliation.
Additive reconciliation
Refers to the reconciliation of missing objects that are present in the source cluster but, for any reason, are not present or are not up to date in the destination cluster. The entry point is the EventsListener service which receives events with the current state in the source cluster of all the objects covered by the Recovery Plan with a period of one hour by default.
These resync events are then treated like regular events and follow the syncronization communication flow. If the object does not exist in the destination cluster, the Synchronizer will apply it. In the case of updates, only those with a resourceVersion greater than the existing one for that object will be applied, updating the definition of said object.
Substractive reconciliation
In the case that an object was deleted in the source cluster but it was not in the destination, the Additive Reconciliation will not detect it. The source cluster can send events containing the current state of its existing components, but not of those that ceased to exist in it.
For that, the Reconciler is activated with a period of one hour by default. It compares the state of the objects covered the Recovery Plan in both source and destination clusters. If a change is found, it creates a delete event in the NATS. This event is then processed as an usual delete event throughout the rest of the communication process.
Modifying the periodic interval
By default, the resynchronization process will be launched every hour. It can be changed by modifying the value at spec.config.resyncPeriod in the LiveSynchronization object. The admitted format is %Hh%Mm%Ss e.g. 1h0m0s for intervals of exactly one hour. Modifying this variable updates the schedule for both additive and substractive reconciliations.
After defining a LiveSynchronization, a Task resource will be created in the destination cluster. The operator processes the spec.config.reaplication.resources[*].recoveryProcess parameter to define the required steps to activate the dormant applications. Taking as an example the following definition:
This object should not be tempered with. It is managed by their adjacent LiveSynchronization.
On the day of a disaster
Recovering from a disaster will require the deployment of a TaskRun resource per Task that applies to recover the system and applications. The following example executes the TaskRun resource defined in the previous section:
Monitor the state of the synchronization and recovery process
1.4.1 - Audit fields
Parameters built into Resiliency Operator to track when a change was made and whom did it
Auditing and version control is an important step when configuring resources. Knowing when a change was made and the account that applied it can be determinative in an ongoing investigation to solve an issue or a configuration mismanagement.
Audit fields
The following annotation are attached to every resource that belongs to Resiliency Operator Custom Resources:
apiVersion:automation.astronetes.io/v1alpha1kind:LiveSynchronizationmetadata:annotations:audit.astronetes.io/last-update-time:"<date>"# Time at which the last update was applied.audit.astronetes.io/last-update-user-uid:"<uid-hash>"# Hash representing the Unique Identifier of the user that applied the change.audit.astronetes.io/last-update-username:"<username>"# Human readable name of the user that applied the change.
Fields are updated only when a change to the fields .spec, .labels or .annotations are detected. Status modifications by the operator are not recorded.
Objects that are synchronized will not have these labels.
1.4.2 - Understanding logging
How to interpret Disaster Recovery Operator log messages and manage them
Disaster Recovery Operator implements a logging system throughout all its pieces so that the end user can have visibility on the system.
JSON fields
Name
Description
level
Log level at write time.
timestamp
Time at which the log was written.
msg
Log message.
process
Information about the process identity that generated the log.
event
Indicates if the log is referring to a create, update or delete action.
sourceObject
Object related to the source cluster that is being synchronized.
oldSourceObject
Previous state of the sourceObject. Only applicable to update events.
sourceCluster
Information about the source managed cluster.
destinationObject
Object related to the destination cluster.
destinationObject
Information about the destination managed cluster.
{"level":"info","timestamp":"2023-11-28T18:05:26.904276629Z","msg":"object read from cluster","process":{"id":"eventslistener"},"sourceCluster":{"name":"source","namespace":"dr-config","resourceVersion":"91015","uid":"3c39aaf0-4216-43a8-b23c-63f082b22436"},"sourceObject":{"apiGroup":"apps","apiVersion":"v1","name":"nginx-deployment-five","namespace":"test-namespace-five","resource":"deployments","resourceVersion":"61949","uid":"5eb6d1d1-b694-4679-a482-d453bcd5317f"},"oldSourceObject":{"apiGroup":"apps","apiVersion":"v1","name":"nginx-deployment-five","namespace":"test-namespace-five","resource":"deployments","resourceVersion":"61949","uid":"5eb6d1d1-b694-4679-a482-d453bcd5317f"},"lastUpdate":{"time":"2023-11-25T13:12:28.251894531Z","userUID":"165d3e9f-04f4-418e-863f-07203389b51e","username":"kubernetes-admin"},"event":{"type":"update"}}
An object was uploaded to a recovery bucket.
{"level":"info","timestamp":"2023-11-28T18:05:27.593493962Z","msg":"object uploaded in bucket","sourceObject":{"apiGroup":"apps","apiVersion":"v1","name":"helloworld","namespace":"test-namespace-one","resource":"deployments","resourceVersion":"936","uid":"7c2ac690-3279-43ca-b14e-57b6d57e78e1"},"oldSourceObject":{"apiGroup":"apps","apiVersion":"v1","name":"helloworld","namespace":"test-namespace-one","resource":"deployments","resourceVersion":"936","uid":"7c2ac690-3279-43ca-b14e-57b6d57e78e1"},"process":{"id":"processor","consumerID":"event-processor-n74"},"bucket":{"name":"bucket-dev","namespace":"dr-config","resourceVersion":"91006","uid":"47b50013-3058-4283-8c0d-ea3a3022a339"},"bucketObject":{"path":"dr-config/pre/apps-v1-deployments/test-namespace-one.helloworld"},"lastUpdate":{"time":"2023-11-25T13:12:29.625399813Z","userUID":"165d3e9f-04f4-418e-863f-07203389b51e","username":"kubernetes-admin"}}
Managing logs
Messages structure vary depending on the operation that originated it.
The sourceCluster and destinationCluster are only present for operations that required direct access to either cluster. For the former, only messages originating from either the eventsListener, processor or reconciler services can include it in their logs. The latter will only be present in synchronizer or reconciler logs messages. These parameters will not be present for internal messages such as those coming from the nats since there is no direct connection with either cluster.
oldSourceObject is the previous state of the object when performing an update operation. It is not present in other types.
When the bucket and bucketObject parameters are present, the operation is performed against the indicated bucket without any involvement of the source and destination clusters. For create operations, an object was uploaded for the first time to the bucket, for updates an existing one is modified and for delete an object was deleted from the specified bucket.
These characteristics can be exploited to improve log searches by narrowing down the messages to those that are relevant at the moment. Serving as an example, the following command will output only those logs that affect the source managed cluster by filtering the messages that lack the sourceCluster.
This could be useful when trying to debug and solve connection issues that might arise.
Log messages
The log message is located in the msg parameter. It can be read and interpreted to establish the severity of the log. The following tables group every different log message depending on whether it should be treated as error or informative.
Error messages
msg
“error reading server groups and resources”
“error reading resources for group version”
“error getting namespace from cluster”
“error creating namespace in cluster”
“error getting object from cluster”
“error creating object in cluster”
“error updating object in cluster”
“error listing objects in cluster”
“error deleting object in cluster”
“error uploading object in bucket”
“error deleting object form bucket”
“error getting object from bucket”
Informative messages
Not found objects are not errors
Errors regarding not found objects do not represent errors but rather normal behaviour while synchronizing objects not present in one of the clusters.
msg
“reading server groups and resources”
“server group and resources read from cluster”
“reading resources for group version”
“resource group version not found”
“group resource version found”
“reading namespace from cluster”
“namespace not found in cluster”
“namespace read from cluster”
“creating namespace from cluster”
“namespace already exists in cluster”
“namespace created in cluster”
“reading object from cluster”
“object not found in cluster”
“object read from cluster”
“creating object in cluster”
“object created in cluster”
“updating object in cluster”
“object updated in cluster”
“deleting object in cluster”
“object deleted in cluster”
“listing objects in cluster”
“list objects not found in cluster”
“listed objects in cluster”
“uploading object in bucket”
“object uploaded in bucket”
“deleting object from bucket”
“object deleted from bucket”
“getting object from bucket”
“object got from bucket”
“listing object from bucket”
1.4.3 - Granafa setup
How to configure Grafana
Resiliency Operator offers the option of leveraging an existing Grafana installation to monitor the state of the synchronization and recovery process. Users can incorporate the provided visualizations to their workflows in a transparent manner without affecting their operability.
1. Requirements
Grafana Operator
The operator installation includes the necessary tools to extract the information from it. To view that information with the official dashboard, is required that the management cluster has the Grafana Operator installed.
Astronetes Disaster Recovery Operator supports Grafana v4 and Grafana v5.
2a. Using Grafana Operator v4
Create the GrafanaDashboard from the release manifests:
The dashboard shows detailed information about the write, read and computing processes alongside a general overview of the health of the operator.
General view of the status of the operator:
The dashboard can be filtered attending the following characteristics:
Namespace. Only shows information related to the LiveSynchronizations in a specified namespace.
Recovery Plan. Filters by a specific LiveSynchronizaton.
Object Namespace. Only shows information of the objects located in a given namespace regardless their associated LiveSynchronization.
Object API Group. Objects are filtered attending to the API Group that they belong to.
Filters can be combined to get more specific results e.g. Getting the networking related objects that belong to a LiveSynchronization that is deployed in a namespace.
1.5 - Configuration
Plugin parameters and accepted values
LiveSynchronization
Configuration
Name
Description
Type
Required
sourceName
Kubernetes Cluster acting as source
string
yes
destinationName
Kubernetes Cluster acting as destination
string
yes
bucketName
Bucket name to upload the synchronization contents