Installing PubSub+ Cloud in Azure Kubernetes Service (AKS)

Azure Kubernetes Service (AKS) simplifies deploying a managed Kubernetes cluster in Azure by offloading the operational overhead to Azure. As a hosted Kubernetes service, Azure handles critical tasks, like health monitoring and maintenance. For more information about AKS, see the Azure Kubernetes Service documentation.

There are a number of environment-specific steps that you must perform to install PubSub+ Cloud.

Before you perform the environment-specific steps described below, ensure that you review and fulfill the general requirements listed in Common Kubernetes Prerequisites.

Solace does not support event broker service integration with service meshes. Service meshes include Istio, Cilium, Linkerd, Consul, and others. If deploying to a cluster with a service mesh, you must:

  • exclude the target-namespace used by PubSub+ Cloud services from the service mesh.
  • set up connectivity to event broker service in the cluster using LoadBalancer or NodePort. See Exposing Event Broker Services to External Traffic for more information.

Solace provides reference Terraform projects for deploying a Kubernetes cluster to AKS, EKS, and GKE. These Terraform projects have the recommended configuration settings, such as worker node sizes, resource configurations, taints, and labels optimized to install PubSub+ Cloud.

You can download the reference Terraform projects from our GitHub repository: https://github.com/SolaceLabs/customer-controlled-region-reference-architectures

Beware that all sample scripts, Terraform modules, and examples are provided as-is. You can modify the files as required and are responsible for maintaining the modified files for your Kubernetes cluster.

AKS Cluster Prerequisites

The following are the technical prerequisites for an AKS cluster deployment to deploy event broker services:

Worker Nodes
Worker nodes that you use for PubSub+ Cloud components must use ephemeral disks for the OS. Solace recommends that you (the customer) use ephemeral disks because Azure's premium managed disks don't provide the performance required for Kubernetes unless high-cost disks are utilized. When you use ephemeral disks for the OS, this permits the virtual machines to utilize a storage solution that is performant and cost-effective. Since worker nodes don't persist critical information on the disk that is used by the OS, there's no requirement to use non-ephemeral disks.
Permissions
The following permissions are required by the you to deploy the Terraform module:
  • All the permissions that are required to create and manage the AKS cluster. These permissions can be delegated by the Terraform module you use.
  • Permission to create Service Principals and assign the Contributor role over the whole resource group and the Network Contributor over the private subnets.

    These permissions are given to the AKS cluster to create load balancers and configure them. In addition, these permissions allow the CNI to interact with subnets and route tables.

  • The AKS-managed service (called AzureContainerService) is assigned a permission. Specifically, the AzureContainerService AD Application will get the Network Contributor role assigned to it over the entire resources group. The Terraform module requires this permission to read the NAT gateways configuration when it creates a cluster, and to configure networking as required by Azure-CNI.
  • An Azure account. Permissions to create and manage the following resources are required for the Terraform module you create:
    All Virtual Machine resources
    The Terraform modules access to the VM resources to create the service principal, AD application and the AKS cluster. Note that the in the available example, the AKS cluster is created using the data center-name value from the vars.tf file using the convention <data center-name>-aks.
    VNet
    The Terraform module requires this permission to configure the Virtual Network (VNet).
    Standard Load Balancers
    The Terraform module requires this permission to create and configure the load balancers.
    Premium LRS-managed Disks
    The Terraform module requires this permission to access LRS disks.
    Subnets
    The Terraform module requires this permission to set up the subnets.
    Security Groups
    The Terraform module requires this permission to set up the necessary security groups.
    Routing Tables
    The Terraform module requires this permission to create the appropriate gateways.
    Public IPs
    The Terraform module requires this permission to attach elastic IP addresses (EIPs) to the NAT gateway.

AKS Cluster Specifications

Before you (the customer) install the Mission Control Agent, you must configure the AKS cluster with the technical specifications listed in the sections that follow.

Node Pool Requirements

For high-availability event broker services, the cluster requires 12 node pools for event broker services. These must be split into four sets of three node pools. Each node pool must be locked to a single availability zone. Locking a node pool to an availability zone allows the cluster autoscaler to function properly. Solace uses pod anti-affinity against the node pools' zone label to ensure that each pod in a high-availability event broker service is in a separate availability zone.

For high-availability event broker services, the default (system) node pool spans all three availability zones.

The node pools must also meet the following requirements:

  • Configure the node pool settings where the OS disk type must be ephemeral and the OS disk size must be 48.

  • AKS Worker nodes for monitoring must be a minimum of Standard_DS2s_v3. The following table shows the minimal Worker Node size required based on the largest plan that's supported for your deployment.

    Node Pool TypeRecommended Minimum VM SizeNumber of Worker nodes Required
    MonitoringStandard_D2s_v3One for each service (the sum of all services of all types)
    Up to Enterprise 1K (Kilo) Standard_E2s_v3Two for each Enterprise 1K service
    Up to Enterprise 10K (Giga)Standard_E4s_v3Two for each Enterprise 10K service
    Up to Enterprise 100K  (Tera 100k)Standard_E8s_v3Two for each Enterprise 100K service

  • For deployments to Customer-Controlled Regions, these are the virtual machine requirements:

    Plans Possible Virtual Machine Types Resources
    CPU Memory
    VM Cores (mCore) Allocatable Cores (mCore) VM Memory (GiB) Allocatable Memory (GiB)
    Monitoring Nodes Standard_D2s_V3 2000 1900 8 5
    E2as_V4 2000 1900 16 12.3
    E2s_V3 2000 1900 16 12.3

    Developer

    Enterprise 250 (Nano)

    Enterprise 1K (Kilo)

    E2as_v4 2,000 1900 16 12.3
    E2s_v3 2,000 1900 16 12.3
    Standard_D4s_v3 4,000 3,860 16 12.3

    Enterprise 5K (Mega)

    Enterprise 10K (Giga)

    E4as_v4 4,000 3,860 32 27
    E4s_v3 4,000 3,860 32 27
    Standard_D8s_v3 8,000 7,820 32 27

    Enterprise 50K (Tera 50k)

    Enterprise 100K (Tera 100k)

    E8as_v4 8,000 7,820 64 56.6
    E8s_v3 8,000 7,820 64 56.6
    Standard_D16s_v3 16,000 15,740 64 56.6

Storage Class

For AKS, an autoscaler is included and the deployment script creates a storage class named managed-premium-zoned. The storage class works with PubSub+ Cloud to provide the following:

  • Local Redundant Storage (LRS) redundancy. Solace requires an LRS disk because other types of redundancy are too slow. The PubSub+ Cloud Enterprise plans use high-availability services that replicate data across two LRS disks.
  • Solace requires block device-based storage as regular filesystem-based storage won't work with the event broker service. As such, Solace requires managed volumes instead of azurefile volumes.
  • The event broker services are designed to use the XFS file system. The fsType setting must be set to xfs to ensure the event broker services meet their required performance levels.
  • To support scale-up, the StorageClass must contain the allowVolumeExpansion property, and have it set to "true".
  • To deploy PubSub+ Cloud, a custom StorageClass in AKS is required so that the Persistent Volume Claims (PVC) process creates the volume in the same AZ where the pods are scheduled.  This storage class uses Managed Premium LRS disks, and has the WaitForFirstConsumer binding mode, which instructs PVC to wait for a pod to be scheduled before deciding which zones the disks are in.

    This storage class should have properties similar to the following example:

    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      labels:
        addonmanager.kubernetes.io/mode: EnsureExists
        kubernetes.io/cluster-service: "true"  
      name: managed-premium-zoned
    parameters:
      cachingmode: ReadOnly
      kind: Managed
      storageaccounttype: Premium_LRS
      fsType: xfs
      zoned: "true"
    provisioner: disk.csi.azure.com
    reclaimPolicy: Delete
    volumeBindingMode: WaitForFirstConsumer
    allowVolumeExpansion: true

Bastion Host

  • Solace recommends that you have two Bastion hosts and you disperse them over the three availability zones. This allows remote access to your deployment should a zone fail. If remote access to your deployment during a zone failure is not a requirement, you can choose to use the minimum of one Bastion host instead. The Bastion host is deployed using a minimal VM that runs only an SSH server.

  • To determine the size of the subnet, you must know the number of event broker services that you plan to run on the AKS cluster. Knowing the number of pods and services helps you to determine how big the subnet must be. Refer to the subnet Azure documentation for details.

Load Balancer

When using AKS, event broker services are exposed through a single public network load balancer. The source IP address for outgoing connections to internet hosts are static IP address. The static IP address that is used as a front-end public IP is associated with the AKS public Standard Load Balancer.

You must use a Standard Load Balancer SKU instead of a Basic load balancer SKU. The Standard Load Balancer is required to act as a NAT solution and avoids the requirement to segregate the AKS cluster into zonal stacks and the requirement for a separate NAT. The Standard Load Balancer also allows you to deploy the AKS cluster to a single, private subnet using a single route table, and simplifies the deployment and planning of CIDRs when using VNet peering technologies, such as Hub-spoke network topology in Azure.

Networking

  • To spread an event broker service, or rather the event broker services over three different Availability Zones (AZ), an anti-pod affinity can be used. For regions that support Availability Zones (AZ), set the topologyKey to topology.kubernetes.io/zone; otherwize for AKS clusters that do not have AZ, set it to kubernetes.io/hostname.

  • Configure the IP addresses for the pods to be by AKS Kubernetes rather than VNet. Use kubenet (assigned by the cluster) for Kubernetes or azure (assigned from the subnet). When you select azure, each worker node is pre-allocated 30 IP addresses from the VNet. Using kubenet does not assign pre-allocated IP addresses.

  • Determine the number of outgoing SNAT (outgoing) ports for each VM in the cluster. The number you choose determines the number of worker nodes available; this must be between 0 and 64000. Choosing a lower number gives you more worker nodes, but fewer send connections per node. For more information about SNAT ports, see Outbound Rules Azure Load Balancer and Scenarios with outbound rules on the Microsoft website.

  • The requirements are one pod per Developer service and three pods for other event broker services. The subnets need to be big enough to contain all the IP addresses used for the pods. For more information, see Configure Azure CNI networking in Azure Kubernetes Service (AKS) on the Microsoft website.

For information about using Azure Kubernetes Service, see the Azure documentation site.

IP Range

There are two networking options in Azure: Kubenet and Azure CNI. For Customer-Controlled Regions, Solace recommends using Kubenet. Kubenet offers the most efficient CIDR requirements for the VNet containing the cluster.

CIDR requirements for Dedicated Regions depend on your need to access event broker services privately through peering. If this is necessary, Solace requires a CIDR that is compatible with your network plan.

You should carefully consider future expansion requirements when estimating the CIDR size required for your AKS cluster. Once deployed, you cannot change the CIDR and expanding the size of your VNet is not simple. For more information, see AKS Cluster Specifications. You can also use the Solace provided downloadable excel-based CIDR calculator to help calculate your CIDR requirements.

Deployments in Regions With No Availability Zones

Some regions do not have availability zones (AZs). You can deploy to these regions, but the IaaS has a reduced fault tolerance without AZs.

To deploy to regions that don't have AZs available, in the anti-pod affinity, the topologyKey key should be set to kubernetes.io/hostname, whereas when AZs are available, it is set to topology.kubernetes.io/zone.

Autoscaling

Your cluster requires autoscaling to provide the appropriate level of available resources for your event broker services as their demands change. Solace recommends using the Kubernetes Cluster Autoscaler, which you can find in the Kuberenetes GitHub repository at: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler.

See the Automatically scale a cluster to meet application demands on Azure Kubernetes Service (AKS) documentation on the Microsft Azure Kubernetes Service (AKS) documentation site for information about implementing a Cluster Autoscaler.