Note that pretty much all controllers consume:
CPU: largely based on the number of reconciliations they perform, which are generally related to event activity for resources they’re watching.
Memory: largely based on the number of primary resources that exist (multiplied by some factor based on the number of operand resources they need to watch as a result) via informer caches.
And then there is a concern that one Pod or Container could monopolize all available resources and Cluster admins must consider the effects that one Pod or Container may have on other components.
In an effort to prevent a container from consuming all the resources on a cluster or affecting other workloads from being scheduled, many production clusters will define ResourceQuota configurations.
The ResourceQuota configuration also applies to tenant workloads that are managed by your Operator. Cluster administrators will typically set a ResourceQuota for each tenant’s namespace as part of the onboarding. If a LimitRange with default values has not been created in each namespace and your Operator creates Containers inside the tenant namespace without specifying at least resource requests for CPU and Memory of its Pods then, the system or quota may reject Pod creation. Check the following statements obtained from K8s docs:
“If a LimitRange is activated in a namespace for computing resources like CPU and memory, users must specify requests or limits for those values. Otherwise, the system may reject Pod creation." (Reference).
“If quota is enabled in a namespace for compute resources like cpu and memory, users must specify requests or limits for those values; otherwise, the quota system may reject pod creation." (Reference).
In an effort to support clusters with the above configuration, to ensure safe operations and avoid negatively impacting other workloads: Operators should always include reasonable memory and CPU resource requests for their own deployment as well as for operands they deploy.
HINT Cluster admins might also able to avoid the above scenario by setting default values when they are not specified for each Pod and/or Container in a namespace.
Therefore, your Operator should always apply at least resource requests for
Pods/Deployments that it creates as part of the reconciliation logic. Ideally your
Operator also applies memory limits to those
Pods/Deployments. You may also consider CPU limits.
Resource requests and limits for the Operator Deployment can be defined by modifying the
as shown below:
... # TODO(user): Configure the resources accordingly based on the project requirements. # More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/ resources: requests: cpu: 10m memory: 64Mi ...
How to compute default values
IMPORTANT: A single configuration that fits all scenarios is not possible. In this way, Operators authors MUST to ensure that Cluster Admins and its users can change the resource requests/limits of the Operator/manager, and of its Operands.
However, you are able to benchmark your Operator by Monitoring the resource usage to ensure good and reasonable values for the general cases. Kubebuilder and SDK provide some metrics which can help you with.
NOTE Also, be aware that if the project was generated by Kubebuilder or SDK scaffold
then some values for the
are populated by default to get you started, however, you ought to optimize them based
on your own tests and the specific needs of your operator.
How to change the Operator/manager resources values when under OLM management
If your operator is managed by OLM, administrators or users can configure your operator’s resource requests and limits via the subscription.
Following are some general recommendations to manage the resources:
- MUST declare resource requests for both, CPU and Memory, for the
Pod/Deploymentmanaged by it
- OPTIONALLY setting the resources limit for CPU and Memory for the Operator Pod and any
Pod/Deploymentmanaged by it.
- SHOULD provide the mechanisms for Monitoring compute & memory resource usage so that,
Cluster Admins can use these metrics to monitor and resize the Operator and its Operands.CAVEAT: If the Operator
is integrated with OLM and the bundle has a
InstallPlanwill fail on a cluster, which does not have these CRD/the Prometheus operator installed. In this case, you might want to ensure the dependency requirement with OLM dependency or make clear its requirement for the Operator consumers.
- SHOULD allow admins to customize the
requests/limitsamounts defined for the
Pod/Deploymentcreated by the Operator and not hardcode these values.
- SHOULD document how your Operator consumer can customize/rightsize the resources requests and limits for the
Operator and Operands
Pod/Deploymentsor describe how the solution could be configured to automatically adjust these values based on the environment. the Operator automatically adjust the values to the environment rather than asking its consumers to amend them. You might also consider leveraging the Vertical Pod Autoscaler to have the resources requested by the Operator automatically adjusted to the cluster where it is deployed. You might also look at allowing horizontal pod autoscaling for the other
Pod/Deploymentscreated by your Operator. In this case, be aware that resource requests are also required for horizontal pod autoscaling to work.CAVEAT: If you are using OpenShift as your Kubernetes distribution you might want to check the doc Automatically adjust pod resource levels with the vertical pod autoscaler. Also, if the VPA CRD is not available in the cluster where the operator gets deployed the InstallPlan will fail. Please, see how to work with OLM dependency if your project integrating with it.
Why should you set these?
What happens when the resource requests are not set?
- configurations made by the cluster administrators such as ResourceQuota might not work without LimitRanges. The LimitRanger admission controller can provide default values for resource requests and limits when they have not been defined.
- the Operators consumers might face resource shortages on a node when resource usage increases, for example, during a daily peak in request rate.
- the Operator’s consumers might be unable to successfully deploy the Operator because it does not have the minimal resources available.
- the scheduler cannot make an informed placement choice when it picks the nodes the operator pods will be running on.
- when there is memory contention on the node the pod is likely to get either evicted or OOM killed.
- when there is CPU contention on the node the pod is likely to get starved of CPU cycles making the operator unresponsive.
What happens when the resource limits are not set?
Wrong configurations or code implementations can consume all resources available, affecting other components on cluster. Also, it might leave the containers more vulnerable such as to Dos Attacks. More info. See that when there is memory contention on a node, pods will start getting evicted and possibly killed according to their OOM score. The node will be flagged with a MemoryPressure condition and eventually made unschedulable. However, when there is CPU contention on the node, neighbour pods may get slowed down to the CPU cores they requested.
However, a popular practice by cluster administrators is to leverage ResourceQuota to limit the total amount of resources that can be requested or allowed in a single namespace. This may protect against over consumption of resources by operands of a faulty or wrongly configured operator. On the other hand it also means that the operator may not be able to create additional pods, limiting its functionality when the limit has been reached.
Also, see might want to check in the K8s docs the following sections:
- Motivation for CPU requests and limits and If you do not specify a CPU limit
- Motivation for Memory requests and limits and If you do not specify a memory limit
What happens when:
What happens when the resource limits have been reached?
For Memory: the container might be terminated with the reason
OOM Killed. If it is restartable, the kubelet will restart it, as with any other type of runtime failure.
For CPU: the container might or might not be allowed to exceed its CPU limit for extended periods of time. However, it will not be killed for excessive CPU usage. The
CPUis considered a “compressible” resource, and if the
Podstarts hitting the CPU limits, Kubernetes uses kernel to starts throttling the container. That means the CPU will be artificially restricted, giving a potentially worse performance only.
You might want to check the Troubleshooting section in the Kubernetes documents to better understand how to debug these scenarios.
Limits are specified but not requests
What happens when the resource limits are defined but not the requests?
If you specify a CPU or Memory limit for a Container but do not specify a request, Kubernetes automatically assigns a CPU or Memory request that matches the limit. In this way, you will be requesting always the limit and will be allocating more resources than required. (NOT RECOMMENDED)
Values are too big
Memory and CPU requests and limits are associated with Containers, but be aware that the Memory and CPU requests and limits of a Pod are the sum of its specific computing types for all the containers in the Pod.
If you define that your Pods should have Memory or CPU request too big then, you might not only be allocating and
blocking the usage of more than you ought unnecessarily. Also, your Operator consumers might be able to install
your Operator via OLM, for example, but will be unable to check the Pods/Deployment running successfully
when the amount defined exceeds the capacity available. In these scenarios, the Operator consumers
will check that Pod(s) failed to schedule with event errors like
Insufficient cpu and/or