What is a Service Mesh?
The goal of service meshes is to provide a smart network for services. There are four pillars that offer the smart network or rather mesh: Connect, Secure, Control and Observe.
In 2019 we saw a lot of buzz for Service Mesh technology as it dominated conference tracks. Popular service mesh technologies include Istio, Linkerd, Maesh, and HashiCorp Consul. Service Meshes provide many features for distributed applications that are well worth considering these include: Service Discovery, Routing, Service Identity and Authorization, Service Retries, Circuit Breaking and Observability.
Even if you do not decide to utilize a service mesh today it is still worth understanding the concepts behind Service Mesh solutions. This blog post will cover service mesh fundamentals and offer a range of references and descriptions to expand your knowledge.
Why use a Service Mesh?
Connection offers resilience for applications, providing configurations for traffic between services. This function provides resilience and can enable fault injection, traffic mirroring, A/B Testing, and deployments such as canary and staged rollouts.
Security offers application-independent security. The idea is to take the responsibility of security out of the application and back into the infrastructure. This can include service-to-service encryption, and service-to-service authentication (transport and origin authentication).
Control offers a uniform abstraction for policy control. Configure policies to allow traffic redirection in response to real-time events and rule-based processing based on headers.
Observability offers visibility into application deployments. Visibility features allow for end-to-end monitoring, logging, metrics, and distributed tracing bundled into the service mesh level.
How does Service Mesh work?
Service mesh provides a proxy for every service in the mesh. The proxy interacts with services to provide end-to-end resilience, security, policy control, and visibility. Istio’s network proxy is an Envoy Proxy instance. Envoy is a layer 7 network proxy. Each proxy is deployed as a sidecar container. Having a Layer 7 or L7 proxy for service-to-service communication provides features like traffic shaping, service discovery and network policy control. For other mesh implementations, the proxy technology could be an L4 network proxy or proxy that is deployed as DaemonSet.
Mesh implementations provide a control plane to configure these proxies. The control plane’s components interact with the data plane to provide the four pillars of a service mesh. This control plane can be thought of as similar to a Kubernetes control plane. Control plane components will manage certificates and keys for authentication, enable additional plugin configurations via adapters. Importantly, it also will expose APIs for Command Line Interface (CLI) use. Istio provides isitoctl as its CLI utility.
There are two components of a mesh: the control plane and the data plane. As shown below, service mesh users have a CLI tool to interact with the service meshes’ Control plane. These service mesh configurations are written in YAML and are applied by said CLI tool, which interacts with the Control Plane’s APIs to modify the data plane.
Istio is a popular service mesh implementation, trending the adoption of service mesh due to its feature set and production readiness. This de-facto nature leads to many references and architectures that are biased towards Istio. This blog post will caveat these Istio Service Mesh specific details.
To the point of differing mesh implementations, the Service Mesh Interface (SMI) was released in 2019 for mesh interoperability. SMI is a “specification for service meshes that run on Kubernetes. It defines a common standard that can be implemented by a variety of providers. This allows for both standardization for end-users and innovation by providers of Service Mesh
Technology.” as defined by specification here. The SMI allows the reuse of service mesh configurations across solutions.
Now let’s continue with how service meshes provide the four pillars in more detail.
Providing Resilience through Connect:
Service meshes allow for Traffic HTTP/TCP routing and traffic management. TrafficSplit defined by the SMI allows “incrementally direct percentages of traffic between various services. It will be used by clients such as an ingress controller or proxy sidecars to split the outgoing traffic to different destinations.” The traffic splits need three specifications: a root service that clients use to direct traffic to services. Two Kubernetes service resources that potentially have a different selector and type. And weights for the traffic split for each of those Kubernetes services.
SMI policies around retries, timeouts, and rate limits are currently out of the scope of the SMI. Istio proves these additional features. We will discuss some policies that are of importance. Timeouts allow services to terminate on a request after a defined amount of time when calling another service. The Envoy proxy can be configured to wait a predefined amount of time before quitting and returning a 504 status code.
Another feature is circuit breaking, which, as the name describes, works as a regular circuit breaker. Circuit breakers act as a wrapper for function calls, should the circuit breaker trip due to a failure the circuit breaker prevents the application from performing the function call which is bound to fail. Istio will take the pod out from the Envoy address pool when tripping the circuit breaker.
Fault injection allows the definition of throughput and latency delays. Linkerd 2.x is a service mesh implementation which also provides fault injection. “Fault injection is a form of chaos engineering where the error rate of a service is artificially increased to see what impact there is on the system as a whole. Traditionally, this would require modifying the service’s code to add a fault injection library that would be doing the actual work. Linkerd can do this without any service code changes, only requiring a little configuration.” Read about fault injection defined by Linkerd here.
Resilient systems are able to cope with failures in downstream systems. Service mesh aims to build resilience through connection configurations.
Providing Security through Control:
Service meshes provide control for policy scoped to three levels: the service mesh level, the namespace level, and the service level. The service mesh level is respectively the broadest scope and allows policies to be applied service mesh scope wide.
Control over configuring routes is dependent on a custom resource called TrafficTarget. There are two parts of a TrafficTarget. A Service Role is a set of rules, like GET or POST, and a Service Role Binding, which binds the roles to Kubernetes service accounts.
The SMI’s specification on Traffic Spec specifies how to configure the mesh to define rules based on the types of traffic that flow through the mesh. Currently, this spec only covers HTTP/1 and HTTP/2 protocols.
Ingress Gateways allow verified traffic from outside a cluster into the service mesh. It ensures end-to-end encryption for incoming traffic. Implementations will differ on the kind of ingress as for now, as SMI does not specify. Some mesh solutions that may not have an opinionated approach to ingress. Solo’s Gloo product is Ingress Controller built on top of Envoy that can be used as an API gateway for instances where you do not get an out of the box implementation. In an Istio architecture, this component is a standalone instance of Envoy.
Istio Security Architecture from Istio’s documentation here.
Encrypted traffic enters the Ingress Proxy as shown. The traffic is authenticated with TLS termination at the proxy. The request is then re-encrypted with the internal service mesh encryption and sent to the targeted traffic within the mesh. This target is a virtual service or another gateway that will route the traffic to its destination within the mesh.
Isitio uses a service mesh control plane component called Citadel for key and certificate management. Citadel will handle the creation and rotation of certificates that are used for encrypted communications between services in the mesh. The service handles this on a service account level based on Kubernetes namespaces managed by the service mesh. Currently, the cert gets mounted as volume as a Kubernetes secret resource for the Envoy proxy to use. In the future, there will be support for Secret Discovery Service or SDS, which is a more secure method for identity provisioning.
Service meshes provide controllable security for application systems.
Monitoring and Tracing for Observability:
Service meshes also provide pluggable backends for telemetry capture. This is done using an instance of Prometheus. Service mesh solutions like Istio also use an instance of application tracing with Jaeger. An added benefit with service mesh observability is that it ensures commercially off the shelf (COTS) applications have non-zero visibility regarding performance.
Monitoring support is provided through a Prometheus instance living in the control plane. This instance is dedicated to the service mesh, so there may be other instances of Prometheus running in the cluster. The Prometheus instance living in the control plane targets what and where to scrape automatically for the services in your data plane. This auto-discovery utilizes
Kubernetes APIs. An idea here is that you’re able to expand your monitoring environment through federation. Applications within the mesh are currently responsible for exposing app-specific data to the Prometheus instance.To get the specifics read more about traffic metrics as specified by the SMI here.
A service mesh implementation that implements tracing capabilities is Red Hat’s OpenShift Service Mesh which uses the Cloud Native Cloud Foundation’s Jaeger project. The service mesh runs a version of an all in one image, which includes the Collector, UI, and query components. Here Envoy acts as the Jaeger Agent component. Read more about the implementation of Distributed Tracing for a service mesh here.
Finally, another aspect of observability is visibility into the service mesh’s data plane. Tools like Kiali can be installed on top of existing service meshes to additional service mesh observability and configuration allowing for service topology, and wizards for easily configuring istio routing via a user interface. Read more about Kiali’s features here. Solo’s Service Mesh Hub is another service mesh level tool.
Service meshes provide observability of systems involving technologies such as Prometheus, Jaeger, and Kiali.
Service mesh technologies support applications through added resilience, security, control, and observability at the mesh infrastructure level. Meshes take the burden off of developers offering an appealing set of capabilities. Application developers would otherwise have to implement these features into their software applications. And this process produces redundant or infrastructure specific code and often additional dependencies.
There is quite a bit to consider when adopting and utilising a service mesh. Service mesh implementations should continue to mature and manage the complexity associated with use. We hope the Service Mesh Interface continues to expand to address this.
One heavy consideration for adopting service mesh is cost. Service meshes use compute and memory resources, and then there the cost of latency. There are also caveats around application architecture and design when adopting service mesh solutions. One example is an application that utilizes Kubernetes StatefulSets since the resources use direct pod-to-pod communication and are not good candidates for service mesh tenants. I highly recommend ensuring a plan for installing service mesh and onboarding applications.