Despite the fact that most of the IT industry implements container and cloud-based infrastructure solutions, it is necessary to understand the limitations of these technologies. Traditionally, Docker, Linux Containers (LXC) and Rocket (rkt) are not truly isolated because they share the core of the parent operating system in their work. Yes, they are efficient in terms of resources, but the total number of perceived attack vectors and potential losses from hacking are still large, especially in the case of a multi-tenant cloud environment in which the containers are located.
The root of our problem lies in the weak delimitation of containers at the moment when the host operating system creates a virtual user area for each of them. Yes, research and development has been carried out aimed at creating real “containers” with a full-fledged sandbox. And the majority of the solutions obtained lead to the restructuring of the boundaries between containers in order to enhance their isolation. In this article, we will look at four unique projects from IBM, Google, Amazon, and OpenStack, respectively, that use different methods to achieve the same goal: to create reliable isolation. For example, IBM Nabla deploys containers on top of Unikernel, Google gVisor creates a dedicated guest kernel, Amazon Firecracker uses an extremely lightweight hypervisor for sandbox applications, and OpenStack places containers in a specialized virtual machine optimized for orchestra tools.
Overview of modern container technology
Containers are a modern way to pack, share, and deploy an application. Unlike a monolithic application, in which all functions are packaged in one program, container applications or microservices are designed for narrow purposeful use and specialize in only one task.
The container includes all the dependencies (for example, packages, libraries, and binary files) that an application needs to perform its particular task. As a result, containerized applications are platform independent and can run on any operating system regardless of its version or installed packages. This convenience saves developers from a huge piece of work on adapting different versions of software for different platforms or clients. Although conceptually this is not entirely accurate, many people like to think of containers as “lightweight virtual machines.”
When a container is deployed on a host, the resources of each container, such as its file system, process, and network stack, are placed in a virtually isolated environment that other containers cannot access. This architecture allows you to simultaneously run hundreds and thousands of containers in a single cluster, and each application (or microservice) can then easily be scaled by replicating a larger number of instances.
In this case, the container deployment is based on two key “building blocks”: the Linux namespace and the Linux control groups (cgroups).
The namespace creates a virtually isolated user space and provides the application with dedicated system resources, such as the file system, the network stack, the process ID, and the user ID. In this isolated user space, the application controls the root directory of the file system and can be run as root. This abstract space allows each application to work independently, without interfering with other applications living on the same host. There are now six namespaces available: mount, inter-process communication (ipc), UNIX time-sharing system (uts), process id (pid), network and user.It is proposed to add two additional namespaces to this list: time and syslog, but the Linux community has not yet decided on the final specifications.
Cgroups provide limited hardware resources, prioritization, monitoring and control of the application. As an example of hardware resources that they can manage, you can call the processor, memory, device, and network. By combining namespaces and cgroups, we can safely run multiple applications on the same host, with each application in its own isolated environment — which is a fundamental property of the container.
The main difference between a virtual machine (VM) and a container is that the virtual machine is virtualization at the hardware level, and the container is virtualization at the operating system level. The VM hypervisor emulates the hardware environment for each machine, where the container runtime already in turn emulates the operating system for each object. Virtual machines share the physical hardware of the host, and the containers share both the hardware and the OS kernel. Because containers in general share more resources with a host, their work with storage, memory, and CPU cycles is much more efficient than that of a virtual machine. However, the disadvantage of such a public access is problems in the information security plane, since too much trust is established between the containers and the host. Figure 1 illustrates the architectural difference between a container and a virtual machine.
In general, the isolation of virtualized equipment creates a much stronger security perimeter than just the isolation of the namespace. The risk that an attacker successfully leaves an isolated process is much higher than the chance of successfully exiting the virtual machine. The reason for the higher risk of going beyond the confined environment of containers is the weak isolation created by the namespace and cgroups. Linux implements them by associating new property fields with each process. These fields in the
file system indicate the host operating system whether one process can see another, or how much processor/memory resources a particular process can use. When viewing running processes and threads from the parent OS (for example, the top or ps commands), the container process looks just like any other process. As a rule, traditional solutions such as LXC or Docker are not considered to be fully isolated, since they use the same core within the same host. Therefore, it is not surprising that containers have a sufficient number of vulnerabilities. For example, CVE-2014-3519, CVE-2016-5195, CVE-2016-9962, CVE-2017-5123 and CVE-2019-5736 could result in an attacker gaining access to data outside the container.
Most kernel exploits create a vector for a successful attack, since they usually translate into privilege escalation and allow a compromised process to gain control outside its intended namespace. In addition to attack vectors in the context of software vulnerabilities, incorrect configuration can also play a role. For example, deploying images with excessive privileges (CAP_SYS_ADMIN, privileged access) or critical mount points (
) may result in a leak. Given these potentially disastrous consequences, you should understand the risk that you take when deploying a system in a multi-tenant space or when using containers to store sensitive data.
These problems motivate researchers to create stronger security perimeters. The idea is to create a real sandbox-container, as isolated as possible from the main OS.Most of these solutions include the development of a hybrid architecture that uses a strict demarcation of the application and the virtual machine, and focuses on improving the efficiency of container solutions.
At the time of this writing, there was not a single project that could be called mature enough to be taken as a standard, but in the future, developers will undoubtedly accept some of these concepts as basic ones.
We begin our review with Unikernel, the oldest highly specialized system that packs an application into a single image using the minimum set of operating system libraries. The very concept of Unikernel turned out to be fundamental to many projects whose goal was to create safe, compact and optimized images. After that, we will proceed to reviewing IBM Nabla, a project for launching Unikernel applications, including containers. In addition, we have Google gVisor - a project to launch containers in user kernel space. Next, we will switch to container solutions based on virtual machines - Amazon Firecracker and OpenStack Kata. Summarize this post by comparing all the above solutions.
The development of virtualization technology has allowed us to move to cloud computing. Hypervisors like Xen and KVM laid the foundation for what we now know as Amazon Web Services (AWS) and the Google Cloud Platform (GCP). And although modern hypervisors are able to work with hundreds of virtual machines combined into a single cluster, traditional general-purpose operating systems are not too adapted and optimized for work in such an environment. A general-purpose OS is designed primarily to support and work with as many diverse applications as possible, so their kernels include all types of drivers, libraries, protocols, schedulers, and so on. However, most virtual machines deployed somewhere in the cloud are used to run a single application, for example, to ensure the operation of a DNS, proxy, or some kind of database. Since such a separate application relies in its work only on a specific and small section of the OS kernel, all its other “body kits” simply idle system resources, and by the very fact of their existence they increase the number of vectors for a potential attack. After all, the larger the code base, the more difficult it is to eliminate all the flaws, and the more potential vulnerabilities, errors and other weaknesses. This problem encourages specialists to develop highly specialized operating systems with a minimal set of core functionality, that is, to create tools to support one specific application.
The idea of Unikernel was born for the first time back in the 90s. At the same time, he took shape as a specialized image of a machine with a single address space that can work directly on hypervisors. It packs the core and dependent applications and kernel functions into a single image. Nemesis and Exokernel are the two earliest research versions of the Unikernel project. The packaging and deployment process is shown in Figure 2.
Figure 2. Multi-purpose operating systems are designed to support all types of applications, so many libraries and drivers are preloaded in them. Unikernels are highly specialized operating systems that are designed to support one specific application.
Unikernel splits the kernel into several libraries and places only necessary components into the image. Like regular virtual machines, unikernel is deployed and runs on the VM hypervisor. Due to its small size, it can load quickly and also scale quickly. The most important features of Unikernel are enhanced security, small footprint, high degree of optimization and fast loading.Since these images contain only application-dependent libraries, and the OS shell is inaccessible, if it was not connected purposefully, then the number of attack vectors that attackers can use them is minimal.
That is, it is not only difficult for an attacker to gain a foothold in these unique cores, but their influence is also limited to one copy of the core. Since the size of Unikernel images is only a few megabytes, they load in tens of milliseconds, and literally hundreds of copies can be launched on one host. Using memory allocation in a single address space instead of a multi-level page table, as is the case in most modern operating systems, unikernel applications have a lower memory access delay compared to the same application running on a regular virtual machine. Since applications build together with the kernel when building an image, compilers can simply perform static type checking to optimize binary files.
The Unikernel.org website maintains a list of unikernel projects. But with all its distinctive features and properties, unikernel was not widely used. When Docker acquired Unikernel Systems in 2016, the community decided that the company would now pack containers in them. But three years have passed, and there are still no signs of integration. One of the main reasons for this slow implementation is that there is still no mature tool for creating Unikernel applications, and most of these applications can only work on certain hypervisors. In addition, porting an application to unikernel may require manual rewriting of code in other languages, including rewriting dependent kernel libraries. It is also important that monitoring or debugging in unikernels is either impossible or has a significant impact on performance.
All these restrictions keep developers from switching to this technology. It should be noted that unikernel and containers have many similar properties. Both the first and second are narrowly focused, unchangeable images, which means that the components inside them cannot be updated or corrected, that is, for the application patch, you always have to create a new image. Today, Unikernel is similar to Docker's ancestor: then the container runtime was inaccessible, and developers had to use the basic tools for building an isolated application environment (chroot, unshare, and cgroups).
Some time ago, researchers from IBM proposed the concept of “Unikernel as a process” - that is, an unikernel application that would run as a process on a specialized hypervisor. The IBM project “Nabla containers” strengthened the unikernel security perimeter, replacing the universal hypervisor (for example, QEMU) with its own development called Nabla Tender. The rationale for this approach is that calls between unikernel and the hypervisor still provide the most vectors to attack. That is why the use of a dedicated unikernel hypervisor with a smaller number of allowed system calls can significantly strengthen the security perimeter. Nabla Tender intercepts calls that unikernel sends to the hypervisor, and already translates them into system requests. At the same time, the seccomp Linux policy blocks all other system calls that are not needed for the operation of the Tender. Thus, Unikernel in conjunction with the Nabla Tender runs as a process in the user space of the host. Below, in Figure 3, it is reflected how Nabla creates a thin interface between unikernel and the host.
Figure 3. To link Nabla with existing container runtime platforms, Nabla uses an OCI-compatible environment, which in turn can be connected to Docker or Kubernetes.
The developers claim that Nabla Tender uses in its work less than seven system calls to interact with the host. Since system calls serve as a kind of bridge between processes in user space and the operating system kernel, the less system calls available to us, the smaller the number of vectors available to attack the kernel. Another advantage of running unikernel as a process is that you can debug similar applications with a large number of tools, for example, using gdb.
To work with the container orchestration platforms, Nabla provides a dedicated
runtime that is implemented according to the Open Container Initiative (OCI) standard. The latter defines the API between clients (for example, Docker, Kubectl) and the runtime environment (e.g., runc). Nabla also comes with an image builder, which later will be able to run
. However, due to file system differences between unikernels and traditional containers, Nabla images do not conform to the specifications of the OCI image and, therefore, Docker images are not compatible with
. At the time of this writing, the project is still in the early development stage. There are other limitations, such as lack of support for mounting/accessing host file systems, adding multiple network interfaces (required for Kubernetes), or using images from other unikernel images (for example, MirageOS).
Google gVisor is a sandbox technology using the Google Cloud Platform (GCP) application engine, cloud functions and CloudML. At some point, Google realized the risk of running unreliable applications in the public cloud infrastructure and the inefficiency of sandbox applications using virtual machines. As a result, a user-space kernel was developed for the isolated environment of such unreliable applications. gVisor puts these applications in a sandbox, intercepting all system calls from them to the host kernel and processing them in the user environment using the gVisor Sentry core. In essence, it functions as a combination of the guest core and the hypervisor. Figure 4 shows the gVisor architecture.
Figure 4. Kernel gVisor implementation//Sentry and gVisor Gofer file systems use a small number of system calls to interact with the host
gVisor creates a strong security perimeter between the application and its host. It limits the system calls that applications can use in user space. Without relying on virtualization, gVisor works as a host process that interacts between the isolated application and the host. Sentry supports most Linux system calls and core kernel functions, such as signal delivery, memory management, the network stack, and the threading model. Sentry has implemented more than 70% of the 319 Linux system calls to support isolated applications. At the same time, Sentry uses fewer than 20 Linux system calls to interact with the host kernel. It is worth noting that gVisor and Nabla have a very similar strategy: protecting the host OS and both of these solutions use less than 10% of the Linux system calls to interact with the kernel. But you need to understand that gVisor creates a multipurpose core, and, for example, Nabla relies on unique cores. At the same time, both solutions launch a specialized guest kernel in user space to support isolated applications entrusted to them.
Some may wonder why gVisor needs its own kernel when the Linux kernel already has open source and is readily available. So, the gVisor kernel written in Golang is more secure than the Linux kernel written in C. All thanks to the powerful type safety and memory management functions in Golang.Another important point regarding gVisor is tight integration with Docker, Kubernetes and the OCI standard. Most of the Docker images can be easily extracted and run with gVisor, changing the runtime to gVisor runsc. In the case of Kubernetes, instead of a “sandbox” for each individual container in gVisor, you can run an entire “sandbox” module.
Since gVisor is still in its infancy, it has some limitations. When gVisor intercepts and processes the system call created by the application from the sandbox, there are always costs, so it is not suitable for heavy applications. (Note that there are no such problems in Nabla, since unikernel applications do not create system calls. Nabla uses seven system calls only to handle hypercall). GVisor has no direct hardware access (passthrough), so applications that require it, for example, to a GPU, cannot work in it. Finally, since gVisor supports only 70% of the Linux system calls, applications using calls that are not on the support list cannot be started in gVisor.
Amazon Firecracker is a technology that is used today in AWS Lambda and AWS Fargate. This is a hypervisor that creates “lightweight virtual machines” (MicroVM) specifically for multi-tenant containers and serverless operating models. Before Firecracker, the Lambda and Fargate functions for each client worked inside dedicated EC2 virtual machines in order to provide reliable isolation. Although virtual machines provide sufficient isolation for containers in the public cloud, using both general-purpose virtual machines and virtual machines for applications with isolated environments is not very efficient in terms of resources consumed. Firecracker solves both security and performance issues, being designed specifically for cloud applications. The Firecracker hypervisor provides every guest virtual machine with minimal OS functionality and emulated devices to improve both security and performance. Users using a Linux kernel binary file and an ext4 file system image can easily create virtual machine images. Amazon began developing Firecracker in 2017, and in 2018 opened the source code of the project to the community.
Like the unikernel concept, Firecracker provides only a small subset of the functions to ensure that container operations work. Compared to traditional virtual machines, micro-VM has a much smaller number of potential vulnerabilities, as well as consumed memory and launch time. Practice shows that the Firecracker micro-VM consumes about 5 MB of memory and is loaded in ~ 125 ms when running on a host with a configuration of 2 CPUs and 256 GB of RAM. Figure 5 shows the Firecracker architecture and its security perimeter.
Figure 5. The Firecracker hypervisor uses security levels to isolate each individual user’s application
Firecracker is based on KVM, and each instance is launched as a process in user space. Each Firecracker process is blocked by the seccomp, cgroups, and namespaces policies, so system calls, hardware resources, the file system, and network activities for it are strictly limited. There are several threads inside each Firecracker process. Thus, the API thread allows control between clients on the host and microVM. The hypervisor thread provides a minimal set of virtIO devices (network and unit). Firecracker provides only four emulated devices for each microVM: the virtio-block, virtio-net, serial console, and 1-button keyboard controller, designed to stop the microVM only. For the sake of security, virtual machines do not have a file sharing mechanism with the host.Data on the host, such as container images, communicate with microVM via File Block Devices, and network interfaces are supported via a network bridge. All outgoing packets are copied to the selected device, and their speed is limited by the cgroups policy. All these precautions and information security ensure that the likelihood of a single application affecting others will be minimized.
At the time of writing this post, Firecracker has not yet fully completed the integration process with Docker and Kubernetes. Firecracker does not support end-to-end hardware connectivity, so applications that require a graphics processor or any device access accelerator are incompatible with it. It also has limited file sharing capabilities between virtual machines and a primitive network model. However, since the project is being developed by a large community, it should soon be brought under the OCI standard and start supporting more applications.
OpenStack Kata h3>
Seeing the security problems of traditional containers, in 2015, Intel introduced a proprietary technology based on Clear Containers virtual machines. Clear Containers is based on Intel VT virtualization hardware technology and the highly modified QEMU-KVM
qemu-lite hypervisor. At the end of 2017, the Clear Containers project joined Hyper RunV, based on the hypervisor for OCI, and began developing the Kata project. Having inherited all the properties of Clear Containers, Kata now supports a wider range of infrastructures and specifications.
Kata is fully integrated with OCI, Container Execution Interface (CRI) and Network Interface (CNI). It supports various types of network models (for example, passthrough, MacVTap, bridge, tc mirroring) and customizable guest kernels, so that all applications that require special network models or kernel versions can run on it. Figure 6 shows how containers inside Kata virtual machines interact with existing orchestration platforms.
Figure 6. Full integration of Kata containers with Docker and Kubernetes
The Kata host has a startup and configuration environment. For each container in the Kata virtual machine, there is a corresponding Kata Shim on the host, which receives API requests from clients (for example, docker or kubectl) and sends requests to the agent inside the virtual machine via VSock. Additionally, Kata optimizes load times. NEMU is a lightweight version of QEMU from which ~ 80% of devices and packages are removed. VM-Templating creates a clone of the running instance of Kata VM and shares it with other newly created VMs. This significantly reduces load time and memory consumption by the guest, but can lead to side channel attack vulnerabilities, for example, CVE-2015-2877. The “hot” connectivity allows VMs to boot with a minimal amount of resources (for example, CPU, memory, virtio block), and later add the missing on request.
Kata and Firecracker containers are a virtual machine-based sandbox technology designed for cloud applications. They have one goal, but different approaches. Firecracker is a specialized hypervisor that creates a secure virtualization environment for guest OSs, while Kata containers are lightweight virtual machines that are well optimized for their tasks. There were also attempts to launch Kata containers on Firecracker. Although this idea is still in the experimental stage, it could potentially combine the best features of the two projects.
We considered several solutions whose purpose is to help with the problem of poor insulation of modern container technologies.
IBM Nabla is a unikernel-based solution that packs applications into a dedicated virtual machine.
Google gVisor is a specialized hypervisor and guest OS kernel that creates a secure interface between applications and their host.
Amazon Firecracker is a specialized hypervisor that provides every guest with a minimal set of hardware and nuclear resources.
OpenStack Kata is a highly optimized virtual machine with a built-in container engine that can work on various hypervisors.
It is difficult to say which of these solutions works best, since they all have their own advantages and disadvantages. The table at the end of the article makes a parallel comparison of some important functions of all four projects. Nabla is the best choice if you have applications running on unikernel systems, such as MirageOS or IncludeOS. gVisor now integrates best with Docker and Kubernetes, but due to incomplete coverage of system calls, some applications are incompatible with it. Firecracker supports custom guest OS images and is a good choice if your applications need to run in a customized virtual machine. Kata containers are fully compliant with the OCI standard and can work on both KVM and Xen hypervisor. This can simplify the deployment of microservices on hybrid platforms.
It may take time for one of the solutions to become a standard, but it’s already good that most of the major cloud-based providers began to look for ways to solve existing problems.