OS update is a crucial step not only to patch security vulnerabilities or fix bugs, but also to add features that improve the performance of the system. But, it comes at the cost of rebooting the system, leading to unavoidable downtime and service disruption. It is an essential to mitigate critical threats immediately, but for end users, the inevitable downtime during an update simply degrades the productivity and usability of OS. This is critical to enterprises; for example, Amazon looses a whopping $66,000 [2] for a single minute of downtime. And, this downtime / disruption can further exacerbate in case if the update fails.

To solve the aforementioned issues, following are the two pragmatic techniques which are in practice:

Patches

Rolling updates - First apply an update to a small group of machines, then extend it to others in case of no failure. Although this technique facilitates a safe measure against mass failure, there is still a considerable downtime while rebooting the systems.
Dynamic hot-patching - This technique directly applies the patches to the running kernel in-place. As a result, the system reboot becomes unnecessary, resulting in no application downtime. Although, it seems attractive, it is inherently limited to particular set of patches consisting of simple code change than semantic one. We tested an open source tool called kpatch which can only support 1 out of 16 minor updates over six months of Ubuntu's releases (Linux 3.13.0.32 -> 34). The above figure shows its limitations. X-axis represents before-version and after-version, and dotted bars represent failures in executing kpatch.

Even though these solutions are pragmatic, they are either inherently limited or do not completely solve the issues of system downtime and service disruption while updating the system. To address these issues, we come up with idea of KUP - a simple, yet effective update mechanism that enables seamless kernel updates without any modification of the commodity operating systems (or with minimal changes for least downtime) by using an application with a stable and matured checkpoint-and-restart (C/R) technique. KUP not only supports the full system update for any kind of complex patches, but also facilitates the mitigation of update failure with two extensions: (1) safe fallback that enables automatic recovery upon upgrade failures, by restoring back to the original system before update; and (2) update dryrun which allows users to check if a new system update breaks running applications or services before actually applying it to the real machine. Another crazy extension that we have thought of is application agnostic fault tolerance, in which the application can be replicated either to provide high availability or even load balancing.

Our evaluation shows that KUP provides a fast kernel update with various running applications such as memcached, mysql or in the middle of Linux kernel compile. For example, KUP can update the Linux kernel from v3.17-rc7 to 3.17.0 with a downtime of total 2.4 sec (constant), without losing 5.6 GB (or even larger) of memcached data.

Approach and Design

Overview

KUP's update procedure is clean and simple: it first checkpoints the process, followed by a kernel switch and later restarts the application by restoring its checkpointed state (see above figure). While designing KUP, our prime focus has been to obtain the maximum performance out of the C/R without modifying the commodity OS. Later, we also concentrated on squeezing the maximum performance (having least downtime) for C/R with minimal changes to the kernel. To achieve this, KUP leverages four new techniques for reducing the system downtime during update:

Stages	Inc.	Ond.	Inc+Ond	FOAM	RP-RAMFS	PPP
Checkpoint	+83.5%	--	+83.5%	+83.5%	+94.0%	+99.7%
Restore	-15.2%	+99.6	-42.8%	+99.6%	+99.6%	+99.4%

Techniques for improving KUP's performance. Inc. and Ond. represent the incremental checkpoint and on-demand restore respectively. Inc+Ond is the combination of the previous two, whereas FOAM is KUP's simple data structure for obtaining the best performance without any kernel change. RP-RAMFS represents FOAM based C/R scheme used on RAM file system. PPP is the modification applied in the kernel to achieve best performance.

1. Incremental checkpoint

For applications with large working set size (WSS), a single checkpoint results in huge downtime. To mitigate this issue, KUP relies on the idea of incremental checkpoint. KUP asynchronously takes multiple snapshots of the process' memory followed by a synchronous snapshot. With its introduction, we observe a significant improvement of 83.5% (Inc. column) over a simple checkpoint approach with respect to the downtime. Currently, KUP relies on criu[1] for application C/R and it already provides incremental checkpoint functionality.

2. On-demand restore

The downtime is experienced during both: the checkpoint and restore period. Thus, application restart also adds up an equivalent time to that of checkpointing. To resolve this issue, KUP restarts the application without loading its entire memory, rather reload them on-demand when the process tries to access a particular page. This approach is similar to standard copy-on-write (COW) optimization, thereby drastically decreasing the downtime by 99.6% (column Ond).

Using incremental checkpoint along with on-demand restore seems to be an apt choice. But, from the above table, even the simple restore with incremental checkpoint adds an overhead of 15.2% (Restore row of Inc. column). This is due to the extra work in maintaining a sequence of images taken during incremental checkpoint, in which the restore code path wastes significant amount of time in linearizing the data back to the process' memory. On further combination of both incremental checkpoint and on-demand restore, we observe a huge downtime for on-demand restore by 42.8%. This occurs because we need to map the memory at the page granularity level for dynamic reloading of the pages with the help of mmap() system call, which results in a huge overhead.

3. File offset-based address mapping (FOAM)

In order to effectively use the incremental checkpoint with on-demand restore, KUP uses a simple data structure, called file offset based address mapping (FOAM) for checkpointing the process' memory. FOAM uses direct one-to-one mapping between the process address and a huge file, thus representing the whole virtual address space of the process. The free regions in the address space (not allocated to the process) are represented as holes, which is already supported by the modern file systems.

By using FOAM, we reap the following benefits:

There is no maintenance cost of metadata now.
There is no data fragmentation issue i.e. everything gets updated in a single place.
Simplifies the work to enable on-demand restore.

The above table (column FOAM) shows the benefits of using FOAM as it facilitates us with the best of both worlds.

4. Persistent physical pages (PPP)

The bandwidth of the storage medium also plays its part when the application is checkpointed and restarted. To circumvent this, we introduce a RAM-based file system (RP-RAMFS) that stores its data purely in RAM but makes it persistent across reboot. Even though RP-RAMFS seems a viable option, it has a critical limitation. KUP cannot perform C/R on an application with working set size more than half of the actual memory space.

We overcome the limitation of RP-RAMFS by introducing new mechanism in the kernel by letting it preserve the application memory across update. KUP saves the virtual address and physical pages pair of the process which gets used during restoration. KUP achieves this functionality by introducing two new system calls - preserve(pid, mapinfo, nele) for transferring the data to the kernel while checkpointing and prestore(mapinfo, nele) during application restart. Following are the steps that KUP uses for PPP:

During the checkpoint phase, KUP dumps the virtual-to-physical mapping of the targeted process in userspace and passes the relevant information to the kernel via preserve syscall.
Before updating, KUP creates a list of new pages that are not touched by the new kernel unless specified.
During new kernel's boot, KUP globally reserves the required set of requested pages.
During application restart, by using prestore syscall, KUP passes the information to the page-fault handler, thereby allowing it to correctly rebind the faulted virtual address with the reserved physical page.

This technique not only improves the performance by instantly binding the pages at the page level granularity, but also avoids redundant memory copy.

Evaluation

It is time to test our design and implementation. Now, we will discuss about the about the effectiveness of our techniques.

Memcached

The above figure shows the effectiveness of various techniques - on-demand, FOAM and PPP techniques on both SSD and RP-RAMFS using memcached as the application. The x axis is the network bandwidth and x axis is the time-line. At time t=193 second, the kernel update is started for each case. As already discussed, PPP is the best approach with the least downtime of around 2.4 seconds. Whereas FOAM is the next best candidate in terms of performance.

Micro-benchmark

The above 3 figures (a,b,c) illustrate the impact of using our FOAM for both checkpoint and restore phase. (a) shows the downtime caused by checkpointing with varying WSS. (b) shows the downtime when changing the percentage of write under fixed WSS, and the overhead of FOAM's incremental approach suddenly becomes high with the increase in write percentage. (c) illustrates the advantage of on-demand restore when increasing the WSS. (d) shows the performance of PPP and FOAM (on RP-RAMFS and SSD) with varying WSS (50% write) up to 72 GB, which is larger than half of the system's memory (128 GB). RP-RAMFS fails at 56GB as it requires free RAM space for checkpointing the process, therefore it cannot support large WSS. On the contrary, PPP can efficiently support applications with large WSS.

There are other aspects that we have not covered including the testing of other applications, KUP against kpatch, and the kernel switch downtime. We will go in detail in our next blog post along with some other cool ideas.

[1] criu: http://www.criu.org/Main_Page

[2] Amazon.com Goes Down, Loses $66,240 Per Minute. http://www.forbes.com/sites/kellyclay/2013/08/19/amazon-com-goes-down-loses-66240-per-minute

Proofread by Taesoo Kim and Changwoo Min.