OS update is a crucial step not only to patch security vulnerabilities or fix bugs, but also to add features that improve the performance of the system. But, it comes at the cost of rebooting the system, leading to unavoidable downtime and service disruption. It is an essential to mitigate critical threats immediately, but for end users, the inevitable downtime during an update simply degrades the productivity and usability of OS. This is critical to enterprises; for example, Amazon looses a whopping $66,000 [2] for a single minute of downtime. And, this downtime / disruption can further exacerbate in case if the update fails.
To solve the aforementioned issues, following are the two pragmatic techniques which are in practice:
-
Rolling updates - First apply an update to a small group of machines, then extend it to others in case of no failure. Although this technique facilitates a safe measure against mass failure, there is still a considerable downtime while rebooting the systems.
-
Dynamic hot-patching - This technique directly applies the patches to the running kernel in-place. As a result, the system reboot becomes unnecessary, resulting in no application downtime. Although, it seems attractive, it is inherently limited to particular set of patches consisting of simple code change than semantic one. We tested an open source tool called
kpatch
which can only support 1 out of 16 minor updates over six months of Ubuntu's releases (Linux 3.13.0.32 -> 34). The above figure shows its limitations. X-axis represents before-version and after-version, and dotted bars represent failures in executingkpatch
.
Even though these solutions are pragmatic, they are either inherently limited or
do not completely solve the issues of system downtime and service disruption
while updating the system. To address these issues, we come up with idea of KUP
- a simple, yet effective update mechanism that enables seamless kernel updates
without any modification of the commodity operating systems (or with minimal
changes for least downtime) by using an application with a stable and matured
checkpoint-and-restart (C/R) technique. KUP
not only supports the full
system update for any kind of complex patches, but also facilitates
the mitigation of update failure with two extensions: (1) safe fallback
that enables automatic recovery upon upgrade failures, by restoring back to
the original system before update; and (2) update dryrun which allows users
to check if a new system update breaks running applications or services before
actually applying it to the real machine. Another crazy extension that we
have thought of is application agnostic fault tolerance, in which the
application can be replicated either to provide high availability or even
load balancing.
Our evaluation shows that KUP
provides a fast kernel update with various
running applications such as memcached, mysql or in the middle of Linux
kernel compile. For example, KUP
can update the Linux kernel from v3.17-rc7
to 3.17.0 with a downtime of total 2.4 sec (constant), without losing 5.6 GB
(or even larger) of memcached data.
Approach and Design
KUP
's update procedure is clean and simple: it first checkpoints the process,
followed by a kernel switch and later restarts the application by restoring its
checkpointed state (see above figure). While designing KUP
, our prime focus has
been to obtain the maximum performance out of the C/R without modifying the
commodity OS. Later, we also concentrated on squeezing the maximum performance
(having least downtime) for C/R with minimal changes to the kernel. To achieve
this, KUP
leverages four new techniques for reducing the system downtime during update:
Stages | Inc. | Ond. | Inc+Ond | FOAM | RP-RAMFS | PPP |
---|---|---|---|---|---|---|
Checkpoint | +83.5% | -- | +83.5% | +83.5% | +94.0% | +99.7% |
Restore | -15.2% | +99.6 | -42.8% | +99.6% | +99.6% | +99.4% |
KUP
's performance. Inc. and Ond. represent
the incremental checkpoint and on-demand restore respectively. Inc+Ond is the combination
of the previous two, whereas FOAM is KUP
's simple data structure for obtaining
the best performance without any kernel change. RP-RAMFS represents FOAM based
C/R scheme used on RAM file system. PPP is the modification applied in the kernel
to achieve best performance.
1. Incremental checkpoint
For applications with large working set size (WSS), a single checkpoint results
in huge downtime. To mitigate this issue, KUP
relies on the idea of
incremental checkpoint. KUP
asynchronously takes multiple snapshots of
the process' memory followed by a synchronous snapshot. With
its introduction, we observe a significant improvement of 83.5% (Inc. column)
over a simple checkpoint approach with respect to the downtime. Currently,
KUP
relies on criu[1] for application C/R and it already provides incremental
checkpoint functionality.
2. On-demand restore
The downtime is experienced during both: the checkpoint and restore period. Thus,
application restart also adds up an equivalent time to that of checkpointing.
To resolve this issue, KUP
restarts the application without loading its
entire memory, rather reload them on-demand when the process tries to access
a particular page. This approach is similar to standard copy-on-write (COW)
optimization, thereby drastically decreasing the downtime by 99.6% (column Ond).
Using incremental checkpoint along with on-demand restore seems to be an apt choice. But, from the above table, even the simple restore with incremental checkpoint adds an overhead of 15.2% (Restore row of Inc. column). This is due to the extra work in maintaining a sequence of images taken during incremental checkpoint, in which the restore code path wastes significant amount of time in linearizing the data back to the process' memory. On further combination of both incremental checkpoint and on-demand restore, we observe a huge downtime for on-demand restore by 42.8%. This occurs because we need to map the memory at the page granularity level for dynamic reloading of the pages with the help of mmap() system call, which results in a huge overhead.
3. File offset-based address mapping (FOAM)
In order to effectively use the incremental checkpoint with on-demand
restore, KUP
uses a simple data structure, called file offset based
address mapping (FOAM) for checkpointing the process' memory. FOAM
uses direct one-to-one mapping between the process address and a huge file,
thus representing the whole virtual address space of the process. The free regions
in the address space (not allocated to the process) are represented as holes,
which is already supported by the modern file systems.
By using FOAM, we reap the following benefits:
- There is no maintenance cost of metadata now.
- There is no data fragmentation issue i.e. everything gets updated in a single place.
- Simplifies the work to enable on-demand restore.
The above table (column FOAM) shows the benefits of using FOAM as it facilitates us with the best of both worlds.
4. Persistent physical pages (PPP)
The bandwidth of the storage medium also plays its part when the application
is checkpointed and restarted. To circumvent this, we introduce a RAM-based
file system (RP-RAMFS) that stores its data purely in RAM but makes it
persistent across reboot. Even though RP-RAMFS seems a viable option,
it has a critical limitation. KUP
cannot perform C/R on an application
with working set size more than half of the actual memory space.
We overcome the limitation of RP-RAMFS by introducing new mechanism in the
kernel by letting it preserve the application memory across update. KUP
saves the virtual address and physical pages pair of the process which gets
used during restoration. KUP
achieves this functionality by introducing
two new system calls - preserve(pid, mapinfo, nele)
for transferring
the data to the kernel while checkpointing and prestore(mapinfo, nele)
during application restart.
Following are the steps that KUP
uses for PPP:
- During the checkpoint phase,
KUP
dumps the virtual-to-physical mapping of the targeted process in userspace and passes the relevant information to the kernel viapreserve
syscall. - Before updating,
KUP
creates a list of new pages that are not touched by the new kernel unless specified. - During new kernel's boot,
KUP
globally reserves the required set of requested pages. - During application restart, by using
prestore
syscall,KUP
passes the information to the page-fault handler, thereby allowing it to correctly rebind the faulted virtual address with the reserved physical page.
This technique not only improves the performance by instantly binding the pages at the page level granularity, but also avoids redundant memory copy.
Evaluation
It is time to test our design and implementation. Now, we will discuss about the about the effectiveness of our techniques.
The above figure shows the effectiveness of various techniques - on-demand, FOAM and PPP techniques on both SSD and RP-RAMFS using memcached as the application. The x axis is the network bandwidth and x axis is the time-line. At time t=193 second, the kernel update is started for each case. As already discussed, PPP is the best approach with the least downtime of around 2.4 seconds. Whereas FOAM is the next best candidate in terms of performance.
The above 3 figures (a,b,c) illustrate the impact of using our FOAM for both checkpoint and restore phase. (a) shows the downtime caused by checkpointing with varying WSS. (b) shows the downtime when changing the percentage of write under fixed WSS, and the overhead of FOAM's incremental approach suddenly becomes high with the increase in write percentage. (c) illustrates the advantage of on-demand restore when increasing the WSS. (d) shows the performance of PPP and FOAM (on RP-RAMFS and SSD) with varying WSS (50% write) up to 72 GB, which is larger than half of the system's memory (128 GB). RP-RAMFS fails at 56GB as it requires free RAM space for checkpointing the process, therefore it cannot support large WSS. On the contrary, PPP can efficiently support applications with large WSS.
There are other aspects that we have not covered including the testing of other
applications, KUP
against kpatch
, and the kernel switch downtime. We will go
in detail in our next blog post along with some other cool ideas.
[1] criu: http://www.criu.org/Main_Page
[2] Amazon.com Goes Down, Loses $66,240 Per Minute. http://www.forbes.com/sites/kellyclay/2013/08/19/amazon-com-goes-down-loses-66240-per-minute
Proofread by Taesoo Kim and Changwoo Min.