The "Monster" VMs

Today, cloud service providers are provisioning VMs with high core count coupled with lots of memory; for example, Microsoft Azure provides a largest (aka baby monster) VM instance consisting of 32 vCPUs with 488 GBs of RAM. Similar configurations are also available with other cloud service providers such as Amazon EC2 and Google Compute Engine (GCE). With the help of these instances, the cloud service providers are trying to focus on applications that require large amount of memory with lots of computation. Some examples include big-data processing and in-memory databases (such as SAP HANA). The provisioning of such large VMs has been made possible because of the ease of availability of cost-effective, large multicore machines and this trend will continue as the number of cores increase per commodity CPUs (e.g., up to 1K cores in SPARC M7).

But, the questions that we would like to answer are the following:

What is the scalability characteristics of the large VM instances provided by the cloud service providers?
Is the underlying virtualization technology really scalable enough to support a VM with hundreds of vCPUs in the future?

We try to answer the aforementioned questions in the following sections by running a Linux kernel compile, an embarrassingly parallel job that end users might expect to scale vertically, on popular cloud services and then replicate the same experiment on our 80-core machine to project the scalability trend.

1) Cloud Scalability

Clouds

We executed the Linux kernel compile on three popular cloud service providers - Amazon EC2, Google Compute Engine and Microsoft Azure. We can observe that all the VMs are clearly scalable till 16 vCPUs and there is a degradation after 16 vCPUS for both EC2 and GCE instances. This occurs due to the usage of hyperthreads for the 32 vCPUs VMs. We confirm this same behavior by running the same experiment in our lab with a 16-core E5-2630 v3 machine having similar configuration used by these cloud service providers. Interestingly, Azure scales well beyond 16 cores, reflecting next to ideal scalability characteristics. We believe that this happens because they use more physical cores than the logical ones.

2) Monster VMs and their scalability characteristics

160core

We replicate the same experiment on our 80-core E7-8870 machine by performing the experiment on two kinds of VMs - first one is the highly optimized paravirtulized VM (PVM) and another one is the hardware assisted VM (HVM). Theoretically, the performance of PVM should be similar or better than HVM. But, this trend tends to break when the vCPU count increases from 20 vCPUS. This problem is counter intuitive and is only observed in case of vCPUs being greater than 20 physical cores. We classify this problem as sleepy lock anomaly, which occurs due to the usage of paravirtual interface in the ticket spinlock implementation that has been introduced to solve the Lock Holder Preemption problem without the architectural support (i.e. Pause Loop Exiting).

Currently, paravirtual interface complements the PLE and helps in boosting the performance when compared against HVM. During the lock acquisition, a lock waiter first spins a lock for fix loop-iteration (fast path), and if it fails to acquire the lock, it voluntarily yields to other vCPUS by issuing a halt instruction (slow path). When a lock is released, the next waiter is woken up (by kicking the next waiting vCPU based on the ticket value). However, the above figure illustrates the sudden performance drop which occurs due to the sudden increase in halt exits from 30th core onwards). This happens due to the high contention among the vCPUs where they are contending with each other. This drastically increases the time period of lock acquisition, resulting in failure of the lock acquisition by most vCPUs and they start trapping to the hypervisor at the same time. From this point, the communication cost to wake up other vCPU starts dominating which results in performance collapse.

Our Fix

To solve this issue for the virtualized instance, we design and implement a variant of spinlock, called opportunistic ticket spinlock (oticket) that exploits the distance between the lock holder and lock waiter as the time to acquire a lock is roughly proportional to the waiters's distance for the ticket spinlock implementation. We introduce two new schemes to tackle the performance degradation:

Opportunistic spinning: Determining spin duration in the fast path is dependent on the workload and hardware configuration. During the lock phase, we dynamically determine the spin duration by relying on the distance information between the lock waiter and its holder. Nearer waiters opportunistically spin for a longer duration, hoping to avoid the costly switching between the guest OS and the hypervisor. Conversely, farther waiters spin for shorter duration and they yield early to give more chance for a lock holder to make progress.
Opportunistic wake-up: In this scheme, during the unlocking phase, next N waiters are woken up. This approach amortizes the vCPU wake-up latency which is experienced in case of virtualized environment.

Following is the modified code snippet of the paravirtual spinlock implementation in the Linux kernel version 4.0:

1  #define SPIN_THRESHOLD     (1 << 15)
2  + #define SPIN_MAX_THRESHOLD (1UL << 34)
3  + #define TICKET_QUEUE_WAIT  (18)
4  + #define EAGER_WAKEUP_NCPU  (4)
5
6  + static __always_inline
7  + unsigned int __ticket_distance(__ticket_t head, __ticket_t tail)
8  + {
9  +   return (tail - (head & ~TICKET_SLOWPATH_FLAG)) \
10 +     / TICKET_LOCK_INC;
11 + }
12
13   static __always_inline
14   void arch_spin_lock(arch_spinlock_t *lock)
15   {
16     register struct __raw_tickets inc = {.tail = TICKET_LOCK_INC};
17 +   unsigned int dist;
18
19     /* default threshold set in Linux */
20     u64 threshold = SPIN_THRESHOLD
21
22     /* try locking */
23     inc = xadd(&lock->tickets, inc);
24     if (likely(inc.head == inc.tail))
25       goto out;
26
27 +   /* opportunistically determines spinning threshold */
28 +   dist = __ticket_distance(inc.head, inc.tail);
29 +   if (dist < TICKET_QUEUE_WAIT)
30 +     threshold = SPIN_MAX_THRESHOLD >> (dist - 1);
31
32     for (;;) {
33       /* spinning (fast path) */
34       u64 count = threshold;
35       do {
36         inc.head = READ_ONCE(lock->tickets.head);
37         if (__tickets_equal(inc.head, inc.tail))
38           goto clear_slowpath;
39         cpu_relax();
40       } while (--count);
41
42       /* yield (slow path) */
43        __ticket_lock_spinning(lock, inc.tail);
44     }
45
46   clear_slowpath:
47     __ticket_check_and_clear_slowpath(lock, inc.head);
48   out:
49     barrier();
50   }
51
52   static __always_inline
53   void arch_spin_unlock(arch_spinlock_t *lock)
54   {
55     if (TICKET_SLOWPATH_FLAG &&
56         static_key_false(&paravirt_ticketlocks_enabled)) {
57       __ticket_t head;
58
59       head = xadd(&lock->tickets.head, TICKET_LOCK_INC);
60
61       if (unlikely(head & TICKET_SLOWPATH_FLAG)) {
62         u8 count;
63         head &= ~TICKET_SLOWPATH_FLAG;
64
65 +       /* opportunistic wakeup */
66 +       for (count = 1; count <= EAGER_WAKEUP_NCPU; ++count)
67 +         __ticket_unlock_kick(lock,
68 +                     (head + count * TICKET_LOCK_INC));
69       }
70     } else
71       __add(&lock->tickets.head,
72             TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
73   }

Performance comparison

oticket

In the above figure, we compare our oticket implementation with the stock ticket spinlock implementation and another implementation known as qspinlock. qspinlock is the queue based spinlock for further reducing the cacheline contention. We can clearly observe that oticket outperforms PVM and it shows the same scalability trend that can be observed in case of HVM. The qspinlock performs almost the same as of PVM. This happens because qspinlock lock implementation also relies on fast path and slow path approach, thereby suffering from the same sleepy lock anomaly as that of PVM.

Oversubscribed

The above figure illustrates the performance of Linux kernel compilation time inside a VM which has been co-scheduled with other VM. Both VMs are compiling the Linux kernel simultaneously. We compare our oticket implementation against the existing stock version (PVM). We can observe that oticket still outperforms the PVM, but it also suffers from performance collapse after 40 vCPUS. We suspect that the both PLE and oticket are counteracting each other in oversubscribed environment.

Conclusion

We have taken a step to analyze the scalability behavior of VMs with high core count (till 80 core). Our preliminary study suggests that besides cache contention bottleneck, the usage of ticket spinlock is another culprit for the degradation. We will take a step further and will try to analyze the scalability characteristics of these monster VMs by running other benchsuites from Mosbench.