If you've read anything about scaling large websites, you've probably heard about memcached. memcached is a high-performance, distributed memory object caching system. Here at Facebook, we're likely the world's largest user of memcached. We use memcached to alleviate database load. memcached is already fast, but we need it to be faster and more efficient than most installations. We use more than 800 servers supplying over 28 terabytes of memory to our users. Over the past year as Facebook's popularity has skyrocketed, we've run into a number of scaling issues. This ever increasing demand has required us to make modifications to both our operating system and memcached to achieve the performance that provides the best possible experience for our users.
Because we have thousands and thousands of computers, each running a hundred or more Apache processes, we end up with hundreds of thousands of TCP connections open to our memcached processes. The connections themselves are not a big problem, but the way memcached allocates memory for each TCP connection is. memcached uses a per-connection buffer to read and write data out over the network. When you get into hundreds of thousands of connections, this adds up to gigabytes of memory-- memory that could be better used to store user data. To reclaim this memory for user data, we implemented a per-thread shared connection buffer pool for TCP and UDP sockets. This change enabled us to reclaim multiple gigabytes of memory per server.
Although we improved the memory efficiency with TCP, we moved to UDP for get operations to reduce network traffic and implement application-level flow control for multi-gets (gets of hundreds of keys in parallel). We discovered that under load on Linux, UDP performance was downright horrible. This is caused by considerable lock contention on the UDP socket lock when transmitting through a single socket from multiple threads. Fixing the kernel by breaking up the lock is not easy. Instead, we used separate UDP sockets for transmitting replies (with one of these reply sockets per thread). With this change, we were able to deploy UDP without compromising performance on the backend.
Another issue we saw in Linux is that under load, one core would get saturated, doing network soft interrupt handing, throttling network IO. In Linux, a network interrupt is delivered to one of the cores, consequently all receive soft interrupt network processing happens on that one core. Additionally, we saw an excessively high rate of interrupts for certain network cards. We solved both of these by introducing “opportunistic” polling of the network interfaces. In this model, we do a combination of interrupt driven and polling driven network IO. We poll the network interface anytime we enter the network driver (typically for transmitting a packet) and from the process scheduler’s idle loop. In addition, we also take interrupts (to keep latencies bounded) but we take far fewer network interrupts (typically by setting interrupt coalescing thresholds aggressively). Since we do network transmission on every core and since we poll for network IO from the scheduler’s idle loop, we distribute network processing evenly across all cores.
Finally, as we started deploying 8-core machines and in our testing, we discovered new bottlenecks. First, memcached's stat collection relied on a global lock. A nuisance with 4 cores, with 8 cores, the lock now accounted for 20-30% of CPU usage. We eliminated this bottleneck by moving stats collection per-thread and aggregating results on-demand. Second, we noticed that as we increased the number of threads transmitting UDP packets, performance decreased. We found significant contention on the lock that protects each network device’s transmit queue. Packets are enqueued for transmission and dequeued by the device driver. This queue is managed bv Linux’s “netdevice” layer that sits in-between IP and device drivers. Packets are added and removed from the queue one at a time, causing significant contention. One of our engineers changed the dequeue algorithm to batch dequeues for transmit, drop the queue lock, and then transmit the batched packets. This change amortizes the cost of the lock acquisition over many packets and reduces lock contention significantly, allowing us to scale memcached to 8 threads on an 8-core system.
Since we’ve made all these changes, we have been able to scale memcached to handle 200,000 UDP requests per second with an average latency of 173 microseconds. The total throughput achieved is 300,000 UDP requests/s, but the latency at that request rate is too high to be useful in our system. This is an amazing increase from 50,000 UDP requests/s using the stock version of Linux and memcached.
We’re hoping to get our changes integrated into the official memcached repository soon, but until that happens, we’ve decided to release all our changes to memcached on github.
Because we have thousands and thousands of computers, each running a hundred or more Apache processes, we end up with hundreds of thousands of TCP connections open to our memcached processes. The connections themselves are not a big problem, but the way memcached allocates memory for each TCP connection is. memcached uses a per-connection buffer to read and write data out over the network. When you get into hundreds of thousands of connections, this adds up to gigabytes of memory-- memory that could be better used to store user data. To reclaim this memory for user data, we implemented a per-thread shared connection buffer pool for TCP and UDP sockets. This change enabled us to reclaim multiple gigabytes of memory per server.
Although we improved the memory efficiency with TCP, we moved to UDP for get operations to reduce network traffic and implement application-level flow control for multi-gets (gets of hundreds of keys in parallel). We discovered that under load on Linux, UDP performance was downright horrible. This is caused by considerable lock contention on the UDP socket lock when transmitting through a single socket from multiple threads. Fixing the kernel by breaking up the lock is not easy. Instead, we used separate UDP sockets for transmitting replies (with one of these reply sockets per thread). With this change, we were able to deploy UDP without compromising performance on the backend.
Another issue we saw in Linux is that under load, one core would get saturated, doing network soft interrupt handing, throttling network IO. In Linux, a network interrupt is delivered to one of the cores, consequently all receive soft interrupt network processing happens on that one core. Additionally, we saw an excessively high rate of interrupts for certain network cards. We solved both of these by introducing “opportunistic” polling of the network interfaces. In this model, we do a combination of interrupt driven and polling driven network IO. We poll the network interface anytime we enter the network driver (typically for transmitting a packet) and from the process scheduler’s idle loop. In addition, we also take interrupts (to keep latencies bounded) but we take far fewer network interrupts (typically by setting interrupt coalescing thresholds aggressively). Since we do network transmission on every core and since we poll for network IO from the scheduler’s idle loop, we distribute network processing evenly across all cores.
Finally, as we started deploying 8-core machines and in our testing, we discovered new bottlenecks. First, memcached's stat collection relied on a global lock. A nuisance with 4 cores, with 8 cores, the lock now accounted for 20-30% of CPU usage. We eliminated this bottleneck by moving stats collection per-thread and aggregating results on-demand. Second, we noticed that as we increased the number of threads transmitting UDP packets, performance decreased. We found significant contention on the lock that protects each network device’s transmit queue. Packets are enqueued for transmission and dequeued by the device driver. This queue is managed bv Linux’s “netdevice” layer that sits in-between IP and device drivers. Packets are added and removed from the queue one at a time, causing significant contention. One of our engineers changed the dequeue algorithm to batch dequeues for transmit, drop the queue lock, and then transmit the batched packets. This change amortizes the cost of the lock acquisition over many packets and reduces lock contention significantly, allowing us to scale memcached to 8 threads on an 8-core system.
Since we’ve made all these changes, we have been able to scale memcached to handle 200,000 UDP requests per second with an average latency of 173 microseconds. The total throughput achieved is 300,000 UDP requests/s, but the latency at that request rate is too high to be useful in our system. This is an amazing increase from 50,000 UDP requests/s using the stock version of Linux and memcached.
We’re hoping to get our changes integrated into the official memcached repository soon, but until that happens, we’ve decided to release all our changes to memcached on github.