An automated translation process converts plain old Java objects directly into Corfu objects through both runtime and compile-time transformation of code. This allows programmers to quickly adapt existing code to run on top of Corfu. The Corfu runtime also provides strong support for transactions, which enable multiple applications to read and modify objects without relaxing consistency guarantees. We show that with Stream materialization, Corfu can support storing large amounts of state while supporting strong consistency and transactions. In Chapter 6, we describe our experience in both writing new applications and adapting existing applications to Corfu. We start by building an adapter for Apache Zookeeper clients to run on top of Corfu, then describe the implementation of Silver, a new distributed file system which leverages the power of the vCorfu stream store. We then conclude the chapter by describing our efforts to retrofit a large and complex application: a software defined network switch controller, and detail how the strong transaction model and rich object interface greatly reduce the burden on distributed system programmers. Finally, Chapter 7 summarizes the findings from the previous chapters and concludes the dissertation. Before the cloud era,hydroponic vertical garden system designers focused on building systems which provided increasingly stronger guarantees.
Problems such as consensus and byzantine fault tolerance were at the forefront of distributed systems research. Stronger guarantees simplify programming complex, unreliable distributed systems, and few would consider relaxing those guarantees. Indeed, it was often possible to avoid distributed algorithms altogether by strong all data on a single, centralized server. Feature-rich databases could coordinate the data needs of an entire system, and support for transactions and complex queries made it easy for multiple clients to safely and concurrently operate on data.The cloud era and “big data”, however, changed the landscape. Suddenly, system designers had to deal with a workloads that they never dealt with before – workloads that no longer fit within the confines of a single machine. The focus became scalability, and the quickest way to do that was to split the data that used to fit in a single database across multiple machines, a process known as sharding or partitioning. With partitioning, the system can utilize the aggregate throughput of all the machines that data has been partitioned across to service requests. Unfortunately, partitioning was not a panacea. Splitting data across multiple nodes made the features programmers started to depend on, such as complex queries, transactions and consistency difficult to support. Strong consistency was seen as the enemy, and the NoSQL movement sought to relax consistency as much as possible in order to achieve maximum scalability. Consistency guarantees were greatly relaxed, and support for transactions and queries across multiple partitions were often dropped entirely.
Key-value stores, which have a greatly simplified interface rose to prominence, providing unparalleled scalability but placing the burden on the programmer to maintain consistency across partitions.Migrating applications from SQL databases to early NoSQL stores was difficult and bug-prone for all but the most simple applications. In order to build feature-rich, reliable distributed applications, programmers really needed the capabilities of a traditional database with strong guarantees. System designers began to make compromises, trading off some scalability in order to provide some guarantees. Relaxed consistency models such as eventual consistency and causal consistency emerged to fill the gap, but these models increased the burden on the programmer further by forcing them to reason about an unconventional consistency model. Yet other systems used multi-version concurrency control on key-value stores with protocols such as two phase locking or specialized hardware to ensure consistency, with a significant performance penalty and added system complexity. ince the beginning of the cloud era, developers have thought of consistency and scalability as mutually exclusive: that scalable systems must sacrifice consistency, and the strongly consistent systems must not be scalable. This line of thought surfaced when system designers were pushed to a corner to deliver as much performance as possible. Shedding consistency did indeed make their systems more scalable, but programmers paid the price, sometimes with catastrophic results. System designers reacted by retrofitting consistency back on, which resulted in performance losses and complexity as protocols had to work around the strict partitioning which was added for scalability.
In the next chapters, we will explore Corfu, which takes a clean slate approach. Instead of tacking on consistency, Corfu addresses scalability and consistency at its core, providing a strongly consistent yet scalable fabric, the Corfu distributed log,vertical vegetable tower which applications leverage through the use of an object-oriented interface. One motivation has changed significantly, however. Corfu was originally designed for flash memory, and the shared log was cast as a distributed flash translation layer for networked devices which exposed raw flash. Flash memory has a unique idiosyncrasy: although random reads are fast, the ideal write pattern is sequential due to the cost of garbage collection. The Corfu log takes all the writes across the data center and transforms them into sequential writes, enabling Corfu to consume the full write throughput of flash devices. It turns out Corfu works well on disk as well: due to the nature of the log, random reads on the log are rare and clients typically read the tail of the log. In the next chapters, we will see that the Corfu log works well irrespective of whether the underlying storage devices are flash or disk.The Corfu storage unit is now referred to as logging unit, and the projection is now a component of an layout which describes the entire Corfu system. In this chapter, we will use the older terminology. We begin by introducing the Corfu distributed log and describing its interface and implementation in Section 3.3. Next, in Section 3.4 we describe an hardware-accelerated implementation of the log on a FPGA, and then conclude in Section 3.5 with an evaluation of the performance of the log. Traditionally, system designers have been forced to believe that the only way to scale up reliable data stores, whether on-premise or cloud hosted, is to shard the database. In this manner, recent systems like Percolator, Megastore, WAS and Spanner are able to drive parallel IO across enormous fleets of storage machines. Unfortunately, these designs defer to costly mechanisms like two-phase locking or centralized concurrency managers in order to provide strong consistency across partitions. At the same time, consensus protocols like Paxos have been used as a building block for reliable distributed systems, typically deployed on a small number of servers, to provide a lever for a variety of consistent services: virtual block devices, replicated storage systems, lock services, coordination facilities, con- figuration management utilities, and transaction coordinators. As these successful designs have become pillars of today’s data centers and cloud back-ends, there is a growing recognition of the need for these systems to scale in number of machines, storage-volume and bandwidth. Internally, the Corfu distributed shared log is distributed over a cluster of machines and is fully replicated for fault tolerance, without sharding data or sacrificing global consistency. The Corfu distributed shared log allows hundreds of concurrent client machines in a large data-center to append to the tail of a shared log and read from it over a network.
Nevertheless, there is no single I/O bottleneck to either appends or reads, and the aggregated cluster throughput may be utilized by clients accessing the log. Historically, shared log designs have appeared in a diverse array of systems. Quick Silver and Camelot used shared logs for failure atomicity and node recovery. LBRM uses shared logs for recovery from multicast packet loss. Shared logs are also used in distributed storage systems for consistent remote mirroring. In such systems, Corfu fits the shared log role perfectly, pooling together the aggregate cluster resources for higher throughput and lower latency. Moreover, a shared log is panacea for replicated transactional systems. For instance, Hyder is a recently proposed high-performance database designed around a shared log, where servers speculatively execute transactions by appending them to the shared log and then use the log order to decide commit/abort status. In fact, Hyder was the original motivating application for Corfu and has been fully implemented over our code base. Interestingly, a shared log can also be used as a consensus engine, providing functionality identical to consensus protocols such as Paxos. Used in this manner, Corfu provides a fast, fault-tolerant service for imposing and durably storing a total order on events in a distributed system. From this perspective, Corfu can be used as a drop-in replacement for existing Paxos implementations, with far better performance than previous solutions.So far, we have argued the power and hence the desirability of a shared log. The key to its success is high performance, which we realize through a paradigm shift from existing cluster storage designs. In Corfu, each position in the shared log is projected onto a set of storage pages on different storage units. The projection map is maintained – consistently and compactly – at the clients. To read a particular position in the shared log, a client uses its local copy of this map to determine a corresponding physical storage page, and then directly issues a read to the storage unit storing that page. To append data, a client first determines the next available position in the shared log – using a sequencer node as an optimization for avoiding contention with other appending clients – and then writes data directly to the set of physical storage pages mapped to that position. In this way, the log in its entirety is managed without a leader, and Corfu circumvents the throughput cap of any single storage node. Instead, we can append data to the log at the aggregate bandwidth of the cluster, limited only by the speed at which the sequencer can assign them 64-bit tokens, i.e., new positions in the log. The evaluation for this chapter was done with a user-space sequencer serving 200K tokens/s, whereas our more recent user-space sequencer is capable of 500K tokens/s. Moreover, we can support reads at the aggregate cluster bandwidth. Essentially, Corfu’s design decouples ordering from I/O, extracting parallelism from the cluster for all IO while providing single-copy semantics for the shared log. Naturally, the real throughput clients obtain may depend on application workloads. However, we argue that CORFU provides excellent throughput in many realistic scenarios. Corfu’s append protocol generates nearly perfect sequential write-load, which can be turned into a high throughput, purely sequential access pattern at the server units with very little buffering effort. Meanwhile, reads can be served from a memory cache as in, or from from a cold standby as in. This leaves non-sequential accesses which may result from log compaction procedures. Because we design for large clusters, compaction need not be performed during normal operations, and may be fulfilled by switching between sets of active storage drives. Finally, we have originally argued in for the use of nonvolatile flash memory as CORFU storage units, which alleviates altogether any performance degradation due to random-accesses. Indeed, our present evaluation of CORFU is carried on servers equipped with SSD drives. The last matter we need to address is efficient failure handling. When storage units fail, clients must move consistently to a new projection from log positions to storage pages. Corfu achieves this via a reconfiguration mechanism capable of restoring availability within tens of milliseconds on drive failures. A challenging failure mode peculiar to a client-centric design involves ‘holes’ in the log; a client can obtain a log position from the sequencer for an append and then crash without completing the write to that position. To handle such situations, Corfu provides a fast hole-filling primitive that allows other clients to complete an unfinished append within a millisecond.The system setting for Corfu is a data center with a large number of application servers and a cluster of storage units. Our goal is to provide applications running on the clients with a shared log abstraction implemented over the storage cluster. Our design for this shared log abstraction aims to drive appends and random-reads to/from the log at the aggregate throughput that the storage cluster provides, while avoiding bottlenecks in the distributed software design.