The system takes about a millisecond to detect this failure and reconfigure into degraded mode

Clients may have multiple views of the same object: for example, a client may have a local view of the present state of the object with a remote view of a past version of the object, enabling the client to operate against a snapshot.Transactions enable developers to issue multiple operations which either succeed or fail atomically. Transactions are a pain point for partitioned data stores since a transaction may span across multiple partitions, requiring locking or schemes such as 2PL or MVCC to achieve consistency. vCorfu leverages atomic multi-stream appends and global snapshots provided by the log, and exploits the sequencer as a lightweight transaction manager. Transaction execution is optimistic, similar to transactions in shared log systems, which leverage the total ordering provided by the log. However, since our sequencer supports conditional token issuance, we avoid polluting the log with transactional aborts. To execute a transaction, a client informs the runtime that it wishes to enter a transactional context by calling TXBegin. The client obtains the most recently issued log token once from the sequencer and begins optimistic execution by modifying reads to read from a snapshot at that point.

Writes are buffered into a write buffer. When the client ends the transaction by calling TXEnd,frambuesa y mora the client checks if there are any writes in the write buffer. If there are not, then the client has successfully executed a read-only transaction and ends transactional execution. If there are writes in the write buffer, the client informs the sequencer of the log token it used and the streams which will be affected by the transaction. If the streams have not changed, the sequencer issues log and stream tokens to the client, which commits the transaction by writing the write buffer. Otherwise, the sequencer issues no token and the transaction is aborted by the client without writing an entry into the log. This important optimization ensures only committed entries are written to the log, so that when a client encounters a transactional commit entry, it may treat it as any other update. In other shared log systems, each client must determine whether a commit record succeeds or aborts, either by running the transaction locally or looking for a decision record. In vCorfu, we have designed transactional support to be as general as possible and to minimize the amount of work that clients must perform to determine the result of a transaction. We treat each object as an opaque object, since fine-grained conflict resolution would either require the client resolve conflicts or a much more heavyweight sequencer. Opacity is ensured by always operating against the same global snapshot, leveraging the history provided by the log.

Opacity is a stronger guarantee than strict serializability as opacity prevents programmers from observing inconsistent state. Since global snapshots are expensive in partitioned systems, these systems typically provide only a weaker guarantee, allowing programs to observe inconsistent state but guaranteeing that such transactions will be aborted. Read-own-writes is another property which vCorfu provides: transactional reads will also apply any writes in the write buffer. Many other systems do not provide this property since it requires writes to be applied to data items. The SMR paradigm, however, enables vCorfu to generate the result of a write in-memory, simplifying transactional programming. vCorfu fully supports nested transactions, where a transaction may begin and end within a transaction. Whenever transaction nesting occurs, vCorfu buffers each transaction’s write set and the transaction takes the timestamp of the outermost transaction.vCorfu supports several mechanisms for finding and retrieving objects. First, a developer can use vCorfu like a traditional key-value store just by using the stream id for object as a key. We also support a much richer query model: a set of collections, which resemble the Java collections are provided for programmers to store and access objects in. These collections are objects just like any other vCorfu object, so developers are free to implement their own collection. Developers can take advantage of multiple views on the same collection: for instance a List can be viewed as a Queue or a Stack simultaneously. Some of the collections we provide include a List, Queue, Stack, Map, and RangeMap.

Collections, however, tend to be very large objects which are highly contended. In the next section, we discuss composable state machine replication, a technique which allows vCorfu to build a collection out of multiple objects.In vCorfu, objects may be composed of other objects, a technique which we refer to as composable state machine replication. The simplest example of CSMR is a hash map composed of multiple hash maps, but much more sophisticated objects can be created. Composing SMR objects has several important advantages. First, CSMR divides the state of a single object into several smaller objects, which reduces the amount of state stored at each materialized stream. Second, smaller objects reduce contention and false sharing, providing for higher concurrency. Finally, CSMR resembles how data structures are constructed in memory – this allows us to apply standard data structure principles to vCorfu. For example, a B-tree constructed using CSMR would result in a structure with O time complexity for search, insert and delete operations. This opens a plethora of familiar data structures to developers. Programmers manipulate CSMR objects just as they would any other vCorfu object. A CSMR object starts with a base object, which defines the interface that a developer will use to access the object. An example of a CSMR hash map is shown in Figure 5.5. The base object manipulates child objects, which usually store the actual data. Child objects may just reuse standard vCorfu objects, like a hash map, or they may be custom-tailored for the CSMR object, like a B-tree node. In the example CSMR map shown in Figure 5.5,frambueso maceta the object shown is the base object and the child objects are standard SMR maps. The number of buckets is set at creation time in the num Buckets variable. Two helper functions, get Child Number and get Child help the base object locate child objects deterministically. In our CSMR map, we use the Luby-Rakoff algorithm to obtain an improved key distribution over the standard Java hashCode function. Most operations such as get and put operate as before, and the base object needs to only select the correct child to operate on. However, some operations such as size and clear touch all child objects. The vCorfu log provides fast access to snapshots of arbitrary objects, and the ability to open remote views, which avoids the cost of playback, enables clients to quickly traverse CSMR objects without reading many updates or storing large amounts of local state. In a more complex CSMR object, such as our CSMR B-tree, the base object and the child object may have completely different interfaces. In the case of the B-tree, the base object presents a map-like interface, while the child objects are nodes which contain either keys or references to other child objects. Unlike a traditional B-tree, every node in the CSMR B-tree is versioned like any other object in vCorfu. CSMR takes advantage of this versioning when storing a reference to a child object: instead of storing a static pointer to particular versions of node, as in a traditional B-tree, references in vCorfu are dynamic. Normally, references point to the latest version of an object, but they may point to any version during a snap shotted read, allowing the client to read a consistent version of even the most sophisticated CSMR objects. With dynamic pointers, all pointers are implicitly updated when an object is updated, avoiding a problem in traditional trees, where an update to a single child node can cause an update cascade requiring all pointers up to the root to be explicitly updated, known as the recursive update problem.

The design of vCorfu relies on performant materialization. To show that materializing streams is efficient, we implement streams using back pointers in vCorfu with chain replication, similar to the implementation described in Tango. For these tests, in order to compare vCorfu with a with a chain replication-based protocol, we use a symmetrical configuration for vCorfu, with an equal number of log replicas and stream replicas. For the back pointer implementation, we use chain replication , with two replicas per chain and two chains. Our back pointer implementation only stores a single back pointer per entry – Tango uses 4 back pointers. Multiple back pointers are only used to reduce the probability that a linear scan will be required due to hole-filling and should not have an effect on our evaluation. The primary drawback of materialization is that it requires writing a commit message, which results in extra messages proportional to the number of streams affected. We characterize the overhead with a micro-benchmark that appends 32B entries, varying the number of streams and logging units. Figure 5.6 shows that writing a commit bit imposes about a 40% penalty on writes, compared to a back pointer based protocol which does not have to send commit messages. However, write throughput continues to scale as we increase the number of replicas, so the bottleneck in both schemes is the rate in which the sequencer can hand out tokens, not the commit protocol. Now we highlight the power of materializing streams. Figure 5.7 shows the performance of reading an entire stream with a varying number of 32B entries and streams in the log. The 100K stream case uses significantly fewer entries, reflecting our expectation that CSMR objects will increase the number of streams while decreasing the number of entries per stream. As the number of streams and entries increase, vCorfu greatly outperforms back pointers thanks to the ability to perform a single bulk read, whereas back pointers must traverse the log backwards before being able to serve a single read. When hole-filling occurs due to client timeouts, back pointers perform very poorly, falling back to a scan because the hole fill does not contain back pointers resulting in a linear scan of the log. Figure 5.8 examines the number of log entries a back pointer implementation may have to read as a result of a hole. To populate this test, we use 256 clients which randomly select a stream to append a 32B entry to. We then generate a hole, varying the number of streams in the log, and measure the number of entries that the client must seek through. The back pointer implementation cannot do bulk reads, and reading each entry takes about 0.3 ms. The median time to read a stream with a hole takes only 210ms with 32 streams, but jumps to 14.8 and 39.6 seconds with 100K and 500k streams, respectively. vCorfu avoids this issue altogether because stream replicas manage holes. Finally, Figure 5.9 shows that vCorfu performance degrades gracefully when a stream replica fails, and vCorfu switches to using the log replicas instead. We instantiate two local views on the same object, and fail the stream replica hosting the object at t = 29.5s.The append throughput almost doubles, since the replication factor has decreased while the read throughput stays about the same, falling back to using back pointers. Since the local view contains all the previous updates in the stream, reading the entire stream is not necessary. If a remote view was used, however, vCorfu would have to read the entire stream to restore read access to the object. Next, we examine the power of remote views. We first show that remote views address the playback bottleneck: In figure 5.10, we append to a single local view and increase the number of clients reading from their own local views. As the number of views increases, read throughput decreases because each view must playback the stream and reade very update. Once read throughput is saturated readers are unable to keep up with the updates in the system and read latency skyrockets: with just 32 clients, the read latency jumps to above one second. With a remote view, the stream replica takes care of playback and even with 1K clients is able to maintain read throughput and millisecond latency.