[FoRK] Joyent, cloud service evolution (J. Andrew Rogers)

J. Andrew Rogers andrew at jarbox.org
Wed Jun 22 15:43:04 PDT 2016

> On Jun 21, 2016, at 6:52 PM, Ken Meltsner <meltsner at alum.mit.edu> wrote:
> You can assume that the databases will be "eventually consistent" and
> get much better performance, especially if it's unlikely that multiple
> users will be competing for the same items; at the bleeding edge (not
> sure how much has been published), the databases proceed
> optimistically, but can back out of transactions that failed due to
> conflicts relatively long after those were committed.

I don’t think that eventually consistent systems are that common anymore, it is rarely necessary for high performance and fault tolerance. Instead, large-scale systems tend to put strict constraints on transaction characteristics while still supporting the intended use case, which makes it much easier to scale guarantees than allowing all possible transaction structures. You need to warn the user about those constraints but they often aren’t a limitation for the application at hand. Some systems have much stronger internal transactional models than they expose to the user so they don’t have to account for user abuse of the facility.  

However, many (most?) analytical data models have little meaningful transactional structure in their data and operations. Providing consistent, repeatable views at extremely large scales under mixed workloads when you have record-level atomicity on insert and statement-level atomicity on update/delete is straightforward and often sufficient. However, this means global secondary indexes and similar are verboten, which puts the burden of high selectivity for diverse queries on your data model sharding mechanism.

> That's my limited understanding of the situation -- most of the
> methods JAR deprecates are pretty close to the state of the art for
> mid-range commercial products, so I'm curious how far the new stuff
> has gotten beyond the lab -- does something like
> https://www.cockroachlabs.com/ count as The Old Show or Next
> Generation?

First, there is no “one weird trick” to designing scalable real-time database engines. There are actually a few exotic new design elements that massively boost scalability and performance, and a few critical conventional design elements that have been around for years but are inexplicably missing from many new designs. No production system currently implements all of them.

I’ve been having many interesting conversations about this recently, there is some recognition that new database designs are stuck in rut because they keep mindlessly copying what has already been done. The list of design elements that are qualitative in terms of scalability and performance in 2016 is pretty short but I informally use it as a checklist for a state-of-the-art design. 

Three elements that should be common knowledge:

1.) A proper I/O and cache replacement scheduler. Open source tends to either let the OS schedule I/O or implement the most naive design possible (e.g. LRU). Engineers that design these things for a living use algorithms not in the literature, but most open source doesn’t even use what is in the literature. Open source having almost universally poor I/O throughput can be blamed on this. Contrary to popular belief, SSDs don’t fix this.

2.) Dynamic query compilation (usually LLVM). For the price of a bit more implementation effort, you get a free order of magnitude boost in query performance. This has been slowly seeping into open source implementations but many popular systems don’t do it. 

3.) Don’t write your engine in a language that requires a GC. You lose the ability to implement important optimizations which permanently gives you an integer factor performance penalty. This lesson has mostly been learned; many new open source database engines are being written in C++11 (maybe Rust in the future). There are still a lot legacy Java-based platforms being used though.

Three exotic new elements that aren’t common knowledge:

4.) Discrete topology sharding. At the limit, this provides content-addressable selectivity similar to having an index on every column without the index. Because the sharding is not order-based (more information theoretic), it can directly represent spatial relationships (not possible in order-based systems), and if properly designed, directly express and parallelize join relationships. Only a few production systems have been built this way but more are coming. It violates intuitions and takes a while to grok the theory behind the mechanics, which partly explains the slow spread.

5.) Game theoretic decentralized metadata protocols. The abstract idea actually comes from HPC, and it allows you to design consistent distributed systems with no metadata oracle. This lets your metadata mutation rate scale with the size of your cluster since you don't need global consensus on most metadata changes to get globally optimal behavior. Some similarities to gossip protocols but without the poor scaling characteristics. I find the Nash equilibrium thing to be very elegant but some people think it is more difficult than multi-Paxos. A few pre-production implementations are in the works at tech companies I am aware of. I think I’ve seen a couple theory papers that touch on these concepts as well.

6.) Bélády optimal scheduler. Bélády developed a lot of the theory on cache replacement algorithms back in the 1970s; the concepts can be applied to scheduling variable latency operations generally. The optimal solution is in NP generally, and many “best practice" software architectures have no solution at all. However, efficient approximations are possible by design. A subset of recent “single process locked to a core” database architectures are *halfway* to an optimal approximation and they already exhibit 10x the throughput of equivalent lock-free multithreaded architectures. The full version will probably add a small integer to that. Optimal schedulers don’t exist yet, it is too new, though I am currently working on one.

Nothing that I know of currently has all six architectural elements. But if something did, no other architectural decisions would move the needle in terms of adding fundamental capability, performance, or flexibility. Not implementing one of them won’t materially impact every workload profile, so the relevance is at least partly based on what you need the system to do. Transaction processing is somewhat orthogonal to all of this. 

On CockroachDB specifically, some of their ideas are good, particularly from an operations and management perspective, which is important and often overlooked. Technically as a database engine though, it is generic and par for the course for open source. They are using a mediocre ordered key-value storage engine (RocksDB) and the implementation on top of it is written in Go. These two design decisions alone put some limits on where it can go and what it can do.

The CockroachDB roadmap has two prominent items on it: real-time analysis and joins. The architecture doesn’t support good implementations of those features at scale. 

More information about the FoRK mailing list