[FoRK] Spanner: synchronizing the largest global database
Ken Ganshirt @ Yahoo
ken_ganshirt at yahoo.ca
Thu Nov 29 07:07:09 PST 2012
Indeed, it's about time. The sync technology discussed is very mature. It has been in use in the telecom industry for decades. In phone companies this technology has been used since at least the 1970's to lock voice switches and data switches together in global networks.
By the early 1990's the prices of Stratum 1 and 2 devices had reached the point where even small companies like the one I worked for could easily afford them. If it's affordable for deployment in a telco central office, it's in the same price range as comparable equipment in data centres.
Computer professionals are way late to the party. Google might be applauded for coming up with an API but mainstream computer professionals need their asses kicked for taking so long to "discover" the technology.
--- On Tue, 11/27/12, Eugen Leitl <eugen at leitl.org> wrote:
> From: Eugen Leitl <eugen at leitl.org>
> Subject: [FoRK] Spanner: synchronizing the largest global database
> To: tt at postbiota.org, Beowulf at beowulf.org, "forkit!" <fork at xent.com>
> Received: Tuesday, November 27, 2012, 7:25 AM
> (it's about time)
> Exclusive: Inside Google Spanner, the Largest Single
> Database on Earth
> By Cade Metz
> 6:30 AM
> Each morning, when Andrew Fikes sat down at his desk inside
> Google headquarters in Mountain View, California, he turned
> on the “VC” link to New York.
> VC is Google shorthand for video conference. Looking up at
> the screen on his desk, Fikes could see Wilson Hsieh sitting
> inside a Google office in Manhattan, and Hsieh could see
> him. They also ran VC links to a Google office in Kirkland,
> Washington, near Seattle. Their engineering team spanned
> three offices in three different parts of the country, but
> everyone could still chat and brainstorm and troubleshoot
> without a moment’s delay, and this is how Google built
> “You walk into our cubes, and we’ve got VC on — all
> the time,” says Fikes, who joined Google in 2001 and now
> ranks among the company’s distinguished software
> engineers. “We’ve been doing this for years. It lowers
> all the barriers to communication that you typically
> ‘As a distributed-systems developer, you’re taught from
> — I want to say childhood — not to trust time. What we
> did is find a way that we could trust time — and
> understand what it meant to trust time.’
> — Andrew Fikes
> The arrangement is only appropriate. Much like the
> engineering team that created it, Spanner is something that
> stretches across the globe while behaving as if it’s all
> in one place. Unveiled this fall after years of hints and
> rumors, it’s the first worldwide database worthy of the
> name — a database designed to seamlessly operate across
> hundreds of data centers and millions of machines and
> trillions of rows of information.
> Spanner is a creation so large, some have trouble wrapping
> their heads around it. But the end result is easily
> explained: With Spanner, Google can offer a web service to a
> worldwide audience, but still ensure that something
> happening on the service in one part of the world doesn’t
> contradict what’s happening in another.
> Google’s new-age database is already part of the
> company’s online ad system — the system that makes its
> millions — and it could signal where the rest of the web
> is going. Google caused a stir when it published a research
> paper detailing Spanner in mid-September, and the buzz was
> palpable among the hard-core computer systems engineers when
> Wilson Hsieh presented the paper at a conference in
> Hollywood, California, a few weeks later.
> “It’s definitely interesting,” says Raghu Murty, one
> of the chief engineers working on the massive software
> platform that underpins Facebook — though he adds that
> Facebook has yet to explore the possibility of actually
> building something similar.
> Google’s web operation is significantly more complex than
> most, and it’s forced to build custom software that’s
> well beyond the scope of most online outfits. But as the web
> grows, its creations so often trickle down to the rest of
> the world.
> Before Spanner was revealed, many didn’t even think it was
> possible. Yes, we had “NoSQL” databases capable of
> storing information across multiple data centers, but they
> couldn’t do so while keeping that information
> “consistent” — meaning that someone looking at the
> data on one side of the world sees the same thing as someone
> on the other side. The assumption was that consistency was
> barred by the inherent delays that come when sending
> information between data centers.
> But in building a database that was both global and
> consistent, Google’s Spanner engineers did something
> completely unexpected. They have a history of doing the
> unexpected. The team includes not only Fikes and Hsieh, who
> oversaw the development of BigTable, Google’s seminal
> NoSQL database, but also legendary Googlers Jeff Dean and
> Sanjay Ghemawat and a long list of other engineers who
> worked on such groundbreaking data-center platforms as
> Megastore and Dremel.
> This time around, they found a new way of keeping time.
> “As a distributed systems developer, you’re taught from
> — I want to say childhood — not to trust time,” says
> Fikes. “What we did is find a way that we could trust time
> — and understand what it meant to trust time.”
> Time Is of the Essence
> On the net, time is of the essence. Yes, in running a
> massive web service, you need things to happen quickly. But
> you also need a means of accurately keeping track of time
> across the many machines that underpin your service. You
> have to synchronize the many processes running on each
> server, and you have to synchronize the servers themselves,
> so that they too can work in tandem. And that’s easier
> said than done.
> Typically, data-center operators keep their servers in sync
> using what’s called the Network Time Protocol, or NTP.
> This is essentially an online service that connects machines
> to the official atomic clocks that keep time for
> organizations across the world. But because it takes time to
> move information across a network, this method is never
> completely accurate, and sometimes, it breaks altogether. In
> July, several big-name web operations experienced problems
> — including Reddit, Gawker, and Mozilla — because their
> software wasn’t prepared to handle a “leap second”
> that was added to the world’s atomic clocks.
> ‘We wanted something that we were confident in. It’s a
> time reference that’s owned by Google.’
> — Andrew Fikes
> But with Spanner, Google discarded the NTP in favor of its
> own time-keeping mechanism. It’s called the TrueTime API.
> “We wanted something that we were confident in,” Fikes
> says. “It’s a time reference that’s owned by
> Rather than rely on outside clocks, Google equips its
> Spannerized data centers with its own atomic clocks and GPS
> (global positioning system) receivers, not unlike the one in
> your iPhone. Tapping into a network of satellites orbiting
> the Earth, a GPS receiver can pinpoint your location, but it
> can also tell time.
> These time-keeping devices connect to a certain number of
> master servers, and the master servers shuttle time readings
> to other machines running across the Google network.
> Basically, each machine on the network runs a daemon — a
> background software process — that is constantly checking
> with masters in the same data center and in other Google
> data centers, trying to reach a consensus on what time it
> is. In this way, machines across the Google network can come
> pretty close to running a common clock.
> ‘The System Responds — And Not a Human’
> How does this bootstrap a worldwide database? Thanks to the
> TrueTime service, Google can keep its many machines in sync
> — even when they span multiple data centers — and this
> means they can quickly store and retrieve data without
> stepping on each other’s toes.
> “We can commit data at two different locations — say the
> West Coast [of the United States] and Europe — and still
> have some agreed upon ordering between them,” Fikes says,
> “So, if the West Coast write happens first and then the
> one in Europe happens, the whole system knows that — and
> there’s no possibility of them being viewed in a different
> ‘By using highly accurate clocks and a very clever time
> API, Spanner allows server nodes to coordinate without a
> whole lot of communication.’
> — Andy Gross
> According to Andy Gross — the principal architect at
> Basho, an outfit that builds an open source database called
> Riak that’s designed to run across thousands of servers
> — database designers typically seek to synchronize
> information across machines by having them talk to each
> other. “You have to a do a whole lot of communication to
> decide the correct order for all the transactions,” he
> told us this fall, when Spanner was first revealed.
> The problem is that this communication can bog down the
> network — and the database. As Max Schireson — the
> president of 10gen, maker of the NoSQL database MongoDB —
> told us: “If you have large numbers of people accessing
> large numbers of systems that are globally distributed so
> that the delay in communications between them is relatively
> long, it becomes very hard to keep everything synchronized.
> If you increase those factors, it gets even harder.”
> So Google took a completely different tack. Rather than
> struggle to improve communication between servers, it gave
> them a new way to tell time. “That was probably the
> coolest thing about the paper: using atomic clocks and GPS
> to provide a time API,” says Facebook’s Raghu Murty.
> In harnessing time, Google can build a database that’s
> both global and consistent, but it can also make its
> services more resistant in the face of network delays,
> data-center outages, and other software and hardware snafus.
> Basically, Google uses Spanner to accurately replicate its
> data across multiple data centers — and quickly move
> between replicas as need be. In other words, the replicas
> are consistent too.
> When one replica is unavailable, Spanner can rapidly shift
> to another. But it will also move between replicas simply to
> improve performance. “If you have one replica and it gets
> busy, your latency is going to be high. But if you have four
> other replicas, you can choose to go to a different one, and
> trim that latency,” Fikes says.
> One effect, Fikes explains, is that Google spends less money
> managing its system. “When there are outages, things just
> sort of flip — client machines access other servers in the
> system,” he says. “It’s a much easier service
> story…. The system responds — and not a human.”
> Spanning Google’s Footsteps
> Some have questioned whether others can follow in Google’s
> footsteps — and whether they would even want to. When we
> spoke to Andy Gross, he guessed that even Google’s atomic
> clocks and GPS receivers would be prohibitively expensive
> for most operations.
> Yes, rebuilding the platform would be a massive undertaking.
> Google has already spent four and half years on the project,
> and Fikes — who helped build Google’s web history tool,
> its first product search service, and Google Answers, as
> well as BigTable — calls Spanner the most difficult thing
> he has ever worked on. What’s more, there are countless
> logistical issues that need dealing with.
> ‘The important thing to think about is that this is a
> service that is provided to the data center. The costs of
> that are amortized across all the servers in your fleet. The
> cost per server is some incremental amount — and you weigh
> that against the types of things we can do for that.’
> — Andrew Fikes
> As Fikes points out, Google had to install GPS antennas on
> the roofs of its data centers and connect them to the
> hardware below. And, yes, you do need two separate types of
> time keepers. Hardware always fails, and your time keepers
> must fail at, well, different times. “The atomic clocks
> provide stability if there is a GPS issue,” he says.
> But according to Fikes, these are relatively inexpensive
> devices. The GPS units aren’t as cheap as those in your
> iPhone, but like Google’s atomic clocks, they cost no more
> than a few thousand dollars apiece. “They’re sort of in
> the order of the cost of an enterprise server,” he says,
> “and there are a lot of different vendors of these
> devices.” When we discussed the matter with Jeff Dean —
> one of Google primary infrastructure engineers and another
> name on the Spanner paper — he indicated much the same.
> Fikes also makes a point of saying that the TrueTime service
> does not require specialized servers. The time keepers are
> kept in racks onside the servers, and again, they need only
> connect to some machines in the data center.
> “You can think of it as only a handful of these devices
> being in each data center. They’re boxes. You buy them.
> You plug them into your rack. And you’re going to connect
> to them over Ethernet,” Fikes says. “The important thing
> to think about is that this is a service that is provided to
> the data center. The costs of that are amortized across all
> the servers in your fleet. The cost per server is some
> incremental amount — and you weigh that against the types
> of things we can do for that.”
> No, Spanner isn’t something every website needs today. But
> the world is moving in its general direction. Though
> Facebook has yet to explore something like Spanner, it is
> building a software platform called Prism that will run the
> company’s massive number crunching tasks across multiple
> data centers.
> Yes, Google’s ad system is enormous, but it benefits from
> Spanner in ways that could benefit so many other web
> services. The Google ad system is an online auction —
> where advertisers bid to have their ads displayed as someone
> searches for a particular item or visits particular websites
> — and the appearance of each ad depends on data describing
> the behavior of countless advertisers and web surfers across
> the net. With Spanner, Google can juggle this data on a
> global scale, and it can still keep the whole system in
> As Fikes put it, Spanner is just the first example of Google
> taking advantage of its new hold on time. “I expect there
> will be many others,” he says. He means other Google
> services, but there’s a reason the company has now shared
> its Spanner paper with the rest of the world.
> Illustration by Ross Patton
> FoRK mailing list
More information about the FoRK