Reliability at massive scale At a large scale, small and large components fail continuously and the way persistent state is managed in the face of these failures drives the reliability and scalability of software systems.
- Reliability: slightest outage has significant financial consequences and impacts customer trust.
- Scalability: support continuous growth.
- Amazon uses a highly decentralized, lossely coupled, service oriented architecture consisting of hundreds of services. There is paricular need for storage technologies that are always available (disks are failing, network routes are flapping, or data centers are being destroyed by tornados).
- Always be able to write to and read from
- Need to deal with failures in an infrastructure comprised of millions of components
- a select set of applications requires a storage that is flexible enough to let application designers configure.
- Writes are never rejected.
- When to perform the process of resolving update conflicts
Dynamo, a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on” experience.
- It is used to manage the state of services that have very high reliability requirements and need tight control over the tradeoffs between availability, consistency, cost-effectiveness and performance.
- the common pattern of using a relational database would lead to inefficiencies and limit scale and availability.
- Dynamo provides a simple primary-key only interface
- Data is partitioned and replicated using consistent hashing
- consistency is facilitated by object versioning
- consistency among replicas during updates is maintained by a quorum-like technique and a decentralized replica synchronization protocol.
- It uses a gossip based distributed failure detection and membership protocol.
- It is a completely decentralized system with minimal need for manual administration.
- Storage nodes can be added and removed from Dynamo without requiring any manual partioning or redistribution.
It was able to scale to extreme peek loads efficiently without any downtime.
- sacrifices consistency under certain failure scenarios.
- the reliability and scalability of a system is dependent on how its application state is managed.
Traditional replicated relational database systems focus on the problem of guarantee strong consistency to replicate data. Previous systems are not capable of handling network partitions because they typically provide strong consistency guarantees
- Always writeable
- Built for an infrastructure within a single administrative domain where all nodes are assumed to be trusted.
- applications that use Dynamo do not require support for hierarchical namespaces or complex relational schema
- Dynamo is built for latency sensitive applications
- Evaluation of how different techniques can be combined to provide a single highly-available system.
- It demonstrates that an eventually-consistent storage system can be used in production with demanding applications.
- It also provides insight into the tuning of these techniques to meet the requirements of production systems with very strict performance demands.
A relational database is a solution that is far from ideal.
System Assumptions and Requirements
- Query Model: simple read and write operations to a data item that is uniquely identified by a key. State is stored as binary objects identified by unique keys.
- ACID Properties: (Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantee that database transactions are processed reliably.
- Efficiency: The system needs to function on a commodity hardware infrastructure.
Service Level Agreements(SLA)
Clients and services engage in a Service Level Agreement(SLA), a formally negotiated contract where a client and a service agree on several system-related characteristics.
Amazon’s engineering and optimization efforts are not focused on averages.
- Dynamo targets the design space of an “always writeable” data store.(when)
- Who performs the process of conflict resolution.
- Data store?: choices are rathre limited.
- Application?: can decide on the conflict resolution method that is best suited for its client’s experience
Other key principles embraced in the design
- Incremental scalability
- failure handling
get(key)locates the object replicas associated with the key in the storage system and returns a single object or a list of objects with confliction versions along with a context
put(key, context, object)determines where the replicas of the object should be placed based on the associated key, and writes the replicas to disk. The context encodes system metadata about the object that is opaque to the caller and includes information.
Dynamo’s partitioning scheme relies on consistent hasing. Each data identified by a key is assigned to a node by hasing its key to yield its position on the ring, then walking clockwise to find the first node with a position larger than the item’s position.
Each node becomes responsible for the region in the ring between it and its predecessor node on the ring.
Advantage of consistent hashing: departure or arrival of a node only affects its immediate neighbors and other nodes remain unaffected.
- The random position assignment of each node on the ring leads to non-uniform data and load distribution
- The basic algorithm is oblivious to the heterogeneity in the performance of nodes.
Solution to these challenges:
- each node gets assigned to multiple points in the ring.
- uses the concept of “virtual nodes”
Each key, k, is assigned to a coordinator node.The coordinator is in charge of the replication of the data items that fall within its range. In addition to locally storing each key within its range, the coordinator replicates these keys at the N-1 clockwise successor nodes in the ring.
Dynamo provides eventual consistency, which allows for updates to be propagated to all replicas asynchronously.
When a customer wants to add an item to (or remove from) a shopping cart and the latest version is not available, the item is added to (or removed from) the older version and the divergent versions are reconciled later.
Dynamo treats the result of each modification as a new and immutable version of the data.
It is allowed for multiple versions of an object to be present in the system at the same time. Dynamo uses vector clocks in order to capture causality between different versions of the same object.A vector clock is effectively a list of (node, counter) pairs. One vector clock is associated with every version of every object. One can determine whether two versions of an object are on parallel branches or have a causal ordering, by examine their vector clocks
Selecting a node:
- route is request through a generic load balancer that will select a node based on load information
- use a partition-aware client library that routes requests directly to the appropriate coordinator nodes.