About Distributed Systems Engineering
The mission of Distributed Systems Engineering is to provide world-class
software components that meet Amazon's general business needs, with
excellent support and low cost. The products and technologies that
make up the DSE components toolkit (services, application server
environment, frameworks, middleware and routing solutions) make
developing distributed systems easier. See below for more information
about our teams and information regarding our design
principles.
Distributed Systems Engineering Programs and Teams
- Persistent Systems Engineering
The Persistent Systems Engineering group takes on the challenge of designing
and implementing highly scalable storage systems. Techniques such as
consistent hashing, in-memory databases, group membership protocols, vector
clocks, etc., are coherently stitched together to address large (multiple
terabytes) and small (gigabytes) storage needs. Your expertise in software
development, distributed systems theory, and a highly analytical mind are
valued assets. [open positions]
- Messaging Technolgies
The messaging group believes that event-driven design coupled
with asynchronous messaging is a great way to build scalable distributed
software. Adopting messaging is a superior way to design and implement
distributed software - given the reality that the network is inherently
unreliable, non-deterministic, and that loose coupling has great
advantage over explicit tightly coupled distributed software systems.
Our mission is to provide scalable and reliable message oriented
middleware that service owners can use as basic building blocks
for inter-system communication, high performance asynchronous
processing, and loose coupling. [open
positions]
- Request Reply Frameworks
The mission of the Request Reply team is to provide libraries, components and
tools that enable low cost, decoupled, extremely parallelized, high throughput,
highly reliable request reply services. The team owns production critical
request routing, system availability management, and service definition
functionality that can be used to create services that are massively
scalable as well as robust to work consistently in the face of
unpredictable failures (of the network, agents and hosts). Request Reply also
provides productivity tools that allow other teams to understand the performance
and behavioral characteristics of their systems. The team's products form the
basis of a great deal of the distributed system architecture at Amazon.com.
[open positions]
Distributed Systems Design Principles
The following principles of distributed system design are applied to many of the systems DSE builds:
- Decentralization
Use fully decentralized techniques to remove scaling bottlenecks
and single points of failure.
- Asynchrony
The system makes progress under all circumstances.
- Autonomy
The system is designed such that individual components
can make decisions based on local information.
- Local responsibility
Each individual component is responsible
for achieving its consistency; this is never the burden of its
peers.
- Controlled Concurrency
Operations are designed such that no
or limited concurrency control is required.
- Failure Tolerant
The system considers the failure of components
to be a normal mode of operation, and continues operation with
no or minimal interruption.
- Controlled Parallelism
Abstractions used in the system are
of such granularity that parallelism can be used to improve performance
and robustness of recovery or the introduction of new nodes.
- Decomposition into small, well-understood building blocks
Do not
try to provide a single service that does everything for everyone,
but instead build small components that can be used as building
blocks for other services.
- Symmetry
Nodes in the system are identical in terms of functionality,
and require no or minimal node-specific configuration to function.
- Simplicity
The system should be made as simple as possible
(but no simpler).
If you're interested in joining the DSE team, take a look at our
open positions.
|