Tuesday, October 31, 2006

Paper: "Experiences Building PlanetLab"

Experiences Building PlanetLab
Larry Peterson, Andy Bavier, Marc E. Fiuczynski, and Steve Muir, Princeton University

Abstract

This paper reports our experiences building PlanetLab over the last four years. It identifies the requirements that shaped PlanetLab, explains the design decisions that resulted from resolving conflicts among these requirements, and reports our experience implementing and supporting the system. Due in large part to the nature of the ``PlanetLab experiment,'' the discussion focuses on synthesis rather than new techniques, balancing system-wide considerations rather than improving performance along a single dimension, and learning from feedback from a live system rather than controlled experiments using synthetic workloads.

2 Comments:

Blogger Shanth said...

TALK SCRIBE:
~~~~~~~~~~~

This talk, given by Larry Peterson, was
about the authors' experience building
PlanetLab (PL). PL is a global platform
for deploying and evaluating
planetary-scale network services. PL has
machines spread around the world with
users' services running in a slice of
PL's global resources.

The PL design was a synthesis of existing
ideas to produce a fundamentally new
system: it was experience- and
conflict-driven.

Larry listed the requirements identified
at the time PL was conceived and the
design challenges they faced. Given its
scale, PL had to rely on site autonomy
and decentralized control for
sustainability, while also managing the
trust relationships between the users of
PL and the owners of the machines. Next,
it had to balance the need for resource
isolation while coping with support for
many users with minimal resources.
Finally, PL had to be a stable usable
system, supporting long-running services
and short experiements, while continously
evolving based on feedback.

PL's management architecture has the
following key features to address the
design challenges. PlanetLab Control
(PLC), a centralized front-end, acts as
the trusted intermediary between PL users
and node owners. To support long-lived
slices and accomodate scarce resource, PL
decouples slice creation from resource
allocation. Node-owner autonomy is
achieved by making sure that only owners
generate resources on their nodes and
that they can directly allocate a
fraction of their node's resources to
Virtual Machines of a specific slice. To
support slice management through
third-party services, PLC allows
delegation of slice-creation by granting
tickets to such services. For
scalability, PL was designed so that
multiple PL-like systems can co-exist and
federate with each other. As per the
principle of least privilege, management
functionality has been factored into
self-contained services, isolated into
their own VMs and granted minimal
privileges. To address the resource
allocation issues, PL provides fair
sharing of CPU and network bandwidth and
simple mechanisms to protect against
thrashing and overuse. Finally, keeping
PL's control plane orthogonal from the
VMM, leveraging existing software and
rolling out upgrades incrementally helped
PL evolve while also being operational.

Larry concluded with lessons learnt from
their experience. Key among them was the
observation that decentralization
follows centralization, i.e. a
centralized model is important for a
system to achieve critical mass and it
is only by federation that the system
can scale.

During the Q&A session, Sean Rhea (Intel
Research Berkeley) asked about Larry's
comments on the proposal to partition
the system between short-running
experiements and long-running services.
Larry said he was not convinced about
reserving a portion of the resources for
long-running services, although in case
of conflict, services were given
priority over experiements. He also said
that were some measurements being taken
regarding this. David Anderson (CMU)
stated that Larry's talk presented a
rosy picture of PL, in contrast to the
PL panel in WORLDS'06 that discussed
problems with PL. David asked about the
observed problems with running
latency-sensitive services, disk
thrashing and scheduling. Larry said
that there was room for improvement in
scheduling and that they are working on
it. He also hinted that there may be a
scheduling bug in their code.

1:27 PM  
Blogger Shanth said...

Corrections:

During the Q&A session, Sean Rhea (Intel Research Berkeley) asked about Larry's comments on the proposal to set aside physical boxes for measurements. Larry said he was not convinced about reserving physical resources, but rather thought that logical isolation was sufficient. David Anderson (CMU) noted that Larry's talk presented a rosy picture of PL, in contrast to the PL panel in WORLDS'06 that discussed problems with PL. David asked about the observed problems with running latency-sensitive services, disk thrashing and scheduling. Larry said that there was room for improvement in scheduling. He also noted that since the PL code is available, the community was welcome to track down bugs that hamper their research and report patches. He said that there was a known kernel bug that could cause problem with latency-sensitive slices and that things would improve when the next kernel upgrade is rolled out.

12:00 PM  

Post a Comment

<< Home