Ruth Pordes, Vicky White, Lothar Bauerdick, Amber Boehnlein, Wyatt Merritt, Rick Snider, Rick St. Denis,
Frank Wuerthwein, Liz Sexton-Kennedy
The U.S.LHC Tier-1 and Tier-2 laboratories and universities are
developing production Grids to support LHC applications running across
a worldwide Grid computing system. Together with partners in computer
science, physics grid projects and running experiments, we will build a
common national production grid infrastructure which is open in its
architecture, implementation and use.
The OSG model builds upon the successful approach of last year's joint
Grid2003 project. The Grid3 shared infrastructure has for over eight
months given significant computational resources and throughput to more
than six applications, including ATLAS and CMS data challenges, SDSS,
LIGO and Biology analyses and computer science demonstrators.
To move towards LHC-scale data management, access and analysis
capabilities, we will need to increase the scale, services, and
sustainability of the current infrastructure by an order of magnitude.
This requires a significant upgrade in its functionalities and
technologies.
The OSG roadmap is a strategy and work plan to build the U.S.LHC
computing enterprise as a fully usable, sustainable and robust grid,
which is part of the LHC global computing infrastructure and open to
partners. The approach is to federate with other application
communities in the U.S. to build a shared infrastructure open to other
sciences and capable of being modified and improved to respond to needs
of other applications, including CDF, D0, BaBar and RHIC experiments.
We describe the application driven engineered services of the OSG,
short term plans and status, and the roadmap for a consortium, its
partnerships and national focus.
SAMGrid is the shared data handling framework of the two large Fermilab
Run II collider experiments: DZero and CDF. In production since 1999 at D0, and
since mid-2004 at CDF, the SAMGrid framework has been adapted over time to
accommodate a variety of storage solutions and configurations, as well as the
differing data processing models of DZero and CDF. Backed by primary data
repositories of approximately 1 PB in size for each experiment, the SAMGrid
framework delivers over 100 TB/day to DZero and CDF analyses at Fermilab and
around the world, a remarkable success. Each of the storage systems used with
SAMGrid, however, has distinct interfaces, protocols, and behaviors. This led
to different levels of integration of the various storage devices into the
framework, which complicated the exploitation of their functionality.
In an effort to simplify the SAMGrid storage interfaces, SAMGrid has
adopted the Storage Resource Manager (SRM) concept as the universal interface
to all storage devices. This has simplified the SAMGrid framework, expecially
the implementation of storage device interactions. It prepares the SAMGrid
framework for future storage solutions equipped with SRM interfaces, without
the need for long and risky software integration projects. In principle, any
storage device with an SRM interface can be used now with the SAMGrid
framework. The integration of SRMs is an improtant further step towards evolving the
SAMGrid framework into a co-operating collection of distinct, modular
grid-oriented services. To date, SRMs for Enstore, dCache, local caches, and
permanent disk locations are tested and in production use. This report outlines
how SRMs have been integrated into the existing SAMGrid framework without
disturbing on-going operations, and describes our operational experience with
SAMGrid and SRMs in the field.
The SAMGrid team is in the process of implementing a monitoring and
information service, which fulfills several important roles in the
operation of the SAMGrid system, and will replace the first generation of
monitoring tools in the current deployments. The first generation tools
are in general based on text logfiles and represent solutions which are
not scalable or maintainable.
The roles of the monitoring and information service are: 1) providing
diagnostics for troubleshooting the operation of SAMGrid services; 2)
providing support for monitoring at the level of user jobs; 3) providing
runtime support for local configuration and other information currently
which currently must be stored centrally (thus moving the system toward
greater autonomy for the SAM station services, which include cache
management and job management services); 4) providing intelligent
collection of statistics in order to enable performance monitoring and
tuning.
The architecture of this service is quite flexible, permitting input from
any instrumented SAM application or service. It will allow multiple
backend storage for archiving of (possibly) filtered monitoring
events, as well as real time information displays and active
notification service for alarm conditions.
This service will be able to export, in a configurable manner,
information to higher level Grid monitoring services, such as MonALisa.
We describe our experience to date with using a prototype version
together with MonAlisa.
Using a Relational Database to House Metadata for the Common Physicist
Wyatt Merritt,Julie Trumbo, Rick St. Denis
SAM was developed as a data handling system for Run II at Fermilab. SAM is a
collection of services, each described by metadata. The metadata are modeled
on a relational database, and implemented in ORACLE. SAM, originally
deployed in production for the D0 Run II experiment, has now been also
deployed at CDF and is in testing at MINOS. This illustrates that the
metadata decomposition of its services has broader applicability than at
just one experiment. We believe this is the first example of such a
unification, where two complex collider experiments are sharing a schema for
the complete description of file contents, file locations, and processing
descriptions. Metadata for several million files are now stored for each
experiment. Over the last five years, greater understanding of the required
services in a performant data handling system has emerged. The collection of
metadata to support these services forms the core of the SAM system. We
describe this schema and the commonalities and differences that emerge from
the need to support two experiments. We also describe the support structure
required for schema updates: the use of development, integration, and
production instances. This talk will focus on the four categories of SAM
services, the functionality currently implemented in those services, and the
supporting metadata we collect for these services. We will also explore the
SAM Entity Relationship diagram for a visual means of understanding SAM and
its functions, and some of the query structure needed and some of the
performance issues.
The SAMGrid team has recently refactored its test harness suite for
greater flexibility and easier configuration. This makes possible more
interesting applications of the test harness, for component tests,
integration tests, and stress tests. We report on the architecture of the
test harness and its recent application to stress tests of a new analysis
cluster at Fermilab, to explore the extremes of analysis use cases and
the relevant parameters for tuning in the SAMGrid station services.
This reimplementation of the test harness is a python framework which
uses XML for configuration and small plug-in python modules for specific
test purposes.
One current testing application is running on a 128-CPU analysis cluster
with access to 6 TB distributed cache and also to a 2 TB centralized
cache, permitting studies of different cache strategies. We have studied
the service parameters which affect the performance of retrieving data
from tape storage as well. The use cases studied vary from those which
will require rapid file delivery with short processing time per file, to
the opposite extreme of long processing time per file. We also show how the
same harness can be used to run regular unit tests on a
production system to aid early fault detection and diagnosis.These results are
interesting for their implications with regard to Grid operations, and
illustrate the type of monitoring and test facilities required to
accomplish such performance tuning.
SAMGrid Deployment for Production Simulation and
Reconstruction
Art Kreymer
The Fermilab CDF Run-II experiment is now providing official support for
remote computing, expanding this to about 1/4 of the total CDF computing
during the Summer of 2004.
I will discuss in detail the extensions to CDF software distribution
and configuration tools and procedures, in support of CDF GRID/DCAF
computing for Summer 2004. We face the challenge of unreliable networks,
time differences, and remote managers with little experience with
this particular software.
We have made the first deployment of the SAM data handling system
outside its original home in the D0 experiment.
We have deployed to about 20 remote CDF sites.
We have created light weight testing and monitoring tools
to assure that these sites are in fact functional when installed.
We are distributing and configuring both client code within CDF code releases,
and the SAM servers to which the clients connect.
Procedures which once took days are now performed in minutes.
These tools can be used to install SAM servers for D0 and other experiments.
Networks permitting, we will give a live SAM installation demonstration.
We have separated the data handling components from the main CDF offline
code releases by means of shared libraries, permitting live upgrades
to otherwise frozen code.
We now use a special 'development lite' release to ensure that all sites
have the latest tools available.
We have put substantial effort into revision control,
so that essentially all active CDF sites are running exactly the same code.
Experience Producing Simulated Events for the D0
Experiment
on the SAMGrid
Gabriele, Igor
Most of the simulated events for the DZero experiment at Fermilab have
been historically produced by the remote collaborating
institutions. One of the principal challenges reported concerns the
maintenance of the local software infrastructure, which is generally
different from site to site. As the understanding of the community on
distributed computing over distributively owned and shared resources
progresses, it becomes increasingly interesting the adoption of grid
technologies to address the production of montecarlo events for high
energy physics experiments. The SAM-Grid is a software system developed
at Fermilab, which integrates standard grid technologies for job and
information management with SAM, the data handling system of the DZero
and CDF experiments. During the past few months, this grid system has
been tailored for the montecarlo production of DZero. Since the initial
phase of deployment, this experience has exposed an interesting series of
requirements to the SAM-Grid services, the standard middleware, the
resources and their management and to the analysis framework of the
experiment. As of today, the inefficiency due to the grid infrastructure
has been reduced to as little as 1%. In this paper, we present our
statistics and the lesson learned in running large high
energy physics applications on a grid infrastructure.
Testing the CDF Distributed Computing
Framework
Valeria Bartsch
To distribute computing for CDF (Collider Detector at Fermilab) a system
managing local compute and storage resources is needed. For this purpose
CDF will use the DCAF (Decentralized CDF Analysis Farms) system which is
already at Fermilab. DCAF has to work with the data handling system SAM
(Sequential Access to data via Metadata). However, both DCAF and SAM are
mature systems which have not yet been used in combination, and on top
of this DCAF has only been installed at Fermilab and not on local sites.
Therefore tests of the systems are necessary to test the interplay of
the data handling with the farms, the behaviour of the off-site DCAFs
and the user friendliness of the whole system. The tests are focussed on
the main tasks of the DCAFs, like Monte Carlo generation and stores, as
well as the readout of data files and connected data handling. To
achieve user friendliness the SAM station environment has to be common
to all stations and adaptations to the environment have to be made.
SAMGrid Experiences with the Condor Technology in Run II
Computing
Igor Terekhov
SAMGrid is a globally distributed system for data handling and job
management, developed at Fermilab for the D0 and CDF experiments in Run
II. The Condor system is being developed at the University of Wisconsin
for management of distributed resources, computational and otherwise. We
briefly review the SAMGrid architecture and its interaction with Condor,
which was presented earlier. We then present our experiences using the
system in production, which have two distinct aspects.
At the global level, we deployed Condor-G, the Grid-extended Condor, for
the resource brokering and global scheduling of our jobs. At the heart of
the system is Condor's Matchmaking Service. As a more recent work at the
computing element level, we have been benefitting from the large computing
cluster at the University of Wisconsin campus. The architecture of
the computing facility and the philosophy of Condor's resource management
have prompted us to improve the application infrastructure for D0 and CDF,
in aspects such as parting with the shared file system or reliance on
resources being dedicated. As a result, we have increased productivity
and made our applications more portable and Grid-ready. Our fruitful
collaboration with the Condor team has been made possible by the
Particle Physics Data Grid.
The SAMGrid Database Server Component: Its upgraded
infrastructure and future development path
Lauri, Steve, Sinisa
The SAMGrid Database Server encapsulates several important services, such as
accessing file metadata and replica catalog, keeping track of the processing
information, as well as providing the runtime support for SAMGrid station
services. Recent deployment of the SAMGrid system for CDF has resulted in
unification of the database schema used by CDF and D0, and the complexity
of changes required for the unified metadata catalog has warranted a
complete redesign of the DB Server.
We describe here the architecture and features of the new server. In particular,
we discuss the new CORBA infrastructure that utilizes python wrapper classes
around IDL structs and exceptions. Such infrastructure allows us to
use the same code on both server and client sides, which in turn results
in significantly improved code maintainability and easier development.
We also discuss future integration of the new server with an SBIR II
project which is directed toward allowing the dbserver to access distributed
databases, implemented in different DB systems and possibly using different
schema.
Deployment of SAM for the CDF Experiment
Stefan Stonjek
CDF is an experiment at the Tevatron at Fermilab. One dominating
factor of the experiments' computing model is the high volume of raw,
reconstructed and generated data. The distributed data handling services
within SAM move these data to physics analysis applications.
The SAM system was already in use at
the D-Zero experiment. Due to difference in the computing model of the
two experiments some aspects of the SAM system had to be adapted. We
will present experiences from the adaptation and the deployment
phase. This includes the behavior of the SAM system on batch systems
of very different sizes and type as well as the interaction between
the datahandling and the storage systems, ranging from disk pools to
tape systems. In particular we will cover the problems faced on large
scale compute farms. To accomodate the needs of Grid computing, CDF
deployed installations consisting of SAM for datahandling and CAF for
high troughput batch processing. The CDF experiment already had
experiences with the CAF system. We will report on the deployment of
the combined system.
JIM Deployment for the CDF Experiment
Morag Burgon-Lyon
JIM (Job and Information Management) is a grid extension to the mature
data handling system called SAM (Sequential Access via Metadata) used by
the CDF, DZero and Minos Experiments based at Fermilab. JIM uses a thin
client to allow job submissions from any computer with Internet access,
provided the user has a valid certificate or kerberos ticket. On
completion the job output can be downloaded using a web interface. The
JIM execution site software can be installed on shared resources, such as
ScotGRID, as it may be configured for any batch system and does not
require exclusive control of the hardware. Resources that do not belong
entirely to CDF and thus cannot run DCAF (Decentralised CDF Analysis
Farm), may therefore be accessed using JIM. We will report on the initial
deployment of JIM for CDF and the steps taken to integrate JIM with DCAF.
A working group on metadata with representatives from ATLAS, BaBar, CDF,
CMS, D0, and LHCB in cooperation with EGEE have identified overlapping
user requirements that may be supported by common service implementations.
Classes of metadata specific to each service and their relations are
described. These include a set of use cases based on compilation of
various HEP documents. These documents are used to inform interfaces in
existing and planned services as described in metadata schema.
Emphasis is placed on the evolution of schema using keyword-value pairs
that are then transformed into a normalised performant database schema. A
report is made of self-description mechanisms, which coupled with updating
processes, allow the APIs to remain static as the schema evolves. A
presentation is made of the way use cases drive performance. Requirements
are presented for the physical and logical arrangement of service
implementations, dictating the degree to which the databases containing
the metadata may be distributed or centralised. A set of existing
monitoring tools expose the validity and completeness of the use cases for
experiments in various stages of maturity. A survey of the query
languages, web service interfaces and tools in use across the experiments
is presented.
To maximize the physics potential of the data currently being taken,
the CDF collaboration at Fermi National
Laboratory has started to deploy user analysis computing facilities
at several locations throughout the world.
Over 600 users are signed up and able to submit their physics analysis
and simulation applications directly
from their desktop or laptop computers to these facilities. These
resources consist of a mix of customized
computing centers and a decentralized version of our Central Analysis
Facility (CAF) initially used at Fermilab,
which we have designated Decentralized CDF Analysis Facilites (DCAFs).
We report on experience gained during the initial deployment and use
of these resources for the summer
conference season 2004. During this period, we allowed MC generation
as well as data analysis of selected
data samples at several globally distributed centers. In addition, we
discuss a migration path from this first
generation distributed computing infrastructure towards a more open
implementation that will be
interoperable with LCG, OSG and other general-purpose grid
installations at the participating sites.