Fermilab

SAMGrid
SAMGrid at Work D0 SAM
DZero User CDF User Developer Admin/Shifter Project History Documentation

Plenary title: Run II Distributed Computing Moving onto the Grid


Open Science Grid

Ruth Pordes, Vicky White, Lothar Bauerdick, Amber Boehnlein, Wyatt Merritt, Rick Snider, Rick St. Denis, Frank Wuerthwein, Liz Sexton-Kennedy

The U.S.LHC Tier-1 and Tier-2 laboratories and universities are developing production Grids to support LHC applications running across a worldwide Grid computing system. Together with partners in computer science, physics grid projects and running experiments, we will build a common national production grid infrastructure which is open in its architecture, implementation and use. The OSG model builds upon the successful approach of last year's joint Grid2003 project. The Grid3 shared infrastructure has for over eight months given significant computational resources and throughput to more than six applications, including ATLAS and CMS data challenges, SDSS, LIGO and Biology analyses and computer science demonstrators. To move towards LHC-scale data management, access and analysis capabilities, we will need to increase the scale, services, and sustainability of the current infrastructure by an order of magnitude. This requires a significant upgrade in its functionalities and technologies. The OSG roadmap is a strategy and work plan to build the U.S.LHC computing enterprise as a fully usable, sustainable and robust grid, which is part of the LHC global computing infrastructure and open to partners. The approach is to federate with other application communities in the U.S. to build a shared infrastructure open to other sciences and capable of being modified and improved to respond to needs of other applications, including CDF, D0, BaBar and RHIC experiments. We describe the application driven engineered services of the OSG, short term plans and status, and the roadmap for a consortium, its partnerships and national focus.

SAMGrid Integration with SRMs

Robert Kennedy, Andrew Baranovski

SAMGrid is the shared data handling framework of the two large Fermilab Run II collider experiments: DZero and CDF. In production since 1999 at D0, and since mid-2004 at CDF, the SAMGrid framework has been adapted over time to accommodate a variety of storage solutions and configurations, as well as the differing data processing models of DZero and CDF. Backed by primary data repositories of approximately 1 PB in size for each experiment, the SAMGrid framework delivers over 100 TB/day to DZero and CDF analyses at Fermilab and around the world, a remarkable success. Each of the storage systems used with SAMGrid, however, has distinct interfaces, protocols, and behaviors. This led to different levels of integration of the various storage devices into the framework, which complicated the exploitation of their functionality. In an effort to simplify the SAMGrid storage interfaces, SAMGrid has adopted the Storage Resource Manager (SRM) concept as the universal interface to all storage devices. This has simplified the SAMGrid framework, expecially the implementation of storage device interactions. It prepares the SAMGrid framework for future storage solutions equipped with SRM interfaces, without the need for long and risky software integration projects. In principle, any storage device with an SRM interface can be used now with the SAMGrid framework. The integration of SRMs is an improtant further step towards evolving the SAMGrid framework into a co-operating collection of distinct, modular grid-oriented services. To date, SRMs for Enstore, dCache, local caches, and permanent disk locations are tested and in production use. This report outlines how SRMs have been integrated into the existing SAMGrid framework without disturbing on-going operations, and describes our operational experience with SAMGrid and SRMs in the field.

SAMGrid Monitoring and Information Service and its Integration with MonALisa

Sinisa Veseli, Adam Lyon

The SAMGrid team is in the process of implementing a monitoring and information service, which fulfills several important roles in the operation of the SAMGrid system, and will replace the first generation of monitoring tools in the current deployments. The first generation tools are in general based on text logfiles and represent solutions which are not scalable or maintainable. The roles of the monitoring and information service are:  1) providing diagnostics for troubleshooting the operation of SAMGrid services; 2) providing support for monitoring at the level of user jobs; 3) providing runtime support for local configuration and other information currently which currently must be stored centrally (thus moving the system toward greater autonomy for the SAM station services, which include cache management and job management services); 4) providing intelligent collection of statistics in order to enable performance monitoring and tuning. The architecture of this service is quite flexible, permitting input from any instrumented SAM application or service.  It will allow multiple backend storage for archiving of (possibly) filtered monitoring events, as well as real time information displays and active notification service for alarm conditions. This service will be able to export, in a configurable manner, information to higher level Grid monitoring services, such as MonALisa.  We describe our experience to date with using a prototype version together with MonAlisa.

Using a Relational Database to House Metadata for the Common Physicist

Wyatt Merritt,Julie Trumbo, Rick St. Denis

SAM was developed as a data handling system for Run II at Fermilab. SAM is a collection of services, each described by metadata. The metadata are modeled on a relational database, and implemented in ORACLE. SAM, originally deployed in production for the D0 Run II experiment, has now been also deployed at CDF and is in testing at MINOS. This illustrates that the metadata decomposition of its services has broader applicability than at just one experiment. We believe this is the first example of such a unification, where two complex collider experiments are sharing a schema for the complete description of file contents, file locations, and processing descriptions. Metadata for several million files are now stored for each experiment. Over the last five years, greater understanding of the required services in a performant data handling system has emerged. The collection of metadata to support these services forms the core of the SAM system. We describe this schema and the commonalities and differences that emerge from the need to support two experiments. We also describe the support structure required for schema updates: the use of development, integration, and production instances. This talk will focus on the four categories of SAM services, the functionality currently implemented in those services, and the supporting metadata we collect for these services. We will also explore the SAM Entity Relationship diagram for a visual means of understanding SAM and its functions, and some of the query structure needed and some of the performance issues.

Application of the SAMGrid Test Harness for Performance Evaluation and Tuning of a Distributed Cluster Implementation of Data Handling Services

Adam, Matt

The SAMGrid team has recently refactored its test harness suite for greater flexibility and easier configuration. This makes possible more interesting applications of the test harness, for component tests, integration tests, and stress tests. We report on the architecture of the test harness and its recent application to stress tests of a new analysis cluster at Fermilab, to explore the extremes of analysis use cases and the relevant parameters for tuning in the SAMGrid station services. This reimplementation of the test harness is a python framework which uses XML for configuration and small plug-in python modules for specific test purposes. One current testing application is running on a 128-CPU analysis cluster with access to 6 TB distributed cache and also to a 2 TB centralized cache, permitting studies of different cache strategies. We have studied the service parameters which affect the performance of retrieving data from tape storage as well. The use cases studied vary from those which will require rapid file delivery with short processing time per file, to the opposite extreme of long processing time per file. We also show how the same harness can be used to run regular unit tests on a production system to aid early fault detection and diagnosis.These results are interesting for their implications with regard to Grid operations, and illustrate the type of monitoring and test facilities required to accomplish such performance tuning.

SAMGrid Deployment for Production Simulation and Reconstruction

Art Kreymer

The Fermilab CDF Run-II experiment is now providing official support for remote computing, expanding this to about 1/4 of the total CDF computing during the Summer of 2004. I will discuss in detail the extensions to CDF software distribution and configuration tools and procedures, in support of CDF GRID/DCAF computing for Summer 2004. We face the challenge of unreliable networks, time differences, and remote managers with little experience with this particular software. We have made the first deployment of the SAM data handling system outside its original home in the D0 experiment. We have deployed to about 20 remote CDF sites. We have created light weight testing and monitoring tools to assure that these sites are in fact functional when installed. We are distributing and configuring both client code within CDF code releases, and the SAM servers to which the clients connect. Procedures which once took days are now performed in minutes. These tools can be used to install SAM servers for D0 and other experiments. Networks permitting, we will give a live SAM installation demonstration. We have separated the data handling components from the main CDF offline code releases by means of shared libraries, permitting live upgrades to otherwise frozen code. We now use a special 'development lite' release to ensure that all sites have the latest tools available. We have put substantial effort into revision control, so that essentially all active CDF sites are running exactly the same code.

Experience Producing Simulated Events for the D0 Experiment on the SAMGrid

Gabriele, Igor

Most of the simulated events for the DZero experiment at Fermilab have been historically produced by the remote collaborating institutions. One of the principal challenges reported concerns the maintenance of the local software infrastructure, which is generally different from site to site. As the understanding of the community on distributed computing over distributively owned and shared resources progresses, it becomes increasingly interesting the adoption of grid technologies to address the production of montecarlo events for high energy physics experiments. The SAM-Grid is a software system developed at Fermilab, which integrates standard grid technologies for job and information management with SAM, the data handling system of the DZero and CDF experiments. During the past few months, this grid system has been tailored for the montecarlo production of DZero. Since the initial phase of deployment, this experience has exposed an interesting series of requirements to the SAM-Grid services, the standard middleware, the resources and their management and to the analysis framework of the experiment. As of today, the inefficiency due to the grid infrastructure has been reduced to as little as 1%. In this paper, we present our statistics and the lesson learned in running large high energy physics applications on a grid infrastructure.

Testing the CDF Distributed Computing Framework

Valeria Bartsch

To distribute computing for CDF (Collider Detector at Fermilab) a system managing local compute and storage resources is needed. For this purpose CDF will use the DCAF (Decentralized CDF Analysis Farms) system which is already at Fermilab. DCAF has to work with the data handling system SAM (Sequential Access to data via Metadata). However, both DCAF and SAM are mature systems which have not yet been used in combination, and on top of this DCAF has only been installed at Fermilab and not on local sites. Therefore tests of the systems are necessary to test the interplay of the data handling with the farms, the behaviour of the off-site DCAFs and the user friendliness of the whole system. The tests are focussed on the main tasks of the DCAFs, like Monte Carlo generation and stores, as well as the readout of data files and connected data handling. To achieve user friendliness the SAM station environment has to be common to all stations and adaptations to the environment have to be made.

SAMGrid Experiences with the Condor Technology in Run II Computing

Igor Terekhov

SAMGrid is a globally distributed system for data handling and job management, developed at Fermilab for the D0 and CDF experiments in Run II. The Condor system is being developed at the University of Wisconsin for management of distributed resources, computational and otherwise. We briefly review the SAMGrid architecture and its interaction with Condor, which was presented earlier. We then present our experiences using the system in production, which have two distinct aspects. At the global level, we deployed Condor-G, the Grid-extended Condor, for the resource brokering and global scheduling of our jobs. At the heart of the system is Condor's Matchmaking Service. As a more recent work at the computing element level, we have been benefitting from the large computing cluster at the University of Wisconsin campus. The architecture of the computing facility and the philosophy of Condor's resource management have prompted us to improve the application infrastructure for D0 and CDF, in aspects such as parting with the shared file system or reliance on resources being dedicated. As a result, we have increased productivity and made our applications more portable and Grid-ready. Our fruitful collaboration with the Condor team has been made possible by the Particle Physics Data Grid.

The SAMGrid Database Server Component: Its upgraded infrastructure and future development path

Lauri, Steve, Sinisa

The SAMGrid Database Server encapsulates several important services, such as accessing file metadata and replica catalog, keeping track of the processing information, as well as providing the runtime support for SAMGrid station services. Recent deployment of the SAMGrid system for CDF has resulted in unification of the database schema used by CDF and D0, and the complexity of changes required for the unified metadata catalog has warranted a complete redesign of the DB Server. We describe here the architecture and features of the new server. In particular, we discuss the new CORBA infrastructure that utilizes python wrapper classes around IDL structs and exceptions. Such infrastructure allows us to use the same code on both server and client sides, which in turn results in significantly improved code maintainability and easier development. We also discuss future integration of the new server with an SBIR II project which is directed toward allowing the dbserver to access distributed databases, implemented in different DB systems and possibly using different schema.

Deployment of SAM for the CDF Experiment

Stefan Stonjek

CDF is an experiment at the Tevatron at Fermilab. One dominating factor of the experiments' computing model is the high volume of raw, reconstructed and generated data. The distributed data handling services within SAM move these data to physics analysis applications. The SAM system was already in use at the D-Zero experiment. Due to difference in the computing model of the two experiments some aspects of the SAM system had to be adapted. We will present experiences from the adaptation and the deployment phase. This includes the behavior of the SAM system on batch systems of very different sizes and type as well as the interaction between the datahandling and the storage systems, ranging from disk pools to tape systems. In particular we will cover the problems faced on large scale compute farms. To accomodate the needs of Grid computing, CDF deployed installations consisting of SAM for datahandling and CAF for high troughput batch processing. The CDF experiment already had experiences with the CAF system. We will report on the deployment of the combined system.

JIM Deployment for the CDF Experiment

Morag Burgon-Lyon

JIM (Job and Information Management) is a grid extension to the mature data handling system called SAM (Sequential Access via Metadata) used by the CDF, DZero and Minos Experiments based at Fermilab. JIM uses a thin client to allow job submissions from any computer with Internet access, provided the user has a valid certificate or kerberos ticket. On completion the job output can be downloaded using a web interface. The JIM execution site software can be installed on shared resources, such as ScotGRID, as it may be configured for any batch system and does not require exclusive control of the hardware. Resources that do not belong entirely to CDF and thus cannot run DCAF (Decentralised CDF Analysis Farm), may therefore be accessed using JIM. We will report on the initial deployment of JIM for CDF and the steps taken to integrate JIM with DCAF.

Metadata for the Common Physicist

Rick St. Denis

A working group on metadata with representatives from ATLAS, BaBar, CDF, CMS, D0, and LHCB in cooperation with EGEE have identified overlapping user requirements that may be supported by common service implementations. Classes of metadata specific to each service and their relations are described. These include a set of use cases based on compilation of various HEP documents. These documents are used to inform interfaces in existing and planned services as described in metadata schema. Emphasis is placed on the evolution of schema using keyword-value pairs that are then transformed into a normalised performant database schema. A report is made of self-description mechanisms, which coupled with updating processes, allow the APIs to remain static as the schema evolves. A presentation is made of the way use cases drive performance. Requirements are presented for the physical and logical arrangement of service implementations, dictating the degree to which the databases containing the metadata may be distributed or centralised. A set of existing monitoring tools expose the validity and completeness of the use cases for experiments in various stages of maturity. A survey of the query languages, web service interfaces and tools in use across the experiments is presented.

Globally Distributed User Analysis Computing at CDF

Alan Sill,

To maximize the physics potential of the data currently being taken, the CDF collaboration at Fermi National Laboratory has started to deploy user analysis computing facilities at several locations throughout the world. Over 600 users are signed up and able to submit their physics analysis and simulation applications directly from their desktop or laptop computers to these facilities. These resources consist of a mix of customized computing centers and a decentralized version of our Central Analysis Facility (CAF) initially used at Fermilab, which we have designated Decentralized CDF Analysis Facilites (DCAFs). We report on experience gained during the initial deployment and use of these resources for the summer conference season 2004. During this period, we allowed MC generation as well as data analysis of selected data samples at several globally distributed centers. In addition, we discuss a migration path from this first generation distributed computing infrastructure towards a more open implementation that will be interoperable with LCG, OSG and other general-purpose grid installations at the participating sites.

Contact
Last modified: 07-May-2004 14:34 CDT
Security, Privacy, Legal | Fermilab Policy on Computing Fermi National Accelerator Laboratory