| |
| DZero User | CDF User | Developer | Admin/Shifter | Project History | Documentation |
This design document continues the effort to understand, in order to
implement, the functionality of the station. We develop the ideas born in
discussions with (or expressed by) several people, especially VW, LL, DP, RW.
All these people should assume credit for this document (unless their ideas
have been changed or misinterpreted beyond all recognition). Some of the
previous ideas were captured in Lee's document.
The new ideas are mostly contained in the actual section on cache management.
This revision incorporates (wherever possible) comments from
RW, Ruth,
VW.
In the remainder of the introduction, we elaborate on our design goals
and then present our rationale based on the analysis of the resource management
issues. We will also clarify some of the terms we use most frequently. In
Section 2, we present the actual disk management ideas. In Section 3, we
address the issues of administering SAM station. Section 4 describes what
work has to be done on the Database side of our project. Section 5 concludes
with the anticipated impact on users, the ultimate judges of our work.
As the Run II data handling system, SAM ought to manage (and
optimize whenever possible) hardware resources in order to provide the most
efficient data access for the users. In order to do so, SAM controls (monitors
and regulates) data access by different physics groups and individuals. We
envision a hierarchy in resource management and data access coordination: the
global resource manager called the optimizer is responsible for the
global resources (primarily related to the ATL, such as tape mount rate and
tape drives); the station coordinates allocation of resources to
projects; and projects coordinate data access and resource usage
by their consumers (which are also end users of the resources).
We assume that the network bandwidth as seen by the station is abundant
(thanks to the efforts of our partner projects) so that we can exclude from the
station design. We will defer CPU management until the batch system is fully
understood. Thus, the station primarily manages the disk (and the files
cached on it). The goal of this design is to suggest a picture for efficient
and convenient access to disk files through SAM. We will not concentrate on
management of disk bandwidth, i.e., we will largely ignore disk contention,
assuming that the disk requests will be randomized well enough so as to provide
natural balancing of disk I/O across the many disks.
Note: it is hard
to define "efficient" until we fully understand the meaning of the
throughput that we aim to maximize. It is almost certain however that
the Station's Cache Manager will try to minimize the number of file
transfers to/from HSM because these are expensive in any reasonably defined
cost metric.
The station will have to cooperate with non-SAM resource management
systems, primarily with the batch system. SAM will not assume the role of
a batch system job scheduler; rather, it will assist the batch system in
resource co-allocation in the following way. The traditional batch system
usually handles only exclusive-access resources such as CPU, physical memory,
scratch disk, etc.. In addition, the batch system may not properly manage
global resources such as those related to the ATL, which we currently
believe will be the most scarce among all the resources.
SAM will aim at remedying potential deficiencies of the batch system. For example, SAM may help determine a job's priority by comparing the job's SAM resource requirements vs resource availability. SAM will treat a disk-cached file as a resource. (It may not be immediately obvious that a cached file is a resource, because unlike the more known resources such as CPU, this resource is sharable. For Computer Science, however, sharable resources are known as a canonical type). Clearly, the availability of a job's files on disk cache greatly affects its expected turnaround and therefore the extent to which it may be desirable to schedule the job sooner.
In building the SAM station, we defer most of the resource management issues until (a) the batch system is completely understood and (b) some global optimization is begun. Development of a disk management in station will, however, be the first step towards the station-batch interface. We assume that, given the ability to allocate and schedule disks, the knowledge of the data file sets requested by projects (both running and queued), combined with a user-supplied CPU per event estimate, will provide the necessary basis to build the station-batch interface.
In summary, the rationale for the station design in the present form,
i.e, largely restricted to disk management, is as follows. We strongly believe
that an intelligent disk (cache) management is (a) a well-defined task of the
station, to be integrated seamlessly into the bigger picture, (b) a natural
step towards efficient overall resource management, rather than a diversion
from the ongoing overall analysis, and (c) a necessity at the present stage of
the SAM evolution as a project.
Specifically in the context of the station design, we will use the
following terms:
A Station is said to manage a disk if and only if:
While these points may seem obvious for some readers, they may not be obvious for everyone, thus requiring an explanation. We have already seen how violation of the second item brings SAM into an inconsistent state. As for the first item, consider an example where a non-SAM entity writes a file onto a SAM-managed disk. If the space had not been allocated, the station may dedicate some or all of the physical blocks for another purpose thus leading to unpredictable consequences.
The station will either manage the cache automatically or provide
administrative tools for direct disk manipulation by the human administrator.
The former encompasses what Lee's document refers to as Short/Long Term Caches
and Buffers and is described in the following subsection. The latter is
primarily based on file locking/unlocking in the end of this Section (as well
as on explicit allocate() operation, see the end of the section).
The primary contribution of the present document is given by the
following discussion. The proposed design differs significantly from earlier
ideas.
The distinction between the Cache and the Buffer is too fine and becomes cumbersome when enforced by the design. In many cases, it is simply not possible to predict whether a file will be reused in near future or not.Treating a part of the disk as a buffer simply means a particular (FIFO) cache replacement algorithm. We are not presenting any particular cache replacement; moreover, we assume that multiple algorithms will be possible (and dynamically set) for various parts of the total disk on the station. Thus, we erase the boundaries between Buffer, Short Term Cache, Long Term cache while understanding that different parts of the station may be configured to effectively be one of such. Thus, we treat all the station's disk as THE CACHE.
The Station's Cache Manager (CM) is responsible for coordination of projects requesting files and proper cooperation with the global resource manager (i.e., the optimizer). The cache management algorithm will essentially generalize that in the project master's replenisher: while the replenisher serves only its (directly attached) project master, the station's disk manager serves any number of projects, possibly with overlapping file requests. In other words, the replinisher can be instantiated either within the process space of the project or, in the canonical case, in the process space of station master.
For backward compatibility with projects that must (or wish to) run without the station master, the cache manager will implement all the interfaces of the replenisher. Thus, every project master will communicate with the same interface implemented either as directly attached replenisher or in the station, with the decision being made at project startup time.
When a project is started, its snapshot files are added to the "requested file" set in the Cache Manager. The CM then requests authorization from the optimizer for all the newly requested files i.e., those that weren't already known before this project started (The CM "knows" a file if it is already cached or requested to be cached.) At all times, each file in the "requested" set is associated with at least one project that expressed interest in it.
When the authorization for a file arrives, the file is added to the "can go" file list. This is the list of files, hopefully grouped by volume (if the optimizer has done good job) whose HSM->disk retrieval can begin as soon as there is enough cache space. Specifically, if the disk requirements for the next delivery group (see below) can be met by erasing some of the disposable files (called "can free" in the replenisher), CM instructs the stager(s) to erase the disposable files and initiate the deliveries for the group. A delivery group is a sublist of the "can go" list that is a unit of ENCP work; naturally, it is a set of files from one physical volume (tape). If tape mounts are the most scarce resource, a group includes all the files from the tape that are needed by all the known projects. If disk space becomes limited as well, the group size may decrease to a single file (as in the initial implementation of the replenisher).
When a stager notifies the CM of a successful file retrieval completion, the file becomes a "cached file" and is served to the projects associated with the file. The newly cached file is marked as being in use. Its new location is added to the database. Each project then serves the file to its consumers in the usual way; when all the consumers are done, the project releases the file by calling CM. It is important for CM to be able to limit the time a project takes to process a file, much like projects themselves have time limits for their consumers to process a file.
Finally, when all the projects release a file, the file is added to the "disposable" list (see above) and the CM reviews its chances to deliver a next group, at which point the file may be erased. Exactly what disposable files are selected to be erased is irrelevant for this document; what is important is that the CM possesses enough information about file accesses (both past and near future) in order to execute some intelligent generalization of LRU or another cache algorithm (see the section on persistent variables). When a file is erased from disk, its associated location is erased from the database.
If we want multiple stations to access each other's caches, the decision by CM on when to erase a file may become quite complicated. We assume that the global resource manager will coordinate inter-station file exchange; for now, we can either (1) disallow a station accessing a file from another station, or (2) allow remote cache access but then be prepared for the possibility of delivery errors and ignore them.
It is a requirement to the station Cache Manager to support the
notion of a locked (AKA pinned) file, i.e., a file that has been
marked as "unerasable" until further notice. We will assume that any cached
file (whether in use or disposable) may be locked on disk by a user with
sufficient privileges. Clearly, uncontrolled use of this facility will
incapacitate the CM by eventually locking of all the files thus leaving
effectively no free space on disk and precluding any intelligent cache
algorithm from execution. Therefore, the locking of files is primarily intended
for specific kinds of data (such as Thumbnail or calibration) and by group
administrators only.
Locked files (and their occupied space) are effectively excluded from
the disk management algorithms above. It is critical, however, that similarly
to any other disk files, locked files are subject to full access history
monitoring. This access history will be provided to the administrators for
their viewing pleasure (well, actually to facilitate decisions to change the
contents of the locked area).
It is important that SAM be responsible for controlling of the
output buffer allocation. Although we will most likely choose to set aside
ouput buffer area, ultimately we must treat both input and output areas as
parts of THE DISK for the following reasons. First, we must ensure proper rate
of output buffer flushing and reasonable availability of output buffer area for
user jobs as it affects the overall progress of projects (and we are concerned
with the consumption rate in this design). Second, an "output" file may become
an "input" file soon enough that the distinction between input area and
output area becomes quite artificial. We therefore envision the
aforementioned allocate() method in the station interface.
(In the SAM stub in the analysis framework we already have the place to
call it; the framework will do so right before opening an output file.)
Station configuration is the set of parameters to be
controlled by system and group administrators. The number of parameters
should be neither too small (lest administrators think that SAM is too
simplistic or that they don't have enough control) nor too large (lest
administrators get too confused). These parameters fall into approximately
three categories:
Note: allocations of some global resources, such as tape drives or tape mounts per hour, to a station will likely not be a part of that station configuration; rather, those will define the configuration of the global resource manager (optimizer).
Example activities of administrators changing these parameters include:
Station master is a permanent "stateful" server, therefore, it must store its state persistently in order to recover from software failures and system reboots. Upon startup, the station master reads it state from the database using the interface with the server. The latter is of course driven by what constitutes the state of the station.
In this section, we present the required DB support for the proposed design. It is not the purpose of this document to decide exact table organization in the database; we possess great expertise with other project developers to do so. Instead, we intend to define what variables must be made persistent.
The quasi-permanent configuration-related variables are based on the following entities and relationships:
The more dynamic objects that are created by the station itself will require the following entities to be added to the database:
We hereby suggest that the remaining information could then be derived from these tables upon station startup. For example, the access history for a particular file is based on the already existing analysis_projects table and analyzed_files table.
The Db server interfaces should be such that they allow storage and
retrieval of the above station variables. In addition, interfaces to record
significant events, which already include project begin/end, should be
extended so as to incorporate file delivery/erasure.
In this section we attempt to predict the change in "look and feel"
of SAM, i.e., give the flavor of new commands and outline benefits for the end
users (aside from performance increase due to extensive caching of files). With
the introduction of the SAM station, and from that time on, a clear distinction
will be made between administrators and end users. Almost all of the the new
commands/tools will be for use by administrators for configuring and restarting
the station.
Typical command lines for configuration will feel like:
sam add disk --disk=/sam/cache1 --size=1000000
--station=central-analysis
sam increase/set allocation --group=mcc99 --disk=/sam/cache1
--size=200000
Typical administrative command to lock a file on disk:
sam lock --file=sim.pmc02_01.pythia.zhbbmet_mb1.1av_200evts.292_1753
(This command may involve physical moving of the file.)
As for the end users, the major benefit will be in relieving them from explicit buffer allocation/cleanup for their projects. The sam start project command (or its successor) will be a request to the station, rather than an action of physically starting the project master; therefore, the command may fail if the station rejects the job. Furthermore, as we work towards the integration with the batch system, we will more frequently speak of a user job and less frequently of a project. A single consumer project is a part of the user job which essentially entails (1) starting of a project, (2) running of an analysis program, and (3) stopping a project. Our tendency is toward a single command such as one of the following:
sam run XXX.py <params>
sam submit XXX.py <params>
Users will have to deal with SAM-imposed resource restrictions, such as
disk/ATL usage. We are excited to see how we can, by (seemingly) creating
problems for every particular individual, enlighten the life of the
Collaboration as a whole!
=============================================================================
Project : SAM
Package : sam_doc
$Id: station.html,v
1.10 2000/03/24 19:58:27 vranicar Exp $
This work is part of a development project, called SAM, which consists
of a
number of coordinated packages each named sam_xxxx .
Notice of authorship, copyright status, and terms and conditions,
should
the software eventually become available for use outside Fermilab,
can be
found in the README and LICENCE files in the top level directory of
the main
sam package.
==============================================================================
Contact SAMGrid |
|
| Last modified: 11-May-2005 10:43 CDT | |
| Security, Privacy, Legal | Fermilab Policy on Computing |
|