Bringing SAM Deployment of v6 Back Under Control
-
Step 1:
Define which SAM systems CDF needs from SAM on what systems and why
- Step 2:
Document SAM usage
- 2.1 Use of SAM in AC++
- 2.2 Sam for Storing Files: Sam Upload
- 2.3 SAM on CAF
- 2.4 SAM use on the Desktop
- 2.5 SAM in Predator
- 2.6 SAM in Scavanger
- 2.7 SAM Datasets Off Site
- 2.8 CDF Configuration
- 2.9 DiskCacheManager Local Copy
- Step 3:
Address the SAM team concerns uncovered by step 2 (eg make log
files web visible, establish development/integration enviroments,
etc.)
- 3.1 Log Files
- 3.2 Integration and Development Enviroments
- Step 4:
Repeat SAM on CAF testing program with configuration environments
established in 1-3.
- 4.1 CAF Monitoring with AC++
- 4.2 DBServer Monitoring
- 4.3 One-Off Integration Tests per Bundle Release
- Step 5:
Write, Review, Approve phased deployment plan to CDF users, SAMGrid
Management, CDF software management.
- A1:Appendix 1: Comments and thoughts on Doc
1.1 Drivers
- Easy storage of MC from remote sites for Winter Conferences
- Use of offsite resources for Winter Conferences for MC and User
analysis
- Reduction in operational load on Data Handling through control of
Caching resources
- Reproducible cataloging of user production
- Cataloging and management of Farms processes with CAF
This translates into the following prioritized list of goals:
- SAM online
- MC Store (sam_store)
- CAF
- SAM on production: farm
- SAM on dCAF
These should be formed into projects that the SAM management guides. The
SAM on dCAF and SAM on CAF are already described on
this web page.
Back to Index
1.2 Systems
fbsng CAF at FNAL: 100TB of disk on DCache with SAM traffic shaping in
the CAF, 1200 cpus
Condor CAF at FNAL:Same 100TB of disk as for fbsng, same
requirements, 2000 cpus.
Together fbsng and Condor CAF have 5000
segment slots. Typically there are 50 jobs running with segments
starting in groups of 5 (Condor) or 10 (Fbsng)
Import fileservers and stations: Used to import files for
concatenation:
- MC Import Support: Need a Few TB of SAM Cache to handle import from
external MC production. We estimate at 3 such servers are
needed. This probably can be done with one sam station.
fcdfdata014 (cdf-cat) and fcdfdata016 (cdf-sam) are currently
being used. fcdfdata014 only available for use by Italian
collaborators (they paid for the machine). fcdfdata016 is
our main sam station supporting caf. This functionality
should not be combined with import support.
- MC IMport and Validation Support.
This requires autodest on
durable cache (Doc: ppt
or
pdf).
We currently have this for
testing of user stripping of datasets by the Italians who
brought
files to fcdfdata014 for concatenation and validation(?)
- Farms Support: The farms model with SAM is
shown
in this picture
. 10TB of file servers need to be made to have
duable storage so taht concatenation can be done. (Question:
Still don't understand, CDF Note accompanying the picture due
this Friday October 23).
Offsite SAM stations
Support for of order 20 offsite stations. The cpu and disk
resources are listed here. The CDF
datasets that are pinned at these locations are listed
here.
All stations have a private
network. Some have nfs mounted SAM cache, some have to use Fedor's
copy code.
Executive summary of configurations:
CNAF and TTU use SAN storage systems (Linux head nodes using Red Hat
Enterprise Linux in each case). The only site I am sure that has tape
and dCache is Karlsruhe. San Diego has a successful deployment of
resilient dCache, with disks distributed over several nodes in their
cluster.
Most other use nfs-mounted RAID-5 disk storage. Off-site storage is
one of the weakest aspects of our DCAF deployments so far, as you know,
with typically only a few TB of disk storage available at each site.
Details on individual sites:
- CNAF
CDF-CNAF SAM station is actually using NFS mounted disks.
Some are simple servers with Raid5 arrays of IDE disks a'la Fermilab,
some are a FC based SAN, but still accessed via an host that
NFS mount to the SAM station.
Access via the CAF worker nodes is done via NFS automount,
with all known troubles. We have thus implemented a local CafExe hack
that on a part of the disk servers use DCM_COPY tools from
DHInput + fcp + anonymous ftp to make a local copy of the
file requested by SAM. While this is an elegant solution,
sociological difficulties have prevented us from using it on
all servers, and the CafExe hack to do this is imperfect.
Migration to CondorCaf during the week of October 25 will include a
cleaner implementation that will hopefully be imported into CVS.
No tape access is foreseen in the near future. CNAF has a large
tape robot that uses CERN Castor,but there is not any utility in
looking at it until both SAM and Castor have a reliable SRM interface
to begin with.
- GridKa
One of the (older) fileservers is a 'standard' ide-raid disk box,
mounted via nfs.
The others are fibre-channel based and in a SAN, running GPFS as the
file-system.
CDF at FZK is using
a) NAS (3Ware 6k series RAID 5, 1.9 TB in a box)
b) NAS with 3Ware 7k series RAID 5, 1.9 TB via SAN
c) GPFS with SAN Fibre Channel disks 20 TB (IBM FastT700)
All storage is exported via NFS to the clients.
More info here.
dCache is connected to a TSM backend. We are working on scaling the
connection. Right now a single stream 40 MB/s to tape is available.
Our library has 500 TB and can and will be expanded to 1500 TB
- Rutgers
NFS exported disks, one server with internal IDE RAID5, another
with external IDE RAID5 connected via SCSI.
Workers in private network with outgoing connectivity.
- Toronto. Has a mix of SCSI servers and one IDE disk
server that are serving the SAM disks (total about 4 TB), all via
automounted NFS. No tape. Worker nodes are hidden behind a firewall with
no external IPs but have outbound conductivity.
Test stations and setups
There are a variety of test stations and setups. The most notable are
- nglas09, nglas12,nglas13,nglas14,nglas15: the mini caf.
This has 4 old farm nodes and a dual
processor head node(nglas09). It is located in the trailers (worker
nodes in Rick StD's office and nglas09 is in the
outback. Valiera coordinates usage and it is currently being
used for JIM and Grid3 testing.
- cdfsamint and cdfsamdev, described below.
- nglas08 plus
nglas03,nglas04,nglas05,nglas06,nglas07,nglas10,nglas16
A CAB-like cluster of stations for testing of SAM by the Glasgow Group,
located mainly in Rick's office, but also in the outback.
- lf7.ph.gla.ac.uk : Small tests for deployment in Glasgow.
- cdf-testharness: Used by Matt Leslie for Test Harness
Back to Index
Back to Index
2.1 Use of SAM in AC++
Flow Charts for the use of SAM in AC++ are given in:this set of files
Back to Index
2.2 Sam for Storing Files: Sam Upload
Sam Upload was presented to the SamDesign meeting. The transparencies
can be found here: ppt
or
pdf. There are
detailed
instructions
for using sam_upload.
For convenience, here is a short summary of sam_upload:
sam_upload is presumed to allow a user to take a file on
a local node anywhere (for example a node of a farm that
could be fnal's CAF like an EGEE farm, but definitely
not a sam station) and allow the user to upload it to a
sam location while describing it in sam db along with
it metadata, i.e. a friendly an easy to install wrapper
around "copy to a sam station + sam store".
The first part (copy a file from my computer to a
sam station local disk, using the best matching protocol
between what my computer can do and the sam station can do)
is where sam_cp was hoped to bring in a great wrapper
around many possible protocols. Lauri points out,
deploying griftp clinet/servers everywhere may have
administrative drawbacks, in addititon to currently not
being as easily installable and as knonw to sys-admin
as rcp or scp. However this is what is needed. This would be solved in
a JIM deployment.
Back to Index
2.3 SAM on CAF
SAM on CAF interfaces to AC++ through the setting of enviroment
variables.
The critical environment variables are:
SAM_USER_NAME
SAM_PROJECT
SAM_CONSUMER_ID
SAM_CONSUMER_PID
SAM_DATASET
(SAM_FILE_LIMIT is needed with offline v5_3_3 or earlier)
SAM_STATION
By setting SAM_PROJECT, CAF Signals to AC++ that it need not start the
project.
By setting SAM_CONSUMER_ID and SAM_CONSUMER_PID CAF may then control
the relationship of the caf segment number to the SAM_CONSUMER_PID.
If these are not set, AC++ handles starting these.
CAF Must set SAM_USER_NAME so that with it or AC++ runs, all the
segments join the same consumer process and project.
SAM_DATASET is set in the (G)UI for CAF Submission.
A full description of how SAM is used on CAF may be found
here.
Back to Index
2.4 SAM use on the Desktop
A user at FNAL may run sam on his desktop in the trailers or on fcdflnx2
in order to debug. By using the cdf-sam station, the user is able
to access the data with the dcap protocol.
Instructions to do this are
here.
The user is responsible for setting the SAM_STATION environment
variable. The dataset may be conveyed to AC++ using TCL or by
setting the SAM_DATASET environment variable.
Back to Index
2.5 SAM in Predator
The complete description of SAM and Predator is found in
CDF Note 6169.
Some further modifications since the publication of the note are
indicated here:
Looking at CDF note 6169:
predator now runs hourly in "all files" mode.
Enstore data is availed by first trying to avail it from tables built
by another program which read the Enstore Complete File Listing and
if that fails, then uses the encp -q cdfen enstore commands.
The Enstore Complete File Listing is read every two hours as that seems
to be the interval at which it is generated.
The files are now typed by their data type as recorded in the
FILECATALOG.CDF2_DATASET_REGISTRIES if available, otherwise by
the first letter of the file names as mentioned in CDF6169.
There have been minor changes for SAM schema changes: the use of
id fields instead of character strings directly encoded in tables
(e.g., parameter values and crc types) and the dropping of some
no longer used columns (e.g., DATA_FILES.KBYTE_FILE_SIZE, which
is represented now by FILE_SIZE_IN_BYTES).
No attempt is made to maintain virtual dataset or fileset files
to represent files' dataset or fileset associations. These are
now only represented by cdf.dataset and cdf.fileset parameter values.
Both predator and scavenger use the (run<<16)|(0xFFFF§ion) computation.
Back to Index
2.6 SAM in Scavanger
Scavenger ==>
Queries SAM data files and associated parameter values for files with
`cdf.runsections' file parameters.
Loops over the result set from the query with one row per file and
file parameter association
Loops over each file and file parameter association tokenizing it
on white space (as multiple fields are permitted and space separated).
For each runsection specification (run/section or run/section:section),
standardizes the value to a run and pair of low and high sections
(the colon-less form is a single section that is normalized to a
low and high section of the same value and then if the low section
is zero length, then it is set 0 and if the high section is zero
length, then it is set to 65535)
If necessary, then the runs are associated with data files in the
DATA_FILES_RUNS table. Any extant associations are not removed.
The normalized run and section associations are merged with any
existing lumblock associations for the file such that the number of
lumblock associations are minimized. (The ranges are combined with
any extant lumblock ranges to which they are adjacent or overlapping
and if a gap is filled between extant ranges, then the number of
associations (not the number ranges that are associated) is reduced.)
If the runsection file parameter is completely processed without
problems, then it is removed as an associated file parameter.
Back to Index
2.7 SAM Datasets Offsite
A page of
datasets pinned offsite is kept up to date by a process
running at Rutgers.
The procedure works as follows.
- For every requested station, an SQL query to the SAM database is
made. This returns a list of files located on the station
accompanied by the "lock" and "tape" status on the station.
- The list of all cdf.dataset parameters for files
located on the station is obtained. This is another SQL query
independent from
the first one.
- Every dataset found in the second query is
translated into list of files.
The list of local files is checked against a list of all contributing
files for every dataset and statistics are obtained.
- For every dataset a dataset description is requested.
Note that here we assume dataset duality: description is queried from
SAM dataset, but files
are grouped by CDF.DATASET parameter.
-
The rest is formatting of the output, filtering out datasets with few
local files etc.
The script is run once for all stations in one table with higher
threshold and then for every
station separately with lower thresholds.
Also there is a script displaying status of all "rutgers" datasets. The
general philosophy is the same,
but all datasets with pattern ??8v2* are considered regardless if they
have local files or not.
Back to Index
2.8 CDF Configuration
The cdf configuration
for SAM is described in detail on a separate web page.
Back to Index
2.9 DiskCacheManager Local Copy: Moving one file from SAM Cache to local
node
Implemented for CAF use by Stefano Belforte
The worker nodes cannot access SAM cache
Maintained by Fedor
Specify two environment variables
DCM_COPY_SCRATCH
The local scratch area.
DCM_COPY_COMMAND
The copy command to use
DCM_COPY_COMMAND is
@rcp sam_station:%s%s@, "dccp %s %s"
the %s are substituted by DH (remote) file reference and local cpoy.
Back to Index
Step 3:
Address the SAM team concerns uncovered by step 2 (eg make log
files web visible, establish development/integration enviroments,
etc.)
Back to Index
3.1 Log files
The CDF log files are visible.
Back to Index
3.2 Integration and Development Enviroments
cdfsamint and cdfsamdev are used for integration and development.
The cdfint process type on the CAF can be used for job submission that
uses the sam-int station.
Krzysztof Genser is coordinator for cdfsamdev and Stefan Stonjek
coordinates cdfsamint.
For cdfsamint the following sam qualifiers give the following action:
-
int_new_dbserv: connects to port 9005 of cdfsam02. This is where the
dbserver
connected to the cdf integration database is expected to run. It has
the dbserver name for the v6 dbserver:newdbsrv:SAMDbServer
-
int: connects to port 9005 of cdfsam02. This is where the
dbserver
connected to the cdf integration database is expected to run. It has
the dbserver name SAMDbServer.user_int:SAMDbServer
a v5 dbserver:
prd: connects to port 9010 of cdfsam01. This is where the dbserver
connected to the cdf production databsase is expected to run. it has
the dbserver name SAMDbServer.user_prd:SAMDbServer, a v5 dbserver
This is used for tests of the integration sam station against the
production database so as not to interfere with the sam station
in produciton.
-
cdfint-prddb: connects to port 9010 of cdfsam01. This is where the
dbserver
connected to the cdf production database is expected to run. It has
the dbserver name for the v6 dbserver:newdbsrv:SAMDbServer.
This is used for tests of the integration sam station against the
production database so as not to interfere with the sam station
in produciton.
This uses dcache for its cache. Hence it is for DCAF/DCache
integration.
We have no integration for sam_cache, relying on Alan Sill to test in
Texas when we have integrated here.
For cdfsamdev the following sam qualifiers give the following action:
- dev: Development dbserver to deevelopment database
- station-prd: Production dbserver to production database
- prd: Production dbserver to production database: presably suitable for sam
client setup. No idea why we need this, but Alan has this in his
current instructions.
sam_cp does not work.
This has a sam cache -- we do not have a dev (just now) with dcache.
Shall we have another station on this machine and choose to run one or
the other for testing?
Back to Index
Back to Index
4.1 CAF Monitoring with AC++
Jobs are continuously submitted to CAF so that we can catch failures
first
hand. These are AC++ jobs. Columns are:
- dataset
- CAF Name (CAF, CAF-Condor, etc)
- Submission time
- job start time
- job end time
- project name
- CID (Consumer ID)
- prfls (Number of files in the project)
- sgmnt (Number segments for the project: ie. number of consumer
processes)
- logs (number of log files returned: should be same as number segments)
- ecpid (Number of consumer process id's in log files: should be
the same as number of segments)
- rqsd (number of file requested from SAM)
- eos (number of End of Stream returns from a file request)
- dlvd (number of files delivered)
- opnd (Number of files opened)
- clsd (number of files closed)
- rlsd (number of files released)
- opcl (number of files opened - number closed: should be 0)
- eoss (number of files requested - number of files delivered -
number with EOS: should be zero)
- dlop (number of files delivered - number opened: should be 0)
- dlrl (number of flies opened - number released: should be 0)
- good (number of files analyzed ok: successfully closed)
- prec (Predicted number of files to be recovered based on log files)
- torc (Number of files to recover according to SAM database via sam
generate recovery project)
- <--- is printed if any failure has occurred.
Back to Index
4.2 DBServer Monitoring
The health of the DBServer is monitored by doing a sam locate foo every
15 minutes and examining the response time. Results are
shown here.
Back to Index
4.3 Integration Tests done Once per Bundle Release
Stress tests and memory tests are integration tests that need to be
repeated as software has been changed.
Back to Index
This plan
was presented to the SAM design meeting and a
readiness plan was presented
to the CDF Data Handling Design meeting in May 2004.
The action plan for deployment of SAM on CAF has been maintained in
this web page. This has been
copied into a
new web page suitable for
updating the bundle of products tested in integration. This page
also contains the sam_products (station server code bundle running
on any station), sam_client (station client code running on any
cdfsoft installation where user analysis takes place or user
queries to sam takes place and the sam_dbserver. The major bundle
version can be listed and any mods put down as a minor version
number with explanation of the mod.
The deployment to users is
- Give CPU preference to users on fbsng caf only
- Choose users from those requesting tapes plus volunteers
- Convert all of fbsng-caf to sam only
- Move to do the same on Condor-CAF
This assumes the deployment has been finished in its testing and that
any changes follow the integration tests.
Back to Index
Here are some comments and thoughts that are being incorporated into the
document.
Back to Index
A1.1 Art's comments (With Alan's mod)
CDF Drivers ( v6 with dropping of v5 implied througout )
Sam On Caf in production
Sam On DCaf with pinning
SAM on DCAF without pinning.
(Implies much more robustness in retries, service failures, latency
reduction, plus existence of recovery projects and the other features
promised / implemented in v5.3.4+ of CDF software.)
Sam On Farm
Sam on Desktop - shares server with SOC
Sam Upload for data import
Autodest - to Enstore with correct file families
Autodest - to local Durable cache
Dcache writes - being done at Karlsruhe and D0, but with hacked python
Predator - will retire soon
Scavanger - sorry, I missed this, don't understand it
AC++ integration - need to move to GCC first, and understand upgrade issues
Monitoring - ( NextSamTV, dbserver monitoring, MonaLisa, etc )
need UTC time stamps in logs, for grid support
Hardware support
cdf-sam hardware support 24x7 - switch to new server, need to find/do
dbserver with failover - have cdfsam1/cdfsam2 for this, need SW test
failover
using cdfsam02 for both dev/int and failover
cdf-store - for central storage support, have HW fcdfdata053 , need to
install
cdf-cat serving INFN customers, can add more such as needed/provided
three test stations ( cdfsamdev/cdfdevint/cdfsamth )
web server ( fcdfsun1 -> cdfsamweb , have HW, need to deploy )
monitor sever - cdfsammon, have HW, need to deploy , will replace SamTV on
151
cdf-farm station - will get from existing farm systems, prototyping
Deployment
Test v6 without v5 components ( just need new sam_dcache_cp today ? )
Redeploy in production to
cdf-sam
cdf-cat
cdf-fzkaa
cdf-cnaf
cdf-toronto
Then deploy to all the rest
Software projects - driven by the above ?
v6 DBserver memory usage - still with us, I'm afraid.
dbserver per-query memory/time limits (select * from data_files)
Autodest
Python 2.3 to get UTC times
Dcache write support
SRM for support/grid deployment/DCache scaling
SL 3.0.3+ migration
End Of Development Phase !!!
Back to Index
Rick St. Denis
Last
modified: Thu Apr 15 18:51:36 CDT 2004