Bringing SAM Deployment of v6 Back Under Control


Step 1: Define which SAM systems CDF needs from SAM on what systems and why

1.1 Drivers

This translates into the following prioritized list of goals: These should be formed into projects that the SAM management guides. The SAM on dCAF and SAM on CAF are already described on this web page.

Back to Index

1.2 Systems

  • fbsng CAF at FNAL: 100TB of disk on DCache with SAM traffic shaping in the CAF, 1200 cpus
  • Condor CAF at FNAL:Same 100TB of disk as for fbsng, same requirements, 2000 cpus.
  • Together fbsng and Condor CAF have 5000 segment slots. Typically there are 50 jobs running with segments starting in groups of 5 (Condor) or 10 (Fbsng)
  • Import fileservers and stations: Used to import files for concatenation:
  • Offsite SAM stations
  • Support for of order 20 offsite stations. The cpu and disk resources are listed here. The CDF datasets that are pinned at these locations are listed here. All stations have a private network. Some have nfs mounted SAM cache, some have to use Fedor's copy code.

    Executive summary of configurations:

    CNAF and TTU use SAN storage systems (Linux head nodes using Red Hat Enterprise Linux in each case). The only site I am sure that has tape and dCache is Karlsruhe. San Diego has a successful deployment of resilient dCache, with disks distributed over several nodes in their cluster.

    Most other use nfs-mounted RAID-5 disk storage. Off-site storage is one of the weakest aspects of our DCAF deployments so far, as you know, with typically only a few TB of disk storage available at each site.

    Details on individual sites:

  • Test stations and setups

    There are a variety of test stations and setups. The most notable are

  • Back to Index


    Step 2:Document SAM usage

    Back to Index

    2.1 Use of SAM in AC++

    Flow Charts for the use of SAM in AC++ are given in:this set of files

    Back to Index

    2.2 Sam for Storing Files: Sam Upload

    Sam Upload was presented to the SamDesign meeting. The transparencies can be found here: ppt or pdf. There are detailed instructions for using sam_upload. For convenience, here is a short summary of sam_upload:
    sam_upload is presumed to allow a user to take a file on
    a local node anywhere (for example a node of a farm that
    could be fnal's CAF like an EGEE farm, but definitely
    not a sam station) and allow the user to upload it to a
    sam location while describing it in sam db along with
    it metadata, i.e. a friendly an easy to install wrapper
    around "copy to a sam station + sam store".
    The first part (copy a file from my computer to a
    sam station local disk, using the best matching protocol
    between what my computer can do and the sam station can do)
    is where sam_cp was hoped to bring in a great wrapper
    around many possible protocols. Lauri points out,
    deploying griftp clinet/servers everywhere may have
    administrative drawbacks, in addititon to currently not
    being as easily installable and as knonw to sys-admin
    as rcp or scp.  However this is what is needed. This would be solved in
    a JIM deployment.
    
    

    Back to Index

    2.3 SAM on CAF

    SAM on CAF interfaces to AC++ through the setting of enviroment variables. The critical environment variables are:
    SAM_USER_NAME
    SAM_PROJECT
    SAM_CONSUMER_ID
    SAM_CONSUMER_PID
    SAM_DATASET
    (SAM_FILE_LIMIT is needed with offline v5_3_3  or earlier)
    SAM_STATION
    

    By setting SAM_PROJECT, CAF Signals to AC++ that it need not start the project.

    By setting SAM_CONSUMER_ID and SAM_CONSUMER_PID CAF may then control the relationship of the caf segment number to the SAM_CONSUMER_PID. If these are not set, AC++ handles starting these.

    CAF Must set SAM_USER_NAME so that with it or AC++ runs, all the segments join the same consumer process and project.

    SAM_DATASET is set in the (G)UI for CAF Submission.

    A full description of how SAM is used on CAF may be found here.

    Back to Index

    2.4 SAM use on the Desktop

    A user at FNAL may run sam on his desktop in the trailers or on fcdflnx2 in order to debug. By using the cdf-sam station, the user is able to access the data with the dcap protocol.

    Instructions to do this are here.

    The user is responsible for setting the SAM_STATION environment variable. The dataset may be conveyed to AC++ using TCL or by setting the SAM_DATASET environment variable.

    Back to Index

    2.5 SAM in Predator

    The complete description of SAM and Predator is found in CDF Note 6169. Some further modifications since the publication of the note are indicated here:
    Looking at CDF note 6169:
    
    predator now runs hourly in "all files" mode.
    
    Enstore data is availed by first trying to avail it from tables built
    by another program which read the Enstore Complete File Listing and
    if that fails, then uses the encp -q cdfen enstore commands.
    
    The Enstore Complete File Listing is read every two hours as that seems
    to be the interval at which it is generated.
    
    The files are now typed by their data type as recorded in the
    FILECATALOG.CDF2_DATASET_REGISTRIES if available, otherwise by
    the first letter of the file names as mentioned in CDF6169.
    
    There have been minor changes for SAM schema changes: the use of
    id fields instead of character strings directly encoded in tables
    (e.g., parameter values and crc types) and the dropping of some
    no longer used columns (e.g., DATA_FILES.KBYTE_FILE_SIZE, which
    is represented now by FILE_SIZE_IN_BYTES).
    
    No attempt is made to maintain virtual dataset or fileset files
    to represent files' dataset or fileset associations.  These are
    now only represented by cdf.dataset and cdf.fileset parameter values.
    
    Both predator and scavenger use the (run<<16)|(0xFFFF§ion) computation.
    
    

    Back to Index

    2.6 SAM in Scavanger

    Scavenger ==>
      Queries SAM data files and associated parameter values for files with
      `cdf.runsections' file parameters.
      Loops over the result set from the query with one row per file and
      file parameter association
        Loops over each file and file parameter association tokenizing it
        on white space (as multiple fields are permitted and space separated).
          For each runsection specification (run/section or run/section:section),
          standardizes the value to a run and pair of low and high sections
          (the colon-less form is a single section that is normalized to a
          low and high section of the same value and then if the low section
          is zero length, then it is set 0 and if the high section is zero
          length, then it is set to 65535)
          If necessary, then the runs are associated with data files in the
          DATA_FILES_RUNS table.  Any extant associations are not removed.
          The normalized run and section associations are merged with any
          existing lumblock associations for the file such that the number of
          lumblock associations are minimized. (The ranges are combined with
          any extant lumblock ranges to which they are adjacent or overlapping
          and if a gap is filled between extant ranges, then the number of
          associations (not the number ranges that are associated) is reduced.)
        If the runsection file parameter is completely processed without
        problems, then it is removed as an associated file parameter.
    

    Back to Index

    2.7 SAM Datasets Offsite

    A page of datasets pinned offsite is kept up to date by a process running at Rutgers.

    The procedure works as follows.

    The script is run once for all stations in one table with higher threshold and then for every station separately with lower thresholds. Also there is a script displaying status of all "rutgers" datasets. The general philosophy is the same, but all datasets with pattern ??8v2* are considered regardless if they have local files or not.

    Back to Index

    2.8 CDF Configuration

    The cdf configuration for SAM is described in detail on a separate web page.

    Back to Index

    2.9 DiskCacheManager Local Copy: Moving one file from SAM Cache to local node

    
      Implemented for CAF use by Stefano Belforte
      The worker nodes cannot access SAM cache
      Maintained by Fedor
    
      Specify two environment variables
      DCM_COPY_SCRATCH
         The local scratch area.
      DCM_COPY_COMMAND
         The copy command to use
    
      DCM_COPY_COMMAND is
        @rcp sam_station:%s%s@, "dccp %s %s"
    
     the %s are substituted by DH (remote) file reference and local cpoy.
    

    Back to Index


    Step 3: Address the SAM team concerns uncovered by step 2 (eg make log files web visible, establish development/integration enviroments, etc.)

    Back to Index

    3.1 Log files

    The CDF log files are visible.

    Back to Index

    3.2 Integration and Development Enviroments

    cdfsamint and cdfsamdev are used for integration and development. The cdfint process type on the CAF can be used for job submission that uses the sam-int station.

    Krzysztof Genser is coordinator for cdfsamdev and Stefan Stonjek coordinates cdfsamint.

    For cdfsamint the following sam qualifiers give the following action:

    For cdfsamdev the following sam qualifiers give the following action: sam_cp does not work. This has a sam cache -- we do not have a dev (just now) with dcache. Shall we have another station on this machine and choose to run one or the other for testing?

    Back to Index


    Step 4: Repeat SAM on CAF testing program with configuration environments established in 1-3.

    Back to Index

    4.1 CAF Monitoring with AC++

    Jobs are continuously submitted to CAF so that we can catch failures first hand. These are AC++ jobs. Columns are:

    Back to Index

    4.2 DBServer Monitoring

    The health of the DBServer is monitored by doing a sam locate foo every 15 minutes and examining the response time. Results are shown here.

    Back to Index

    4.3 Integration Tests done Once per Bundle Release

    Stress tests and memory tests are integration tests that need to be repeated as software has been changed.

    Back to Index


    Step 5: Write, Review, Approve phased deployment plan to CDF users, SAMGrid Management, CDF software management.

    This plan was presented to the SAM design meeting and a readiness plan was presented to the CDF Data Handling Design meeting in May 2004.

    The action plan for deployment of SAM on CAF has been maintained in this web page. This has been copied into a new web page suitable for updating the bundle of products tested in integration. This page also contains the sam_products (station server code bundle running on any station), sam_client (station client code running on any cdfsoft installation where user analysis takes place or user queries to sam takes place and the sam_dbserver. The major bundle version can be listed and any mods put down as a minor version number with explanation of the mod.

    The deployment to users is

    This assumes the deployment has been finished in its testing and that any changes follow the integration tests.

    Back to Index


    A1: Appendix: Comments and Thoughts

    Here are some comments and thoughts that are being incorporated into the document.

    Back to Index

    A1.1 Art's comments (With Alan's mod)

    CDF Drivers ( v6 with dropping of v5 implied througout )
    
        Sam On Caf in production
        Sam On DCaf with pinning
        SAM on DCAF without pinning.
        (Implies much more robustness in retries, service failures, latency
        reduction, plus existence of recovery projects and the other features
        promised / implemented in v5.3.4+ of CDF software.)
        Sam On Farm
        Sam on Desktop - shares server with SOC
        Sam Upload for data import
        Autodest - to Enstore with correct file families
        Autodest - to local Durable cache
        Dcache writes - being done at Karlsruhe and D0, but with hacked python
        Predator - will retire soon
        Scavanger - sorry, I missed this, don't understand it
        AC++ integration - need to move to GCC first, and understand upgrade issues
        Monitoring - ( NextSamTV, dbserver monitoring, MonaLisa, etc )
            need UTC time stamps in logs, for grid support
    
    Hardware support
        cdf-sam hardware support 24x7 - switch to new server, need to find/do
        dbserver with failover - have cdfsam1/cdfsam2 for this, need SW test
    failover
             using cdfsam02 for both dev/int and failover
        cdf-store - for central storage support, have HW fcdfdata053 , need to
    install
          cdf-cat serving INFN customers, can add more such as needed/provided
        three test stations ( cdfsamdev/cdfdevint/cdfsamth )
        web server ( fcdfsun1 -> cdfsamweb , have HW, need to deploy )
        monitor sever - cdfsammon, have HW, need to deploy , will replace SamTV on
    151
        cdf-farm station - will get from existing farm systems, prototyping
    
    Deployment
        Test v6 without v5 components ( just need new sam_dcache_cp today ? )
    
        Redeploy in production to
           cdf-sam
           cdf-cat
           cdf-fzkaa
           cdf-cnaf
           cdf-toronto
    
        Then deploy to all the rest
    
    Software projects - driven by the above ?
    
        v6 DBserver memory usage - still with us, I'm afraid.
    
        dbserver per-query memory/time limits (select * from data_files)
    
        Autodest
    
        Python 2.3 to get UTC times
    
        Dcache write support
    
        SRM for support/grid deployment/DCache scaling
    
        SL 3.0.3+ migration
    
        End Of Development Phase !!!
    
    

    Back to Index

    Rick St. Denis
    Last modified: Thu Apr 15 18:51:36 CDT 2004