RunJob Stakeholders Meeting, 2004-August-6 Attending: Liz, Gustaaf, Gavin, Suen, Amber, Ian, Ruth, Stefano, Wyatt, Rick Greg: RunJob is keeping up with the schedule proposed in February for the most part. Items that have been completed on schedule (or will be completed on schedule soon) include the ScriptObjects API for building containers for compute jobs, the FileMetaBroker architecture for transporting ScriptObjects and their components, CMS Integration with the RunJob core code, and XML representation of internal components. The context mechanism for defining logical sets of parameters was also successfully prototyped in CMS MCRunjob, and ported back into the RunJob Common code. We have also implemented code quality measures, and now require that all code comform to written standards enforced by a PyLint parser scanner at release time. We have slipped on DZero integration, though it is proceeding apace (Peter Love) and mainly slips for reasons at the discretion of DZero. We have also slipped on an initial package for CDF, again for reasons at the discretion of CDF. Internally, we have slipped on redesigning mechanisms for declaring dependencies among applications and services. They are currently considered difficult to use. We have also slipped on redefining the syntax of the commands users use to communicate with the system which is also considered to be difficult to use. (This was presented in too much technical detail in Greg's slides- this is more or less what it means.) DZero: A concern was raised that care must be taken in defining common code and experiment specific code. For example, experiment specific batch adaptors should outside of the Runjob common code. Samgrid would have responsibility to submit to its own batch adaptors. Job Monitoring on the Grid: Is the "ShREEK" component of Runjob useful? Some background: ShREEK (ShahKar Runtime Execution Environment Kit) evolved as a DZero extension during the "ShahKar" RunJob Pilot Project phase. It preceded the RunJob project itself. It is now a logically distinct component of RunJob which uses threads to manage execution of a set of "tasks" and monitor the tasks using XML-RPC. Some primative control points can also be introduced. In CMS, this overlaps with Boss, which does monitoring of the execution of a job by filtering its stdout/stderr and storing the information in a MySQL database. CMS is therefore not supporting the ShREEK component. DZero however, does use ShhREEK and is "morally" supporting it. We will need regular phone conferences between Dugan and Greg. Moving Runjob V7 into production at D0 is not an issue for the RTE but is for JIM/SAMGRid. Interested in long term maintenance when support resources are reduced. Peter Love will be on the project for about 3 years. CDF: CDF has several different areas where RunJob code could be used. This one focuses on the farms. CDF requirements here require additional development and support for Skreek. Checkpoint/restart of the jobs might need more states in the SAM status for the files. Old farm to the SAM farm is one of the top priorities for the fall. Scope by taking Eilliots CAF scripts and see what it takes to make the configurators. Monitoring in present farms is too heavy in terms of communication and fault handling. Do not currently have monitoring and recovery in the CDF farms. If this is available in the Runjob version it would be additional functionality. Greg thinks a pilot integration project would be good. 1.5FTEs available to work on the pilot in a week or two. At the moment the Taiwan effort is involved with a SAM integration. Engineer and possibly recriuit an engineer for the short term. Start to use durable cache on the farms sam station - couple of weeks. Greg on vacation on 29th August. Rick will get together with Greg. CDF stakeholders and Runjob meeting in 2 weeks. CMS. Most of priorities are being met with Runjob. new MC processing service in next 6 weeks. Continued support for the CMS production environment and interface with LCG. CMS adoption of Runjob contingent on good support and integration with the LCG. Longer term issue is the use of runjob to submit analysis jobs on behalf of the user. Overlap with other stakeholders: Interface to LCG; Use for analysis job submissions. 1 FTE integration effort; 25% FTE to core development. If necessary, have agreed to contribute 25% FTE additional to the core development for a defined length of time.