Our goal for this project is to create a programming package that will analyze the data access pattern for the D0 experiments, and provide this information to the public on an updated webpage. For further information on this project visit the fnal projects webpage at projects.fnal.gov. For this analysis we will use sql queries to retrieve information about the data files. Therefore, we would like to find the most expedient and least resource consuming query that outputs the project ids, the dates of the analysis projects, and the difference in the dates between Analysis Project 1 (AP1), and Analysis Project 2 (AP2). The project ids are constrained between the values APlo to APmid and APlo to APhi. We limit the project ids in order to view a smaller segment of the data files rather than the whole. In addition, the data type is defined as reconstructed. We have written 3 different queries that all implement the same task described above.
==========================================================================QUERY1
select distinct AP1.PROJECT_ID, AP2.PROJECT_ID, AP1.CREATE_DATE from analysis_projects AP1, analysis_projects AP2, project_files PF1, project_files PF2, data_files DF1, data_files DF2 where (PF1.proj_snap_id=AP1.proj_snap_id AND PF2.proj_snap_id=AP2.proj_snap_id AND PF1.FILE_ID = DF1.FILE_ID AND PF2.FILE_ID = DF2.FILE_ID) AND (DF1.data_tier = 'reconstructed' AND DF2.data_tier = 'reconstructed') AND (AP1.PROJECT_ID between 9000 and 9010 AND AP2.PROJECT_ID between 9000 and 23000 AND AP2.PROJECT_ID > AP1.PROJECT_ID);Query1 does not use any nested SELECT statements. The query utilizes joins to connect the tables: analysis_ project, project_files, and data_files.
==========================================================================QUERY2
select AP.project_id,Q1.project_id, AP.create_date from analysis_project AP, Project_files PF, data_files DF, ( select AP2.project_id,AP2.Create_date, PF2.File_id from analysis_project AP2, Project_File PF2, data_files DF2 where AP2.Project_id between 9000 and 23000 AND DF2.data_tier = 'reconstructed' AND PF2.Proj_snap_id = AP2.Proj_snap_id AND PF2.File_id=DF2.File_id ) Q1 where AP.project_id between 9000 and 9010 AND DF.data_tier ='reconstructed AND AP.proj_snap_id=PF.proj_snap_id AND PF.File_id=DF.File_id AND AP.project_id < Q1.project_id order by AP.project_id, Q1.project_id ;Query2 uses a nested SELECT statement within the FROM statement of the main SELECT statement. The SELECT statement acts as a table, which contains the columns project_id, create_date and file_id.
==========================================================================QUERY3
select AP1.project_id, AP2.project_id, AP1.create_date from analysis_projects AP1, analysis_projects AP2 where AP1.project_id in ( select analysis_projects.project_id from project_files, analysis_projects, data_files where (analysis_projects.proj_snap_id = project_files.proj_snap_id) AND (project_files.file_id = data_files.file_id) AND data_files.data_tier = 'reconstructed' AND analysis_projects.project_id between 9000 and 9010) AND AP2.project_id in ( select analysis_projects.project_id from project_files, analysis_projects, data_files where (analysis_projects.proj_snap_id = project_files.proj_snap_id) AND (project_files.file_id = data_files.file_id) AND data_files.data_tier = 'reconstructed' AND analysis_projects.project_id between 9000 and 23000) AND AP2.project_id > AP1.project_id;Query3 uses two nested SELECT statements in the WHERE statement. The SELECT statements in this case are used as control variables for AP1.project_id and AP2.project_id.
The following timetable gives the system time in seconds for Query1, Query2, and Query3 at different APlo, APmid and APhi values. APlo to APhi gives the whole range, and APlo to APmid gives us a range in which we can take the average over the quantities we look at APlo to APhi.
| APlo-APmid-APhi | Query1 | Query2 | Query3 |
|---|---|---|---|
| 19000-19010-20000 | 0.826s | 77.742s | 1.579s |
| 19000-19010-21000 | 0.932s | 174.883s | 0.910s |
| 19000-19010-22000 | 1.075s | 215.464s | 0.992s |
| 10000-10010-13000 | 1.126s | 43.244s | 1.085s |
| 10000-10010-14000 | 1.248s | 52.241s | 1.196s |
| 10000-10010-15000 | 1.392s | 70.827s | 1.451s |
| 10000-10010-16000 | 1.678s | 77.926s | 1.542s |
After analyzing the times, and performing the Oracle's explain statement for the 3 queries, it is clear that Query3 has the shortest running time; the server does not perform any cartesian joins, which generally increase the amount of resources to use. Therefore we will use Query3 to retrieve the dates, and the difference of the dates for the analysis projects.