Each sam analysis project run on a collection of files (datasets). We are interested in understanding how many of these files are shared among analysis projects. The following is a study on the files of type "reconstructed". Please, follow this link for the monthly statistics.
Given a random analysis project id (1000, 10000, ....) which runs on a dataset
of N1 files, I look at all the following analysis projects (up to typically
27000) seeking for common files.
For each of them I consider the number of files N2 in the dataset they run on,
and the number NI of common files. I plot the average quantity NI/sqrt(N1*N2)
vs distance in time between analysis projects. The error bars are the RMS of
the distribution of NI/sqrt(N1*N2) within the time bin.
See
plot10000-27000_reco_loc10000.ps
and
plot10010-27000_reco_loc10010.ps
for an example.
In general I consider the average contribution of 10 or 100 subsequent
starting analysis projects.
See
plot10000-27000_reco_loc10010.ps
and
plot10000-27000_reco_loc10100.ps
for an example.
Typically up to 50% of the files are shared among pairs of analysis projects
for a long period of time (months). See for example again
plot10000-27000_reco_loc10010.ps
and
plot17000-27000_reco_loc17010.ps
There are occasional situations where files are only analized for a short period
of time and never looked again (see
plot1000-27000_reco_loc1010.ps)
plot1000-27000_reco_loc1010.ps
plot1000-27000_reco_loc1100.ps
plot5000-27000_reco_loc5010.ps
plot8000-27000_reco_loc8010.ps
plot10000-27000_reco_loc10000.ps
plot10000-27000_reco_loc10010.ps
plot10000-27000_reco_loc10100.ps
plot10010-27000_reco_loc10010.ps
plot12000-27000_reco_loc12010.ps
plot12000-27000_reco_loc12100.ps
plot17000-27000_reco_loc17010.ps
plot17000-27000_reco_loc17100.ps
plot20000-27000_reco_loc20010.ps
plot25000-27000_reco_loc25010.ps
The time frames studied are
AP_id date ------------------ 1000 14-DEC-99 1010 14-DEC-99 1100 18-DEC-99 5000 29-MAR-00 5010 29-MAR-00 8000 18-MAY-00 8010 18-MAY-00 10000 26-JUN-00 10010 26-JUN-00 10100 27-JUN-00 12000 02-AUG-00 12010 02-AUG-00 17000 16-NOV-00 17010 17-NOV-00 17100 18-NOV-00 20000 15-JAN-01 20010 15-JAN-01 25000 24-APR-01 25010 24-APR-01 27000 26-MAY-01and the average size (KB) over the interval of analysis projects considered is
AP1_id AP2_id ave_size(KB) ------------------------------ 1000 1010 262,019 1000 1100 58,512 5000 5010 1,921,610 8000 8010 5,525,950 10000 - 11,804 10000 10010 371,521 10000 10100 1,322,670 10010 - 1,481,147 12000 12010 3,515,264 17000 17010 171,312,000 (AP 17007 run over 9067 files) 17000 17100 24,035,500 20000 20010 7,857,550 25000 25010 292,171,000 (AP 25002/04/07/10 run over 4367 files)
The following plots are also of interest. In this histogram, the x axis is the number of times a file appears in an analysis project; the histogram reports the frequency of this variable considering all the files. The top plot reports the whole range (a test files which appears 9846 has been neglected here). The bottom plot is a zoom in of the top.

From this plot, it is possible to calculate the percentage of cache hits vs
cache size. The x axis is oriented starting from the most used files
(at the right of the plot above) and going toward the least used.

The plot below shows the same "integral" information on the files size vs the number of times a file appears in an AP. The top plot shows the datasize as the sum of the sizes of the single files; the bottom plot shows the information processed by APs as the size of the files times the number of times they appear in an AP. As of June 4 2001, the total size of files considered for analysis is of the order of 10 TB (top plot), while the total information processed is of the order of 200 TB (bottom plot).
