Study of the number of files shared among pairs of analysis projects in sam

Gabriele Garzoglio with the help of Igor Terekhov, June 5 2001

Each sam analysis project run on a collection of files (datasets). We are interested in understanding how many of these files are shared among analysis projects. The following is a study on the files of type "reconstructed". Please, follow this link for the monthly statistics.

Given a random analysis project id (1000, 10000, ....) which runs on a dataset of N1 files, I look at all the following analysis projects (up to typically 27000) seeking for common files. For each of them I consider the number of files N2 in the dataset they run on, and the number NI of common files. I plot the average quantity NI/sqrt(N1*N2) vs distance in time between analysis projects. The error bars are the RMS of the distribution of NI/sqrt(N1*N2) within the time bin. See plot10000-27000_reco_loc10000.ps and plot10010-27000_reco_loc10010.ps for an example.
In general I consider the average contribution of 10 or 100 subsequent starting analysis projects. See plot10000-27000_reco_loc10010.ps and plot10000-27000_reco_loc10100.ps for an example.

Typically up to 50% of the files are shared among pairs of analysis projects for a long period of time (months). See for example again plot10000-27000_reco_loc10010.ps and plot17000-27000_reco_loc17010.ps There are occasional situations where files are only analized for a short period of time and never looked again (see plot1000-27000_reco_loc1010.ps)

plot1000-27000_reco_loc1010.ps
plot1000-27000_reco_loc1100.ps
plot5000-27000_reco_loc5010.ps
plot8000-27000_reco_loc8010.ps
plot10000-27000_reco_loc10000.ps
plot10000-27000_reco_loc10010.ps
plot10000-27000_reco_loc10100.ps
plot10010-27000_reco_loc10010.ps
plot12000-27000_reco_loc12010.ps
plot12000-27000_reco_loc12100.ps
plot17000-27000_reco_loc17010.ps
plot17000-27000_reco_loc17100.ps
plot20000-27000_reco_loc20010.ps
plot25000-27000_reco_loc25010.ps

The time frames studied are

 AP_id      date
------------------
  1000   14-DEC-99
  1010   14-DEC-99
  1100   18-DEC-99
  5000   29-MAR-00
  5010   29-MAR-00
  8000   18-MAY-00
  8010   18-MAY-00
 10000   26-JUN-00
 10010   26-JUN-00
 10100   27-JUN-00
 12000   02-AUG-00
 12010   02-AUG-00
 17000   16-NOV-00
 17010   17-NOV-00
 17100   18-NOV-00
 20000   15-JAN-01
 20010   15-JAN-01
 25000   24-APR-01
 25010   24-APR-01
 27000   26-MAY-01
and the average size (KB) over the interval of analysis projects considered is
 AP1_id  AP2_id  ave_size(KB)
------------------------------
  1000    1010       262,019
  1000    1100        58,512
  5000    5010     1,921,610
  8000    8010     5,525,950
 10000     -          11,804
 10000   10010       371,521
 10000   10100     1,322,670
 10010     -       1,481,147
 12000   12010     3,515,264
 17000   17010   171,312,000  (AP 17007 run over 9067 files)
 17000   17100    24,035,500
 20000   20010     7,857,550
 25000   25010   292,171,000  (AP 25002/04/07/10 run over 4367 files)

The following plots are also of interest. In this histogram, the x axis is the number of times a file appears in an analysis project; the histogram reports the frequency of this variable considering all the files. The top plot reports the whole range (a test files which appears 9846 has been neglected here). The bottom plot is a zoom in of the top.


(ps version)

From this plot, it is possible to calculate the percentage of cache hits vs cache size. The x axis is oriented starting from the most used files (at the right of the plot above) and going toward the least used.



(ps version)

The plot below shows the same "integral" information on the files size vs the number of times a file appears in an AP. The top plot shows the datasize as the sum of the sizes of the single files; the bottom plot shows the information processed by APs as the size of the files times the number of times they appear in an AP. As of June 4 2001, the total size of files considered for analysis is of the order of 10 TB (top plot), while the total information processed is of the order of 200 TB (bottom plot).


(ps version)

Technical notes on how to publish this information via a cron job (Anne Futrell, supported by Gabriele Garzoglio): ( note1 | note2 )