THOUGHTS ON COMPUTING IN THE CDF TRAILERS IN RUN II

08 August 1999 - Stefano Belforte - INFN Trieste

I am sorry not being able to be at Fnal to discuss this in person,
I really think this is a very important issue.
In particular us italians, with our limited connectivity to the US internet,
have come to depend very heavily on being able to effectively use computing
resources at the lab, and regard the trailer workstations with great hopes.
So, even if it took me some time to be up to speed, I would lik
now to share my opinions with you after what I heard friday.

It took me a long time, but in the end I understand that is not at
all clear to me what really the "centralised system" is, is it defined
by a central disk pool with no local disks ? Is it defined by central
password file (Yellow Page like) vs. 200 independent authorization files ?
Is it defined by central system files, versus tools being installed
indipendently on each machine ? Is it central disk for code versus code
distribution to 200 nodes (I hope we never come to the need to install
the offline pacakge and/or database on each linux box !!!).
Is it like the reconstruction farm ?

Rather then trying to sort that out, I will try to come to some
conclusions by looking at hard facts. And I will take two point of
views: one perspective is what we want to do in terms of data
access (the user point of view) the other what things may look like
from system management point of view.

Data access is sort of easy to put on firm ground, is really a matter
of bandwidth. 
The bottom line is always Stephan's one, I/O from data
storage to CPU is a bottleneck, the Central Analysis Facility (CAF ?)
is designed to deal with this via the Fiber Channel's promise to give
SCSI performance over LAN distances. But when we look at the trailers,
there is a very different picture. There are all these machines with
100Mbit/sec capbility connected by non-blocking fast switches to a
totally inadequate Gigagit Ehternet backbone. 10 PC accessing the network
at the same time and puf... everyting is stuck.
Unless something dramatic is done, the trailers are *not* a place where
PC's can use the network to access efficiently data that are "beyond
the switch they are connected to". And doing someting dramtic, would mean
to change the trailers into a computing center..

Let's look at two scenarios:

GROUP SERVERS IN THE TRAILERS:
-----------------------------
All offices in one trailer work out of mid-size machine sharing its disk
and CPU, traffic is local to one non-blocking switch, communication
needs to/from CAF at FCC are minimal. As long as data access is thus confined
the Gigabit link to FCC is not saturated.
Will need several such systems (10~20), they share backbone and uplink to FCC,
therefore they really have to require very little acces to the main data
storage.
These servers will look very much like the system one can have at the home
institution, have ~1 TByte disk storage for frequently used small data
sets (total PAD data is 20~30 TB) (anyhting smaller is a desktop PC, anything
bigger is really computer center equipment).

Problems:
1. data sets will be "private" to one physical trailer or sub-trailer,
   these servers do not help cases where people of different institutions
   work togheter on same data (often the case in Run I). Also people in same
   institution may have to be physically distant, as now happens.
2. the trailers will be full of "home instituion data centers" tied togheter
   and to FCC by poor network (1 Gbit/sec) with very very poor access to the
   main data storage and robotic tape storage at CAF.
3. the server has only 100Mbit fast ethernet hookup to talk to the 10 or so
   PC that "depend" from it. Need to put special communication hardware
   in the room where this machine is to support e.g. Gigabit ethernet
   up to the local switch, a more expensive switch etc.
4. if there a task that one istitution group believe can be effectively done
   on a small server with limited local disk without good link to FCC, why
   do it in the trailer ? Why not at home.
5. the trailers do not have "computer rooms", all these servers have to
   put in officies, tapes are exposed to dust, disk noise is a big
   nuisance, and even air conditioning in many trailers is not
   adequate for a heavy equipment load. Who wants a big machine with
   20 disks under the desk ?
6. Which data traffic will this server serve anyhow ? What data does it send
   to the desktop PC's ? Are they just Xterm's ? If so, why does it need to
   be in the trailers, X-term conenctivity to FCC wil lbe pretty good.
   Wil the dekstop PC update continually their local disk storage ? With
   what ? How often ? Do I make a new n-tuple every hour ? How big is it ?
   Why can't it be ftp-ed from CAF ?
7. What purpose will really this machine serve ? Other then fulfilling
   some sense of ownership ?

Comments:
 If some institutions wants to deploy at FNAL a workgroup server whose
 size, connectivity capability, and data transfer needs make it suited
 for the FCC (for example Italy is considering buying a few of those),
 why put it in the trailers, rathen then the CAF at FCC ? I want to put
 Italian's ones in the CAF to have fast access to the full data set,
 which I don't have from Italy. The trailers are near FCC as miles go,
 but still are at Gbit/sec rate, imagine to divide that Gbit by ten/twenty
 institutions... as soon as we leave the CAF FiberChannel LAN, the data
 becomes distant, no matter what.
 At present I can see one only advantage in the "group server in the trailer"
 model: secrecy. Since this machine will have good access for a few people,
 and poor access to the group at large, is ideal for discovering SUSY without
 anybody else spotting it.
 I will not personally defend that purpose. There will always be some
 protected disk area for everybody, and "secure" work places at remote
 institutions.. the balkanization of the trailers is just too much.
 It is a political issue, not techical anyhow.

SINGLE PC's:
-------------
As alternative to group servers, let's examine just 200 individual
desktop PC. All "identical". A democratic self-organising system.
Data flow is mostly from CAF (where most n-tuples are created) to PC
(where they are browsed). Sticking to my estimates from February's Cdf
Computing Workshop, which all in all started as Remote Access Workshop
(I already discussed how trailers are as remote as Italy):

- n-tuple size: 5 GBytes (so it takes few minutes to process)
- n-tuple trasnfer time: 100Mbit/sec ==> one minute.
- n-tuple updating frequency: every 4 hour (a lot !)
- people updating n-tuples in the same half day: 100 (a lot !)
- total b/w need to the trailers: 100 * 5GBytes/4hour = 0.3 Gbit/sec

N-tuple may by (little) bigger, but usually one will not make a new one
twice a day, but use it to think a bit. So the uplink is just adequate
for this leaving space for X-term sessions, remote editing, mail, web...

Data sharing with other PC's may be usefull, prefered way is just ftp'ing
n-tuple over and run locally, also because higher efficiency of FTP over
other protocols (like NFS). 

While it will be easier to access PC's connected to the same switch, there
is nothing dramatic in accessing PC's elsewhere since data transer is
episodic, the idea is always to keep the n-utple on the local disks and
refresh them only every now and then. In this sense the clustering is
self-organising.

And of course, you can always log into your friend's PC and run "PAW"
using your PC as X-terminal, just as we do routinely now. Again network
traffic is limites, physical location in the trailer is not a big issue,
acces rights may be in principle regualted by the directory owner allowgin
a precise, flexible, reconfigurable "clustering" that has no rigid label
attached.

For larger n-tuples, data sets (50 GByte e.g.) that is unconvenient to
copy over to 10 PC's, is also unconvenient to run interactively, better
to run a low priority batch job where the data is and get a small histogram
file back. Sharing the disk on NFS should not make this faster in most cases.

For special cases of largish data dets that for strange reasons are not in
the main CAF storage, and require CPU limited access (heavy fitting e.g.) by
several people, sharing the disk across several PC's (NFS like) could be
desirable.
I see this as a particular case, sort of occasional, once such a sharing need
is established for a data set, being that group wide or callaboration wide is
not important, the data set should be moved to the CAF.
Agin, if you want to buy 2TB of disk for your data, do not put it in the
trailers, put it the CAF, so you can fill them from tapes fast !
Here again I would like a kind of self-organization, by which individual
users can temporarely grant to other users the right to mount their disk
as needed. This is a different subject, but guranteeing proper mounting
of hundreds of disks across hundreds of PC's in an environment where each
machine can be rebooted at random by the user, is probably a terrrible
management task, if the users can do it just when they need, they will take
care of fixing things after power off and limit the use.


SYSTEM MANAGEMENT ISSUES:
-------------------------
Given that data access analysis favours the picture of 200 independent
PC's with 100GByte local disk, let's look at system mangement in that
hypothesis.

My favourite list of requirements is:

- management should be done by professionals. In particular by laboratory
  personnel. There are security issues, as well as performance issues.
  There have been case in which goodwilling unaware amaterus have
  managed to chocke the LAN while trying to set up web/name/disk servers.
  Some sys.man. experience is a good think in people education, but is
  better be done in a controlled environment.
- the task must be made easy, in order to deal with 200 computers

- same op sys version for everybody, installable via network.
- only minimal system files locally, all application must be serverd from
  a central disk to allow maintenance. AFS would be great for this (local
  caching of used files).
  Central installation in this way for all applications: web, print, mail,
  videoconf, word processing, display, etc. etc. 
  The system shoudl really look like cdfsga, you log in and have all
  applications ready, with the same name, parameters etc. Same print
  queues and commands etc. etc.
- centralised .cshrc and .login should be made, e.g. following
  the CERN example. This is also truye for the central system,
  there should not be need to a new user to spend a day messgin around
  with things he/she does not know about before you can type "netscape"
  or "latex" and it works.
- common home directory on the CAF file server plus local scratch/data
  disk (this I think is already "decided"... but...)
- local disks are installed and configured by system managers who give
  disk directory ownership to the PC owner and then "forget" about them.
- nobody but the system manager should have the root password
- special cases of trustworthy skillfull users with special needs should
  be dealt with as needed, so the above is the rule but intelligent
  exceptions should be used. This should talke care of web servers e.g.
- the users should be given the possibility to allow/deny read/write access
  to their local disk areas to other users, pretty as much as ACL's used
  to work on VMS.
- there should be a unique authorization file (Yellow Pages like) allowing
  everybody to log on any PC. Experience has shown that there is little to
  fear, peole will be discourage anyhow by logging in a PC whose disk they
  can not use and will not bug the "owner".
- a limit to interactive logins (3 e.g.) could be put anyhow.
- the PC's owner should still be given a guarantee to privileged use, e.g.
  by lowrring the priority of networked processes vs. console ones.
- users should be given some tools to mount other PC's disks and tune the
  permission for this operation without knowing the root password, it
  does not need to be instant operation, submission of request to remote
  process with proper privilegies who checks authorization and executes
  command after 10 min is acceptable.
- a good batch system should be put in place, two different needs should
  be addressed:
  1: "CPU farming" a lot of CPU cycles will be idle, they are perfect for
     large MC's. A tool exist for this that gurantees that the batch
     does not impact the interactive performance. Is called CONDOR, developed
     at Univ. of Madison.
  2: local analysis: some way to allow user X to run a job on Y's PC to access
     data on Y disk. This is also important for X to run on X's PC!
     A small downgrading of interactive response is acceptable. I do not
     have a good suggestion here.

Several of the requirements I listed do not have an obvious solution that
I know of, but I think tools can be developed to implement them. In
particular the development of this kind of system management tools is
something that may even be handed out to remote help, like CD or collaborating
institutions, these are the kind of thing that often local system managers
invent independently in different places.
The main issue if probably just the bare system installation management,
definitely sysman's should not roam around offices with CD's.
I suggest at least to have a look at what CERN is doing.

A consequence I like of the "200 PC" scenario versus the "group servers" is
again that the 200 PC have to be all equals, since they are too many to
manage them independently, so we have to define a list like the above,
develop specific tools, and have a unique system management plotics. With
group servers gepgraphycally parted (by State or by Trailer), it is hard to
cut on the pressure for 20 different customisations, and very hard to do
it without 20 different system managers.


Finally, summarising following the points mentioned in the charge:

- hardware setup: we may suggest O(100GB) disk as guideline for sizing
  an "usefull" system, but other then that there is nothing particular
  required by data access needs. The following is my personal favourite, but
  hardly a PC would come without:
  - 100Mbit/sec fast ethernet
  - audio board
  - video acquisition board and camera (facultative)
  - low end SCSI i/f
  - R/W cd
- root password: should be kept by the lab sys.mans. Occasional and/or
  specific needs may require a different passwrod on a few systems to
  be shared with few trustworthy users, this should be be exception.
- backup: user local disk should not be backed up centrally. Users will
  have to back up themselves using 8mm tapes or R/W CD. Given the state
  of flux of mosta data (executabels e.g.), it is probably not necessary
  to back them up at all. Sources etc. should be in the file server.
  Few GB of really important stuff can go to CD, for larger sets a local\
  8mm unit can be used. I do not favour extended deployment of 8mm units
  "a' la Run I" because there is a lot of dust in the trailers and now both
  drives and tapes are more expensive.
- disk cross-mounting: should be temporary, limited, flexible. As long as
  it is limited should not be a noteworhty load on the LAN