THOUGHTS ON COMPUTING IN THE CDF TRAILERS IN RUN II 08 August 1999 - Stefano Belforte - INFN Trieste I am sorry not being able to be at Fnal to discuss this in person, I really think this is a very important issue. In particular us italians, with our limited connectivity to the US internet, have come to depend very heavily on being able to effectively use computing resources at the lab, and regard the trailer workstations with great hopes. So, even if it took me some time to be up to speed, I would lik now to share my opinions with you after what I heard friday. It took me a long time, but in the end I understand that is not at all clear to me what really the "centralised system" is, is it defined by a central disk pool with no local disks ? Is it defined by central password file (Yellow Page like) vs. 200 independent authorization files ? Is it defined by central system files, versus tools being installed indipendently on each machine ? Is it central disk for code versus code distribution to 200 nodes (I hope we never come to the need to install the offline pacakge and/or database on each linux box !!!). Is it like the reconstruction farm ? Rather then trying to sort that out, I will try to come to some conclusions by looking at hard facts. And I will take two point of views: one perspective is what we want to do in terms of data access (the user point of view) the other what things may look like from system management point of view. Data access is sort of easy to put on firm ground, is really a matter of bandwidth. The bottom line is always Stephan's one, I/O from data storage to CPU is a bottleneck, the Central Analysis Facility (CAF ?) is designed to deal with this via the Fiber Channel's promise to give SCSI performance over LAN distances. But when we look at the trailers, there is a very different picture. There are all these machines with 100Mbit/sec capbility connected by non-blocking fast switches to a totally inadequate Gigagit Ehternet backbone. 10 PC accessing the network at the same time and puf... everyting is stuck. Unless something dramatic is done, the trailers are *not* a place where PC's can use the network to access efficiently data that are "beyond the switch they are connected to". And doing someting dramtic, would mean to change the trailers into a computing center.. Let's look at two scenarios: GROUP SERVERS IN THE TRAILERS: ----------------------------- All offices in one trailer work out of mid-size machine sharing its disk and CPU, traffic is local to one non-blocking switch, communication needs to/from CAF at FCC are minimal. As long as data access is thus confined the Gigabit link to FCC is not saturated. Will need several such systems (10~20), they share backbone and uplink to FCC, therefore they really have to require very little acces to the main data storage. These servers will look very much like the system one can have at the home institution, have ~1 TByte disk storage for frequently used small data sets (total PAD data is 20~30 TB) (anyhting smaller is a desktop PC, anything bigger is really computer center equipment). Problems: 1. data sets will be "private" to one physical trailer or sub-trailer, these servers do not help cases where people of different institutions work togheter on same data (often the case in Run I). Also people in same institution may have to be physically distant, as now happens. 2. the trailers will be full of "home instituion data centers" tied togheter and to FCC by poor network (1 Gbit/sec) with very very poor access to the main data storage and robotic tape storage at CAF. 3. the server has only 100Mbit fast ethernet hookup to talk to the 10 or so PC that "depend" from it. Need to put special communication hardware in the room where this machine is to support e.g. Gigabit ethernet up to the local switch, a more expensive switch etc. 4. if there a task that one istitution group believe can be effectively done on a small server with limited local disk without good link to FCC, why do it in the trailer ? Why not at home. 5. the trailers do not have "computer rooms", all these servers have to put in officies, tapes are exposed to dust, disk noise is a big nuisance, and even air conditioning in many trailers is not adequate for a heavy equipment load. Who wants a big machine with 20 disks under the desk ? 6. Which data traffic will this server serve anyhow ? What data does it send to the desktop PC's ? Are they just Xterm's ? If so, why does it need to be in the trailers, X-term conenctivity to FCC wil lbe pretty good. Wil the dekstop PC update continually their local disk storage ? With what ? How often ? Do I make a new n-tuple every hour ? How big is it ? Why can't it be ftp-ed from CAF ? 7. What purpose will really this machine serve ? Other then fulfilling some sense of ownership ? Comments: If some institutions wants to deploy at FNAL a workgroup server whose size, connectivity capability, and data transfer needs make it suited for the FCC (for example Italy is considering buying a few of those), why put it in the trailers, rathen then the CAF at FCC ? I want to put Italian's ones in the CAF to have fast access to the full data set, which I don't have from Italy. The trailers are near FCC as miles go, but still are at Gbit/sec rate, imagine to divide that Gbit by ten/twenty institutions... as soon as we leave the CAF FiberChannel LAN, the data becomes distant, no matter what. At present I can see one only advantage in the "group server in the trailer" model: secrecy. Since this machine will have good access for a few people, and poor access to the group at large, is ideal for discovering SUSY without anybody else spotting it. I will not personally defend that purpose. There will always be some protected disk area for everybody, and "secure" work places at remote institutions.. the balkanization of the trailers is just too much. It is a political issue, not techical anyhow. SINGLE PC's: ------------- As alternative to group servers, let's examine just 200 individual desktop PC. All "identical". A democratic self-organising system. Data flow is mostly from CAF (where most n-tuples are created) to PC (where they are browsed). Sticking to my estimates from February's Cdf Computing Workshop, which all in all started as Remote Access Workshop (I already discussed how trailers are as remote as Italy): - n-tuple size: 5 GBytes (so it takes few minutes to process) - n-tuple trasnfer time: 100Mbit/sec ==> one minute. - n-tuple updating frequency: every 4 hour (a lot !) - people updating n-tuples in the same half day: 100 (a lot !) - total b/w need to the trailers: 100 * 5GBytes/4hour = 0.3 Gbit/sec N-tuple may by (little) bigger, but usually one will not make a new one twice a day, but use it to think a bit. So the uplink is just adequate for this leaving space for X-term sessions, remote editing, mail, web... Data sharing with other PC's may be usefull, prefered way is just ftp'ing n-tuple over and run locally, also because higher efficiency of FTP over other protocols (like NFS). While it will be easier to access PC's connected to the same switch, there is nothing dramatic in accessing PC's elsewhere since data transer is episodic, the idea is always to keep the n-utple on the local disks and refresh them only every now and then. In this sense the clustering is self-organising. And of course, you can always log into your friend's PC and run "PAW" using your PC as X-terminal, just as we do routinely now. Again network traffic is limites, physical location in the trailer is not a big issue, acces rights may be in principle regualted by the directory owner allowgin a precise, flexible, reconfigurable "clustering" that has no rigid label attached. For larger n-tuples, data sets (50 GByte e.g.) that is unconvenient to copy over to 10 PC's, is also unconvenient to run interactively, better to run a low priority batch job where the data is and get a small histogram file back. Sharing the disk on NFS should not make this faster in most cases. For special cases of largish data dets that for strange reasons are not in the main CAF storage, and require CPU limited access (heavy fitting e.g.) by several people, sharing the disk across several PC's (NFS like) could be desirable. I see this as a particular case, sort of occasional, once such a sharing need is established for a data set, being that group wide or callaboration wide is not important, the data set should be moved to the CAF. Agin, if you want to buy 2TB of disk for your data, do not put it in the trailers, put it the CAF, so you can fill them from tapes fast ! Here again I would like a kind of self-organization, by which individual users can temporarely grant to other users the right to mount their disk as needed. This is a different subject, but guranteeing proper mounting of hundreds of disks across hundreds of PC's in an environment where each machine can be rebooted at random by the user, is probably a terrrible management task, if the users can do it just when they need, they will take care of fixing things after power off and limit the use. SYSTEM MANAGEMENT ISSUES: ------------------------- Given that data access analysis favours the picture of 200 independent PC's with 100GByte local disk, let's look at system mangement in that hypothesis. My favourite list of requirements is: - management should be done by professionals. In particular by laboratory personnel. There are security issues, as well as performance issues. There have been case in which goodwilling unaware amaterus have managed to chocke the LAN while trying to set up web/name/disk servers. Some sys.man. experience is a good think in people education, but is better be done in a controlled environment. - the task must be made easy, in order to deal with 200 computers - same op sys version for everybody, installable via network. - only minimal system files locally, all application must be serverd from a central disk to allow maintenance. AFS would be great for this (local caching of used files). Central installation in this way for all applications: web, print, mail, videoconf, word processing, display, etc. etc. The system shoudl really look like cdfsga, you log in and have all applications ready, with the same name, parameters etc. Same print queues and commands etc. etc. - centralised .cshrc and .login should be made, e.g. following the CERN example. This is also truye for the central system, there should not be need to a new user to spend a day messgin around with things he/she does not know about before you can type "netscape" or "latex" and it works. - common home directory on the CAF file server plus local scratch/data disk (this I think is already "decided"... but...) - local disks are installed and configured by system managers who give disk directory ownership to the PC owner and then "forget" about them. - nobody but the system manager should have the root password - special cases of trustworthy skillfull users with special needs should be dealt with as needed, so the above is the rule but intelligent exceptions should be used. This should talke care of web servers e.g. - the users should be given the possibility to allow/deny read/write access to their local disk areas to other users, pretty as much as ACL's used to work on VMS. - there should be a unique authorization file (Yellow Pages like) allowing everybody to log on any PC. Experience has shown that there is little to fear, peole will be discourage anyhow by logging in a PC whose disk they can not use and will not bug the "owner". - a limit to interactive logins (3 e.g.) could be put anyhow. - the PC's owner should still be given a guarantee to privileged use, e.g. by lowrring the priority of networked processes vs. console ones. - users should be given some tools to mount other PC's disks and tune the permission for this operation without knowing the root password, it does not need to be instant operation, submission of request to remote process with proper privilegies who checks authorization and executes command after 10 min is acceptable. - a good batch system should be put in place, two different needs should be addressed: 1: "CPU farming" a lot of CPU cycles will be idle, they are perfect for large MC's. A tool exist for this that gurantees that the batch does not impact the interactive performance. Is called CONDOR, developed at Univ. of Madison. 2: local analysis: some way to allow user X to run a job on Y's PC to access data on Y disk. This is also important for X to run on X's PC! A small downgrading of interactive response is acceptable. I do not have a good suggestion here. Several of the requirements I listed do not have an obvious solution that I know of, but I think tools can be developed to implement them. In particular the development of this kind of system management tools is something that may even be handed out to remote help, like CD or collaborating institutions, these are the kind of thing that often local system managers invent independently in different places. The main issue if probably just the bare system installation management, definitely sysman's should not roam around offices with CD's. I suggest at least to have a look at what CERN is doing. A consequence I like of the "200 PC" scenario versus the "group servers" is again that the 200 PC have to be all equals, since they are too many to manage them independently, so we have to define a list like the above, develop specific tools, and have a unique system management plotics. With group servers gepgraphycally parted (by State or by Trailer), it is hard to cut on the pressure for 20 different customisations, and very hard to do it without 20 different system managers. Finally, summarising following the points mentioned in the charge: - hardware setup: we may suggest O(100GB) disk as guideline for sizing an "usefull" system, but other then that there is nothing particular required by data access needs. The following is my personal favourite, but hardly a PC would come without: - 100Mbit/sec fast ethernet - audio board - video acquisition board and camera (facultative) - low end SCSI i/f - R/W cd - root password: should be kept by the lab sys.mans. Occasional and/or specific needs may require a different passwrod on a few systems to be shared with few trustworthy users, this should be be exception. - backup: user local disk should not be backed up centrally. Users will have to back up themselves using 8mm tapes or R/W CD. Given the state of flux of mosta data (executabels e.g.), it is probably not necessary to back them up at all. Sources etc. should be in the file server. Few GB of really important stuff can go to CD, for larger sets a local\ 8mm unit can be used. I do not favour extended deployment of 8mm units "a' la Run I" because there is a lot of dust in the trailers and now both drives and tapes are more expensive. - disk cross-mounting: should be temporary, limited, flexible. As long as it is limited should not be a noteworhty load on the LAN