From - Fri Apr 5 13:48:06 2002 Return-Path: Received: from fsui02.fnal.gov (fsui02.fnal.gov [131.225.68.20]) by castor.ts.infn.it (8.11.6/8.11.6) with ESMTP id g34KPgY26126 for ; Thu, 4 Apr 2002 22:25:42 +0200 (MET DST) Received: from localhost (badgett@localhost) by fsui02.fnal.gov (8.10.2/8.10.2) with ESMTP id g34KPd624879; Thu, 4 Apr 2002 14:25:39 -0600 (CST) X-Authentication-Warning: fsui02.fnal.gov: badgett owned process doing -bs Date: Thu, 4 Apr 2002 14:25:39 -0600 (CST) From: "William Badgett, Fermilab 1(630)840-6674" To: cc: Stefano Belforte , David Waters , "Richard E. Hughes, OSU/CDF" , Jack Cranshaw , Franco Bedeschi , Al Goshaw , Robert Harris , Frank Wuerthwein , Kevin McFarland , , Jeff Tseng , , Terry Watts , Nigel Lockyer , Subject: Re: Database manpower In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII We're not in crisis mode, things are under control; we have appropriate plans for the future; like every other project on CDF, we would like more people. I invite everyone interested to the weekly database meeting to discuss these issues, but I don't think continuing this e-mail thread would be useful. Cheers, Bill On Thu, 4 Apr 2002, Rick St. Denis wrote: > Hi Stefano > You are absolutely right. > As you know, we have tried to address these issues by bringing in Grid > manpower. This was after we abandoned efforts to get the manpower needed > for the tasks that you suggested from within the collaboration. You > did kindly suggest one person but she was unable to be out here full time > to work with the people here and was therefore never able to get > integrated. In addition, she came in a time when I was in the midst of > changing the paradigm for finding the manpower we needed to address the > problems of the database. > Rob's assertion that the proper forum to present needs like this is the > database meeting and that the leaders take requests to the offline > management results in nothing practical. Jack and I did this time and > time again, and, in fact, we also took our problems it to the online > operations. However, the leadership of the collaboration is only as > strong as the collaboration itself. Therefore I decided to take an > entirely different tack and try to get these problems solved in the > context of Grid. It is a gamble and I hope that the CDF databases can > limp along until the issues that Stefano raises can be attacked from the > front, and not the side. I am grateful that Rob was supportive of this > orthogonal approach as I am sure he feels the same frustration as I with > regard to motivating the collaboration to solve these kinds of problems. I > am trying to do something constructive and I do not see another solution. > An example of the way in which we may be able to use grid to adress the > problems was illustrated today when we learned from Vicky that Sam > provides for the database constants exported on a file to be kept as a > data tier. This can then be allocated space in the cache as > user=calibration, separtated in quota from user=top-analysis so that when > one runs an analysis the first thing that happens is you run a project to > get the calbration, then you run the project to analyse. This seems a bit > awkward but is the kind of developmental work that is definitely > "integrate cdf with sam" and hence allows us to use new resources. > Stefano, you right that the oracle database cannot handle the > sam load. But it is widely known in grid that this has to be solved and > therefore we tap into that. > cheers, > rick > > ***************************** > Dr. Richard St. Denis,Dept. of Phys.& Astr., Glasgow University > Glasgow G12 8QQ; United Kingdom; UK Phone: [44] (141) 330 5887 > UK Fax : [44] (141) 330 5881 > ====================================== > FermiLab PO Box 500; MS 318 Batavia, Illinois 60510 USA > FNAL Phone: [00] 1-630-840-2943 FNAL Fax: [00] 1-630-840-2968 > Sidet: [00] 1-630-840-8630 FCC: [00] 1-630-840-3707 > > > ---------- Forwarded message ---------- > Date: Thu, 04 Apr 2002 10:07:21 -0600 > From: ALANSILL@tthep2.phys.ttu.edu > To: stefano.belforte@ts.infn.it > Cc: r.stdenis@physics.gla.ac.uk, badgett@fnal.gov, cranshaw@fnal.gov, > bed@fnal.gov, goshaw@fnal.gov, rharris@cdfsga.fnal.gov, > fkw@fkw.lns.mit.edu, ksmcf@fnal.gov, t.huffman1@physics.ox.ac.uk, > jtseng@fnal.gov, giorgiob@fnal.gov, watts@physics.rutgers.edu, > ALANSILL@tthep2.phys.ttu.edu > Subject: Re: Picking up on database export and offsite needs / grid use > > Hi Stefano, > > I guess I left out a little bit of preface information, sorry. I have > accepted a position as deputy database coordinator for a short while to work > with Bill on exactly these concerns. I share your conclusions and concerns, > and while right at the moment I don't think we are in a crisis in my own mind > we are not far from one. Distributing computing around the world even if the > software is working well is a major project, and we need to think about the > database and calibration distribution from the beginning. (Even more so, > since we are dependent on it for innovative solutions like SAM.) > > We have been meeting for weeks lately as Rob says in the context of the > CDF weekly database meeting run by Bill Badgett to discuss operational > problems as well as plans for the future, and are now trying to think as hard > as we can about problems like replication and methods to hold off any coming > crisis in access to the database. Resources are few and people are hard to > come by, but there are the beginnings of a serious attempt to take these > problems seriously, including but not limited to specifying new hardware > that can be duplicated remotely. > > I look forward to getting more outright criticism offered in a friendly > way such as this as a mechanism to identify the problems and move forward. > Only by being frank with each other can we hope to make progress. > > Alan > --------------------------------------------------------------------------- > > > Date: Thu, 04 Apr 2002 16:42:51 +0200 > > From: Stefano Belforte > > To: ALANSILL@tthep2.phys.ttu.edu > > CC: r.stdenis@physics.gla.ac.uk, badgett@fnal.gov, cranshaw@fnal.gov, > > Franco Bedeschi , Al Goshaw , > > Robert Harris , > > Frank Wuerthwein , > > Kevin McFarland , > > "Todd Huffman (CDF/ATLAS)" , > > Jeff Tseng , bellettini giorgio , > > Terry Watts > > Subject: Re: Picking up on database export and offsite needs / grid use > > > > Thanks Alan, > > not your fault, but your message makes me very upset. > > I think once a year I am entitled to file a serious > > complain with the experiment management, and this is it. > > > > There is big problem with DataBase. But > > frankly I do not know how to deal with this anymore. > > It is clear there is a big problem. But is also too big for me. > > We do not have the peole nor the knowledge to take the DB export > > as an italian project, we tried to follow somebody else's lead, > > and got nowhere. > > I have tried to help by pushing italians to work on this, > > and got very bad results. From Italy we put one person on this > > for half a year, she ended up with few months of wasted time with > > no clear direction and finally being asked to work on SAM > > instead, making a few people displeased and making it difficult for > > us to find more help in the future. > > > > If anything, I have realised it is not simply export. As we > > move to decentralised computing we have to look at > > distributed DB's and integration of those. And is not > > just calibration constant, is the data file catalogue that > > worries me most now. > > There is no point in me and you talking about DB, this has > > to be tackled from the top, seriously and with a lot of > > manpower, not occasional help like the kind I can provide. > > > > It is not something where we can throw a student or > > a part time newcomers, it needs dedicated thinking, full > > time people, and real computing experts. > > Since we trapped ourselves in a situation where the > > code is so fucking slow that we needs thousands of > > computers around the world to get the physics out, we > > need a way to make it if not easy, at least feasible. > > > > As far as I see the lab pushed us in this oracle mess, and > > in spite of all help they are giving on SAM, they are not > > even remotely tackling the biggest problem of a single DB > > server for everybody, they are rather making it more serious by > > this cute feature of SAM where you can't even start a job > > without talking to an Oracle server in Fermilab. > > This is more or less what has killed Objectivity. > > The DB replica is not my problem, or your problem, or CDF > > problem, it has to be Fermilab problem, just as it is one > > of the biggest GRID problems and more people are working > > on it then I can keep track of. > > > > The new CAF means nothing to me, CD can set up an Oracle > > replica, even if it takes 4 weeks of work of all Vicky White's > > group, I worry about having 100-200-300 nodes in Italy trying to > > run jobs. What will happen ? What kind of support are we getting ? > > Zero so far. > > For a few years I have been told that we can simply access > > the FNAL DB server, that there is a free Oracle licence for > > all of CDF, and that commercial DB servers are so good that > > handle millions operations a second with no problems. > > I even remember some highly placed Fermilab person showing > > the Oracle book from the CHEP2000 desk as the magic wand > > while explaining to the world how Fnal had found there > > the panacea to all ills. > > > > Now I just learnt from Frank that already the coming linux farm > > at FCC needs a dedicated DB replica to work. > > Years ago I expressed concerns about how to deal with data sets > > remotely to the offline manager of the time (MS) and the answer\was > > "we will distribute DIM with the offline". Now DIM is going and the > > replacement is as portable as Cheope's pyramid. And as scalable. > > This is total nuts, we are locking ourselves in a cage. > > > > Regardless of GRIDS we have to make it easy to copy > > data in Italy, run here using standard AC++ input > > modules by dataset names, do it even if network with > > fnal is down, and be sure we are using the correct files. > > Anything less then this is a disaster. > > You can substitute Italy with UK, Chicago, Texas, or > > whatever. But we have to look behind the summer conferences, > > to 8 years of data taking, millions of files, and computers > > running analysis everywhere, grid or not grid. > > Till when can people keep track by hand of the files > > they copy home and be confident they have all the events ? > > One thousand files, 2 thousands ? 5 millions ? > > > > I beg the DH leaders, the offline managers, and the spokepersons > > to start doing something serious about this, like making it > > a real priority of the CDF. Can we just for the fun of doing > > things differently, tackle this before it is an exploded > > crisis ? > > Who is in charge of this ? Is it a DH task ? An offline manager ? > > > > I am glad to learn that Rutgers has developed a "local DFC" > > that works, but I can not put people to work on this kind > > of things without the certainety of going somewhere. There > > has to be a plan, and each effort has to be part of that plan. > > > > > > Thanks > > > > Stefano > > > > -- > > ALANSILL@tthep2.phys.ttu.edu wrote: > > > > > > Hi Stefano, > > > > > > Picking up the thread from below and trying to update it, I'd like > > > to update you on our current thinking and plans, and to see whether these > > > match what you, Rick, and others who are doing offsite analysis are doing > > > with respect to the Grid. > > > > > > We've seen on several occasions that both experiments and inadvertent > > > use (abuse) can overload the current CDF offline "production" db server, > > > fcdfora1.fnal.gov (= cdfofprd). This is a Sun machine; plans call for > > > future replacement and/or augmentation with a multiprocessor Linux-based > > > machine. Bottlenecks exist both in hardware performance (e.g. SCSI bus > > > layout and implementation) and in the number of simultaneous processes > > > allowed; ultimately the latter is controlled mostly by the number of > > > Fermilab licenses for simultaneous average connections, though some > > > allowance is made by Oracle for "spiking." > > > > > > The number of Fermilab licenses for Oracle total stands now at 135. > > > The offline machine process limit is 200, and we have seen several cases > > > of it being reached and exceeded, e.g. by refused connections, monitoring, > > > and complaints from users. Nonetheless most of these cases seem to be > > > traced to unusual and/or abusive use, for example via an experiment by an > > > offsite institution to export the entire database contents to try an import > > > into MySQL (not recommended for a number of reasons right now, but I'll get > > > to this), timed out processes, etc. Bill Badgett and the Oracle dba's, Anil, > > > etc. have been trying to track down some of the worst of the non-disconnecting > > > processes and make some changes to the CDF calib db code to cut down on this > > > problem. For the moment things are OK, i.e., good under routine conditions, > > > and people are able almost always to connect from offsite and onsite to get > > > their work done. > > > > > > How this might change in the future: > > > > > > 1) The CAF is coming up. This will greatly increase the number of > > > concurrently running jobs. > > > > > > 2) People are running an increasing number of analysis jobs from their > > > desktop workstations, both remotely and from FNAl on site. > > > > > > --> 1 a) and 2 a): both sets of users above will be running on > > > increasingly concentrated data sets. > > > > > > 3) The CDF grid projects and CDF SAM are starting up. SAM makes > > > and will make heavy use of the offline database both to access > > > usual CDF DB entries and to manage & store its own catalog > > > information. > > > > > > 4) CDF is getting more data. (This may seem trivial, and of course > > > is the kind of problem we would like to have at a considerably > > > increased rate, with more luminosity from the accelerator than we > > > are currently getting, but ultimately will have an effect. Also > > > reprocessing of existing datasets with new versions of production > > > will put an increased load on the database and will cause us to > > > refine our procedures for storing multiple copies and versions of > > > calibration constants, etc.) > > > > > > What are we doing about this? > > > > > > So far just talking, but with plans toward the future. > > > > > > - We want to implement a more advanced (note not "Oracle Advanced") > > > version of replication between the online and offline db servers > > > perhaps; between the offline and (new) copies of the offline server > > > definitely. > > > > > > - Some of these copies or duplicates of the offline database, including > > > the data file catalog and/or whatever it grows to in the future, can > > > be housed off-site, and probably should. > > > > > > - Some of these copies can probably be achieved, or at least this is the > > > suggestion from Oracle, by a technology called "Oracle streams." This > > > would require that we move to a newer version of the Oracle software > > > than we are using, is a lot easier to implement than the previous > > > technique that everyone seems to hate of "advanced replication", and > > > seems to be made to order for copying to a limited number of off-site > > > and on-site locations. > > > > > > - We probably will have to look at some kind of organized export to > > > off-site freeware databases also, if only for license limitation reasons > > > (we have a limited number of licenses to share between FNAL and off-site > > > institutions). There are however some limtations in even the best of > > > the currently available freeware solutions that make it impossible to > > > cover everything that the CDF database does right now, even (as far as > > > I know) in MySQL. > > > > > > That's it for the moment. We need volunteers to work on some of the > > > exporting issues -- e.g. to work in a more coordinated way toward solving > > > some of the missing features in MySQL, provide some of the missing features > > > for offsite querying of the database and insertion needed to support SAM, > > > etc. -- let me know if you need a list, and I can provide it if you are > > > interested. We all want to get to the point of analyzing CDF data more > > > smoothly and efficiently, both from off-site and on; I'm just trying to > > > look ahead and see what we can do to avoid a "crash" of the offline database > > > and its exporting system. > > > > > > Let me know if I can do anything more to help, > > > > > > Alan > > > > > > --------------------------------------------------------------------------- > > > > > > > Date: Thu, 15 Mar 2001 08:57:08 -0600 > > > > To: r.stdenis@physics.gla.ac.uk > > > > Cc: Stefano Belforte , > > > > Igor Gorelov , franco.semeria@bo.infn.it > > > > From: Jack Cranshaw > > > > Subject: Re: DB export to Italy > > > > > > > > Stefano, > > > > > > > > Have you tried just using the oracle client which you > > > > get distributed automatically with the offline software? > > > > If you've had problems, then please send us the numbers > > > > on performance or lack thereof. But I would actually push > > > > people to first try just using the database at Fermilab. > > > > > > > > One indicator that this may serve well is that Vladimir > > > > who works with silicon now does his work on a machine > > > > at Texas Tech because the PC there is faster than anything > > > > available at Fermilab, and although the network connection > > > > is poor, the database access is not noticeably different than > > > > at Fermilab. > > > > > > > > We're slowly getting a handle on the export, but I don't see > > > > a viable export of all the pieces of the database that you need > > > > for analysis for at least one month. Due to the chronically > > > > undermanned effort given to the database software, things > > > > move slowly. Offers of help are gladly appreciated. > > > > > > > > Cheers, > > > > > > > > Jack > > > > > > > > "Rick St. Denis" wrote: > > > > > > > > > Dear Stefano > > > > > I have not heard from Mark for some time and lost touch with operations > > > > > of the database as I have been teaching since February. > > > > > So the proposals and intentions for export were fine, but needed action > > > > > from proponents. Jack has had to put as first and only priority the second > > > > > commissioning run. Er. I mean Run II. So I am frankly not sure what > > > > > happened to export and could not chase it. > > > > > We can mumble lots about who has responsibility to provide what, which > > > > > commmittee wants what thing, but if you really want to get the full CDFDB, > > > > > then I think we need someone who is doing what Igor Gorelov of New Mexico > > > > > has done. He has studied oracle, having taken official courses, set up a > > > > > server on a linux box at new mexico and started replication to it. He is > > > > > now going to FNAL for 6 months having realized how much he needs to know > > > > > to actually get a server working and how much he can learn from the > > > > > Fermilab Computing Division people, like Nelly Stanfield and others. > > > > > But even then, you have a whole new problem: the network > > > > > connectivity. The fact is that when we modify the schema -- and we do > > > > > this more than we would have liked -- we have to wipe out the replication > > > > > and refresh it all. As the database has grown, this becomes quite > > > > > demanding! > > > > > Also, when you talk about the full database, you probably dont want > > > > > every silicon raw calibration. But to pick and choose, you need mark's > > > > > solution. If you go with the oracle solution, you are looking at a lot of > > > > > work and resource usage. Kinda like buying a jaguar. > > > > > I will try to find out therefore what mark is up to, but suspect i will > > > > > not get a lot of happiness. > > > > > cheers > > > > > rick > > > > > > > > > > ***************************** > > > > > Dr. Richard St. Denis,Dept. of Phys.& Astr., Glasgow University > > > > > Glasgow G12 8QQ; United Kingdom; UK Phone: [44] (141) 330 5887 > > > > > UK Fax : [44] (141) 330 5881 > > > > > ====================================== > > > > > FermiLab PO Box 500; MS 318 Batavia, Illinois 60510 USA > > > > > FNAL Phone: [00] 1-630-840-2943 FNAL Fax: [00] 1-630-840-2968 > > > > > Sidet: [00] 1-630-840-8630 FCC: [00] 1-630-840-3707 > > > > > > > > > > On Wed, 14 Mar 2001, Stefano Belforte wrote: > > > > > > > > > > > Rick, > > > > > > given the kind of network connectivity we have among italian sites, > > > > > > and between Italy and Fnal, it make sense for us to look for a > > > > > > solution that exports the full CDF DB to just one place and then > > > > > > processes running all over Italy could access this common location > > > > > > rather then having local copies. > > > > > > This would hoepfully simplify our work over there. > > > > > > While this sort of naturally blends (I think) with a full blown > > > > > > Oracle replica, it may still be a work-saving solution also for > > > > > > the freeware export. > > > > > > Franco kindly offered to set up this on a central server in Bologna > > > > > > if/when a proper technical proposal is singled out. > > > > > > We very much need your opinion and input on this and definitely > > > > > > would like to talk about it sometimes in the next couple of weeks > > > > > > when Franco will also be at Fermilab (I am at Fnal all months). > > > > > > We definitely can start by e-mail, just as you find more convenient. > > > > > > > > > > > > Looking forward to hearing from you > > > > > > Stefano > > > > > > -- > > > > > > Stefano Belforte - I.N.F.N. tel : +39 040 375-6261 (fax: 375-6258) > > > > > > Area di Ricerca - Padriciano 99 e-mail: Stefano.Belforte@ts.infn.it > > > > > > 34012 TRIESTE TS - Italy Web : http://www.ts.infn.it/~belforte > > > > > > at Fermilab: CDF trailers 169-N tel: (630)840-8698 > > > > > > >