HB repair in TS


11 Jan 2001:
HB #9. Start from SS/AM map failure at FNAL. Found by Marco in Dec 2000. Shipped to Trieste around Christmas. Arrived 11 Jan 2001. At Fnal (tested from TS) was failing hb_quik_test because of SSMAP test error. Repeated in TS with same result. Fnal tests ended with the conclusion in the e-mail:
Investigation on FNAL crate:
   Date:        Tue, 19 Dec 2000 18:35:32 +0100
  From:        Stefano Belforte 
    To:        belforte@ts.infn.it

mi sembra che l'indirizzo 400200 della SSmap sovrascriva 400000
n.b. 400000 e;' il base add della SSmap). Anche altri indirizzi possono
avere lo stesso casino:
usando una versione modificata di hb_menu a b0 in 
~belforte/vme/hb/quick_test
in cui rnf scrive contenuto = offset (in words):

ottengo

Hit Buffer: b0svt04 slot 12 : rnf 400000 80
VME address and number of words to write (hex hex): word to write (hex):
0
 word 0  written 128 times. 
Hit Buffer: b0svt04 slot 12 : rnr 400000 10
VME address and number of words to read (hex dec):  Read with status: 0
address 400000: 0 = 00000000 00000000 00000000 00000000
address 400004: 1 = 00000000 00000000 00000000 00000001
address 400008: 2 = 00000000 00000000 00000000 00000010
address 40000c: 3 = 00000000 00000000 00000000 00000011
address 400010: 4 = 00000000 00000000 00000000 00000100
address 400014: 5 = 00000000 00000000 00000000 00000101
address 400018: 6 = 00000000 00000000 00000000 00000110
address 40001c: 7 = 00000000 00000000 00000000 00000111
address 400020: 8 = 00000000 00000000 00000000 00001000
address 400024: 9 = 00000000 00000000 00000000 00001001
Hit Buffer: b0svt04 slot 12 : rnw 400200 4
VME address and number of words to write (hex dec): address 400200: data
word (hex)? 22
address 400204: data word (hex)? 23
address 400208: data word (hex)? 24
address 40020c: data word (hex)? 25
 4 words written. 
Hit Buffer: b0svt04 slot 12 : rnr 400000 10
VME address and number of words to read (hex dec):  Read with status: 0
address 400000: 22 = 00000000 00000000 00000000 00100010
address 400004: 23 = 00000000 00000000 00000000 00100011
address 400008: 24 = 00000000 00000000 00000000 00100100
address 40000c: 25 = 00000000 00000000 00000000 00100101
address 400010: 4 = 00000000 00000000 00000000 00000100
address 400014: 5 = 00000000 00000000 00000000 00000101
address 400018: 6 = 00000000 00000000 00000000 00000110
address 40001c: 7 = 00000000 00000000 00000000 00000111
address 400020: 8 = 00000000 00000000 00000000 00001000
address 400024: 9 = 00000000 00000000 00000000 00001001
Hit Buffer: b0svt04 slot 12 :

Repeated Jan 11 in Trieste:
copy here modified hb_menu etc. in ~vme/hb/quick_test/ same result. It all looks now as address 400200 overwrites 400000, i.e. vme address bit 9 is bad, i.e. internal address 7 to the rams (AMAPADD 7 ?) Will have to check all bits and AMMAP as well (800000).

write fffffff to 400200 then read 00ffff. All 16 bits of SS_map are affected.
Try AM_MAP. Same thing. Writing to 800200 overwrites 800000 (first words at least)on all 16 bits. Same effect on layer 1 (880000), 2 (900000) and layer 3 (980000) of AMmap, since the adddress busses are:

SSmap+AMly1/3 = Amapad
AMly0/5/6     = Bmapad
AMly2/4/7     = Cmapad
all the mapad busses have problems on bit 7. But can not be IVADD 7 since idprom is OK.
mapads come from HRVME: ivadd7 = pin 98
                        Amapad7 = pin 102, Bmapad7=117, Cmapad7=14
test ivadd7 on via next to hrvme pin 98  ==> OK
test Amapad on pin 19 of ssmap low (U26) ==> NO
test Cmapad on via next to hrvme pin 14  ==> NO
test ivadd7 on pin  98 ==> OK
test Cmapad on pin  14 ==> NO
test Amapad on pin 102 ==> NO
test Bmapad on pin 117 ==> NO

==> input port on pin 98 of HRVME is broken. Need ot replace HRVME.
Jan 16
Modify test programs to ignore mapad7 problem. In ~belforte/code/svtvme/src/svtvme.c add svtvme_testMemoryHB9 where the same word is written for ivadd7=1 or 0. Now RAM test is OK.
Try random tst: get strange things from HRVME and MLDATA
Jan 17
Definitely there are other problems. Amapadd<6> in output from HRVME to MLDATA is apparently always 0. Do not understand why does not affect SS/AM map test. Also get wrong calculation of road parity when Amapadd<6> is set unless Amapadd<15> is also set. Also get some missing hits in output in small numbers but in a non reproducible way. Since all of this could be explained by odd behaviour of HRVME will replace that chip before speinding more time.
2001:
Feb 10:
Board returns from Eclipse after replacing HRVME. SS/AM maps now OK. Passes quick test and HLM test. Problems in reading Hit and Road Spy Buffers (intermittent). Reading again get same spy buffer content, it appears as sometimes write is wrong. After disabling spy check passes 150K iterations of random test. Spy pointers are OK, just content. Probable cause in strobe to RAM.
Feb 11:
Investigate Spy Buffers. Modify mhb_test_random to skip hb1 hit and road spy buffers and re-enable spy check (mhb_test_random_9.c). No problems. So it really appears as Hit/Road Spy in HB9 have problems. Can't figure out a "rule" for H/R spy errors. Remove board, put clip on Hspy RAM U55 (briefly) and U57 (lengthy) and check with tester continuity of conrol lines (CS_ WE_ OE_) from xilinxes to hist spy ram U55. All is OK. Put board back in crate. Several errors, also hlm_test fails. Remove clip from U57. All simple test OK. mhb_test_random_9 still OK. INvestigate Hit/Road spy buffers intensively with new ad hoc test (~/vme/hb/quick_test/hb_test_hspy(rspy)(hrspy) ) but find no problem.
Go back to "normal" random test and now it all works, even checking all spy buffers each iteration. Set to check spy each 2 iterations and let run. Apparently some connection was bad either on the RAM's or on the Xiinxes or on the vias and was fixed by pushing with the tester probes. At 19:00 got 3500 iter OK.
Feb 12:
Halt random test after 217K iter OK. Remove HB9 to "reheat" solders. All solders on H/R spy control lines on HBSPYBUF and VMEPLD lool fine, touch with solder gun solders for CS_ on HBSPYBUF and right side (pins 1-14) of U57 (Hspy data 0-6,22). Board still OK. Restart random test on HB#1 + HB#9.
Get intermittent errors on output bit 4 of HB#1, same in output and Ospy, but wrong. Remove Hb#1 to "heat solders". Put solder on pins 7-8 of XU4. All solder on HLM chips seem OK. Still situation worsen. Now get continous errors. Simple test sending just one road at a time show stickyness of out<0): takes 2 words to change. HLM _test and rand test shower errors on many more bits as well. Always out=ospy.
more tests shows stickyness is more general and gets worse for lower data bits:
send just EE on hits, then send roads and look at output:
ROADS      OUT
200000  2001ff  (the first after init)
200000  200000
2fffff  2ff000
2fffff  2ffffe
2fffff  2fffff
2fffff  2fffff
200000  200fff
200000  200001
200000  200000
Alo gets similar results writing output from VME !!! Lower 12 bits take two writes to change. So it does not seem a problem with GDATA bus. But then, why OUT=SPY ??
Feb 13:
Problem seem reproducible now: load 3000 EE on hits. Then send one road at a time and read it. Send 20000 alternate with 200001 and/or 2fffff. It looks like bit 0 always take 2 cycles to change. Look at Gdata0 : always same as output. Look at HRIN4: always OK. Look at DATA0, briefely looks like puttimg the probe on that pin fix the bug, but then the bug stay fixed also removing the probe. Solders look OK. Look at CLK/NCLK at MLDATA: OK. NRLE, HLE, CLRTAG all seem OK. Can't figure out a cause. Maybe glitch inside MLDATA ? Vcc pin ? Problem goes away when pushing board with finger, maybe just toucing with probes changes the things... At the end of the day the boad is usually OK but goes bad when pushed with a finger (any location) or when the rack is gently kicked. Also stil have to explain why was failing also in writing from VME.
Feb 14:
Same as yesterday night. Random test runs usually OK, then fails after 10-20 min. Random tet fails as soon as the board is pushed from the side.
Give up hope on HB1 till May. Today tested HB#9 in parallel with #1. It gave no errors.
20:15: start random test on HB#17 and #18.
Feb 14:
Stop after ~200K iteration OK. Go back to HB #9. Start random test with HB# 17 and #9. Still read HB spy's each 2 iterations.
Feb 19:
Stop tests. 1.32M iteration OK on HB #9 and #17. Will bring them to Fermilab.

18-19 Apr 2002:
Franco Spinella + Stefano
Test new HB's assembled this winter #19 and #20 with non-IDT tagram. at 20 Mhz. HB#20 fails SS_MAP+AM_MAP test. Traced to address bit stuck at 1. HB #20 has short on xilinx HRVMR pins 95-94, small solder ball behind pins, this short to Vcc (pin 94) line AMAPADD14 (pin 95). Franco manages to lift pin 95 from PCB, then we removed short and resoldered pin. New soldering is not very strong, needs a more professional work. After this hb_quick_test OK.
Both HB fail hb_hml_test and hb_random_test. Trace to ENMAP not working, all ENAMx are stuck to 1. Like happened before new enmap firmware. Found no possiility of programming error, the directory with PRG files used by Franco to program the Xilinxes in Pisa has only one programming file for ENMAP (the right one: enmap_sb1.prg). Tried changing clock timing to ENMAP:
change DNCLK3-4 delay from -4Tu to 0 ==> WORKS !
Also try changing to -2Tu ==> also works.
Run random test overnight HB#20 has 0 delay (no jumpers on U66 4F0-1), HB#19 has -2Tu (U66 4F1 mid, 4F0 low) (cfr HB timing) runs OK for 380K iter.
Try 22 MHz, works OK for a few minutes. Stop test to measure clocks.
On roboclock pins (as in HB timing) but also on ENMAP pins: Back to 20 MHz.
HB #20. No jumpers (0 delay)
DNCLK3: U66 pin 11 : -6.5  ENMAP pin  6 : -3
DNCLK4: U66 pin 10 : -6.5  ENMAP pin 28 : -3
Jumper low on 4F0 (-2Tu delay)
DNCLK3: U66 pin 11 : -8.1  ENMAP pin  6 : -4.7
DNCLK4: U66 pin 10 : -8.1  ENMAP pin 28 : -4.7
Also look at HB#19:
-2Tu ==> on ENMAP -5.3 instead of -4.7 on U66 : -8.7 
So HB#19 has DNCLK 0.6 nsec earlier then HB#20.
Both HB 19 and 20 pass random test with -2Tu.
Also record CLK at ENMAP (pin 5): HB#19 ==> -3.6
==> HB#19 On ENMAP CLK(pin5) --> DNCLK (pin 6,28) = -5.3+3.6 = -1.7 nsec 
same measure on HB#20: pin 5 = -3.1
==> HB#20 On ENMAP CLK(pin5) --> DNCLK (pin 6,28) = -4.7+3.1 = -1.6 nsec 
So HB19 and HB20 have the same timing on ENMAP, previous difference must
be on roboclok -> test point. Will only measure on ENMAP.
HB#20 with -4Tu ==> -3.4  fails test
HB#20 with -2Tu ==> -1.6  works
HB#20 with  0Tu ==>   0   works
Now measure on an old HB, that works with -4Tu
HB#9  with -4Tu ==> -3
HB#9  with -2Tu ==> -1.6
In the end can not find explanation ! Also difficult because HB9 is broken and always fails test (same with HB1). Try HB20 again to make sure softw/cable is OK. HB#20 failes again SS_MAP test. Traced to bad DLKO2 chip. Also HITOVFL error bit is stuck, traced to W0_ and W1_ to tagram being continously sent, traced to HITWE2 in input to TAG_CTR stuck at 1, again wrong signal generated by DLKO2 (all DLKO2 inputs are OK). Will replace DLKO2.
9 May 2002:
Franco Spinella + Stefano
Carlo Magazzu' replaced DLKO2 in Pisa. Test again HB20. SS/AM maps OK. HLM test OK. Fails QuickTest and RandomTest because bit11 of Hit/Road Spy Buffer is stuck to 0 when read from VME. Out Spy is OK. Checked RAM data line with scope writing alternating 0/1, is OK. I.e. when processing data Spy Buffer RAM data bit is OK, when reading Spy from VME, the data is read OK. So RAM is good and line is OK. Look at HRVME pin that drives IVDATA11 line (pin 44) and it is always 0. Also data bit 11 is stuck to 0 if reading FIFO from VME. The IVDATA11 line in HRVME is only used for reading Hit/Road fifo or spy. Test for short with tester: none. Carlo Magazzu' lift pin 44, still it is always 0 while input from HRSPY11 line to HRVME chip is OK. Decalre HRVME broken.
10 May 2002:
Stefano
Carla Magazzu' replace HRVME chip on HB 20
11 May 2002:
Stefano
HB 20 now passes all tests in Pisa (ran from Trieste, slowly), about 10Kiterataion of random test OK. Stop and move to Trieste.
13-16 May 2002:
Stefano
long test of HB20: room temperature, 20MHz: passes 1.4 million iteration of random test.
16 May 2002:
Stefano
Start double test of HB 19 and HB 20, also set both to 23MHz. Stop by hand on May 21 after 2.8 million iterations. Room temperature, with fans. HB's in slots 18-20.
29 Aug 2002 - at FNAL
Stefano
Try HB 19 at 25 and 24 MHz in SVT test stand. Random test fails after about 100 iter at 25 and 1000 ieter at 24 with a few extra or swapped hits in output. Extra output hits are indeed present in input. Everything indicates tag ram problems. Error when happens are reproducible, random test gets stuck there. Go back to 23 MHz.
Stefano Belforte
Last modified: Thu Aug 29 17:53:09 CEST 2002