Updated: 24-SEP-2003 (Use your browsers' Reload button to ensure you're viewing the most recent version)
VMS73_MEM_CHAN-V0200 Alpha V7.3 Memory Channel ECO Summary
*OpenVMS] VMS73_MEM_CHAN-V0200 Alpha V7.3 Memory Channel ECO Summary
New Kit Date: 22-SEP-2003
Modification Date: None
Modification Type: NEW KIT
Copyright (c) Hewlett-Packard Company 2002,2003. All rights reserved.
OP/SYS: OpenVMS Alpha V7.3
COMPONENT: Memory Channel
SOURCE: Hewlett-Packard Company
ECO INFORMATION:
ECO Kit Name: VMS73_MEM_CHAN-V0200
DEC-AXPVMS-VMS73_MEM_CHAN-V0200--4.PCSI
ECO Kits Superseded by This ECO Kit: Yes
ECO Kit Approximate Size: 720 Blocks
Kit Applies To: OpenVMS Alpha V7.3
System/Cluster Reboot Necessary: Yes
Rolling Re-boot Supported: Yes
Installation Rating: INSTALL_2
2 : To be installed by all customers using the following
feature(s):
Memory Channel
Kit Dependencies:
The following remedial kit(s), or later, must be installed BEFORE
installation of this, or any required kit:
VMS73_PCSI-V0100
VMS73_UPDATE-V0100
In order to receive all the corrections listed in this
kit, the following remedial kits should also be installed:
None
ECO KIT SUMMARY:
FILES PATCHED OR REPLACED:
o [SYS$LDR]SYS$MCDRIVER.EXE (new image)
Image Identification Information
image name: "SYS$MCDRIVER"
image file identification: "X-59"
image file build identification: "X91Y-0060010000"
link date/time: 14-MAY-2002 07:04:33.89
linker identification: "A11-50"
o [SYS$LDR]SYS$PMDRIVER.EXE (new image)
Image Identification Information
image name: "SYS$PMDRIVER"
image file identification: "X-31"
image file build identification: "X91Y-0060010010"
link date/time: 12-MAY-2003 15:51:28.65
linker identification: "A11-50"
PROBLEMS ADDRESSED IN THIS KIT
New problems addressed in the VMS73_MEM_CHAN-V0200 kit
o PMDRIVER FORK_THREAD TQE DOUBLE-INSERT FIX
After installation of the VMS73_MEM_CHAN-V0100 ECO kit,
systems may hang when using the Memory Channel SCS-port.
The system will hang and not crash, requiring manual
intervention and a system-HALT (Console ^P) to recover.
This hang only occurs if there is high SCS-data-transfer
activity (MSCP/TMSCP disk/tape serving) with high IPL-8
fork latency on the Memory Channel target node.
A forced operator crash-dump and analysis will reveal the
OpenVMS EXEC looping within the following routines:
+ SYSTEM_PRIMITIVE*.EXE: EXE$SWTIMER_FORK
Primary SMP CPU stuck scanning EXE$GL_TQFL
TQE-queue; check PCs on CPU-0 stack.
+ SYS$PMDRIVER.EXE: PM$COMQ_RETRY
V7.2-2: TQE$L_FPC: SYS$PMDRIVER+13CC0
SDA> FORMAT/TYPE=TQE @EXE$GL_TQFL
SDA> FORMAT/TYPE=TQE @.
SDA> REPEAT ..........
The OpenVMS EXE$GL_TQFL TQE-timer-queue will be
corrupted, typically with the first TQE linked back to
itself:
+ SDA> VAL QUE EXE$GL_TQFL
Occasionally, there will be an ACCVIO within
TIMESCHDL_xxx (SYSTEM_PRIMITIVES) while servicing
TQE-queue.
Images Affected:[SYS$LDR]SYS$PMDRIVER.EXE
Problems addressed in the VMS73_MEM_CHAN-V0100 kit
o Memory Channel virtual-hub (VHUB) can fail to come
"ONLINE"
1. A Memory Channel virtual-hub (VHUB) will fail to come
"ONLINE" and form SCS-virtual-circuitlink-up if the
Memory Channel VHUB VH0/Master node is not booted
first, prior to booting the VHUB VH1/Slave MC-node
2. If a VH0/Master Memory Channel node crashes and/or
reboots while the VH1/Slave Memory Channel node
remains running, the Memory Channel link will fail
and both VHUB Memory Channel nodes MCA0 (and MCB0 if
applicable) will remain "OFFLINE"
This MCx0 "OFFLINE" problem may also occur during
MCA0/MCB0 adapter/link error-handling/recovery.
The following symptoms are manifestations of this MC VHUB
BOOT "OFFLINE" problem:
OPA0: console errors:
--------------------
%MCA0 CPU00: 19-SEP-2000 04:17:50 Slave but adapter_ok
off, retrying.
%MCA0 CPU00: 19-SEP-2000 04:17:50 MC re-init 5 second timer.
%MCA0 CPU00: 19-SEP-2000 04:17:55 Slave but adapter_ok
off, retrying.
%MCA0 CPU00: 19-SEP-2000 04:17:55 MC re-init 5 second timer.
.
.
.
ON REMOTE NODE ATTEMPTING MC SW INIT .........
MCA0 CPU00: 19-SEP-2000 04:27:50 node state retries exceeded"
DCL SHOW DEVICE command output:
-------------------------------
$ DCL SHOW DVICE MCA0: & PMA0: (& MCB0:/PMB0:) = OFFLINE:
$ SHOW DEVICE MC
Device Device Error
Name Status Count
MCA0: Offline 2
MCB0: Offline 16
$ SHOW DEVICE PM
Device Device Error
Name Status Count
PMA0: Offline 0
PMB0: Offline 0
Images Affected:[SYS$LDR]SYS$MCDRIVER.EXE
o MC_INCONSTATE (SYS$MCDRIVER) bugcheck
An MC_INCONSTATE (SYS$MCDRIVER) bugcheck may occur during
local/remote Memory Channel node reboot or Memory Channel
adapter/Memory Channel link- error-recovery. This
bugcheck can occur regardless of the Memory Channel hub
configuration: VHUB or real-HUB. The MC_INCONSTATE
bugcheck will typically occur when a "nested error
(MCDRIVER-internal or MC-adapter HW-error)" is
encountered while recovering from a memory channel link
error or local/remote memory channel node crash/reboot.
The "MC_INCONSTATE" bugcheck is obvious, and is nearly
always caused by this "nested error-handling" bug. A
typical MCx0: error-log event sequence, and SDA> crash
summary are shown below:
MCx0: ERROR-LOG SUMMARY: Unsuccessful events:
---------------------------------------------
MCB0 - Hardware error, reinitializing.
MCB0 -
Node 0: State: Uninitialized
Node 1: State: Uninitialized
MCB0 - Memory channel link online failure 2
MCB0 - We shouldn't be here.
CRASH - MC_INCONSTATE
Crashdump Summary Information:
------------------------------
Bugcheck Type: MC_INCONSTATE, Fatal error
detected by Memory Channel
Failing PC: FFFFFFFF.E2983A44 SYS$MCDRIVER+0BA44
Failing PS: 30000000.00000804
Module: SYS$MCDRIVER (Link Date/Time:
29-DEC-1999 04:09:37.99)
Offset: 0000BA44
Images Affected:[SYS$LDR]SYS$MCDRIVER.EXE
o Memory Channel Receive channel (RX_MESS_CHAN) message
processing may hang
Memory Channel Receive channel (RX_MESS_CHAN) message
processing may hang after processing 512 RX_MESS_CHAN
messages during a single fork-thread
([MEM_CHAN]MC$HANDLE_MESS_CHAN_INT routine). This could
occur with heavy Memory Channel SCS-traffic and high
IPL-8 fork-thread scheduling latency. A Memory Channel
RX_MESS_CHAN message-handling hang will lead to
CNXMGR/LOCK_MGR stalls (and potential cluster hangs) as
well as SCS "virtual-circuit timeouts".
OPA0: CONSOLE PM/MC ERROR MESSAGES:
-----------------------------------
%PMA0 CPU00: ... MC$_CHAN_QUE_EMPTY
channel = 541C8 ppd = 83DD4CC0
%PMA0 CPU00: ... stall state CLEAR
channel = 541C8 ppd = 83DD4CC0
%MCA0 CPU00: ... Timeslice exceeded
while in workque for node RM763A
%MCA0 CPU00: ... Timeslice exceeded while in workque
for node RM763A
%MCA0 CPU00: ... Timeslice exceeded while in workque
for node RM763A
%PMA0, Virtual Circuit Timeout - REMOTE PORT xxxx
SCS VC-TIMEOUT ERRLOG ENTRY:
----------------------------
.
.
.
Error Type/SubType x4009 Signaled via Packet, Virtual
Circuit Timeout.
The "... Timeslice exceeded" error may continue to occur
after this fix is applied. However, MC RX_MESS_CHAN
processing will no longer hang after this event.
Images Affected:[SYS$LDR]SYS$MCDRIVER.EXE
o MCDRIVER enters an infinite Hardware/Software
initialization error-retry loop
Following a boot-time Memory Channel C
unit-init/self-test "LOOPBACK WRITE TEST" failure, which
indicates a Memory Channel adapter PCI-DMA error, the
MCDRIVER will enter an infinite HW/SW initialization
error-retry loop. The following OPA0:/console errors
will be issued at 5 second intervals, changing to 10
minute intervals after 20 retries:
%MCA0 CPU00: ... MC loopback write interrupt test failed.
%MCA0 CPU00: ... Couldn't get mgmt lock.
%MCA0 CPU00: ... ERR - ucb offline and adapter not crashing .
%MCA0 CPU00: ... Couldn't get mgmt lock.
%MCA0 CPU00: ... ERR - ucb offline and adapter not crashing .
%MCA0 CPU00: ... Couldn't get mgmt lock.
%MCA0 CPU00: ... ERR - ucb offline and adapter not crashing .
Note: The first error message occurs on the first pass
only.
Images Affected:[SYS$LDR]SYS$MCDRIVER.EXE
o System crashes with a CPUSPINWAIT, CPU spinwait timer
expired bugcheck.
CPUSPINWAIT bugchecks may occur on any GSxxx Alphaserver
platform (GS140,GS80/160/320) with a Memory
Channel-adapter. The bugchecks occur due to an eror in
the SYS$MCDRIVER "MC$ALLOCATE_MESSAGE" routine performing
Memory Channel message free-queue-header "loopback
WRITE", and an incorrect timer implementation. The
CPUSPINWAIT bugcheck will always involve an SMP$TIMEOUT
acquiring the SCS-spinlock while another SMP-CPU is
holding the SCS-spinlock within the SYS$MCDRIVER /
[MEM_CHAN]MCCHANNELS.C MC$ALLOCATE_MESSAGE routine.
Crashdump Summary Information:
------------------------------
Bugcheck Type: CPUSPINWAIT, CPU spinwait timer expired
Failing PC: FFFFFFFF.8007A384 SMP$TIMEOUT_C+00064
Failing PS: 28000000.00000804
Module: SYSTEM_SYNCHRONIZATION_MIN
Offset: 00000384
NOTE: The "MC loopback write interrupt test failed"
error is typically due to a leftover/stale Memory Channel
adapter PCI-logic error-state that will only clear with a
CONSOLE >>> INIT operation (to perform PCI-bus RESET).
Users who frequently reboot without using the CONSOLE >>>
BOOT_RESET = ON switch (Environment Variable) or without
performing a CONSOLE >>> INIT command are susceptible to
this "MC loopback write test" error.
Images Affected:[SYS$LDR]SYS$MCDRIVER.EXE
o System can crash with a INVPTEFMT, Invalid page table
entry format
Any SCS-data-transfer of "0-length", using the
Memory-Channel/MC SCS-port will result in an "INVPTEFMT,
Invalid page table entry format" bugcheck The bugcheck is
within IOC_STD$PTETOPFN, as a result of a call to
IOC_STD$FILSPT from PMDRIVER.C/SETUP_COPY.
Crashdump Summary Information:
------------------------------
Bugcheck Type: INVPTEFMT, Invalid page table
entry format
Current Process: NULL
Current Image: <not available>
Failing PC: FFFFFFFF.800B88FC
IOC_STD$PTETOPFN_C+0008C
Failing PS: 38000000.00000804
Module: IO_ROUTINES (Link Date/Time:
13-DEC-2000 00:39:37.49)
Offset: 000048FC
Images Affected:[SYS$LDR]SYS$PMDRIVER.EXE
o SCS "SEND MESSAGE" and SCS data transfer commands can
stall or hang
SCS "SEND MESSAGE" (typically LOCK_MGR and MSCP disk
commands) and SCS data transfer commands, issued over a
PM/MC SCS virtual circuit (VC), can stall or hang
following exhaustion of Memory channel
"channel-free-queue" entries. The duration of this stall
or hang is entirely dependent on SCS-sysap traffic and
flow-control (SCS "credit") patterns and will persist
until one of the following occurs:
o SCS VC timeout error closes the VC
o SCS-sysap sends a message that breaks the stalemate
o SCS VC timeout mechanism sends a message that breaks
the stalemate
o PMx0: SCS-port timeout occurs, crashing the MC port
This SYS$PMDRIVER MC-SCS-command processing hang/stall
can occur under the following two conditions:
- HANG: Under heavy and primarily unidirectional
loads;
- STALL: Under more bi-directional loads, stalls will
create low performance over the Memory Channel VC,
drastically reducing Memory Channel performance under
load.
Because this hang/stall will block internode SCS-sysap
cluster communications, symptoms can be obscure and
numerous, or may manifest as:
o Performance degradation over Memory Channel based SCS
VCs
o A SCS VC-timeout
o A LOCK_MGR stall/hang or performance loss
o MSCP served disk command timeouts or disk I/O
slowdowns
o Customer LOCK_MGR-dependent application stalls,
hangs, or slowdowns
Images Affected:[SYS$LDR]SYS$PMDRIVER.EXE
RELATED ARTICLES:
Detailed articles describing the problems listed above may exist in
the OPENVMS database(s). To view these articles,
open the appropriate product database and perform a query using either
of the following search strings: 'VMS73_MEM_CHAN-V0200' or
'VMS73_MEM_CHAN'.
ECO KIT ORDERING INSTRUCTIONS:
If after an evaluation you wish to obtain this kit, request it
electronically using the appropriate Advanced Electronic Services
(AES) Service Tool. If you are not familiar with how to request
kits electronically, open the DIA, WIS or DSNLINK database and
review the article entitled:
[AES] How To Electronically Request ECO Kits Using Service Tools
INSTALLATION INSTRUCTIONS:
Install this kit with the POLYCENTER Software installation utility
by logging into the SYSTEM account, and typing the following at the
DCL prompt:
PRODUCT INSTALL VMS73_MEM_CHAN /SOURCE=[location of Kit]
The kit location may be a tape drive, CD, or a disk directory that
contains the kit.
Additional help on installing PCSI kits can be found by typing
HELP PRODUCT INSTALL at the system prompt
o Scripting of Answers to Installation Questions
During installation, this kit will ask and require user response to
several questions. If you wish to automate the installation of
this kit and avoid having to provide responses to these questions,
you must create a DCL command procedure that includes the following
definitions and commands:
- $ DEFINE/SYS NO_ASK$BACKUP TRUE
- $ DEFINE/SYS NO_ASK$REBOOT TRUE
- Add the following qualifiers to the PRODUCT INSTALL command and
add that command to the DCL procedure.
/PROD=DEC/BASE=AXPVMS/VER=V2.0
- De-assign the logicals assigned
For example, a sample command file to install the VMS73_MEM_CHAN
kit would be:
$
$ DEFINE/SYS NO_ASK$BACKUP TRUE
$ DEFINE/SYS NO_ASK$REBOOT TRUE
$!
$ PROD INSTALL VMS73_MEM_CHAN/PROD=DEC/BASE=AXPVMS/VER=V2.0
$!
$ DEASSIGN/SYS NO_ASK$BACKUP
$ DEASSIGN/SYS NO_ASK$REBOOT
$!
$ exit
COPYRIGHT AND DISCLAIMER:
(C) Copyright 2003 Hewlett-Packard Development Company, L.P.
Confidential computer software. Valid license from HP and/or its
subsidiaries required for possession, use, or copying.
Consistent with FAR 12.211 and 12.212, Commercial Computer
Software, Computer Software Documentation, and Technical Data for
Commercial Items are licensed to the U.S. Government under
vendor's standard commercial license.
Neither HP nor any of its subsidiaries shall be liable for
technical or editorial errors or omissions contained herein. The
information in this document is provided "as is" without warranty
of any kind and is subject to change without notice. The
warranties for HP products are set forth in the express limited
warranty statements accompanying such products. Nothing herein
should be construed as constituting an additional warranty.
DISCLAIMER OF WARRANTY AND LIMITATION OF LIABILITY
THIS PATCH IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND. ALL
EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS AND WARRANTIES,
INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR
PARTICULAR PURPOSE, OR NON-INFRINGEMENT, ARE HEREBY EXCLUDED TO THE
EXTENT PERMITTED BY APPLICABLE LAW. IN NO EVENT WILL COMPAQ BE
LIABLE FOR ANY LOST REVENUE OR PROFIT, OR FOR SPECIAL, INDIRECT,
CONSEQUENTIAL, INCIDENTAL OR PUNITIVE DAMAGES, HOWEVER CAUSED AND
REGARDLESS OF THE THEORY OF LIABILITY, WITH RESPECT TO ANY PATCH
MADE AVAILABLE HERE OR TO THE USE OF SUCH PATCH.
All trademarks are the property of their respective owners.
|