

# System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5 4600/2600/2400/1600/1400 Product Families

Intel order number G90620-002

**Revision 1.1** 

September 2013

**Enterprise Platforms and Services Division – Marketing** 



# **Revision History**

| Date           | Revision<br>Number | Modifications                                             |
|----------------|--------------------|-----------------------------------------------------------|
| January 2013   | 1.0                | Initial release                                           |
| September 2013 | 1.1                | Added MIC Thermal Margin sensors C4 through C7.           |
|                |                    | Added MIC Status sensors A2, A3, A6, and A7.              |
|                |                    | Added voltage sensors EA, EB, EC, ED, and EF.             |
|                |                    | Corrected typographical errors.                           |
|                |                    | Made corrections to Firmware Update Status table.         |
|                |                    | Made corrections to Catastrophic Error Sensor table.      |
|                |                    | Added support for S1400FP, S1400SP, S1600JP, and S4600LH. |

## Disclaimers

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: <u>http://www.intel.com/design/literature</u>.

# Table of Contents

| 1. | Introdu | ction                                                           | 1  |
|----|---------|-----------------------------------------------------------------|----|
|    | 1.1     | Purpose                                                         | 1  |
|    | 1.2     | Industry Standard                                               | 2  |
|    | 1.2.1   | Intelligent Platform Management Interface (IPMI)                | 2  |
|    | 1.2.2   | Baseboard Management Controller (BMC)                           | 2  |
|    | 1.2.3   | Intel <sup>®</sup> Intelligent Power Node Manager Version 2.0   | 3  |
| 2. | Basic D | Decoding of a SEL Record                                        | 4  |
| 2  | 2.1     | Default Values in the SEL Records                               | 4  |
| 2  | 2.2     | Notes on SEL Logs and Collecting SEL Information                | 10 |
|    | 2.2.1   | Examples of Decoding BIOS Timestamp Events                      | 10 |
|    | 2.2.2   | Example of Decoding a PCI Express* Correctable Error Events     | 11 |
|    | 2.2.3   | Example of Decoding a Power Supply Predictive Failure Event     | 12 |
| 3. | Sensor  | Cross Reference List                                            | 13 |
|    | 3.1     | BMC owned Sensors (GID = 0020h)                                 | 13 |
| (  | 3.2     | BIOS POST owned Sensors (GID = 0001h)                           | 24 |
| :  | 3.3     | BIOS SMI Handler owned Sensors (GID = 0033h)                    | 24 |
| (  | 3.4     | Node Manager / ME Firmware owned Sensors (GID = 002Ch or 602Ch) | 25 |
|    | 3.5     | Microsoft* OS owned Events (GID = 0041)                         | 26 |
| (  | 3.6     | Linux* Kernel Panic Events (GID = 0021)                         | 26 |
| 4. | Power   | Subsystems                                                      | 27 |
| 4  | 1.1     | Threshold-based Voltage Sensors                                 | 27 |
| 4  | 1.2     | Voltage Regulator Watchdog Timer Sensor                         | 33 |
|    | 4.2.1   | Voltage Regulator Watchdog Timer Sensor – Next Steps            | 34 |
| 4  | 1.3     | Power Unit                                                      | 34 |
|    | 4.3.1   | Power Unit Status Sensor                                        | 34 |
|    | 4.3.2   | Power Unit Redundancy Sensor                                    | 36 |
|    | 4.3.3   | Node Auto Shutdown Sensor                                       | 37 |
| 4  | 1.4     | Power Supply                                                    | 38 |
|    | 4.4.1   | Power Supply Status Sensors                                     | 38 |
|    | 4.4.2   | Power Supply Power In Sensors                                   | 41 |
|    | 4.4.3   | Power Supply Current Out % Sensors                              | 42 |
|    | 4.4.4   | Power Supply Temperature Sensors                                | 43 |
|    | 4.4.5   | Power Supply Fan Tachometer Sensors                             | 44 |
| 5. | Cooling | g Subsystem                                                     | 45 |
| Į  | 5.1     | Fan Sensors                                                     | 45 |
|    | 5.1.1   | Fan Tachometer Sensors                                          | 45 |
|    | 5.1.2   | Fan Presence and Redundancy Sensors                             | 46 |
|    |         |                                                                 |    |

| 5.2.1                   | Threshold-based Temperature Sensors                                                              |                             |
|-------------------------|--------------------------------------------------------------------------------------------------|-----------------------------|
| 5.2.2                   | Thermal Margin Sensors                                                                           | 51                          |
| 5.2.3                   | Processor Thermal Control Sensors                                                                | 53                          |
| 5.2.4                   | Processor DTS Thermal Margin Sensors                                                             | 55                          |
| 5.2.5                   | Discrete Thermal Sensors                                                                         |                             |
| 5.2.6                   | DIMM Thermal Trip Sensors                                                                        | 57                          |
| 5.3                     | System Air Flow Monitoring Sensor                                                                | 58                          |
| 6. Proces               | ssor Subsystem                                                                                   | 59                          |
| 6.1                     | Processor Status Sensor                                                                          | 59                          |
| 6.2                     | Catastrophic Error Sensor                                                                        | 61                          |
| 6.3                     | CPU Missing Sensor                                                                               | 62                          |
| 6.3.1                   | CPU Missing Sensor – Next Steps                                                                  | 63                          |
| 6.4                     | Quick Path Interconnect Sensors                                                                  | 63                          |
| 6.4.1                   | QPI Link Width Reduced Sensor                                                                    | 63                          |
| 6.4.2                   | QPI Correctable Error Sensor                                                                     | 64                          |
| 6.4.3                   | QPI Fatal Error and Fatal Error #2                                                               | 65                          |
| 6.5                     | Processor ERR2 Timeout Sensor                                                                    | 67                          |
| 6.5.1                   | Processor ERR2 Timeout – Next Steps                                                              | 68                          |
| 6.6                     | Processor MSID Mismatch Sensor                                                                   | 68                          |
| 6.6.1                   | Processor MSID Mismatch Sensor – Next Steps                                                      | 69                          |
| 7. Memo                 | ry Subsystem                                                                                     | 70                          |
| 7.1                     | Memory RAS Configuration Status                                                                  | 70                          |
| 7.2                     | Memory RAS Mode Select                                                                           | 72                          |
| 7.3                     | Mirroring Redundancy State                                                                       | 73                          |
| 7.3.1                   | Mirroring Redundancy State Sensor – Next Steps                                                   | 74                          |
| 7.4                     | Sparing Redundancy State                                                                         | 74                          |
| 7.4.1                   | Sparing Redundancy State Sensor – Next Steps                                                     | 76                          |
| 7.5                     | ECC and Address Parity                                                                           | 76                          |
| 7.5.1                   | Memory Correctable and Uncorrectable ECC Error                                                   | 76                          |
| 7.5.2                   | Memory Address Parity Error                                                                      | 78                          |
| 8. PCI Ex               | <pre>set and Legacy PCI Subsystem</pre>                                                          | 81                          |
| 8.1                     | PCI Express* Errors                                                                              | 81                          |
| 8.1.1                   | Legacy PCI Errors                                                                                | 81                          |
| 8.1.2                   |                                                                                                  |                             |
|                         | PCI Express* Fatal Errors and Fatal Error #2                                                     |                             |
| 8.1.3                   | •                                                                                                | 82                          |
|                         | PCI Express* Fatal Errors and Fatal Error #2                                                     | 82<br>84                    |
|                         | PCI Express* Fatal Errors and Fatal Error #2<br>PCI Express* Correctable Errors                  | 82<br>84<br><b>87</b>       |
| 9. Syster               | PCI Express* Fatal Errors and Fatal Error #2<br>PCI Express* Correctable Errors<br>m BIOS Events | 82<br>84<br><b>87</b><br>87 |
| <b>9. Syster</b><br>9.1 | PCI Express* Fatal Errors and Fatal Error #2<br>PCI Express* Correctable Errors<br>m BIOS Events | 82<br>84<br>87<br>87<br>87  |

| 9.2.1        | System Firmware Progress (Formerly Post Error) – Next Steps                       |     |
|--------------|-----------------------------------------------------------------------------------|-----|
| 10. Chassis  | Subsystem                                                                         | 97  |
| 10.1         | Physical Security                                                                 | 97  |
| 10.1.1       | Chassis Intrusion                                                                 | 97  |
| 10.1.2       | LAN Leash Lost                                                                    | 97  |
| 10.2         | FP (NMI) Interrupt                                                                |     |
| 10.2.1       | FP (NMI) Interrupt – Next Steps                                                   |     |
| 10.3         | Button Sensor                                                                     |     |
| 11. Miscella | aneous Events                                                                     |     |
| 11.1         | IPMI Watchdog                                                                     |     |
| 11.2         | SMI Timeout                                                                       |     |
| 11.2.1       | SMI Timeout – Next Steps                                                          |     |
| 11.3         | System Event Log Cleared                                                          |     |
| 11.4         | System Event – PEF Action                                                         |     |
| 11.4.1       | System Event – PEF Action – Next Steps                                            |     |
| 11.5         | BMC Watchdog Sensor                                                               |     |
| 11.5.1       | BMC Watchdog Sensor – Next Steps                                                  |     |
| 11.6         | BMC FW Health Sensor                                                              |     |
| 11.6.1       | BMC FW Health Sensor – Next Steps                                                 |     |
| 11.7         | Firmware Update Status Sensor                                                     |     |
| 11.8         | Add-In Module Presence Sensor                                                     |     |
| 11.8.1       | Add-In Module Presence – Next Steps                                               |     |
| 11.9         | Intel <sup>®</sup> Xeon Phi <sup>™</sup> Coprocessor Management Sensors           |     |
| 11.9.1       | Intel <sup>®</sup> Xeon Phi <sup>™</sup> Coprocessor (MIC) Thermal Margin Sensors |     |
| 11.9.2       | Intel <sup>®</sup> Xeon Phi <sup>™</sup> Coprocessor (MIC) Status Sensors         |     |
| 12. Hot-Swa  | ap Controller Backplane Events                                                    |     |
| 12.1         | HSC Backplane Temperature Sensor                                                  |     |
| 12.2         | Hard Disk Drive Monitoring Sensor                                                 |     |
| 12.3         | Hot-Swap Controller Health Sensor                                                 |     |
| 12.3.1       | HSC Health Sensor – Next Steps                                                    | 114 |
| 13. Manage   | ability Engine (ME) Events                                                        | 115 |
| 13.1         | ME Firmware Health Event                                                          | 115 |
| 13.1.1       | ME Firmware Health Event – Next Steps                                             | 115 |
| 13.2         | Node Manager Exception Event                                                      | 117 |
| 13.2.1       | Node Manager Exception Event – Next Steps                                         | 117 |
| 13.3         | Node Manager Health Event                                                         | 118 |
| 13.3.1       | Node Manager Health Event – Next Steps                                            | 119 |
| 13.4         | Node Manager Operational Capabilities Change                                      |     |
| 13.4.1       | Node Manager Operational Capabilities Change - Next Steps                         |     |
| 13.5         | Node Manger Alert Threshold Exceeded                                              |     |

| 13.5.1                 | Node Manger Alert Threshold Exceeded – Next Steps |     |
|------------------------|---------------------------------------------------|-----|
| 14. Micros             | soft Windows* Records                             | 124 |
| 14.1                   | Boot up Event Records                             |     |
| 14.2                   | Shutdown Event Records                            |     |
| 14.3                   | Bug Check / Blue Screen Event Records             |     |
| 15. Linux <sup>a</sup> | * Kernel Panic Records                            | 130 |

# List of Tables

| Table 1. SEL Record Format                                                      | 4    |
|---------------------------------------------------------------------------------|------|
| Table 2: Event Request Message Event Data Field Contents                        | 7    |
| Table 3: OEM SEL Record (Type C0h-DFh)                                          | 8    |
| Table 4: OEM SEL Record (Type E0h-FFh)                                          | 9    |
| Table 5: BMC owned Sensors                                                      | .13  |
| Table 6: BIOS POST owned Sensors                                                | .24  |
| Table 7: BIOS SMI Handler owned Sensors                                         | .24  |
| Table 8: Management Engine Firmware owned Sensors                               | . 25 |
| Table 9: Microsoft* OS owned Events                                             | .26  |
| Table 10: Linux* Kernel Panic Events                                            | .26  |
| Table 11: Threshold-based Voltage Sensors Typical Characteristics               | . 27 |
| Table 12: Threshold-based Voltage Sensors Event Triggers – Description          | . 28 |
| Table 13: Threshold-based Voltage Sensors – Next Steps                          | . 28 |
| Table 14: Voltage Regulator Watchdog Timer Sensor Typical Characteristics       | . 34 |
| Table 15: Power Unit Status Sensors Typical Characteristics                     | . 35 |
| Table 16: Power Unit Status Sensor – Sensor Specific Offsets – Next Steps       | . 35 |
| Table 17: Power Unit Redundancy Sensors Typical Characteristics                 | . 36 |
| Table 18: Power Unit Redundancy Sensor – Event Trigger Offset – Next Steps      | . 37 |
| Table 19: Node Auto Shutdown Sensor Typical Characteristics                     | . 37 |
| Table 20: Power Supply Status Sensors Typical Characteristics                   | . 38 |
| Table 21: Power Supply Status Sensor – Sensor Specific Offsets – Next Steps     | . 39 |
| Table 22: Power Supply Power In Sensors Typical Characteristics                 | .41  |
| Table 23: Power Supply Power In Sensor – Event Trigger Offset – Next Steps      | .41  |
| Table 24: Power Supply Current Out % Sensors Typical Characteristics            | . 42 |
| Table 25: Power Supply Current Out % Sensor – Event Trigger Offset – Next Steps | . 42 |
| Table 26: Power Supply Temperature Sensors Typical Characteristics              | .43  |
| Table 27: Power Supply Temperature Sensor – Event Trigger Offset – Next Steps   | .43  |
| Table 28: Power Supply Fan Tachometer Sensors Typical Characteristics           | .44  |
| Table 29: Fan Tachometer Sensors Typical Characteristics                        | . 45 |
| Table 30: Fan Tachometer Sensor – Event Trigger Offset – Next Steps             | . 46 |
| Table 31: Fan Presence Sensors Typical Characteristics                          | . 46 |
| Table 32: Fan Presence Sensors – Event Trigger Offset – Next Steps              | . 47 |
| Table 33: Fan Redundancy Sensors Typical Characteristics                        | . 47 |
| Table 34: Fan Redundancy Sensor – Event Trigger Offset – Next Steps             | . 48 |
| Table 35: Temperature Sensors Typical Characteristics                           |      |
| Table 36: Temperature Sensors Event Triggers – Description                      | . 50 |
| Table 37: Temperature Sensors – Next Steps                                      | . 50 |
| Table 38: Thermal Margin Sensors Typical Characteristics                        | .51  |

| Table 39: Thermal Margin Sensors Event Triggers – Description                             |      |
|-------------------------------------------------------------------------------------------|------|
| Table 40: Thermal Margin Sensors – Next Steps                                             |      |
| Table 41: Processor Thermal Control Sensors Typical Characteristics                       |      |
| Table 42: Processor Thermal Control Sensors Event Triggers – Description                  |      |
| Table 43: Processor DTS Thermal Margin Sensors Typical Characteristics                    |      |
| Table 44: Discrete Thermal Sensors Typical Characteristics                                |      |
| Table 45: Discrete Thermal Sensors – Next Steps                                           |      |
| Table 46: DIMM Thermal Trip Typical Characteristics                                       |      |
| Table 47: Process Status Sensors Typical Characteristics                                  |      |
| Table 48: Processor Status Sensors – Next Steps                                           |      |
| Table 49: Catastrophic Error Sensor Typical Characteristics                               |      |
| Table 50: Catastrophic Error Sensor – Event Data 2 Values – Next Steps                    |      |
| Table 51: CPU Missing Sensor Typical Characteristics                                      |      |
| Table 52: QPI Link Width Reduced Sensor Typical Characteristics                           | 63   |
| Table 53: QPI Correctable Error Sensor Typical Characteristics                            | 64   |
| Table 54: QPI Fatal Error Sensor Typical Characteristics                                  |      |
| Table 55: QPI Fatal #2 Error Sensor Typical Characteristics                               | 66   |
| Table 56: Processor ERR2 Timeout Sensor Typical Characteristics                           | 68   |
| Table 57: Processor MSID Mismatch Sensor Typical Characteristics                          | 69   |
| Table 58: Memory RAS Configuration Status Sensor Typical Characteristics                  | 70   |
| Table 59: Memory RAS Configuration Status Sensor – Event Trigger Offset – Next Steps      | 71   |
| Table 60: Memory RAS Mode Select Sensor Typical Characteristics                           | 72   |
| Table 61: Mirroring Redundancy State Sensor Typical Characteristics                       | 73   |
| Table 62: Sparing Redundancy State Sensor Typical Characteristics                         | 75   |
| Table 63: Correctable and Uncorrectable ECC Error Sensor Typical Characteristics          | 76   |
| Table 64: Correctable and Uncorrectable ECC Error Sensor Event Trigger Offset - Next Step | os77 |
| Table 65: Address Parity Error Sensor Typical Characteristics                             | 78   |
| Table 66: Legacy PCI Error Sensor Typical Characteristics                                 | 81   |
| Table 67: PCI Express* Fatal Error Sensor Typical Characteristics                         | 82   |
| Table 68: PCI Express* Fatal Error #2 Sensor Typical Characteristics                      | 83   |
| Table 69: PCI Express* Correctable Error Sensor Typical Characteristics                   | 85   |
| Table 70: System Event Sensor Typical Characteristics                                     | 88   |
| Table 71: POST Error Sensor Typical Characteristics                                       | 89   |
| Table 72: POST Error Codes                                                                | 90   |
| Table 73: Physical Security Sensor Typical Characteristics                                | 97   |
| Table 74: Physical Security Sensor Event Trigger Offset – Next Steps                      | 98   |
| Table 75: FP (NMI) Interrupt Sensor Typical Characteristics                               | 99   |
| Table 76: Button Sensor Typical Characteristics                                           |      |
| Table 77: IPMI Watchdog Sensor Typical Characteristics                                    |      |
| Table 78: IPMI Watchdog Sensor Event Trigger Offset – Next Steps                          |      |
|                                                                                           |      |

| Table 79: SMI Timeout Sensor Typical Characteristics                                  | 102 |
|---------------------------------------------------------------------------------------|-----|
| Table 80: System Event Log Cleared Sensor Typical Characteristics                     | 103 |
| Table 81: System Event – PEF Action Sensor Typical Characteristics                    |     |
| Table 82: BMC Watchdog Sensor Typical Characteristics                                 | 105 |
| Table 83: BMC FW Health Sensor Typical Characteristics                                | 106 |
| Table 84: Firmware Update Status Sensor Typical Characteristics                       | 107 |
| Table 85: Add-In Module Presence Sensor Typical Characteristics                       | 108 |
| Table 86: MIC Status Sensors - Typical Characteristics                                | 109 |
| Table 87: HSC Backplane Temperature Sensor Typical Characteristics                    | 111 |
| Table 88: HSC Backplane Temperature Sensor – Event Trigger Offset – Next Steps        | 112 |
| Table 89: Hard Disk Drive Monitoring Sensor Typical Characteristics                   | 112 |
| Table 90: Hard Disk Drive Monitoring Sensor - Event Trigger Offset – Next Steps       | 113 |
| Table 91: HSC Health Sensor Typical Characteristics                                   |     |
| Table 92: ME Firmware Health Event Sensor Typical Characteristics                     | 115 |
| Table 93: ME Firmware Health Event Sensor – Next Steps                                |     |
| Table 94: Node Manager Exception Sensor Typical Characteristics                       | 117 |
| Table 95: Node Manager Health Event Sensor Typical Characteristics                    | 118 |
| Table 96: Node Manager Operational Capabilities Change Sensor Typical Characteristics | 120 |
| Table 97: Node Manager Alert Threshold Exceeded Sensor Typical Characteristics        | 122 |
| Table 98: Boot up Event Record Typical Characteristics                                | 124 |
| Table 99: Boot up OEM Event Record Typical Characteristics                            | 125 |
| Table 100: Shutdown Reason Code Event Record Typical Characteristics                  | 126 |
| Table 101: Shutdown Reason OEM Event Record Typical Characteristics                   | 126 |
| Table 102: Shutdown Comment OEM Event Record Typical Characteristics                  | 127 |
| Table 103: Bug Check/Blue Screen – OS Stop Event Record Typical Characteristics       | 128 |
| Table 104: Bug Check/Blue Screen code OEM Event Record Typical Characteristics        | 129 |
| Table 105: Linux* Kernel Panic Event Record Characteristics                           | 130 |
| Table 106: Linux* Kernel Panic String Extended Record Characteristics                 | 131 |

# 1. Introduction

The server management hardware that is part of the Intel<sup>®</sup> Server Boards and Intel<sup>®</sup> Server Platforms serves as a vital part of the overall server management strategy. The server management hardware provides essential information to the system administrator and provides the administrator the ability to remotely control the server, even when the operating system is not running.

The Intel<sup>®</sup> Server Boards and Intel<sup>®</sup> Server Platforms offer comprehensive hardware and software based solutions. The server management features make the servers simple to manage and provide alerting on system events. From entry to enterprise systems, good overall server management is essential to reduce overall total cost of ownership.

This *Troubleshooting Guide* is intended to help the users better understand the events that are logged in the Baseboard Management Controllers (BMC) System Event Logs (SEL) on these Intel<sup>®</sup> Server Boards.

There is a separate *User's Guide* that covers the general server management and the server management software offered on the Intel<sup>®</sup> Server Boards and Intel<sup>®</sup> Server Platforms.

Server boards currently supported by this document:

- Intel<sup>®</sup> S1400FP Server Boards
- Intel<sup>®</sup> S1400SP Server Boards
- Intel<sup>®</sup> S1600JP Server Boards
- Intel<sup>®</sup> S2400BB Server Boards
- Intel<sup>®</sup> S2400EP Server Boards
- Intel<sup>®</sup> S2400GP Server Boards
- Intel<sup>®</sup> S2400LP Server Boards
- Intel<sup>®</sup> S2400SC Server Boards
- Intel<sup>®</sup> S2600CO Server Boards
- Intel<sup>®</sup> S2600CP Server Boards
- Intel<sup>®</sup> S2600GZ/S2600GL Server Boards
- Intel<sup>®</sup> S2600IP Server Boards
- Intel<sup>®</sup> S2600JF Server Boards
- Intel<sup>®</sup> S2600WP Server Boards
- Intel<sup>®</sup> S4600LH Server Boards
- Intel<sup>®</sup> W2600CR Workstation Boards

#### 1.1 Purpose

The purpose of this document is to list all possible events generated by the Intel platform. It may be possible that other sources (not under our control) also generate events, which will not be described in this document.

### 1.2 Industry Standard

#### 1.2.1 Intelligent Platform Management Interface (IPMI)

The key characteristic of the Intelligent Platform Management Interface (IPMI) is that the inventory, monitoring, logging, and recovery control functions are available independently of the main processors, BIOS, and operating system. Platform management functions can also be made available when the system is in a power-down state.

IPMI works by interfacing with the BMC, which extends management capabilities in the server system and operates independently of the main processor by monitoring the on-board instrumentation. Through the BMC, IPMI also allows administrators to control power to the server, and remotely access BIOS configuration and operating system console information.

IPMI defines a common platform instrumentation interface to enable interoperability between:

- The baseboard management controller and chassis
- The baseboard management controller and systems management software
- Between servers

IPMI enables the following:

- Common access to platform management information, consisting of:
  - Local access from systems management software
  - Remote access from LAN
  - Inter-chassis access from Intelligent Chassis Management Bus
  - Access from LAN, serial/modem, IPMB, PCI SMBus\*, or ICMB, available even if the processor is down
- IPMI interface isolates systems management software from hardware.
- Hardware advancements can be made without impacting the systems management software.
- IPMI facilitates cross-platform management software.

You can find more information on IPMI at the following URL: <u>http://www.intel.com/design/servers/ipmi</u>

#### 1.2.2 Baseboard Management Controller (BMC)

A baseboard management controller (BMC) is a specialized microcontroller embedded on most Intel<sup>®</sup> Server Boards. The BMC is the heart of the IPMI architecture and provides the intelligence behind intelligent platform management, that is, the autonomous monitoring and recovery features implemented directly in platform management hardware and firmware.

Different types of sensors built into the computer system report to the BMC on parameters such as temperature, cooling fan speeds, power mode, operating system status, and so on. The BMC monitors the system for critical events by communicating with various sensors on the system

board; it sends alerts and logs events when certain parameters exceed their preset thresholds, indicating a potential failure of the system. The administrator can also remotely communicate with the BMC to take some corrective action such as resetting or power cycling the system to get a hung OS running again. These abilities save on the total cost of ownership of a system.

For Intel<sup>®</sup> Server Boards and Intel<sup>®</sup> Server Platforms, the BMC supports the industry standard *IPMI 2.0 Specification*, enabling you to configure, monitor, and recover systems remotely.

#### 1.2.2.1 System Event Log (SEL)

The BMC provides a centralized, non-volatile repository for critical, warning, and informational system events called the System Event Log or SEL. By having the BMC manage the SEL and logging functions, it helps to ensure that "post-mortem" logging information is available if a failure occurs that disables the system processor(s).

The BMC allows access to SEL from in-band and out-of-band mechanisms. There are various tools and utilities that can be used to access the SEL. There is the Intel<sup>®</sup> SELView utility and multiple open sourced IPMI tools.

#### 1.2.3 Intel<sup>•</sup> Intelligent Power Node Manager Version 2.0

Intel<sup>®</sup> Intelligent Power Node Manager Version 2.0 (NM) is a platform-resident technology that enforces power and thermal policies for the platform. These policies are applied by exploiting subsystem knobs (such as processor P and T states) that can be used to control power consumption. Intel<sup>®</sup> Intelligent Power Node Manager enables data center power and thermal management by exposing an external interface to management software through which platform policies can be specified. It also enables specific data center power management usage models such as power limiting.

The configuration and control commands are used by the external management software or BMC to configure and control the Intel<sup>®</sup> Intelligent Power Node Manager feature. Because Platform Services firmware does not have any external interface, external commands are first received by the BMC over LAN and then relayed to the Platform Services firmware over IPMB channel. The BMC acts as a relay and the transport conversion device for these commands. For simplicity, the commands from the management console might be encapsulated in a generic CONFIG packet format (configuration data length, configuration data blob) to the BMC so that the BMC doesn't even have to parse the actual configuration data.

The BMC provides the access point for remote commands from external management SW and generates alerts to them. Intel<sup>®</sup> Intelligent Power Node Manager on Intel<sup>®</sup> Manageability Engine (Intel<sup>®</sup> ME) is an IPMI satellite controller. A mechanism exists to forward commands to Intel<sup>®</sup> ME and then sends the response back to originator. Similarly events from Intel<sup>®</sup> ME will be sent as alerts outside of the BMC.

# 2. Basic Decoding of a SEL Record

The System Event Log (SEL) record format is defined in the *IPMI Specification*. The following section provides a basic definition for each of the fields in a SEL. For more details see the *IPMI Specification*.

The definitions for the standard SEL can be found in Table 1.

The definitions for the OEM defined event logs can be found in Table 3 and Table 4.

### 2.1 Default Values in the SEL Records

Unless otherwise noted in the event record descriptions the following are the default values in all SEL entries.

- Byte [3] = Record Type (RT) = 02h = System event record
- Byte [9:8] = Generator ID = 0020h = BMC Firmware
- Byte [10] = Event Message Revision (ER) = 04h = IPMI 2.0

| Byte             | Field                        | Description                                                                                                                                                                                                                      |
|------------------|------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1<br>2           | Record ID<br>( <b>RID</b> )  | ID used for SEL Record access.                                                                                                                                                                                                   |
| 3                | Record Type<br>( <b>RT</b> ) | <ul> <li>[7:0] – Record Type</li> <li>02h = System event record</li> <li>C0h-DFh = OEM timestamped, bytes 8-16 OEM defined (See Table 3)</li> <li>E0h-FFh = OEM non-timestamped, bytes 4-16 OEM defined (See Table 4)</li> </ul> |
| 4<br>5<br>6<br>7 | Timestamp<br>( <b>TS</b> )   | Time when event was logged. LS byte first.<br>Example: TS:[29][76][68][4C] = 4C687629h = 1281914409 = Sun, 15 Aug 2010<br>23:20:09 UTC<br>Note: There are various websites that will convert the raw number to a date/time.      |

Table 1. SEL Record Format

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5 4600/2600/2400/1600/1400 Product Families

#### Basic Decoding of a SEL Record

| Byte           | Field                                        | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|----------------|----------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Byte<br>8<br>9 | Field<br>Generator ID<br>(GID)               | Description         RqSA and LUN if event was generated from IPMB.         Software ID if event was generated from system software. <u>Byte 1</u> [7:1] - 7-bit I <sup>2</sup> C Slave Address, or 7-bit system software ID         [0] 0b = ID is IPMB Slave Address         1b = System software ID         Software ID values:         0001h - BIOS POST for POST errors, RAS Configuration/State, Timestamp Synch, OS Boot events         0033h - BIOS SMI Handler         0020h - BMC Firmware         0020h - ME Firmware         0020h - ME Firmware         0020h - HSC Firmware - HSBP A         00C2h - HSC Firmware - HSBP B         Byte 2         [7:4] - Channel number. Channel that event message was received over. 0h if the event |
| 10             | EvM Rev                                      | <ul> <li>[7.4] - Charmer humber. Charmer that event message was received over. On it the event message was received from the system interface, primary IPMB, or internally generated by the BMC.</li> <li>[3:2] - Reserved. Write as 00b.</li> <li>[1:0] - IPMB device LUN if byte 1 holds Slave Address. 00b otherwise.</li> <li>Event Message format version. 04h = IPMI v2.0; 03h = IPMI v1.0</li> </ul>                                                                                                                                                                                                                                                                                                                                          |
| 10             | (ER)                                         |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| 11             | Sensor Type<br>( <b>ST</b> )                 | Sensor Type Code for sensor that generated the event                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| 12             | Sensor #<br>( <b>SN</b> )                    | Number of sensor that generated the event (From SDR)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| 13             | Event Dir  <br>Event Type<br>( <b>EDIR</b> ) | Event Dir         [7] - 0b = Assertion event.         1b = Deassertion event. <u>Event Type</u> Type of trigger for the event, for example, critical threshold going high, state asserted, and so on. Also indicates class of the event. For example, discrete, threshold, or OEM.         The Event Type field is encoded using the Event/Reading Type Code.                                                                                                                                                                                                                                                                                                                                                                                        |

# Basic Decoding of a SEL Record System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5 4600/2600/2400/1600/1400 Product Families

| Byte | Field                          | Description                                                                                                                     |
|------|--------------------------------|---------------------------------------------------------------------------------------------------------------------------------|
|      |                                | [6:0] – Event Type Codes<br>01h = Threshold (States = 0x00-0x0b)<br>02h-0ch = Discrete<br>6Fh = Sensor-Specific<br>70-7Fh = OEM |
| 14   | Event Data 1<br>( <b>ED1</b> ) | Per Table 2                                                                                                                     |
| 15   | Event Data 2<br>( <b>ED2</b> ) |                                                                                                                                 |
| 16   | Event Data 3<br>( <b>ED3</b> ) |                                                                                                                                 |

| Sensor<br>Class | Event Data                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|-----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Threshold       | Event Data 1<br>[7:6] - 00b = Unspecified Event Data 2<br>01b = Trigger reading in Event Data 2<br>10b = OEM code in Event Data 2<br>11b = Sensor-specific event extension code in Event Data 2<br>[5:4] - 00b = Unspecified Event Data 3<br>01b = Trigger threshold value in Event Data 3<br>10b = OEM code in Event Data 3<br>10b = OEM code in Event Data 3<br>11b = Sensor-specific event extension code in Event Data 3<br>[3:0] - Offset from Event/Reading Code for threshold event.<br>Event Data 2 - Reading that triggered event, FFh or not present if unspecified.<br>Event Data 3 - Threshold value that triggered event, FFh or not present if unspecified. If present, Event Data 2 must be present.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| discrete        | Event Data 1         [7:6] - 00b = Unspecified Event Data 2         01b = Previous state and/or severity in Event Data 2         10b = 0EM code in Event Data 2         11b = Sensor-specific event extension code in Event Data 2         [5:4] - 00b = Unspecified Event Data 3         01b = Reserved         10b = 0EM code in Event Data 3         11b = Sensor-specific event extension code in Event Data 3         11b = Sensor-specific event extension code in Event Data 3         11b = Sensor-specific event extension code in Event Data 3         11b = Sensor-specific event extension code in Event Data 3         [3:0] - Offset from Event/Reading Code for discrete event state         Event Data 2         [7:4] - Optional offset from "Severity" Event/Reading Code (0Fh if unspecified).         [3:0] - Optional offset from Event/Reading Type Code for previous discrete event state (0Fh if unspecified).         [3:0] - Optional Offset from Event/Reading Type Code for previous discrete event state (0Fh if unspecified).         [3:0] - Optional Offset from Event/Reading Type Code for previous discrete event state (0Fh if unspecified).         Event Data 3 - Optional OEM code. FFh or not present if unspecified. |
| OEM             | Event Data 1<br>[7:6] – 00b = Unspecified in Event Data 2<br>01b = Previous state and/or severity in Event Data 2<br>10b = OEM code in Event Data 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |

Table 2: Event Request Message Event Data Field Contents

#### Basic Decoding of a SEL Record System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5 4600/2600/2400/1600/1400 Product Families

| Sensor<br>Class | Event Data                                                                                                      |  |  |  |
|-----------------|-----------------------------------------------------------------------------------------------------------------|--|--|--|
|                 | 11b = Reserved                                                                                                  |  |  |  |
|                 | [5:4] – 00b = Unspecified Event Data 3                                                                          |  |  |  |
|                 | 01b = Reserved                                                                                                  |  |  |  |
|                 | 10b = OEM code in Event Data 3                                                                                  |  |  |  |
|                 | 11b = Reserved                                                                                                  |  |  |  |
|                 | [3:0] – Offset from Event/Reading Type Code                                                                     |  |  |  |
|                 | Event Data 2                                                                                                    |  |  |  |
|                 | [7:4] – Optional OEM code bits or offset from "Severity" Event/Reading Type Code (0Fh if unspecified).          |  |  |  |
|                 | [3:0] – Optional OEM code or offset from Event/Reading Type Code for previous event state (0Fh if unspecified). |  |  |  |
|                 | Event Data 3 – Optional OEM code. FFh or not present if unspecified.                                            |  |  |  |

#### Table 3: OEM SEL Record (Type COh-DFh)

| Byte             | Field                        | Description                                                                                                                                                                                                                                                                                                                                                                                                            |
|------------------|------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1<br>2           | Record ID<br>( <b>RID</b> )  | ID used for SEL Record access.                                                                                                                                                                                                                                                                                                                                                                                         |
| 3                | Record Type<br>( <b>RT</b> ) | [7:0] – Record Type<br>C0h-DFh = OEM timestamped, bytes 8-16 OEM defined                                                                                                                                                                                                                                                                                                                                               |
| 4<br>5<br>6<br>7 | Timestamp<br>( <b>TS</b> )   | Time when event was logged. LS byte first.<br>Example: TS:[29][76][68][4C] = 4C687629h = 1281914409 = Sun, 15 Aug 2010<br>23:20:09 UTC<br>Note: There are various websites that will convert the raw number to a date/time.                                                                                                                                                                                            |
| 8<br>9<br>10     | Manufacturer ID              | LS Byte first. The manufacturer ID is a 20-bit value that is derived from the IANA<br>"Private Enterprise" ID.<br>Most significant four bits = Reserved (0000b).<br>000000h = Unspecified. 0FFFFh = Reserved.<br>This value is binary encoded.<br>For example the ID for the IPMI forum is 7154 decimal, which is 1BF2h, which will be<br>stored in this record as F2h, 1Bh, 00h for bytes 8 through 10, respectively. |

#### Basic Decoding of a SEL Record

| Byte | Field       | Description                                                                  |
|------|-------------|------------------------------------------------------------------------------|
| 11   | OEM Defined | OEM Defined. This is defined according to the manufacturer identified by the |
| 12   |             | Manufacturer ID field.                                                       |
| 13   |             |                                                                              |
| 14   |             |                                                                              |
| 15   |             |                                                                              |
| 16   |             |                                                                              |

#### Table 4: OEM SEL Record (Type EOh-FFh)

| Byte                                                                 | Field                        | Description                                              |
|----------------------------------------------------------------------|------------------------------|----------------------------------------------------------|
| 1<br>2                                                               | Record ID<br>( <b>RID</b> )  | ID used for SEL Record access.                           |
| 3                                                                    | Record Type<br>( <b>RT</b> ) | [7:0] – Record Type<br>E0h-FFh = OEM system event record |
| 4<br>5<br>6<br>7<br>8<br>9<br>10<br>11<br>12<br>13<br>14<br>15<br>16 | OEM                          | OEM Defined. This is defined by the system integrator.   |

### 2.2 Notes on SEL Logs and Collecting SEL Information

Whenever you capture the SEL log, you should always collect both the text/human readable version and the hex version. Because some of the data is OEM-specific, some utilities cannot decode the information correctly. In addition with some OEM-specific data there may be additional variables that are not decoded at all.

An example of not decoding all of the information is the BIOS timestamp synchronization event log. This event can be logged by the BIOS during POST or it can be logged by the BIOS SMI Handler when a system is requested to do a shutdown or a restart from the operating system (OS). See section 2.2.1 for examples. Most utilities report this as just a BIOS event and do not differentiate between the two. But sometimes it is useful because you can see the sequence of events better. For example if there are multiple sequences of the timestamp synchronization events, was the power lost after booting to the OS and then the system restarted, was it multiple POST events, or was it a restart from the OS?

An example of not decoding all the information is with the PCI Express\* errors and some of the Power Supply events. For the PCI Express\* errors the type of error and the PCI Bus, Device, and Function are all a part of Event Data 1 through Event Data 3. See section 2.2.2. For the Power Supply events when there is a failure, predictive failure, or a configuration error, Event Data 2 and Event Data 3 hold additional information that describes the Power Supplies PMBus\* Command Registers and values for that particular event. See section 2.2.3.

#### 2.2.1 Examples of Decoding BIOS Timestamp Events

The following are some samples of BIOS timestamp events during POST and during an OS shutdown.

#### 2.2.1.1 BIOS POST Timestamp Events

RID[19][01] RT[02] TS[57][49][6A][4E] GID[01][00] ER[04] ST[12] SN[83] EDIR[6F] ED1[05] ED2[00] ED3[FF]

```
RID (Record ID) = 0119h

RT (Record Type) = 02h = system event record

TS (Timestamp) = 4E6A4957h

GID (Generator ID = 0001h = BIOS POST

ER (Event Message Revision) = 04 = IPMI v2.0

ST (Sensor Type) = 12h = System Event (From IPMI Specification Table 42-3, Sensor Type Codes)

SN (Sensor Number = 83h

EDIR (Event Direction/Event Type) = 6fh; [7] = 0 = Assertion Event

[6:0] = 6fh = Sensor specific

ED1 (Event Data 1) = 05h = Timestamp Clock Synchronization

ED2 (Event Data 2) = 00h = First in pair
```

RID[1A][01] RT[02] TS[57][49][6A][4E] GID[01][00] ER[04] ST[12] SN[83] EDIR[6F] ED1[05] ED2[80] ED3[FF]

RID (Record ID) = 011Ah RT (Record Type) = 02h = system event record TS (Timestamp) = 4E6A4957h GID (Generator ID = 0001h = BIOS POST ER (Event Message Revision) = 04 = IPMI v2.0 ST (Sensor Type) = 12h = System Event (From IPMI Specification Table 42-3, Sensor Type Codes) SN (Sensor Number = 83h EDIR (Event Direction/Event Type) = 6fh; [7] = 0 = Assertion Event [6:0] = 6fh = Sensor specific ED1 (Event Data 1) = 05h = Timestamp Clock Synchronization ED2 (Event Data 2) = 80h = Second in pair

#### 2.2.1.2 BIOS SMI Handler Timestamp Events

```
RID[1F][00] RT[02] TS[C3][70][8D][4F] GID[33][00] ER[04] ST[12] SN[83] EDIR[6F] ED1[05] ED2[00] ED3[FF]

RID (Record ID) = 001Fh

RT (Record Type) = 02h = system event record

TS (Timestamp) = 4F8D70C3h

GID (Generator ID = 0033h = BIOS SMI Handler

ER (Event Message Revision) = 04 = IPMI v2.0

ST (Sensor Type) = 12h = System Event (From IPMI Specification Table 42-3, Sensor Type Codes)

SN (Sensor Number = 83h

EDIR (Event Direction/Event Type) = 6Fh; [7] = 0 = Assertion Event

[6:0] = 6fh = Sensor specific

ED1 (Event Data 1) = 05h = Timestamp Clock Synchronization

ED2 (Event Data 2) = 00h = First in pair
```

RID[20][00] RT[02] TS[C4][70][8D][4F] GID[33][00] ER[04] ST[12] SN[83] EDIR[6F] ED1[05] ED2[80] ED3[FF] RID (Record ID) = 0020h RT (Record Type) = 02h = system event record TS (Timestamp) = 4F8D70C4h GID (Generator ID = 0033h = BIOS SMI Handler ER (Event Message Revision) = 04 = IPMI v2.0 ST (Sensor Type) = 12h = System Event (From IPMI Specification Table 42-3, Sensor Type Codes) SN (Sensor Number = 83h EDIR (Event Direction/Event Type) = 6fh; [7] = 0 = Assertion Event [6:0] = 6fh = Sensor specific ED1 (Event Data 1) = 05h = Timestamp Clock Synchronization ED2 (Event Data 2) = 00h = First in pair

#### 2.2.2 Example of Decoding a PCI Express\* Correctable Error Events

The following is an example of decoding a PCI Express\* correctable error event. For this particular event it recorded a receiver error on Bus 0, Device 2, and Function 2. Note that correctable errors are acceptable and normal at a low rate of occurrence.

```
RID[27][00] RT[02] TS[0A][9B][2E][50] GID[33][00] ER[04] ST[13] SN[05] EDIR[71] ED1[A0] ED1[00] ED3[12] 
RID (Record ID) = 0027h
```

```
RT (Record Type) = 02h = system event record

TS (Timestamp) = 502E9B0Ah

GID (Generator ID = 0033h = BIOS SMI Handler

ER (Event Message Revision) = 04 = IPMI v2.0

ST (Sensor Type) = 13h = Critical Interrupt (From IPMI Specification Table 42-3, Sensor Type Codes)

SN (Sensor Number = 05h

EDIR (Event Direction/Event Type) = 71h; [7] = 0 = Assertion Event

[6:0] = 71h = OEM Specific for PCI Express* correctable errors

ED1 (Event Data 1) = A0h; [7:6] = 10b = OEM code in Event Data 2

[5:4] - 10b = OEM code in Event Data 3

[3:0] - Event Trigger Offset = 0h = Receiver Error

ED2 (Event Data 2) = 00h; PCI Bus number = 0

ED3 (Event Data 3) = 12h; [7:3] - PCI Device number = 02h

[2:0] - PCI Function number = 2
```

#### 2.2.3 Example of Decoding a Power Supply Predictive Failure Event

The following is an example of decoding a Power Supply predictive failure event. For this example power supply 1 saw an A/C power loss event with both the input under-voltage warning and fault events getting set. In most cases this means that the A/C power spiked under the minimum warning and fault thresholds for over 20 milliseconds but the system remained powered on. If these events continue to occur, it is advisable to check your power source.

```
RID[5D][00] RT[02] TS[D3][B1][AE][4E] GID[20][00] ER[04] ST[08] SN[50] EDIR[6F] ED1[A2] ED2[06] ED3[30]
         RID (Record ID) = 005Dh
         RT (Record Type) = 02h = system event record
         TS (Timestamp) = 4EAEB1D3h
         GID (Generator ID = 0020h = BMC
         ER (Event Message Revision) = 04 = IPMI v2.0
         ST (Sensor Type) = 08h = Power Supply (From IPMI Specification Table 42-3, Sensor Type Codes)
         SN (Sensor Number = 50h = Power Supply 1
         EDIR (Event Direction/Event Type) = 6Fh; [7] = 0 = Assertion Event
                                                 [6:0] = 6fh = Sensor specific
         ED1 (Event Data 1) = A2h; [7:6] = 10b = OEM code in Event Data 2
                                   [5:4] - 10b = OEM code in Event Data 3
                                   [3:0] – Event Trigger Offset = 2h = Predictive Failure
         ED2 (Event Data 2) = 06h = Input under-voltage warning
         ED3 (Event Data 3) = 30h; From PMBus* Specification STATUS_INPUT command
                                   [5] - VIN_UV_WARNING (Input Under-voltage Warning) = 1
                                   [4] – VIN UV FAULT (Input Under-voltage Fault) = 1
```

# 3. Sensor Cross Reference List

This section contains a cross reference to help find details on any specific SEL entry.

### 3.1 BMC owned Sensors (GID = 0020h)

The following table can be used to find the details of sensors owned by the BMC.

| Sensor<br>Number | Sensor Name                                | Details Section              | Next Steps                                                                                 |
|------------------|--------------------------------------------|------------------------------|--------------------------------------------------------------------------------------------|
| 01h              | Power Unit Status<br>(Pwr Unit Status)     | Power Unit Status Sensor     | <u>Table 16: Power Unit Status Sensor – Sensor Specific Offsets – Next</u><br><u>Steps</u> |
| 02h              | Power Unit Redundancy<br>(Pwr Unit Redund) | Power Unit Redundancy Sensor | Table 18: Power Unit Redundancy Sensor – Event Trigger Offset – Next<br>Steps              |
| 03h              | IPMI Watchdog<br>(IPMI Watchdog)           | IPMI Watchdog                | Table 78: IPMI Watchdog Sensor Event Trigger Offset – Next Steps                           |
| 04h              | Physical Security<br>(Physical Scrty)      | Physical Security            | Table 74: Physical Security Sensor Event Trigger Offset – Next Steps                       |
| 05h              | FP Interrupt<br>(FP NMI Diag Int)          | FP (NMI) Interrupt           | <u>FP (NMI) Interrupt – Next Steps</u>                                                     |
| 06h              | SMI Timeout<br>(SMI Timeout)               | SMI Timeout                  | SMI Timeout – Next Steps                                                                   |
| 07h              | System Event Log<br>(System Event Log)     | System Event Log Cleared     | Not applicable                                                                             |
| 08h              | System Event<br>(System Event)             | System Event – PEF Action    | System Event – PEF Action – Next Steps                                                     |
| 09h              | Button Sensor<br>(Button)                  | Button Sensor                | Not applicable                                                                             |

Table 5: BMC owned Sensors

#### Sensor Cross Reference List System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

| Sensor<br>Number | Sensor Name                                    | Details Section                            | Next Steps                                                                 |
|------------------|------------------------------------------------|--------------------------------------------|----------------------------------------------------------------------------|
| 0Ah              | BMC Watchdog<br>(BMC Watchdog)                 | BMC Watchdog Sensor                        | BMC Watchdog Sensor – Next Steps                                           |
| 0Bh              | Voltage Regulator Watchdog<br>(VR Watchdog)    | Voltage Regulator Watchdog<br>Timer Sensor | Voltage Regulator Watchdog Timer Sensor – Next Steps                       |
| 0Ch              | Fan Redundancy<br>(Fan Redundancy)             | Fan Presence and Redundancy<br>Sensors     | <u>Table 34: Fan Redundancy Sensor – Event Trigger Offset – Next Steps</u> |
| 0Dh              | SSB Thermal Trip<br>(SSB Thermal Trip)         | Discrete Thermal Sensors                   | Table 45: Discrete Thermal Sensors – Next Steps                            |
| 0Eh              | IO Module Presence<br>(IO Mod Presence)        | Add-In Module Presence Sensor              | Add-In Module Presence – Next Steps                                        |
| 0Fh              | SAS Module Presence<br>(SAS Mod Presence)      | Add-In Module Presence Sensor              | Add-In Module Presence – Next Steps                                        |
| 10h              | BMC Firmware Health<br>(BMC FW Health)         | BMC FW Health Sensor                       | BMC FW Health Sensor – Next Steps                                          |
| 11h              | System Airflow<br>(System Airflow)             | System Air Flow Monitoring<br>Sensor       | Not applicable                                                             |
| 12h              | Firmware Update Status<br>(FW Update Status)   | Firmware Update Status Sensor              | Not applicable                                                             |
| 13h              | IO Module2 Presence<br>(IO Mod2 Presence)      | Add-In Module Presence Sensor              | Add-In Module Presence – Next Steps                                        |
| 14h              | Baseboard Temperature 5<br>(Platform Specific) | Threshold-based Temperature<br>Sensors     | Table 37: Temperature Sensors – Next Steps                                 |
| 15h              | Baseboard Temperature 6<br>(Platform Specific) | Threshold-based Temperature<br>Sensors     | Table 37: Temperature Sensors – Next Steps                                 |
| 16h              | IO Module2 Temperature<br>(I/O Mod2 Temp)      | Threshold-based Temperature<br>Sensors     | Table 37: Temperature Sensors – Next Steps                                 |
| 17h              | PCI Riser 3 Temperature<br>(PCI Riser 3 Temp)  | Threshold-based Temperature<br>Sensors     | Table 37: Temperature Sensors – Next Steps                                 |

# System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5 4600/2600/2400/1600/1400 Product Families

| Sensor<br>Number | Sensor Name                                               | Details Section                                  | Next Steps                                                                                      |
|------------------|-----------------------------------------------------------|--------------------------------------------------|-------------------------------------------------------------------------------------------------|
| 18h              | PCI Riser 4 Temperature<br>(PCI Riser 4 Temp)             | Threshold-based Temperature<br>Sensors           | Table 37: Temperature Sensors – Next Steps                                                      |
| 19h              | Baseboard +1.05V Processor3<br>Vccp<br>(BB +1.05Vccp P3)  | <u>Threshold-based Voltage</u><br><u>Sensors</u> | Table 13: Threshold-based Voltage Sensors – Next Steps                                          |
| 1Ah              | Baseboard +1.05V Processor4<br>Vccp<br>(BB +1.05Vccp P4)  | <u>Threshold-based Voltage</u><br><u>Sensors</u> | Table 13: Threshold-based Voltage Sensors – Next Steps                                          |
| 20h              | Baseboard Temperature 1<br>(Platform Specific)            | Threshold-based Temperature<br>Sensors           | Table 37: Temperature Sensors – Next Steps                                                      |
| 21h              | Front Panel Temperature<br>(Front Panel Temp)             | Threshold-based Temperature<br>Sensors           | Table 37: Temperature Sensors – Next Steps                                                      |
| 22h              | SSB Temperature<br>(SSB Temp)                             | Threshold-based Temperature<br>Sensors           | Table 37: Temperature Sensors – Next Steps                                                      |
| 23h              | Baseboard Temperature 2<br>(Platform Specific)            | Threshold-based Temperature<br>Sensors           | Table 37: Temperature Sensors – Next Steps                                                      |
| 24h              | Baseboard Temperature 3<br>(Platform Specific)            | Threshold-based Temperature<br>Sensors           | Table 37: Temperature Sensors – Next Steps                                                      |
| 25h              | Baseboard Temperature 4<br>(Platform Specific)            | Threshold-based Temperature<br>Sensors           | Table 37: Temperature Sensors – Next Steps                                                      |
| 26h              | IO Module Temperature<br>(I/O Mod Temp)                   | Threshold-based Temperature<br>Sensors           | Table 37: Temperature Sensors – Next Steps                                                      |
| 27h              | PCI Riser 1 Temperature<br>(PCI Riser 1 Temp)             | Threshold-based Temperature<br>Sensors           | Table 37: Temperature Sensors – Next Steps                                                      |
| 28h              | IO Riser Temperature<br>(IO Riser Temp)                   | Threshold-based Temperature<br>Sensors           | Table 37: Temperature Sensors – Next Steps                                                      |
| 29h–2Bh          | Hot-Swap Back Plane 1-3<br>Temperature<br>(HSBP 1-3 Temp) | HSC Backplane Temperature<br>Sensor              | <u>Table 88: HSC Backplane Temperature Sensor – Event Trigger Offset –</u><br><u>Next</u> Steps |

#### Sensor Cross Reference List System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

| Sensor<br>Number | Sensor Name                                                            | Details Section                                      | Next Steps                                                                                  |
|------------------|------------------------------------------------------------------------|------------------------------------------------------|---------------------------------------------------------------------------------------------|
| 2Ch              | PCI Riser 2 Temperature<br>(PCI Riser 2 Temp)                          | Threshold-based Temperature<br>Sensors               | Table 37: Temperature Sensors – Next Steps                                                  |
| 2Dh              | SAS Module Temperature<br>(SAS Mod Temp)                               | Threshold-based Temperature<br>Sensors               | Table 37: Temperature Sensors – Next Steps                                                  |
| 2Eh              | Exit Air Temperature<br>(Exit Air Temp)                                | Threshold-based Temperature<br>Sensors               | Table 37: Temperature Sensors – Next Steps                                                  |
| 2Fh              | Network Interface Controller<br>Temperature<br>(LAN NIC Temp)          | <u>Threshold-based Temperature</u><br><u>Sensors</u> | Table 37: Temperature Sensors – Next Steps                                                  |
| 30h–3Fh          | Fan Tachometer Sensors<br>(Chassis specific sensor names)              | Fan Tachometer Sensors                               | Table 30: Fan Tachometer Sensor – Event Trigger Offset – Next Steps                         |
| 40h–4Fh          | Fan Present Sensors<br>(Fan x Present)                                 | Fan Presence and Redundancy<br>Sensors               | <u>Table 32: Fan Presence Sensors – Event Trigger Offset – Next Steps</u>                   |
| 50h              | Power Supply 1 Status<br>(PS1 Status)                                  | Power Supply Status Sensors                          | <u>Table 16: Power Unit Status Sensor – Sensor Specific Offsets – Next</u><br><u>Steps</u>  |
| 51h              | Power Supply 2 Status<br>(PS2 Status)                                  | Power Supply Status Sensors                          | Table 16: Power Unit Status Sensor – Sensor Specific Offsets – Next<br>Steps                |
| 54h              | Power Supply 1 AC Power Input<br>(PS1 Power In)                        | Power Supply Power In Sensors                        | <u>Table 23: Power Supply Power In Sensor – Event Trigger Offset – Next</u><br><u>Steps</u> |
| 55h              | Power Supply 2 AC Power Input<br>(PS2 Power In)                        | Power Supply Power In Sensors                        | Table 23: Power Supply Power In Sensor – Event Trigger Offset – Next<br>Steps               |
| 58h              | Power Supply 1 +12V % of<br>Maximum Current Output<br>(PS1 Curr Out %) | Power Supply Current Out %<br>Sensors                | Table 25: Power Supply Current Out % Sensor – Event Trigger Offset –<br>Next Steps          |
| 59h              | Power Supply 2 +12V % of<br>Maximum Current Output<br>(PS2 Curr Out %) | Power Supply Current Out %<br>Sensors                | Table 25: Power Supply Current Out % Sensor – Event Trigger Offset –<br>Next Steps          |
| 5Ch              | Power Supply 1 Temperature<br>(PS1 Temperature)                        | Power Supply Temperature<br>Sensors                  | Table 27: Power Supply Temperature Sensor – Event Trigger Offset – Next           Steps     |

#### System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5 4600/2600/2400/1600/1400 Product Families Sensor Cross Reference List

| Sensor  | Sensor Name                                                 | Details Section                      | Next Steps                                                                      |
|---------|-------------------------------------------------------------|--------------------------------------|---------------------------------------------------------------------------------|
| Number  |                                                             |                                      |                                                                                 |
| 5Dh     | Power Supply 2 Temperature<br>(PS2 Temperature)             | Power Supply Temperature<br>Sensors  | Table 27: Power Supply Temperature Sensor – Event Trigger Offset – Next Steps   |
| 60h-68h | Hard Disk Drive 15 – 23 Status<br>(HDD 15 – 23 Status)      | Hard Disk Drive Monitoring<br>Sensor | Table 90: Hard Disk Drive Monitoring Sensor - Event Trigger Offset – Next Steps |
| 69h-6Bh | Hot-Swap Controller 1-3 Status<br>(HSC1 – 3 Status)         | Hot-Swap Controller Health<br>Sensor | HSC Health Sensor – Next Steps                                                  |
| 70h     | Processor 1 Status<br>(P1 Status)                           | Processor Status Sensor              | Table 48: Processor Status Sensors – Next Steps                                 |
| 71h     | Processor 2 Status<br>(P2 Status)                           | Processor Status Sensor              | Table 48: Processor Status Sensors – Next Steps                                 |
| 72h     | Processor 3 Status<br>(P3 Status)                           | Processor Status Sensor              | Table 48: Processor Status Sensors – Next Steps                                 |
| 73h     | Processor 4 Status<br>(P4 Status)                           | Processor Status Sensor              | Table 48: Processor Status Sensors – Next Steps                                 |
| 74h     | Processor 1 Thermal Margin<br>(P1 Therm Margin)             | Thermal Margin Sensors               | Table 40: Thermal Margin Sensors – Next Steps                                   |
| 75h     | Processor 2 Thermal Margin<br>(P2 Therm Margin)             | Thermal Margin Sensors               | Table 40: Thermal Margin Sensors – Next Steps                                   |
| 76h     | Processor 3 Thermal Margin<br>(P3 Therm Margin)             | Thermal Margin Sensors               | Table 40: Thermal Margin Sensors – Next Steps                                   |
| 77h     | Processor 4 Thermal Margin<br>(P4 Therm Margin)             | Thermal Margin Sensors               | Table 40: Thermal Margin Sensors – Next Steps                                   |
| 78h-7Bh | Processor 1 – 3 Thermal Control %<br>(P1 – P4 Therm Ctrl %) | Processor Thermal Control<br>Sensors | Processor Thermal Control % Sensors – Next Steps                                |
| 7Ch     | Processor 1 ERR2 Timeout<br>(P1 ERR2)                       | Processor ERR2 Timeout Sensor        | Processor ERR2 Timeout – Next Steps                                             |
| 7Dh     | Processor 2 ERR2 Timeout<br>(P2 ERR2)                       | Processor ERR2 Timeout Sensor        | Processor ERR2 Timeout – Next Steps                                             |

#### Sensor Cross Reference List System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

| Sensor<br>Number | Sensor Name                                                      | Details Section                         | Next Steps                                                             |
|------------------|------------------------------------------------------------------|-----------------------------------------|------------------------------------------------------------------------|
| 7Eh              | Processor 3 ERR2 Timeout<br>(P3 ERR2)                            | Processor ERR2 Timeout Sensor           | Processor ERR2 Timeout – Next Steps                                    |
| 7Fh              | Processor 4 ERR2 Timeout<br>(P4 ERR2)                            | Processor ERR2 Timeout Sensor           | Processor ERR2 Timeout – Next Steps                                    |
| 80h              | Catastrophic Error<br>(CATERR)                                   | Catastrophic Error Sensor               | Table 50: Catastrophic Error Sensor – Event Data 2 Values – Next Steps |
| 81h              | Processor 1 MSID Mismatch<br>(P1 MSID Mismatch)                  | Processor MSID Mismatch<br>Sensor       | Processor MSID Mismatch Sensor – Next Steps                            |
| 82h              | Processor Population Fault<br>(CPU Missing)                      | CPU Missing Sensor                      | <u>CPU Missing Sensor – Next Steps</u>                                 |
| 83h-86h          | Processor 1 – 4 DTS Thermal<br>Margin<br>(P1 – P4 DTS Therm Mgn) | Processor DTS Thermal Margin<br>Sensors | Not applicable                                                         |
| 87h              | Processor 2 MSID Mismatch<br>(P2 MSID Mismatch)                  | Processor MSID Mismatch<br>Sensor       | Processor MSID Mismatch Sensor – Next Steps                            |
| 88h              | Processor 3 MSID Mismatch<br>(P3 MSID Mismatch)                  | Processor MSID Mismatch<br>Sensor       | Processor MSID Mismatch Sensor – Next Steps                            |
| 89h              | Processor 4 MSID Mismatch<br>(P4 MSID Mismatch)                  | Processor MSID Mismatch<br>Sensor       | Processor MSID Mismatch Sensor – Next Steps                            |
| 90h              | Processor 1 VRD Temp<br>(P1 VRD Hot)                             | Discrete Thermal Sensors                | Table 45: Discrete Thermal Sensors – Next Steps                        |
| 91h              | Processor 2 VRD Temp<br>(P2 VRD Hot)                             | Discrete Thermal Sensors                | Table 45: Discrete Thermal Sensors – Next Steps                        |
| 92h              | Processor 3 VRD Temp<br>(P3 VRD Hot)                             | Discrete Thermal Sensors                | Table 45: Discrete Thermal Sensors – Next Steps                        |
| 93h              | Processor 4 VRD Temp<br>(P4 VRD Hot)                             | Discrete Thermal Sensors                | Table 45: Discrete Thermal Sensors – Next Steps                        |

# System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel<sup>•</sup> Xeon<sup>•</sup> Processor E5 4600/2600/2400/1600/1400 Product Families

| Sensor<br>Number | Sensor Name                                                                        | Details Section                                                              | Next Steps                                                                           |
|------------------|------------------------------------------------------------------------------------|------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|
| 94h              | Processor 1 Memory VRD Hot 0-1<br>(P1 Mem01 VRD Hot)                               | Discrete Thermal Sensors                                                     | <u>Table 45: Discrete Thermal Sensors – Next Steps</u>                               |
| 95h              | Processor 1 Memory VRD Hot 2-3<br>(P1 Mem23 VRD Hot)                               | Discrete Thermal Sensors                                                     | Table 45: Discrete Thermal Sensors – Next Steps                                      |
| 96h              | Processor 2 Memory VRD Hot 0-1<br>(P2 Mem01 VRD Hot)                               | Discrete Thermal Sensors                                                     | Table 45: Discrete Thermal Sensors – Next Steps                                      |
| 97h              | Processor 2 Memory VRD Hot 2-3<br>(P2 Mem23 VRD Hot)                               | Discrete Thermal Sensors                                                     | Table 45: Discrete Thermal Sensors – Next Steps                                      |
| 98h              | Processor 3 Memory VRD Hot 0-1<br>(P3 Mem01 VRD Hot)                               | Discrete Thermal Sensors                                                     | Table 45: Discrete Thermal Sensors – Next Steps                                      |
| 99h              | Processor 3 Memory VRD Hot 2-3<br>(P4 Mem23 VRD Hot)                               | Discrete Thermal Sensors                                                     | Table 45: Discrete Thermal Sensors – Next Steps                                      |
| 9Ah              | Processor 4 Memory VRD Hot 0-1<br>(P4 Mem01 VRD Hot)                               | Discrete Thermal Sensors                                                     | Table 45: Discrete Thermal Sensors – Next Steps                                      |
| 9Bh              | Processor 4 Memory VRD Hot 2-3<br>(P4 Mem23 VRD Hot)                               | Discrete Thermal Sensors                                                     | Table 45: Discrete Thermal Sensors – Next Steps                                      |
| A0h              | Power Supply 1 Fan Tachometer 1<br>(PS1 Fan Tach 1)                                | Power Supply Fan Tachometer<br>Sensors                                       | Power Supply Fan Tachometer Sensors – Next Steps                                     |
| A1h              | Power Supply 1 Fan Tachometer 2<br>(PS1 Fan Tach 2)                                | Power Supply Fan Tachometer<br>Sensors                                       | Power Supply Fan Tachometer Sensors – Next Steps                                     |
| A2h              | Intel <sup>®</sup> Xeon Phi <sup>™</sup> Coprocessor<br>Status 1<br>(MIC 1 Status) | Intel <sup>®</sup> Xeon Phi <sup>™</sup> Coprocessor<br>(MIC) Status Sensors | Intel <sup>®</sup> Xeon Phi <sup>™</sup> Coprocessor (MIC) Status Sensors Next Steps |
| A3h              | Intel <sup>®</sup> Xeon Phi <sup>™</sup> Coprocessor<br>Status 2<br>(MIC 2 Status) | Intel <sup>®</sup> Xeon Phi <sup>™</sup> Coprocessor<br>(MIC) Status Sensors | Intel <sup>®</sup> Xeon Phi <sup>™</sup> Coprocessor (MIC) Status Sensors Next Steps |
| A4h              | Power Supply 2 Fan Tachometer 1<br>(PS2 Fan Tach 1)                                | Power Supply Fan Tachometer<br>Sensors                                       | Power Supply Fan Tachometer Sensors – Next Steps                                     |

#### Sensor Cross Reference List System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

| Sensor<br>Number | Sensor Name                                                                        | Details Section                                                              | Next Steps                                                                           |
|------------------|------------------------------------------------------------------------------------|------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|
| A5h              | Power Supply 2 Fan Tachometer 2<br>(PS2 Fan Tach 2)                                | Power Supply Fan Tachometer<br>Sensors                                       | Power Supply Fan Tachometer Sensors – Next Steps                                     |
| A6h              | Intel <sup>®</sup> Xeon Phi <sup>™</sup> Coprocessor<br>Status 3<br>(MIC 3 Status) | Intel <sup>®</sup> Xeon Phi <sup>™</sup> Coprocessor<br>(MIC) Status Sensors | Intel <sup>®</sup> Xeon Phi <sup>™</sup> Coprocessor (MIC) Status Sensors Next Steps |
| A7h              | Intel <sup>®</sup> Xeon Phi <sup>™</sup> Coprocessor<br>Status 4<br>(MIC 4 Status) | Intel <sup>®</sup> Xeon Phi <sup>™</sup> Coprocessor<br>(MIC) Status Sensors | Intel <sup>®</sup> Xeon Phi <sup>™</sup> Coprocessor (MIC) Status Sensors Next Steps |
| B0h              | Processor 1 DIMM Aggregate<br>Thermal Margin 1<br>(P1 DIMM Thrm Mrgn1)             | Thermal Margin Sensors                                                       | Table 40: Thermal Margin Sensors – Next Steps                                        |
| B1h              | Processor 1 DIMM Aggregate<br>Thermal Margin 2<br>(P1 DIMM Thrm Mrgn2)             | Thermal Margin Sensors                                                       | Table 40: Thermal Margin Sensors – Next Steps                                        |
| B2h              | Processor 2 DIMM Aggregate<br>Thermal Margin 1<br>(P2 DIMM Thrm Mrgn1)             | Thermal Margin Sensors                                                       | Table 40: Thermal Margin Sensors – Next Steps                                        |
| B3h              | Processor 2 DIMM Aggregate<br>Thermal Margin 2<br>(P2 DIMM Thrm Mrgn2)             | Thermal Margin Sensors                                                       | Table 40: Thermal Margin Sensors – Next Steps                                        |
| B4h              | Processor 3 DIMM Aggregate<br>Thermal Margin 1<br>(P3 DIMM Thrm Mrgn1)             | Thermal Margin Sensors                                                       | Table 40: Thermal Margin Sensors – Next Steps                                        |
| B5h              | Processor 3 DIMM Aggregate<br>Thermal Margin 2<br>(P3 DIMM Thrm Mrgn2)             | Thermal Margin Sensors                                                       | Table 40: Thermal Margin Sensors – Next Steps                                        |
| B6h              | Processor 4 DIMM Aggregate<br>Thermal Margin 1<br>(P4 DIMM Thrm Mrgn1)             | Thermal Margin Sensors                                                       | Table 40: Thermal Margin Sensors – Next Steps                                        |

#### System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel<sup>•</sup> Xeon<sup>•</sup> Processor E5 4600/2600/2400/1600/1400 Product Families

| Sensor<br>Number | Sensor Name                                                                                | Details Section                                                                      | Next Steps                                                                 |  |
|------------------|--------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|----------------------------------------------------------------------------|--|
| B7h              | Processor 4 DIMM Aggregate<br>Thermal Margin 2<br>(P4 DIMM Thrm Mrgn2)                     | Thermal Margin Sensors                                                               | Table 40: Thermal Margin Sensors – Next Steps                              |  |
| B8h              | Node Auto-Shutdown Sensor<br>(Auto Shutdown)                                               | Node Auto Shutdown Sensor                                                            | Node Auto Shutdown Sensor – Next Steps                                     |  |
| BAh-BFh          | Fan Tachometer Sensors<br>(Chassis specific sensor names)                                  | Fan Tachometer Sensors                                                               | <u>Table 30: Fan Tachometer Sensor – Event Trigger Offset – Next Steps</u> |  |
| C0h-C3h          | Processor 1 – 4 DIMM Thermal                                                               |                                                                                      | DIMM Thermal Trip Sensors – Next Steps                                     |  |
| C4h              | Intel <sup>®</sup> Xeon Phi <sup>™</sup> Coprocessor<br>Thermal Margin 1<br>(MIC 1 Margin) | Intel <sup>®</sup> Xeon Phi <sup>™</sup> Coprocessor<br>(MIC) Thermal Margin Sensors | Not applicable                                                             |  |
| C5h              | Intel <sup>®</sup> Xeon Phi <sup>™</sup> Coprocessor<br>Thermal Margin 2<br>(MIC 2 Margin) | Intel <sup>®</sup> Xeon Phi <sup>™</sup> Coprocessor<br>(MIC) Thermal Margin Sensors | Not applicable                                                             |  |
| C6h              | Intel <sup>®</sup> Xeon Phi <sup>™</sup> Coprocessor<br>Thermal Margin 3<br>(MIC 3 Margin) | Intel <sup>®</sup> Xeon Phi <sup>™</sup> Coprocessor<br>(MIC) Thermal Margin Sensors | Not applicable                                                             |  |
| C7h              | Intel <sup>®</sup> Xeon Phi <sup>™</sup> Coprocessor<br>Thermal Margin 4<br>(MIC 4 Margin) | Intel <sup>®</sup> Xeon Phi <sup>™</sup> Coprocessor<br>(MIC) Thermal Margin Sensors | Not applicable                                                             |  |
| C8h-CFh          | Global Aggregate Temperature<br>Margin 1 -8<br>(Agg Therm Mrgn 1 – 8)                      | Thermal Margin Sensors                                                               | Table 40: Thermal Margin Sensors – Next Steps                              |  |
| D0h              | Baseboard +12V<br>(BB +12.0V)                                                              | Threshold-based Voltage<br>Sensors                                                   | Table 13: Threshold-based Voltage Sensors – Next Steps                     |  |
| D1h              | Baseboard +5V<br>(BB +5.0V)                                                                | Threshold-based Voltage<br>Sensors                                                   | Table 13: Threshold-based Voltage Sensors – Next Steps                     |  |

#### Sensor Cross Reference List System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

| Sensor<br>Number | Sensor Name                                                | Details Section                                  | Next Steps                                             |
|------------------|------------------------------------------------------------|--------------------------------------------------|--------------------------------------------------------|
| D2h              | Baseboard +3.3V<br>(BB +3.3V)                              | Threshold-based Voltage<br>Sensors               | Table 13: Threshold-based Voltage Sensors – Next Steps |
| D3h              | Baseboard +5V Stand-by<br>(BB +5.0V STBY)                  | Threshold-based Voltage<br>Sensors               | Table 13: Threshold-based Voltage Sensors – Next Steps |
| D4h              | Baseboard +3.3V Auxiliary<br>(BB +3.3V AUX)                | <u>Threshold-based Voltage</u><br><u>Sensors</u> | Table 13: Threshold-based Voltage Sensors – Next Steps |
| D6h              | Baseboard +1.05V Processor1<br>Vccp<br>(BB +1.05Vccp P1)   | <u>Threshold-based Voltage</u><br><u>Sensors</u> | Table 13: Threshold-based Voltage Sensors – Next Steps |
| D7h              | Baseboard +1.05V Processor2<br>Vccp<br>(BB +1.05Vccp P2)   | <u>Threshold-based Voltage</u><br><u>Sensors</u> | Table 13: Threshold-based Voltage Sensors – Next Steps |
| D8h              | Baseboard +1.5V P1 Memory AB<br>VDDQ<br>(BB +1.5 P1MEM AB) | Threshold-based Voltage<br>Sensors               | Table 13: Threshold-based Voltage Sensors – Next Steps |
| D9h              | Baseboard +1.5V P1 Memory CD<br>VDDQ<br>(BB +1.5 P1MEM CD) | Threshold-based Voltage<br>Sensors               | Table 13: Threshold-based Voltage Sensors – Next Steps |
| DAh              | Baseboard +1.5V P2 Memory AB<br>VDDQ<br>(BB +1.5 P2MEM AB) | Threshold-based Voltage<br>Sensors               | Table 13: Threshold-based Voltage Sensors – Next Steps |
| DBh              | Baseboard +1.5V P2 Memory CD<br>VDDQ<br>(BB +1.5 P2MEM CD) |                                                  | Table 13: Threshold-based Voltage Sensors – Next Steps |
| DCh              | Baseboard +1.8V Aux<br>(BB +1.8V AUX)                      | Threshold-based Voltage<br>Sensors               | Table 13: Threshold-based Voltage Sensors – Next Steps |
| DDh              | Baseboard +1.1V Stand-by<br>(BB +1.1V STBY)                | Threshold-based Voltage<br>Sensors               | Table 13: Threshold-based Voltage Sensors – Next Steps |
| DEh              | Baseboard CMOS Battery<br>(BB +3.3V Vbat)                  | Threshold-based Voltage<br>Sensors               | Table 13: Threshold-based Voltage Sensors – Next Steps |

## System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5 4600/2600/2400/1600/1400 Product Families

| Sensor<br>Number | Sensor Name                                                             | Details Section                                  | Next Steps                                                                         |  |
|------------------|-------------------------------------------------------------------------|--------------------------------------------------|------------------------------------------------------------------------------------|--|
| E4h              | Baseboard +1.35V P1 Low Voltage<br>Memory AB VDDQ<br>(BB +1.35 P1LV AB) | Threshold-based Voltage<br>Sensors               | Table 13: Threshold-based Voltage Sensors – Next Steps                             |  |
| E5h              | Baseboard +1.35V P1 Low Voltage<br>Memory CD VDDQ<br>(BB +1.35 P1LV CD) | <u>Threshold-based Voltage</u><br><u>Sensors</u> | Table 13: Threshold-based Voltage Sensors – Next Steps                             |  |
| E6h              | Baseboard +1.35V P2 Low Voltage<br>Memory AB VDDQ<br>(BB +1.35 P2LV AB) | <u>Threshold-based Voltage</u><br><u>Sensors</u> | Table 13: Threshold-based Voltage Sensors – Next Steps                             |  |
| E7h              | Baseboard +1.35V P2 Low Voltage<br>Memory CD VDDQ<br>(BB +1.35 P2LV CD) | <u>Threshold-based Voltage</u><br><u>Sensors</u> | Table 13: Threshold-based Voltage Sensors – Next Steps                             |  |
| EAh              | Baseboard +3.3V Riser 1 Power<br>Good<br>(BB +3.3 RSR1 PGD)             | Threshold-based Voltage<br>Sensors               | Table 13: Threshold-based Voltage Sensors – Next Steps                             |  |
| EBh              | Baseboard +3.3V Riser 2 Power<br>Good<br>(BB +3.3 RSR2 PGD)             | Threshold-based Voltage<br>Sensors               | Table 13: Threshold-based Voltage Sensors – Next Steps                             |  |
| ECh              | Baseboard +0.9V<br>(BB 0.9V Core IB)                                    | Threshold-based Voltage<br>Sensors               | Table 13: Threshold-based Voltage Sensors – Next Steps                             |  |
| EDh              | Baseboard +1.8V<br>(BB 1.8V IB I/O)                                     | Threshold-based Voltage<br>Sensors               | Table 13: Threshold-based Voltage Sensors – Next Steps                             |  |
| EEh              | Baseboard +1.1V<br>(BB 1.1V PCH)                                        | Threshold-based Voltage<br>Sensors               | Table 13: Threshold-based Voltage Sensors – Next Steps                             |  |
| EFh              | Baseboard +1.2V<br>(BB +1.2V IB)                                        | <u>Threshold-based Voltage</u><br><u>Sensors</u> | Table 13: Threshold-based Voltage Sensors – Next Steps                             |  |
| F0h-FEh          | Hard Disk Drive 0 -14 Status<br>(HDD 0 – 14 Status)                     | Hard Disk Drive Monitoring<br>Sensor             | Table 90: Hard Disk Drive Monitoring Sensor - Event Trigger Offset – Next<br>Steps |  |

### 3.2 BIOS POST owned Sensors (GID = 0001h)

The following table can be used to find the details of sensors owned by BIOS POST.

| Sensor<br>Number | Sensor Name                                                     | Details Section                                   | Next Steps                                                                         |  |
|------------------|-----------------------------------------------------------------|---------------------------------------------------|------------------------------------------------------------------------------------|--|
| 02h              | Memory RAS Configuration Status Memory RAS Configuration Status |                                                   | Table 58: Memory RAS Configuration Status Sensor Typical           Characteristics |  |
| 06h              | POST Error                                                      | System Firmware Progress (Formerly Post<br>Error) | System Firmware Progress (Formerly Post Error) – Next Steps                        |  |
| 09h              | Intel <sup>®</sup> Quick Path Interface Link<br>Width Reduced   | QPI Link Width Reduced Sensor                     | QPI Link Width Reduced Sensor – Next Steps                                         |  |
| 12h              | Memory RAS Mode Select                                          | Memory RAS Mode Select                            | Not applicable                                                                     |  |
| 83h              | System Event                                                    | System Events                                     | Not applicable                                                                     |  |

Table 6: BIOS POST owned Sensors

### 3.3 BIOS SMI Handler owned Sensors (GID = 0033h)

The following table can be used to find the details of sensors owned by BIOS SMI Handler.

#### Table 7: BIOS SMI Handler owned Sensors

| Sensor<br>Number | Sensor Name                    | Details Section                                   | Next Steps                                                                                           |
|------------------|--------------------------------|---------------------------------------------------|------------------------------------------------------------------------------------------------------|
| 01h              | Mirroring Redundancy State     | Mirroring Redundancy State                        | Mirroring Redundancy State Sensor – Next Steps                                                       |
| 02h              | Memory ECC Error               | Memory Correctable and Uncorrectable<br>ECC Error | Table 64: Correctable and Uncorrectable ECC Error Sensor           Event Trigger Offset – Next Steps |
| 03h              | Legacy PCI Error               | Legacy PCI Errors                                 | Legacy PCI Error Sensor – Next Steps                                                                 |
| 04h              | PCI Express* Fatal Error       | PCI Express* Fatal Errors and Fatal Error #2      | PCI Express* Fatal Error and Fatal Error #2 Sensor – Next<br>Steps                                   |
| 05h              | PCI Express* Correctable Error | PCI Express* Correctable Errors                   | PCI Express* Correctable Error Sensor – Next Steps                                                   |

| Sensor<br>Number | Sensor Name                                                                               | Details Section                              | Next Steps                                                         |  |
|------------------|-------------------------------------------------------------------------------------------|----------------------------------------------|--------------------------------------------------------------------|--|
| 06h              | Intel <sup>®</sup> Quick Path Interface<br>Correctable Error                              | QPI Correctable Error Sensor                 | QPI Correctable Error Sensor – Next Steps                          |  |
| 07h              | Intel <sup>®</sup> Quick Path Interface Fatal Error                                       | QPI Fatal Error and Fatal Error #2           | QPI Fatal Error and Fatal Error #2 – Next Steps                    |  |
| 11h              | Sparing Redundancy State                                                                  | Sparing Redundancy State                     | Sparing Redundancy State Sensor – Next Steps                       |  |
| 13h              | Memory Parity Error                                                                       | Memory Address Parity Error                  | Memory Address Parity Error Sensor – Next Steps                    |  |
| 14h              | PCI Express* Fatal Error#2<br>(continuation of Sensor 04h)                                | PCI Express* Fatal Errors and Fatal Error #2 | PCI Express* Fatal Error and Fatal Error #2 Sensor – Next<br>Steps |  |
| 17h              | Intel <sup>®</sup> Quick Path Interface Fatal Error<br>#2<br>(continuation of Sensor 07h) | QPI Fatal Error and Fatal Error #2           | QPI Fatal Error and Fatal Error #2 – Next Steps                    |  |
| 83h              | System Event                                                                              | System Events                                | Not applicable                                                     |  |

### 3.4 Node Manager / ME Firmware owned Sensors (GID = 002Ch or 602Ch)

The following table can be used to find the details of sensors owned by the Node Manager / Management Engine (ME) firmware.

#### Table 8: Management Engine Firmware owned Sensors

| Sensor<br>Number | Sensor Name                                            | Details Section                       | Next Steps                                          |
|------------------|--------------------------------------------------------|---------------------------------------|-----------------------------------------------------|
| 17h              | ME Firmware Health Events                              | ME Firmware Health Event              | ME Firmware Health Event – Next Steps               |
| 18h              | Node Manager Exception Events                          | Node Manager Exception Event          | Node Manager Exception Event – Next Steps           |
| 19h              | Node Manager Health Events                             | Node Manager Health Event             | Node Manager Health Event – Next Steps              |
| 1Ah              | Node Manager Operational Capabilities<br>Change Events | Node Manager Operational Capabilities | Node Manager Operational Capabilities Change – Next |
| 1Bh              | Node Manager Alert Threshold<br>Exceeded Events        | Node Manger Alert Threshold Exceeded  | Node Manger Alert Threshold Exceeded – Next Steps   |

### 3.5 Microsoft\* OS owned Events (GID = 0041)

The following table can be used to find the details of records that are owned by the Microsoft\* Operating System (OS).

| Sensor Name           | Record<br>Type | Sensor Type            | Details Section                                                                                                                                    | Next Steps     |
|-----------------------|----------------|------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------|----------------|
| Boot Event            | 02h            | 1Fh = OS Boot          | Table 98: Boot up Event Record Typical Characteristics                                                                                             | Not applicable |
| Bool Event            | DCh            | Not applicable         | Table 99: Boot up OEM Event Record Typical Characteristics                                                                                         |                |
|                       | 02h            | 20h = OS Stop/Shutdown | Table 100: Shutdown Reason Code Event Record Typical Characteristics                                                                               | Not applicable |
| Shutdown Event        | DDh            | Not applicable         | Table 101: Shutdown Reason OEM Event Record Typical Characteristics           Table 102: Shutdown Comment OEM Event Record Typical Characteristics | Not applicable |
|                       | 02h            | 20h = OS Stop/Shutdown | Table 103: Bug Check/Blue Screen – OS Stop Event Record Typical<br>Characteristics                                                                 | Not applicable |
| Bug Check/Blue Screen | DEh            | Not applicable         | Table 104: Bug Check/Blue Screen code OEM Event Record Typical Characteristics                                                                     |                |

#### Table 9: Microsoft\* OS owned Events

### 3.6 Linux\* Kernel Panic Events (GID = 0021)

The following table can be used to find the details of records that can be generated when there is a Linux\* Kernel panic.

#### Table 10: Linux\* Kernel Panic Events

| Sensor Name         | Record<br>Type | Sensor Type            | Details Section                                                       | Next Steps     |
|---------------------|----------------|------------------------|-----------------------------------------------------------------------|----------------|
| Linux* Kernel Panic | 02h            | 20h = OS Stop/Shutdown | Table 105: Linux* Kernel Panic Event Record Characteristics           | Not applicable |
| Linux Kemer Panic   | F0h            | Not applicable         | Table 106: Linux* Kernel Panic String Extended Record Characteristics |                |

The BMC monitors the power subsystem including power supplies, select onboard voltages, and related sensors.

## 4.1 Threshold-based Voltage Sensors

The BMC monitors the main voltage sources in the system, including the baseboard, memory, and processors, using IPMI-compliant analog/threshold sensors. Some voltages are only on specific platforms. For details check your platforms *Technical Product Specification* (TPS).

**Note**: A voltage error can be caused by the device supplying the voltage or by the device using the voltage. For each sensor it will be noted who is supplying the voltage and who is using it.

| Byte | Field                             | Description                                                                                                                                                                         |
|------|-----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | 02h = Voltage                                                                                                                                                                       |
| 12   | Sensor Number                     | See Table 13                                                                                                                                                                        |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 01h (Threshold)</li> </ul>                                   |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 01b = Trigger reading in Event Data 2</li> <li>[5:4] - 01b = Trigger threshold in Event Data 3</li> <li>[3:0] - Event Triggers as described in Table 12</li> </ul> |
| 15   | Event Data 2                      | Reading that triggered event                                                                                                                                                        |
| 16   | Event Data 3                      | Threshold value that triggered event                                                                                                                                                |

#### Table 11: Threshold-based Voltage Sensors Typical Characteristics

The following table describes the severity of each of the event triggers for both assertion and deassertion.

|     | Event Trigger                 | Assertion | Deassert | Description                                                     |  |
|-----|-------------------------------|-----------|----------|-----------------------------------------------------------------|--|
| Hex | Description                   | Severity  | Severity | Description                                                     |  |
| 00h | Lower non-critical going low  | Degraded  | ОК       | The voltage has dropped below its lower non-critical threshold. |  |
| 02h | Lower critical going low      | non-fatal | Degraded | The voltage has dropped below its lower critical threshold.     |  |
| 07h | Upper non-critical going high | Degraded  | ОК       | The voltage has gone over its upper non-critical threshold.     |  |
| 09h | Upper critical<br>going high  | non-fatal | Degraded | The voltage has gone over its upper critical threshold.         |  |

Table 12: Threshold-based Voltage Sensors Event Triggers – Description

Table 13: Threshold-based Voltage Sensors – Next Steps

| Sensor<br>Number | Sensor Name                                           | Next Steps                                                                                                                                                                                                                                                                                                                                                             |  |
|------------------|-------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| 19h              | Baseboard +1.05V Processor3 Vccp<br>(BB +1.05Vccp P3) | <ul> <li>This 1.05V line is supplied by the main board.</li> <li>This 1.05V line is used by processor 1.</li> <li>1. Ensure all cables are connected correctly.</li> <li>2. Check the processor is seated properly.</li> <li>3. Cross test the processors. If the issue remains with the processor socket, replace the main board, otherwise the processor.</li> </ul> |  |
| 1Ah              | Baseboard +1.05V Processor4 Vccp<br>(BB +1.05Vccp P4) | <ul> <li>This 1.05V line is supplied by the main board.</li> <li>This 1.05V line is used by processor 1.</li> <li>1. Ensure all cables are connected correctly.</li> <li>2. Check the processor is seated properly.</li> <li>3. Cross test the processors. If the issue remains with the processor socket, replace the main board otherwise the processor.</li> </ul>  |  |

#### Power Subsystems

| Sensor<br>Number | Sensor Name                               | Next Steps                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|------------------|-------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| D0h              | Baseboard +12V<br>(BB +12.0V)             | <ul> <li>+12V is supplied by the power supplies.</li> <li>+12V is used by SATA drives, Fans, and PCI cards. In addition it is used to generate various processor voltages.</li> <li>1. Ensure all cables are connected correctly.</li> <li>2. Check connections on the fans and HDDs.</li> <li>3. If the issue follows the component, swap it, otherwise, replace the board.</li> <li>4. If the issue remains, replace the power supplies.</li> </ul>                                       |
| D1h              | Baseboard +5V<br>(BB +5.0V)               | <ul> <li>+5.0V is supplied by the power supplies for pedestal systems, and supplied by the main board on rack-optimized systems.</li> <li>+5.0V is used by the PCI slots.</li> <li>1. Ensure all cables are connected correctly.</li> <li>2. Reseat any PCI cards.</li> <li>3. Try PCI cards in other PCI slots.</li> <li>4. If the issue follows the card, swap it, otherwise, replace the main board.</li> <li>5. If the issue remains, replace the power supplies.</li> </ul>            |
| D2h              | Baseboard +3.3V<br>(BB +3.3V)             | <ul> <li>+3.3V is supplied by the power supplies for pedestal systems, and supplied by the main board on rack-optimized systems.</li> <li>+3.3V is used by the PCIe and PCI-X slots.</li> <li>1. Ensure all cables are connected correctly.</li> <li>2. Reseat any PCI cards.</li> <li>3. Try PCI cards in other PCI slots.</li> <li>4. If the issue follows the card, swap it, otherwise, replace the main board.</li> <li>5. If the issue remains, replace the power supplies.</li> </ul> |
| D3h              | Baseboard +5V Stand-by<br>(BB +5.0V STBY) | <ul> <li>+5.0V STBY is supplied by the power supplies for pedestal systems, and supplied by the main board on rack-optimized systems.</li> <li>+5.0V STBY is used to generate other standby voltages.</li> <li>1. Ensure all cables are connected correctly.</li> <li>2. If the issue remains, replace the board.</li> <li>3. If the issue remains, replace the power supplies.</li> </ul>                                                                                                  |

### System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5 4600/2600/2400/1600/1400 Product Families

| Sensor<br>Number | Sensor Name                                                | Next Steps                                                                                                                                                                                                                                                                                                                                                                       |  |
|------------------|------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| D4h              | Baseboard +3.3V Auxiliary<br>(BB +3.3V AUX)                | <ul> <li>+3.3V AUX is supplied by the main board.</li> <li>+3.3V AUX is used by the BMC, clock chips, PCI-E Slot, on-board NIC, Intel<sup>®</sup> C600 series Chipset, and ICH.</li> <li>1. Ensure all cables are connected correctly.</li> <li>2. If the issue remains, replace the board.</li> <li>3. If the issue remains, replace the power supplies.</li> </ul>             |  |
| D6h              | Baseboard +1.05V Processor1 Vccp<br>(BB +1.05Vccp P1)      | <ul> <li>This 1.05V line is supplied by the main board.</li> <li>This 1.05V line is used by processor 1.</li> <li>1. Ensure all cables are connected correctly.</li> <li>2. Check the processor is seated properly.</li> <li>3. Cross test the processors. If the issue remains with the processor socket, replace the main board, otherwise the processor.</li> </ul>           |  |
| D7h              | Baseboard +1.05V Processor2 Vccp<br>(BB +1.05Vccp P2)      | <ul> <li>This 1.05V line is supplied by the main board.</li> <li>This 1.05V line is used by processor 2.</li> <li>1. Ensure all cables are connected correctly.</li> <li>2. Check the processor is seated properly.</li> <li>3. Cross test the processors. If the issue remains with the processor socket, replace the main board, otherwise the processor.</li> </ul>           |  |
| D8h              | Baseboard +1.5V P1 Memory AB<br>VDDQ<br>(BB +1.5 P1MEM AB) | <ul> <li>This 1.5V line is supplied by the main board.</li> <li>This 1.5V line is used by processor 1 memory slots A and B.</li> <li>1. Ensure all cables are connected correctly.</li> <li>2. Check the DIMMs are seated properly.</li> <li>3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main board, otherwise the DIMM.</li> </ul> |  |
| D9h              | Baseboard +1.5V P1 Memory CD<br>VDDQ<br>(BB +1.5 P1MEM CD) | <ul> <li>This 1.5V line is supplied by the main board.</li> <li>This 1.5V line is used by processor 1 memory slots C and D.</li> <li>1. Ensure all cables are connected correctly.</li> <li>2. Check the DIMMs are seated properly.</li> <li>3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main board, otherwise the DIMM.</li> </ul> |  |

Power Subsystems

| Sensor<br>Number | Sensor Name                                                             | Next Steps                                                                                                                                                                                                                                                                                                                                                                         |  |
|------------------|-------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| DAh              | Baseboard +1.5V P2 Memory AB<br>VDDQ<br>(BB +1.5 P2MEM AB)              | <ul> <li>This 1.5V line is supplied by the main board.</li> <li>This 1.5V line is used by processor 2 memory slots A and B.</li> <li>1. Ensure all cables are connected correctly.</li> <li>2. Check the DIMMs are seated properly.</li> <li>3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main board, otherwise the DIMM.</li> </ul>   |  |
| DBh              | Baseboard +1.5V P2 Memory CD<br>VDDQ<br>(BB +1.5 P2MEM CD)              | <ul> <li>This 1.5V line is supplied by the main board.</li> <li>This 1.5V line is used by processor 2 memory slots C and D.</li> <li>1. Ensure all cables are connected correctly.</li> <li>2. Check the DIMMs are seated properly.</li> <li>3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main board, otherwise the DIMM.</li> </ul>   |  |
| DCh              | Baseboard +1.8V Aux<br>(BB +1.8V AUX)                                   | <ul> <li>+1.8V AUX is supplied by the main board.</li> <li>+1.8V AUX is used by the BMC and on-board NIC.</li> <li>1. Ensure all cables are connected correctly.</li> <li>2. If the issue remains, replace the board.</li> <li>3. If the issue remains, replace the power supplies.</li> </ul>                                                                                     |  |
| DDh              | Baseboard +1.1V Stand-by<br>(BB +1.1V STBY)                             | <ul> <li>+1.1V STBY is supplied by the main board.</li> <li>+1.1V STBY is used by the Intel<sup>®</sup> C600 series Chipset.</li> <li>1. Ensure all cables are connected correctly.</li> <li>2. If the issue remains, replace the board.</li> <li>3. If the issue remains, replace the power supplies.</li> </ul>                                                                  |  |
| DEh              | Baseboard CMOS Battery<br>(BB +3.3V Vbat)                               | <ul> <li>+3.3V Vbat is supplied by the CMOS battery when power is off and by the main board when power is on</li> <li>+3.3V Vbat is used by the CMOS and related circuits.</li> <li>1. Replace the CMOS battery. Any battery of type CR2032 can be used.</li> <li>2. If error remains (unlikely), replace the board.</li> </ul>                                                    |  |
| E4h              | Baseboard +1.35V P1 Low Voltage<br>Memory AB VDDQ<br>(BB +1.35 P1LV AB) | <ul> <li>This 1.35V line is supplied by the main board.</li> <li>This 1.35V line is used by processor 1 memory slots A and B.</li> <li>1. Ensure all cables are connected correctly.</li> <li>2. Check the DIMMs are seated properly.</li> <li>3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main board, otherwise the DIMM.</li> </ul> |  |

### System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5 4600/2600/2400/1600/1400 Product Families

| Sensor<br>Number | Sensor Name                                                             | Next Steps                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|------------------|-------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| E5h              | Baseboard +1.35V P1 Low Voltage<br>Memory CD VDDQ<br>(BB +1.35 P1LV CD) | <ul> <li>This 1.35V line is supplied by the main board.</li> <li>This 1.35V line is used by processor 1 memory slots C and D.</li> <li>1. Ensure all cables are connected correctly.</li> <li>2. Check the DIMMs are seated properly.</li> <li>3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main board, otherwise the DIMM.</li> <li>This 1.35V line is supplied by the main board.</li> </ul> |
| E6h              | Baseboard +1.35V P2 Low Voltage<br>Memory AB VDDQ<br>(BB +1.35 P2LV AB) | <ul> <li>This 1.35V line is supplied by the main board.</li> <li>This 1.35V line is used by processor 2 memory slots A and B.</li> <li>1. Ensure all cables are connected correctly.</li> <li>2. Check the DIMMs are seated properly.</li> <li>3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main board, otherwise the DIMM.</li> </ul>                                                         |
| E7h              | Baseboard +1.35V P2 Low Voltage<br>Memory CD VDDQ<br>(BB +1.35 P2LV CD) | <ul> <li>This 1.35V line is supplied by the main board.</li> <li>This 1.35V line is used by processor 2 memory slots C and D.</li> <li>1. Ensure all cables are connected correctly.</li> <li>2. Check the DIMMs are seated properly.</li> <li>3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main board, otherwise the DIMM.</li> </ul>                                                         |
| EAh              | Baseboard +3.3V Riser 1 Power Good<br>(BB +3.3 RSR1 PGD)                | <ul> <li>+3.3V Riser 1 Power Good is supplied by Riser 1 on specific platforms.</li> <li>+3.3V Riser 1 Power Good is an indication of the +3.3V on Riser 1.</li> <li>1. Ensure that the riser is seated correctly.</li> <li>2. If issue remains, replace the riser.</li> <li>3. If issue remains, replace the main board.</li> <li>4. If the issue remains, replace the power supplies.</li> </ul>                                         |
| EBh              | Baseboard +3.3V Riser 2 Power Good<br>(BB +3.3 RSR2 PGD)                | <ul> <li>+3.3V Riser 2 Power Good is supplied by Riser 2 on specific platforms.</li> <li>+3.3V Riser 2 Power Good is an indication of the +3.3V on Riser 2.</li> <li>1. Ensure that the riser is seated correctly.</li> <li>2. If issue remains, replace the riser.</li> <li>3. If issue remains, replace the main board.</li> <li>4. If the issue remains, replace the power supplies.</li> </ul>                                         |

Power Subsystems

| Sensor<br>Number | Sensor Name                          | Next Steps                                                                                                                                                                                                                                                                                                                                                          |  |
|------------------|--------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| ECh              | Baseboard +0.9V<br>(BB 0.9V Core IB) | <ul> <li>+0.9V Core IB is supplied by the main board on specific platforms.</li> <li>+0.9V Core IB is used by the on-board Infiniband* controller on those specific platforms.</li> <li>1. Ensure all cables are connected correctly.</li> <li>2. If the issue remains, replace the board.</li> <li>3. If the issue remains, replace the power supplies.</li> </ul> |  |
| EDh              | Baseboard +1.8V<br>(BB 1.8V IB I/O)  | <ul> <li>+1.8V IB I/O is supplied by the main board on specific platforms.</li> <li>+1.8V IB I/O is used by the on-board Infiniband* controller on those specific platforms.</li> <li>1. Ensure all cables are connected correctly.</li> <li>2. If the issue remains, replace the board.</li> <li>3. If the issue remains, replace the power supplies.</li> </ul>   |  |
| EEh              | Baseboard +1.1V<br>(BB 1.1V PCH)     | <ul> <li>This 1.1V line is supplied by the main board.</li> <li>This 1.1V line is used by the Intel<sup>®</sup> C600 series Chipset.</li> <li>1. Ensure all cables are connected correctly.</li> <li>2. If the issue remains, replace the board.</li> </ul>                                                                                                         |  |
| EFh              | Baseboard +1.2V<br>(BB +1.2V IB)     | <ul> <li>+1.2V is supplied by the main board on specific platforms.</li> <li>+1.2V is used by the on-board Infiniband* controller on those specific platforms.</li> <li>1. Ensure all cables are connected correctly.</li> <li>2. If the issue remains, replace the board.</li> <li>3. If the issue remains, replace the power supplies.</li> </ul>                 |  |

## 4.2 Voltage Regulator Watchdog Timer Sensor

The BMC FW monitors that the power sequence for the board VR controllers is completed when a DC power-on is initiated. Incompletion of the sequence indicates a board problem, in which case the FW powers down the system.

The sequence is as follows:

BMC FW monitors the *PowerSupplyPowerGood* signal for assertion, indicating a DC-power-on has been initiated, and starts a timer (VR Watchdog Timer). For EPSD Platforms Based on Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5 4600/2600/2400/1600 Product Families this timeout is 500ms.

 If the SystemPowerGood signal has not asserted by the time the VR Watchdog Timer expires, the FW powers down the system, logs a SEL entry, and emits a beep code (1-5-1-2). This failure is termed as VR Watchdog Timeout.

| Byte | Field                             | Description                                                                                                                                                            |  |
|------|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| 11   | Sensor Type                       | 02h = Voltage                                                                                                                                                          |  |
| 12   | Sensor Number                     | 0Bh                                                                                                                                                                    |  |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 03h ("digital" Discrete)</li> </ul>             |  |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 00b = Unspecified Event Data 2</li> <li>[5:4] - 00b = Unspecified Event Data 3</li> <li>[3:0] - Event Trigger Offset = 1h = State Asserted</li> </ul> |  |
| 15   | Event Data 2                      | Not used                                                                                                                                                               |  |
| 16   | Event Data 3                      | Not used                                                                                                                                                               |  |

Table 14: Voltage Regulator Watchdog Timer Sensor Typical Characteristics

## 4.2.1 Voltage Regulator Watchdog Timer Sensor – Next Steps

- 1. Ensure that all the connectors from the power supply are well seated.
- 2. Cross test the baseboard. If the issue remains with the baseboard, replace the baseboard.

## 4.3 Power Unit

The power unit monitors the power state of the system and logs the state changes in the SEL.

### 4.3.1 Power Unit Status Sensor

The power unit status sensor monitors the power state of the system and logs state changes. Expected power-on events such as DC ON/OFF is logged and unexpected events are also logged, such as AC loss and power good loss.

| Byte | Field                             | Description                                                                                                                                                                 |
|------|-----------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | 09h = Power Unit                                                                                                                                                            |
| 12   | Sensor Number                     | 01h                                                                                                                                                                         |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 6Fh (Sensor Specific)</li> </ul>                     |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 00b = Unspecified Event Data 2</li> <li>[5:4] - 00b = Unspecified Event Data 3</li> <li>[3:0] = Sensor Specific offset as described in Table 16</li> </ul> |
| 15   | Event Data 2                      | Not used                                                                                                                                                                    |
| 16   | Event Data 3                      | Not used                                                                                                                                                                    |

#### Table 15: Power Unit Status Sensors Typical Characteristics

Table 16: Power Unit Status Sensor – Sensor Specific Offsets – Next Steps

| Se  | nsor Specific Offset | Description                                                           | Next Steps                                                                                                                                                                                                                                                                                                                           |
|-----|----------------------|-----------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Hex | Description          |                                                                       |                                                                                                                                                                                                                                                                                                                                      |
| 00h | Power down           | System is powered down.                                               | Informational Event                                                                                                                                                                                                                                                                                                                  |
| 02h | 240 VA power down    | 240 VA power limit was exceeded and the hardware forced a power down. | <ol> <li>This could have been caused by many things.</li> <li>If you recently added hardware, try removing it.</li> <li>Remove/replace any add-in adapters.</li> <li>Remove/replace the power supply.</li> <li>Remove/replace the processors, DIMM, and/or hard drives.</li> <li>Remove/replace the boards in the system.</li> </ol> |
| 04h | A/C Lost             | A/C power was removed.                                                | Informational Event                                                                                                                                                                                                                                                                                                                  |

#### System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5 4600/2600/2400/1600/1400 Product Families

| Se  | nsor Specific Offset          | Description                                                            | Next Steps                                                                                                                                                       |  |
|-----|-------------------------------|------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| Hex | Description                   |                                                                        |                                                                                                                                                                  |  |
| 05h | Soft Power Control<br>Failure | Generally means power good was lost in the system, causing a shutdown. | This could be cause by the power supply subsystem or system components.                                                                                          |  |
|     |                               |                                                                        | <ol> <li>Verify all power cables and adapters are connected properly (AC<br/>cables as well as the cables between the PSU and system<br/>components).</li> </ol> |  |
|     |                               |                                                                        | 2. Cross test the PSU if possible.                                                                                                                               |  |
|     |                               |                                                                        | 3. Replace the power subsystem.                                                                                                                                  |  |
| 06h | Power Unit Failure            | Power subsystem experienced a failure.                                 | Indicates a power supply failed.                                                                                                                                 |  |
|     |                               |                                                                        | 1. Remove and reapply AC power.                                                                                                                                  |  |
|     |                               |                                                                        | 2. If the power supply still fails, replace it.                                                                                                                  |  |

## 4.3.2 Power Unit Redundancy Sensor

This sensor is enabled on the systems that support redundant power supplies. When a system has AC applied or if it loses redundancy of the power supplies, a message will get logged into the SEL.

#### Table 17: Power Unit Redundancy Sensors Typical Characteristics

| Byte | Field                             | Description                                                                                                                                                               |
|------|-----------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | 09h = Power Unit                                                                                                                                                          |
| 12   | Sensor Number                     | 02h                                                                                                                                                                       |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 0Bh (Generic Discrete)</li> </ul>                  |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 00b = Unspecified Event Data 2</li> <li>[5:4] - 00b = Unspecified Event Data 3</li> <li>[3:0] - Event Trigger Offset as described in Table 18</li> </ul> |
| 15   | Event Data 2                      | Not used                                                                                                                                                                  |
| 16   | Event Data 3                      | Not used                                                                                                                                                                  |

|     | Event Trigger Offset                         | Description                                           | Next Steps                                                   |  |
|-----|----------------------------------------------|-------------------------------------------------------|--------------------------------------------------------------|--|
| Hex | Description                                  |                                                       |                                                              |  |
| 00h | Fully redundant                              | System is fully operational.                          | Informational Event                                          |  |
| 01h | Redundancy lost                              |                                                       |                                                              |  |
| 02h | Redundancy degraded                          |                                                       |                                                              |  |
| 03h | Non-redundant, sufficient from redundant     |                                                       | This event is accompanied by specific power supply errors    |  |
| 04h | Non-redundant, sufficient from insufficient  | System is not running in redundant power supply mode. | (AC lost, PSU failure, and so on). Troubleshoot these events |  |
| 05h | Non-redundant, insufficient                  |                                                       | accordingly.                                                 |  |
| 06h | Non-redundant, degraded from fully redundant |                                                       |                                                              |  |
| 07h | Redundant, degraded from non-redundant       |                                                       |                                                              |  |

Table 18: Power Unit Redundancy Sensor – Event Trigger Offset – Next Steps

#### 4.3.3 Node Auto Shutdown Sensor

The BMC supports a Node Auto Shutdown sensor for logging a SEL event due to an emergency shutdown of a node due to loss of power supply redundancy or PSU CLST throttling due to an over-current warning condition. This sensor is applicable only to multinode systems.

The sensor is rearmed on power-on (AC or DC power on transitions).

This sensor is only used for triggering SEL to indicate node or power auto shutdown assertion or deassertion.

| Byte | Field                             | Description                                                                                       |
|------|-----------------------------------|---------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | 09h = Power Unit                                                                                  |
| 12   | Sensor Number                     | B8h                                                                                               |
| 13   | Event Direction and<br>Event Type | <ul><li>[7] Event direction</li><li>0b = Assertion Event</li><li>1b = Deassertion Event</li></ul> |

Table 19: Node Auto Shutdown Sensor Typical Characteristics

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5 4600/2600/2400/1600/1400 Product Families

| Byte | Field        | Description                                 |
|------|--------------|---------------------------------------------|
|      |              | [6:0] Event Type = 03h ("digital" discrete) |
| 14   | Event Data 1 | [7:6] – 00b = Unspecified Event Data 2      |
|      |              | [5:4] – 00b = Unspecified Event Data 3      |
|      |              | [3:0] – Event Trigger Offset                |
|      |              | 1h = State Asserted                         |
| 15   | Event Data 2 | Not used                                    |
| 16   | Event Data 3 | Not used                                    |

#### 4.3.3.1 Node Auto Shutdown Sensor – Next Steps

This event is accompanied by specific power supply errors (AC lost, PSU failure, and so on) or other system events. Troubleshoot these events accordingly.

## 4.4 Power Supply

The BMC monitors the power supply subsystem.

### 4.4.1 Power Supply Status Sensors

These sensors report the status of the power supplies in the system. When a system first AC applied or removed, it can log an event. Also if there is a failure, predictive failure, or a configuration error, it can log an event.

| Byte | Field                             | Description                                                           |  |
|------|-----------------------------------|-----------------------------------------------------------------------|--|
| 11   | Sensor Type                       | 08h = Power Supply                                                    |  |
| 12   | Sensor Number                     | 50h = Power Supply 1 Status<br>51h = Power Supply 2 Status            |  |
| 13   | Event Direction and<br>Event Type | [7] Event direction<br>0b = Assertion Event<br>1b = Deassertion Event |  |

### Power Subsystems

| Byte | Field        | Description                                                                                                             |
|------|--------------|-------------------------------------------------------------------------------------------------------------------------|
|      |              | [6:0] Event Type = 6Fh (Sensor Specific)                                                                                |
| 14   | Event Data 1 | [7:6] – ED2 data in Table 21<br>[5:4] – ED3 data in Table 21<br>[3:0] – Sensor Specific offset as described in Table 21 |
| 15   | Event Data 2 | As described in Table 21                                                                                                |
| 16   | Event Data 3 | As described in Table 21                                                                                                |

#### Table 21: Power Supply Status Sensor – Sensor Specific Offsets – Next Steps

| Senso | or Specific Offset | Description                                                               | ED2                                                                                                                                                                                                                              | ED3                                                                                                                                                                                                                                                                                                                                 | Next Steps                                                                                                                |
|-------|--------------------|---------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
| Hex   | Description        |                                                                           |                                                                                                                                                                                                                                  |                                                                                                                                                                                                                                                                                                                                     |                                                                                                                           |
| 00h   | Presence           | Power supply detected                                                     | 00b = Unspecified Event Data 2                                                                                                                                                                                                   | 00b = Unspecified Event Data 3                                                                                                                                                                                                                                                                                                      | Informational Event                                                                                                       |
| 01h   | Failure            | Power supply failed<br>Check the data in ED2<br>and ED3 for more details. | <ul> <li>10b = OEM code in Event Data 2</li> <li>01h – Output voltage fault</li> <li>02h – Output power fault</li> <li>03h – Output over-current fault</li> <li>04h – Over-temperature fault</li> <li>05h – Fan fault</li> </ul> | 10b = OEM code in Event Data 3<br>Will have the contents of the<br>associated PMBus* Status<br>register. For example, Data 3 will<br>have the contents of the<br>VOLTAGE_STATUS register at<br>the time an Output Voltage fault<br>was detected. Refer to the<br>PMBus* Specification for details<br>on specific register contents. | Indicates a power supply<br>failed.<br>1. Remove and reapply<br>AC.<br>2. If the power supply<br>still fails, replace it. |

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

| Senso | or Specific Offset     | Description                                                                                      | ED2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | ED3                                                                                                                                                                                                                                                                                                                                     | Next Steps                                                                                                                                                                                                                                               |
|-------|------------------------|--------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Hex   | Description            |                                                                                                  |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |                                                                                                                                                                                                                                                                                                                                         |                                                                                                                                                                                                                                                          |
| 02h   | Predictive<br>Failure  | Check the data in ED2<br>and ED3 for more details.                                               | <ul> <li>10b = OEM code in Event Data 2</li> <li>01h – Output voltage warning</li> <li>02h – Output power warning</li> <li>03h – Output over-current warning</li> <li>04h –Over-temperature warning</li> <li>05h – Fan warning</li> <li>06h – Input under-voltage warning</li> <li>07h – Input over-current warning</li> <li>08h – Input over-power warning</li> </ul>                                                                                                                                                                                                                    | 10b = OEM code in Event Data 3<br>Will have the contents of the<br>associated PMBus* Status<br>register. For example, Data 3 will<br>have the contents of the<br>VOLTAGE_STATUS register at<br>the time an Output Voltage<br>warning was detected. Refer to<br>the PMBus* Specification for<br>details on specific register<br>contents | <ul> <li>Depends on the warning event.</li> <li>1. Replace the power supply.</li> <li>2. Verify proper airflow to the system.</li> <li>3. Verify the power source.</li> <li>4. Replace the system boards.</li> </ul>                                     |
| 03h   | A/C lost               | AC removed                                                                                       | 00b = Unspecified Event Data 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | 00b = Unspecified Event Data 3                                                                                                                                                                                                                                                                                                          | Informational Event.                                                                                                                                                                                                                                     |
| 06h   | Configuration<br>error | Power supply<br>configuration is not<br>supported.<br>Check the data in ED2 for<br>more details. | <ul> <li>10b = OEM code in Event Data 2</li> <li>01h – The BMC cannot access the PMBus* device on the PSU but its FRU device is responding.</li> <li>02h – The PMBUS*_REVISION command returns a version number that is not supported (only version 1.1 and 1.2 are supported).</li> <li>03h – The PMBus* device does not successfully respond to the PMBUS*_REVISION command.</li> <li>04h – The PSU is incompatible with one or more PSUs that are present in the system.</li> <li>05h –The PSU FW is operating in a degraded mode (likely due to a failed firmware update).</li> </ul> | 00b = Unspecified Event Data 3                                                                                                                                                                                                                                                                                                          | <ul> <li>Indicates that at least one of the supplies is not correct for your system configuration.</li> <li>1. Remove the power supply and verify compatibility.</li> <li>2. If the power supply is compatible, it may be faulty. Replace it.</li> </ul> |

## 4.4.2 Power Supply Power In Sensors

These sensors will log an event when a power supply in the system is exceeding its AC power in threshold.

| Byte | Field                             | Description                                                                                                                                                                               |  |
|------|-----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| 11   | Sensor Type                       | 0Bh = Other Units                                                                                                                                                                         |  |
| 12   | Sensor Number                     | 54h = Power Supply 1 Status<br>55h = Power Supply 2 Status                                                                                                                                |  |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 01h(Threshold)</li> </ul>                                          |  |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 01b = Trigger reading in Event Data 2</li> <li>[5:4] - 01b = Trigger threshold in Event Data 3</li> <li>[3:0] - Event Trigger Offset as described in Table 23</li> </ul> |  |
| 15   | Event Data 2                      | Reading that triggered event                                                                                                                                                              |  |
| 16   | Event Data 3                      | Threshold value that triggered event                                                                                                                                                      |  |

#### Table 22: Power Supply Power In Sensors Typical Characteristics

The following table describes the severity of each of the event triggers for both assertion and deassertion.

Table 23: Power Supply Power In Sensor – Event Trigger Offset – Next Steps

| Event Trigger Offset |                               | Assertion<br>Severity | Deassert<br>Severity | Description                                                  | Next Steps                                                                                                                                                                        |  |
|----------------------|-------------------------------|-----------------------|----------------------|--------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| Hex                  | Description                   | Seventy               | Seventy              |                                                              |                                                                                                                                                                                   |  |
| 07h                  | Upper non-critical going high | Degraded              | ОК                   | PMBus* feature to monitor power<br>supply power consumption. | If you see this event, the system is pulling too much power on the input for the PSU rating.                                                                                      |  |
| 09h                  | Upper critical<br>going high  | non-fatal             | Degraded             |                                                              | <ol> <li>Verify the power budget is within the specified range.</li> <li>Check <u>http://www.intel.com/p/en_US/support/</u> for the power budget tool for your system.</li> </ol> |  |

## 4.4.3 Power Supply Current Out % Sensors

PMBus\*-compliant power supplies may monitor the current output of the main 12v voltage rail and report the current usage as a percentage of the maximum power output for that rail.

| Byte | Field                             | Description                                                                                                                                                                               |
|------|-----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | 03h = Current                                                                                                                                                                             |
| 12   | Sensor Number                     | 58h = Power Supply 1 Current Out %<br>59h = Power Supply 2 Current Out %                                                                                                                  |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 01h (Threshold)</li> </ul>                                         |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 01b = Trigger reading in Event Data 2</li> <li>[5:4] - 01b = Trigger threshold in Event Data 3</li> <li>[3:0] - Event Trigger Offset as described in Table 25</li> </ul> |
| 15   | Event Data 2                      | Reading that triggered event                                                                                                                                                              |
| 16   | Event Data 3                      | Threshold value that triggered event                                                                                                                                                      |

#### Table 24: Power Supply Current Out % Sensors Typical Characteristics

The following table describes the severity of each of the event triggers for both assertion and deassertion.

#### Table 25: Power Supply Current Out % Sensor – Event Trigger Offset – Next Steps

| Ev  | ent Trigger Offset               | Assertion<br>Severity | Deassert<br>Severity | Description                                                  | Next Steps                                                                                                                                                                        |
|-----|----------------------------------|-----------------------|----------------------|--------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Hex | Description                      | Seventy               | Seventy              |                                                              |                                                                                                                                                                                   |
| 07h | Upper non-critical<br>going high | Degraded              | ОК                   | PMBus* feature to monitor power<br>supply power consumption. | If you see this event, the system is using too much power on the output for the PSU rating.                                                                                       |
| 09h | Upper critical<br>going high     | non-fatal             | Degraded             |                                                              | <ol> <li>Verify the power budget is within the specified range.</li> <li>Check <u>http://www.intel.com/p/en_US/support/</u> for the power budget tool for your system.</li> </ol> |

## 4.4.4 Power Supply Temperature Sensors

The BMC monitors one or two power supply temperature sensors for each installed PMBus\*-compliant power supply.

| Byte | Field                             | Description                                                                                                                                                                               |
|------|-----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | 01h = Temperature                                                                                                                                                                         |
| 12   | Sensor Number                     | 5Ch = Power Supply 1 Temperature<br>5Dh = Power Supply 2 Temperature                                                                                                                      |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 01h (Threshold)</li> </ul>                                         |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 01b = Trigger reading in Event Data 2</li> <li>[5:4] - 01b = Trigger threshold in Event Data 3</li> <li>[3:0] - Event Trigger Offset as described in Table 27</li> </ul> |
| 15   | Event Data 2                      | Reading that triggered event                                                                                                                                                              |
| 16   | Event Data 3                      | Threshold value that triggered event                                                                                                                                                      |

#### Table 26: Power Supply Temperature Sensors Typical Characteristics

The following table describes the severity of each of the event triggers for both assertion and deassertion.

Table 27: Power Supply Temperature Sensor – Event Trigger Offset – Next Steps

| E   | vent Trigger Offset              | Assertion Deassert<br>Severity Severity |          | Description                                   | Next Steps                                                                                                                                                                        |
|-----|----------------------------------|-----------------------------------------|----------|-----------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Hex | Description                      | Severity                                | Seventy  |                                               |                                                                                                                                                                                   |
| 07h | Upper non-critical<br>going high | Degraded                                | ОК       | An upper non-critical or critical temperature | <ol> <li>Check for clear and unobstructed airflow into and out of the chassis.</li> <li>Ensure SDR is programmed and correct chassis has been selected.</li> </ol>                |
| 09h | Upper critical going<br>high     | non-fatal                               | Degraded | threshold has been crossed.                   | <ol> <li>Ensure there are no fan failures.</li> <li>Ensure the air used to cool the system is within the thermal specifications for the system (typically below 35°C).</li> </ol> |

## 4.4.5 Power Supply Fan Tachometer Sensors

The BMC polls each installed power supply using the PMBus\* fan status commands to check for failure conditions for the power supply fans.

| Byte | Field                             | Description                                                                                                                                                            |  |
|------|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| 11   | Sensor Type                       | 04h = Fan                                                                                                                                                              |  |
| 12   | Sensor Number                     | A0h = Power Supply 1 Fan Tachometer 1<br>A1h = Power Supply 1 Fan Tachometer 2<br>A4h = Power Supply 2 Fan Tachometer 1<br>A5h = Power Supply 2 Fan Tachometer 2       |  |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 03h ("digital" Discrete)</li> </ul>             |  |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 00b = Unspecified Event Data 2</li> <li>[5:4] - 00b = Unspecified Event Data 3</li> <li>[3:0] - Event Trigger Offset = 1h = State Asserted</li> </ul> |  |
| 15   | Event Data 2                      | Not used                                                                                                                                                               |  |
| 16   | Event Data 3                      | Not used                                                                                                                                                               |  |

#### Table 28: Power Supply Fan Tachometer Sensors Typical Characteristics

#### 4.4.5.1 Power Supply Fan Tachometer Sensors – Next Steps

These events only get generated in the systems with PMBus\*-capable power supplies and normally when the airflow is obstructed to the power supply:

- 1. Remove and then reinstall the power supply to see whether something might have temporarily caused the fan failure.
- 2. Swap the power supply with another one to see whether the problem stays with the location or follows the power supply.
- 3. Replace the power supply depending on the outcome of steps 1 and 2.
- 4. Ensure the latest FRUSDR update has been run and the correct chassis is detected or selected.

## 5.1 Fan Sensors

There are three types of fan sensors that can be present on Intel<sup>®</sup> Server Systems: speed, presence, and redundancy. The last two are only present in the systems with hot-swap redundant fans.

## 5.1.1 Fan Tachometer Sensors

Fan tachometer sensors monitor the rpm signal on the relevant fan headers on the platform. Fan speed sensors are threshold-based sensors. Usually they only have lower (critical) thresholds set, so that a SEL entry is only generated if the fan spins too slowly.

| Byte | Field                             | Description                                                                                                                                                                               |
|------|-----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | 04h = Fan                                                                                                                                                                                 |
| 12   | Sensor Number                     | 30h-3Fh (Chassis specific)<br>BAh-BFh (Chassis specific)                                                                                                                                  |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 01h (Threshold)</li> </ul>                                         |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 01b = Trigger reading in Event Data 2</li> <li>[5:4] - 01b = Trigger threshold in Event Data 3</li> <li>[3:0] - Event Trigger Offset as described in Table 30</li> </ul> |
| 15   | Event Data 2                      | Reading that triggered event                                                                                                                                                              |
| 16   | Event Data 3                      | Threshold value that triggered event                                                                                                                                                      |

#### Table 29: Fan Tachometer Sensors Typical Characteristics

The following table describes the severity of each of the event triggers for both assertion and deassertion.

#### System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5 4600/2600/2400/1600/1400 Product Families

| Ev  | vent Trigger Offset             | Assertion Deassert<br>Severity Severity |          | Description                                                             | Next Steps                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |  |
|-----|---------------------------------|-----------------------------------------|----------|-------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| Hex | Description                     | Sevency                                 | Seventy  |                                                                         |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |  |
| 00h | Lower non-critical<br>going low | Degraded                                | ОК       | The fan speed has dropped<br>below its lower non-critical<br>threshold. | <ul> <li>A fan speed error on a new system build is typically not caused by the fan spinning too slowly, instead it is caused by the fan being connected to the wrong header (the BMC expects them on certain headers for each chassis and will log this event if there is no fan on that header).</li> <li>1. Refer to the <i>Quick Start Guide</i> or the <i>Service Guide</i> to identify the correct fan headers to use.</li> <li>2. Ensure the latest FRUSDR update has been run and the correct chassis is detected or selected.</li> <li>3. If you are sure this was done, the event may be a sign of in factor for the factor fo</li></ul> |  |
|     |                                 |                                         |          |                                                                         | impending fan failure (although this only normally applies if the system has been in use for a while). Replace the fan.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |  |
| 02h | Lower critical going low        | non-fatal                               | Degraded | The fan speed has dropped<br>below its lower critical<br>threshold.     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |  |

#### Table 30: Fan Tachometer Sensor – Event Trigger Offset – Next Steps

#### 5.1.2 Fan Presence and Redundancy Sensors

Fan presence sensors are only implemented for hot-swap fans, and require an additional pin on the fan header. Fan redundancy is an aggregate of the fan presence sensors and will warn when redundancy is lost. Typically the redundancy mode on Intel<sup>®</sup> servers is an n+1 redundancy (if one fan fails there are still sufficient fans to cool the system, but it is no longer redundant) although other modes are also possible.

| Byte | Field                             | Description                                                                                                                                                        |
|------|-----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | 04h = Fan                                                                                                                                                          |
| 12   | Sensor Number                     | 40h-4Fh (Chassis specific)                                                                                                                                         |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 08h (Generic "digital" Discrete)</li> </ul> |

#### Cooling Subsystem

| B  | Byte | Field        | Description                                                                                                                                                               |
|----|------|--------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 14 | 4    | Event Data 1 | <ul> <li>[7:6] - 00b = Unspecified Event Data 2</li> <li>[5:4] - 00b = Unspecified Event Data 3</li> <li>[3:0] - Event Trigger Offset as described in Table 32</li> </ul> |
| 1  | 5    | Event Data 2 | Not used                                                                                                                                                                  |
| 16 | 6    | Event Data 3 | Not used                                                                                                                                                                  |

The following table describes the severity of each of the event triggers for both assertion and deassertion.

| Event <sup>-</sup> | Event Trigger Offset |          | Deassert<br>Severity | Description                                                                                                      | Next Steps                                                                                                                                                                         |  |
|--------------------|----------------------|----------|----------------------|------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| Hex                | Description          | Severity | Jeventy              |                                                                                                                  |                                                                                                                                                                                    |  |
| 01h                | Device<br>Present    | OK       | Degraded             | Assertion –A fan was inserted. This<br>event may also get logged when the<br>BMC initializes when AC is applied. | Informational only                                                                                                                                                                 |  |
|                    |                      |          |                      | Deassert – A fan was removed, or<br>was not present at the expected<br>location when the BMC initialized.        | These events only get generated in the systems with hot-swappable fans,<br>and normally only when a fan is physically inserted or removed. If fans<br>were not physically removed: |  |
|                    |                      |          |                      |                                                                                                                  | <ol> <li>Use the Quick Start Guide to check whether the right fan<br/>headers were used.</li> </ol>                                                                                |  |
|                    |                      |          |                      |                                                                                                                  | <ol><li>Swap the fans round to see whether the problem stays with the<br/>location or follows the fan.</li></ol>                                                                   |  |
|                    |                      |          |                      |                                                                                                                  | <ol> <li>Replace the fan or fan wiring/housing depending on the outcome<br/>of step 2.</li> </ol>                                                                                  |  |
|                    |                      |          |                      |                                                                                                                  | <ol> <li>Ensure the latest FRUSDR update has been run and the correct<br/>chassis is detected or selected.</li> </ol>                                                              |  |

Table 32: Fan Presence Sensors – Event Trigger Offset – Next Steps

#### Table 33: Fan Redundancy Sensors Typical Characteristics

| B  | yte | Field         | Description |
|----|-----|---------------|-------------|
| 11 | I   | Sensor Type   | 04h = Fan   |
| 12 | 2   | Sensor Number | 0Ch         |

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5 4600/2600/2400/1600/1400 Product Families

| Byte | Field                             | Description                                                                                                                                                               |
|------|-----------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 0Bh (Generic Discrete)</li> </ul>                  |
| 14   | Event Data 1                      | <ul> <li>[7:6] – 00b = Unspecified Event Data 2</li> <li>[5:4] – 00b = Unspecified Event Data 3</li> <li>[3:0] – Event Trigger Offset as described in Table 34</li> </ul> |
| 15   | Event Data 2                      | Not used                                                                                                                                                                  |
| 16   | Event Data 3                      | Not used                                                                                                                                                                  |

The following table describes the severity of each of the event triggers for both assertion and deassertion.

| -                    |                                              |                                                                                                                                                                        |                                                                                                                                                                    |
|----------------------|----------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Event Trigger Offset |                                              | Description                                                                                                                                                            | Next Steps                                                                                                                                                         |
| Hex                  | Description                                  |                                                                                                                                                                        |                                                                                                                                                                    |
| 00h                  | Fully redundant                              | The system has lost one or more fans and is running in non-                                                                                                            | Fan redundancy loss indicates failure of                                                                                                                           |
| 01h                  | Redundancy lost                              | redundant mode. There are enough fans to keep the system properly cooled, but fan speeds will boost.                                                                   | one or more fans.                                                                                                                                                  |
| 02h                  | Redundancy degraded                          |                                                                                                                                                                        | Look for lower (non-) critical fan errors,<br>or fan removal errors in the SEL, to<br>indicate which fan is causing the<br>problem, and follow the troubleshooting |
| 03h                  | Non-redundant, sufficient from redundant     |                                                                                                                                                                        |                                                                                                                                                                    |
| 04h                  | Non-redundant, sufficient from insufficient  |                                                                                                                                                                        | steps for these event types.                                                                                                                                       |
| 05h                  | Non-redundant, insufficient                  | The system has lost fans and may no longer be able to cool itself adequately. Overheating may occur if this situation remains for a longer period of time.             |                                                                                                                                                                    |
| 06h                  | Non-redundant, degraded from fully redundant | The system has lost one or more fans and is running in non-<br>redundant mode. There are enough fans to keep the system<br>properly cooled, but fan speeds will boost. |                                                                                                                                                                    |
| 07h                  | Redundant, degraded from non-redundant       | The system has lost one or more fans and is running in a degraded mode, but still is redundant. There are enough fans to keep the system properly cooled.              |                                                                                                                                                                    |

#### Table 34: Fan Redundancy Sensor – Event Trigger Offset – Next Steps

## 5.2 Temperature Sensors

There are a variety of temperature sensors that can be implemented on Intel<sup>®</sup> Server Systems. They are split into various types each with their own events that can be logged.

- Threshold-based Temperature
- Thermal Margin
- Processor Thermal Control %
- Processor DTS Thermal Margin (Monitor only)
- Discrete Thermal
- DIMM Thermal Trip

## 5.2.1 Threshold-based Temperature Sensors

Threshold-based temperature sensors are sensors that report an actual temperature. These are linear, threshold-based sensors. In most Intel<sup>®</sup> Server Systems, multiple sensors are defined: front panel temperature and baseboard temperature. There are also multiple other sensors that can be defined and are platform-specific. Most of these sensors typically have upper and lower thresholds set – upper to warn in case of an over-temperature situation, lower to warn against sensor failure (temperature sensors typically read out 0 if they stop working).

| Byte | Field                             | Description                                                                                                                                                                               |
|------|-----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | 01h = Temperature                                                                                                                                                                         |
| 12   | Sensor Number                     | See Table 37                                                                                                                                                                              |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 01h (Threshold)</li> </ul>                                         |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 01b = Trigger reading in Event Data 2</li> <li>[5:4] - 01b = Trigger threshold in Event Data 3</li> <li>[3:0] - Event Trigger Offset as described in Table 36</li> </ul> |
| 15   | Event Data 2                      | Reading that triggered event                                                                                                                                                              |

Table 35: Temperature Sensors Typical Characteristics

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

| Byte | Field        | Description                          |
|------|--------------|--------------------------------------|
| 16   | Event Data 3 | Threshold value that triggered event |

#### Table 36: Temperature Sensors Event Triggers – Description

| Hex  | Event Trigger<br>Description  | Assertion<br>Severity | Deassert<br>Severity | Description                                                         |
|------|-------------------------------|-----------------------|----------------------|---------------------------------------------------------------------|
| TIEA | Description                   | Sevency               | Sevency              |                                                                     |
| 00h  | Lower non-critical going low  | Degraded              | ОК                   | The temperature has dropped below its lower non-critical threshold. |
| 02h  | Lower critical going low      | non-fatal             | Degraded             | The temperature has dropped below its lower critical threshold.     |
| 07h  | Upper non-critical going high | Degraded              | ОК                   | The temperature has gone over its upper non-critical threshold.     |
| 09h  | Upper critical<br>going high  | non-fatal             | Degraded             | The temperature has gone over its upper critical threshold.         |

## Table 37: Temperature Sensors – Next Steps

| Sensor<br>Number | Sensor Name             | Next Steps                                                                                                                                                                                                                                                                              |
|------------------|-------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 21h              | Front Panel Temp        | <ul> <li>If the front panel temperature reads zero, check:</li> <li>1. It is connected properly.</li> <li>2. The SDR has been programmed correctly for your chassis.</li> <li>If the front panel temperature is too high:</li> <li>1. Check the cooling of your server room.</li> </ul> |
| 14h              | Baseboard Temperature 5 | 1. Check for clear and unobstructed airflow into and out of the chassis.                                                                                                                                                                                                                |
| 15h              | Baseboard Temperature 6 | 2. Ensure the SDR is programmed and correct chassis has been selected.                                                                                                                                                                                                                  |
| 16h              | I/O Mod2 Temp           | <ol> <li>Ensure there are no fan failures.</li> <li>Ensure the air used to cool the system is within the thermal specifications for the system (typically below</li> </ol>                                                                                                              |
| 17h              | PCI Riser 5 Temp        | 35°℃).                                                                                                                                                                                                                                                                                  |
| 18h              | PCI Riser 4 Temp        |                                                                                                                                                                                                                                                                                         |
| 20h              | Baseboard Temperature 1 |                                                                                                                                                                                                                                                                                         |
| 22h              | SSB Temperature         |                                                                                                                                                                                                                                                                                         |

| Sensor<br>Number | Sensor Name             |
|------------------|-------------------------|
| 23h              | Baseboard Temperature 2 |
| 24h              | Baseboard Temperature 3 |
| 25h              | Baseboard Temperature 4 |
| 26h              | I/O Mod Temp            |
| 27h              | PCI Riser 1 Temp        |
| 28h              | IO Riser Temp           |
| 2Ch              | PCI Riser 2 Temp        |
| 2Dh              | SAS Mod Temp            |
| 2Eh              | Exit Air Temp           |
| 2Fh              | LAN NIC Temp            |

#### 5.2.2 Thermal Margin Sensors

Margin sensors are also linear sensors but typically report a negative value. This is not an actual temperature, but in fact an offset to a critical temperature. Values reported are seen as number of degrees below a critical temperature for the particular component.

The BMC supports DIMM aggregate temperature margin IPMI sensors. The temperature readings from the physical temperature sensors on each DIMM (such as, Temperature Sensor on DIMM, or TSOD) are aggregated into IPMI temperature margin sensors for groupings of DIMM slots, the partitioning of which is platform/SKU specific and generally corresponding to fan domains.

The BMC supports global aggregate temperature margin IPMI sensors. There may be as many unique global aggregate sensors as there are fan domains. Each sensor aggregates the readings of multiple other IPMI temperature sensors supported by the BMC FW. The mapping of child-sensors into each global aggregate sensor is SDR-configurable. The primary usage for these sensors is to trigger turning off fans when a lower threshold is reached.

| Byte Field |               | Description       |
|------------|---------------|-------------------|
| 11         | Sensor Type   | 01h = Temperature |
| 12         | Sensor Number | See Table 40      |

#### Table 38: Thermal Margin Sensors Typical Characteristics

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

| Byte | Field                             | Description                                                                                                                                                                         |
|------|-----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 01h (Threshold)</li> </ul>                                   |
| 14   | Event Data 1                      | <ul> <li>[7:6] – 01b = Trigger reading in Event Data 2</li> <li>[5:4] – 01b = Trigger threshold in Event Data 3</li> <li>[3:0] – Event Triggers as described in Table 39</li> </ul> |
| 15   | Event Data 2                      | Reading that triggered event                                                                                                                                                        |
| 16   | Event Data 3                      | Threshold value that triggered event                                                                                                                                                |

#### Table 39: Thermal Margin Sensors Event Triggers – Description

|     | Event Trigger                 | Assertion | Deassert | Description                                                        |
|-----|-------------------------------|-----------|----------|--------------------------------------------------------------------|
| Hex | Description                   | Severity  | Severity |                                                                    |
| 07h | Upper non-critical going high | Degraded  | ОК       | The thermal margin has gone over its upper non-critical threshold. |
| 09h | Upper critical<br>going high  | non-fatal | Degraded | The thermal margin has gone over its upper critical threshold.     |

#### Table 40: Thermal Margin Sensors – Next Steps

| Sensor<br>Number | Sensor Name        | Next Steps                                                                      |
|------------------|--------------------|---------------------------------------------------------------------------------|
| 74h              | P1 Therm Margin    |                                                                                 |
| 75h              | P2 Therm Margin    | Not a lagged CEL event. Concer is used for thermal management of the processor  |
| 76h              | P3 Therm Margin    | Not a logged SEL event. Sensor is used for thermal management of the processor. |
| 77h              | P4 Therm Margin    |                                                                                 |
| B0h              | P1 DIMM Thrm Mrgn1 | 1. Check for clear and unobstructed airflow into and out of the chassis.        |
| B1h              | P1 DIMM Thrm Mrgn2 | 2. Ensure the SDR is programmed and correct chassis has been selected.          |
| B2h              | P2 DIMM Thrm Mrgn1 | 3. Ensure there are no fan failures.                                            |

#### Cooling Subsystem

| Sensor<br>Number | Sensor Name        | Next Steps                                                                                                            |
|------------------|--------------------|-----------------------------------------------------------------------------------------------------------------------|
| B3h              | P2 DIMM Thrm Mrgn2 | 4. Ensure the air used to cool the system is within the thermal specifications for the system (typically below 35°C). |
| B4h              | P3 DIMM Thrm Mrgn1 |                                                                                                                       |
| B5h              | P3 DIMM Thrm Mrgn2 |                                                                                                                       |
| B6h              | P4 DIMM Thrm Mrgn1 |                                                                                                                       |
| B7h              | P4 DIMM Thrm Mrgn2 |                                                                                                                       |
| C8h              | Agg Therm Mrgn 1   |                                                                                                                       |
| C9h              | Agg Therm Mrgn 2   |                                                                                                                       |
| CAh              | Agg Therm Mrgn 3   |                                                                                                                       |
| CBh              | Agg Therm Mrgn 4   |                                                                                                                       |
| CCh              | Agg Therm Mrgn 5   |                                                                                                                       |
| CDh              | Agg Therm Mrgn 6   |                                                                                                                       |
| CEh              | Agg Therm Mrgn 7   |                                                                                                                       |
| CFh              | Agg Therm Mrgn 8   |                                                                                                                       |

## 5.2.3 Processor Thermal Control Sensors

The BMC FW monitors the percentage of time that a processor has been operationally constrained over a given time window (nominally six seconds) due to internal thermal management algorithms engaging to reduce the temperature of the device. This monitoring is instantiated as one IPMI analog/threshold sensor per processor package.

If this is not addressed, the processor will overheat and shut down the system to protect itself from damage.

| Byte | Field         | Description                                                                |
|------|---------------|----------------------------------------------------------------------------|
| 11   | Sensor Type   | 01h = Temperature                                                          |
| 12   | Sensor Number | 78h = Processor 1 Thermal Control %<br>79h = Processor 2 Thermal Control % |

Table 41: Processor Thermal Control Sensors Typical Characteristics

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5 4600/2600/2400/1600/1400 Product Families

| Byte | Field               | Description                                     |
|------|---------------------|-------------------------------------------------|
|      |                     | 7Ah = Processor 3 Thermal Control %             |
|      |                     | 7Bh = Processor 4 Thermal Control %             |
| 13   | Event Direction and | [7] Event direction                             |
|      | Event Type          | 0b = Assertion Event                            |
|      |                     | 1b = Deassertion Event                          |
|      |                     | [6:0] Event Type = 01h (Threshold)              |
| 14   | Event Data 1        | [7:6] – 01b = Trigger reading in Event Data 2   |
|      |                     | [5:4] – 01b = Trigger threshold in Event Data 3 |
|      |                     | [3:0] – Event Triggers as described in Table 42 |
| 15   | Event Data 2        | Reading that triggered event                    |
| 16   | Event Data 3        | Threshold value that triggered event            |

#### Table 42: Processor Thermal Control Sensors Event Triggers – Description

|     | Event Trigger                    | Assertion | Deassert | Description                                                        |
|-----|----------------------------------|-----------|----------|--------------------------------------------------------------------|
| Hex | Description                      | Severity  | Severity |                                                                    |
| 07h | Upper non-critical<br>going high | Degraded  | ОК       | The thermal margin has gone over its upper non-critical threshold. |
| 09h | Upper critical<br>going high     | non-fatal | Degraded | The thermal margin has gone over its upper critical threshold.     |

#### 5.2.3.1 Processor Thermal Control % Sensors – Next Steps

These events normally occur due to failures of the thermal solution:

- Verify heatsink is properly attached and has thermal grease.
   If the system has a heatsink fan, ensure the fan is spinning.
- 3. Check all system fans are operating properly.
- 4. Check that the air used to cool the system is within limits (typically 35°C).

## 5.2.4 Processor DTS Thermal Margin Sensors

Intel<sup>®</sup> Xeon<sup>®</sup> processor E5-4600/2600/2400/1600 v2 product families are incorporating a DTS-based thermal spec. This allows a much more accurate control of the thermal solution and enables lower fan speeds and lower fan power consumption. For Intel<sup>®</sup> Xeon<sup>®</sup> processor E5-4600/2600/2400/1600 product families, this requires significant BMC FW calculations to derive the sensor value. Intel<sup>®</sup> Xeon<sup>®</sup> processor E5-4600/2600/2400/1600 v2 product families are the follow-on processors to Intel<sup>®</sup> Xeon<sup>®</sup> processor E5-4600/2600/2400/1600 v2 product families are the follow-on processors to Intel<sup>®</sup> Xeon<sup>®</sup> processor E5-4600/2600/2400/1600 v2 product families are the follow-on processors to Intel<sup>®</sup> Xeon<sup>®</sup> processor E5-4600/2600/2400/1600 v2 product families are the follow-on processors to Intel<sup>®</sup> Xeon<sup>®</sup> processor E5-4600/2600/2400/1600 v2 product families. For Intel<sup>®</sup> Xeon<sup>®</sup> processor E5-4600/2600/2400/1600 v2 product families.

The main usage of this sensor is as an input to the BMC's fan control algorithms. The BMC implements this as a threshold sensor. There is one DTS sensor for each installed physical processor package. Thresholds are not set and alert generation is not enabled for these sensors.

| Byte | Field                             | Description                                                                                                                                                  |
|------|-----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | 01h = Temperature                                                                                                                                            |
| 12   | Sensor Number                     | 83h = Processor 1 DTS Thermal Margin<br>84h = Processor 2 DTS Thermal Margin<br>85h = Processor 3 DTS Thermal Margin<br>86h = Processor 4 DTS Thermal Margin |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 01h (Threshold)</li> </ul>            |

#### Table 43: Processor DTS Thermal Margin Sensors Typical Characteristics

## 5.2.5 Discrete Thermal Sensors

Discrete thermal sensors do not report a temperature at all, instead they report an overheating event of some kind. For example, VRD Hot (voltage regulator is overheating) or processor Thermal Trip (the processor got so hot that its over-temperature protection was triggered and the system was shut down to prevent damage).

| Byte | Field                             | Description                                                                                                                                                               |
|------|-----------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | 01h = Temperature                                                                                                                                                         |
| 12   | Sensor Number                     | See Table 45                                                                                                                                                              |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = See Table 45</li> </ul>                            |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 00b = Unspecified Event Data 2</li> <li>[5:4] - 00b = Unspecified Event Data 3</li> <li>[3:0] - Event Trigger Offset as described in Table 45</li> </ul> |
| 15   | Event Data 2                      | Not used                                                                                                                                                                  |
| 16   | Event Data 3                      | Not used                                                                                                                                                                  |

Table 44: Discrete Thermal Sensors Typical Characteristics

Table 45: Discrete Thermal Sensors – Next Steps

| Sensor | Sensor Name      | Event | Event Trigger Offset |                                                            | Description                                                                                        | Next Steps                                            |                                          |                                                                     |
|--------|------------------|-------|----------------------|------------------------------------------------------------|----------------------------------------------------------------------------------------------------|-------------------------------------------------------|------------------------------------------|---------------------------------------------------------------------|
| Number |                  | Туре  | Hex                  | Description                                                |                                                                                                    |                                                       |                                          |                                                                     |
| 0Dh    | SSB Thermal Trip | 03h   | 01h                  | State Asserted                                             | South Side Bridge (SSB) overheated                                                                 | 1. Check for clear and unobstructed                   |                                          |                                                                     |
| 90h    | P1 VRD Hot       | 05h   | 01h                  | Ih Limit Exceeded Processor 1 voltage regulator overheated | <ul><li>airflow into and out of the chassis.</li><li>2. Ensure the SDR is programmed and</li></ul> |                                                       |                                          |                                                                     |
| 91h    | P2 VRD Hot       |       |                      |                                                            | Processor 2 voltage regulator overheated                                                           | correct chassis has been selected.                    |                                          |                                                                     |
| 92h    | P3 VRD Hot       |       |                      |                                                            | Processor 3 voltage regulator overheated                                                           | 3. Ensure there are no fan failures.                  |                                          |                                                                     |
| 93h    | P4 VRD Hot       |       |                      |                                                            |                                                                                                    |                                                       | Processor 4 voltage regulator overheated | 4. Ensure the air used for cooling the system is within the thermal |
| 94h    | P1 Mem01 VRD Hot |       |                      |                                                            | Processor 1 Memory 0/1 voltage regulator overheated                                                | specifications for the system (typically below 35°C). |                                          |                                                                     |
| 95h    | P1 Mem23 VRD Hot |       |                      |                                                            | Processor 1 Memory 2/3 voltage regulator overheated                                                |                                                       |                                          |                                                                     |
| 96h    | P2 Mem01 VRD Hot |       |                      |                                                            | Processor 2 Memory 0/1 voltage regulator<br>overheated                                             |                                                       |                                          |                                                                     |

Cooling Subsystem

| Sensor | Sensor Name      | Event | Eve | nt Trigger Offset | Description                                            | Next Steps |
|--------|------------------|-------|-----|-------------------|--------------------------------------------------------|------------|
| Number |                  | Туре  | Hex | Description       |                                                        |            |
| 97h    | P2 Mem23 VRD Hot |       |     |                   | Processor 2 Memory 2/3 voltage regulator<br>overheated |            |
| 98h    | P3 Mem01 VRD Hot |       |     |                   | Processor 3 Memory 0/1 voltage regulator<br>overheated |            |
| 99h    | P4 Mem23 VRD Hot |       |     |                   | Processor 3 Memory 2/3 voltage regulator<br>overheated |            |
| 9Ah    | P4 Mem01 VRD Hot |       |     |                   | Processor 4 Memory 0/1 voltage regulator<br>overheated |            |
| 9Bh    | P4 Mem23 VRD Hot |       |     |                   | Processor 4 Memory 2/3 voltage regulator<br>overheated |            |

## 5.2.6 DIMM Thermal Trip Sensors

The BMC supports DIMM Thermal Trip monitoring that is instantiated as one aggregate IPMI discrete sensor per CPU. When a DIMM Thermal Trip occurs, the system hardware will automatically power down the server and the BMC will assert the sensor offset and log an event.

| Byte | Field                             | Description                                                                                                                                              |
|------|-----------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | 0Ch = Memory                                                                                                                                             |
| 12   | Sensor Number                     | C0h = Processor 1 DIMM Thermal Trip<br>C1h = Processor 2 DIMM Thermal Trip<br>C2h = Processor 3 DIMM Thermal Trip<br>C3h = Processor 4 DIMM Thermal Trip |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 6Fh (Sensor Specific)</li> </ul>  |
| 14   | Event Data 1                      | [7:6] – 00b = Unspecified Event Data 2<br>[5:4] – 00b = Unspecified Event Data 3                                                                         |

#### Table 46: DIMM Thermal Trip Typical Characteristics

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5 4600/2600/2400/1600/1400 Product Families

| Byte | Field        | Description                                                   |
|------|--------------|---------------------------------------------------------------|
|      |              | [3:0] – Event Trigger Offset = 0A = Critical over temperature |
| 15   | Event Data 2 | Not used                                                      |
| 16   | Event Data 3 | Not used                                                      |

#### 5.2.6.1 DIMM Thermal Trip Sensors – Next Steps

- 1. Check for clear and unobstructed airflow into and out of the chassis.
- 2. Ensure the SDR is programmed and correct chassis has been selected.
- 3. Ensure there are no fan failures.
- 4. Ensure the air used to cool the system is within the thermal specifications for the system (typically below 35°C).

## 5.3 System Air Flow Monitoring Sensor

The BMC provides an IPMI sensor to report the volumetric system airflow in CFM (cubic feet per minute). The airflow in CFM is calculated based on the system fan PWM values. The specific Pulse Width Modulation (PWM or PWMs) used to determine the CFM is SDR-configurable. The relationship between PWM and CFM is based on a lookup table in an OEM SDR.

The airflow data is used in the calculation for exit air temperature monitoring. It is exposed as an IPMI sensor to allow a data center management application to access this data for use in rack-level thermal management.

This sensor is informational only and will not log events into the SEL.

# 6. Processor Subsystem

Intel<sup>®</sup> servers report multiple processor-centric sensors in the SEL.

## 6.1 Processor Status Sensor

The BMC provides an IPMI sensor of type processor for monitoring status information for each processor slot. If an event state (sensor offset) has been asserted, it remains asserted until one of the following happens:

- A rearm Sensor Events command is executed for the processor status sensor.
- AC or DC power cycle, system reset, or system boot occurs.

CPU Presence status is not saved across A/C power cycles and therefore will not generate a deassertion after cycling AC power.

| Byte | Field                             | Description                                                                                                                                                               |
|------|-----------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | 07h = Processor                                                                                                                                                           |
| 12   | Sensor Number                     | 70h = Processor 1 Status<br>71h = Processor 2 Status<br>72h = Processor 3 Status<br>73h = Processor 4 Status                                                              |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 6Fh (Sensor Specific)</li> </ul>                   |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 00b = Unspecified Event Data 2</li> <li>[5:4] - 00b = Unspecified Event Data 3</li> <li>[3:0] - Event Trigger Offset as described in Table 48</li> </ul> |
| 15   | Event Data 2                      | Not used                                                                                                                                                                  |
| 16   | Event Data 3                      | Not used                                                                                                                                                                  |

| Table 47: Process Status | s Sensors Typica  | al Characteristics |
|--------------------------|-------------------|--------------------|
|                          | s ochoolo i ypici |                    |

| Event Trigger<br>Offset | Processor Status                                                   | Next Steps                                                                                                                                                                                                                                                                                                                                                                               |
|-------------------------|--------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0h                      | Internal error (IERR)                                              | <ol> <li>Cross test the processors.</li> <li>Replace the processors depending on the results of the test.</li> </ol>                                                                                                                                                                                                                                                                     |
| 1h                      | Thermal trip                                                       | <ol> <li>This event normally only happens due to failures of the thermal solution:</li> <li>Verify heatsink is properly attached and has thermal grease.</li> <li>If the system has a heatsink fan, ensure the fan is spinning.</li> <li>Check all system fans are operating properly.</li> <li>Check that the air used to cool the system is within limits (typically 35°C).</li> </ol> |
| 2h                      | FRB1/BIST failure                                                  | 1. Cross test the processors.                                                                                                                                                                                                                                                                                                                                                            |
| 3h                      | FRB2/Hang in POST failure                                          | 2. Replace the processors depending on the results of the test.                                                                                                                                                                                                                                                                                                                          |
| 4h                      | FRB3/Processor startup/initialization failure (CPU fails to start) |                                                                                                                                                                                                                                                                                                                                                                                          |
| 5h                      | Configuration error (for DMI)                                      |                                                                                                                                                                                                                                                                                                                                                                                          |
| 6h                      | SM BIOS uncorrectable CPU-complex error                            |                                                                                                                                                                                                                                                                                                                                                                                          |
| 7h                      | Processor presence detected                                        | Informational Event                                                                                                                                                                                                                                                                                                                                                                      |
| 8h                      | Processor disabled                                                 | 1. Cross test the processors.                                                                                                                                                                                                                                                                                                                                                            |
| 9h                      | Terminator presence detected                                       | 2. Replace the processors depending on the results of the test.                                                                                                                                                                                                                                                                                                                          |

#### Table 48: Processor Status Sensors – Next Steps

## 6.2 Catastrophic Error Sensor

When the Catastrophic Error signal (CATERR#) stays asserted, it is a sign that something serious has gone wrong in the hardware. The BMC monitors this signal and reports when it stays asserted.

| Byte | Field                             | Description                                                                                                                                                                                      |
|------|-----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | 07h = Processor                                                                                                                                                                                  |
| 12   | Sensor Number                     | 80h                                                                                                                                                                                              |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 03h (Digital Discrete)</li> </ul>                                         |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 10b = OEM code in Event Data 2</li> <li>[5:4] - 10b = OEM code in Event Data 3</li> <li>[3:0] - Event Trigger Offset = 1h (State Asserted)</li> </ul>                           |
| 15   | Event Data 2                      | Event Data 2 values as described in Table 50.                                                                                                                                                    |
| 16   | Event Data 3                      | Bitmap of the CPU that causes the system CATERR.<br>[0]: CPU1<br>[1]: CPU2<br>[2]: CPU3<br>[3]: CPU4<br>Note: If more than one bit is set, the BMC cannot<br>determine the source of the CATERR. |

Table 49: Catastrophic Error Sensor Typical Characteristics

#### Table 50: Catastrophic Error Sensor – Event Data 2 Values – Next Steps

| ED2                                                                                                              | Description | Next Steps                                                                                                           |  |
|------------------------------------------------------------------------------------------------------------------|-------------|----------------------------------------------------------------------------------------------------------------------|--|
| 00h         Unknown         1. Cross test the processors.           2.         Replace the processors dependence |             | <ol> <li>Cross test the processors.</li> <li>Replace the processors depending on the results of the test.</li> </ol> |  |

Processor Subsystem

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5 4600/2600/2400/1600/1400 Product Families

| ED2 | Description       | Next Steps                                                                                                                                                                                                                                                                                                                  |  |  |
|-----|-------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| 01h | CATERR            | <ul> <li>This error is typically caused by other platform components.</li> <li>Check for other errors near the time of the CATERR event.</li> <li>Verify all peripherals are plugged in and operating correctly, particularly Hard Drives, Optical Drives, and I/O.</li> <li>Update system firmware and drivers.</li> </ul> |  |  |
| 2h  | CPU Core<br>Error | <ol> <li>Cross test the processors.</li> <li>Replace the processors depending on the results of the test.</li> </ol>                                                                                                                                                                                                        |  |  |
| 3h  | MSID<br>Mismatch  | Verify the processor is supported by your baseboard. Check your boards <i>Technical Product Specification</i> (TPS).                                                                                                                                                                                                        |  |  |

## 6.3 CPU Missing Sensor

The CPU Missing sensor is a discrete sensor reporting the processor is not installed. The most common instance of this event is due to a processor populated in the incorrect socket.

| Byte | Field                             | Description                                                                                                                                                            |
|------|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | 07h = Processor                                                                                                                                                        |
| 12   | Sensor Number                     | 82h                                                                                                                                                                    |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 6Fh (Sensor Specific)</li> </ul>                |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 00b = Unspecified Event Data 2</li> <li>[5:4] - 00b = Unspecified Event Data 3</li> <li>[3:0] - Event Trigger Offset = 1h (State Asserted)</li> </ul> |
| 15   | Event Data 2                      | Not used                                                                                                                                                               |
| 16   | Event Data 3                      | Not used                                                                                                                                                               |

Table 51: CPU Missing Sensor Typical Characteristics

## 6.3.1 CPU Missing Sensor – Next Steps

Verify the processor is installed in the correct slot.

# 6.4 Quick Path Interconnect Sensors

The Intel<sup>®</sup> Quick Path Interconnect (QPI) bus on Intel<sup>®</sup> EPSD Boards Based on Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5-4600/2600/2400/1600/1400 Product Families is the interconnect between processors.

The QPI Link Width Reduced sensor is used by the BIOS POST to report when the link width has been reduced. Therefore the Generator ID will be 01h.

The QPI Error sensors are reported by the BIOS SMI Handler to the BMC so the Generator ID will be 33h.

## 6.4.1 QPI Link Width Reduced Sensor

BIOS POST has reduced the QPI Link Width because of an error condition seen during initialization.

| Byte | Field                             | Description                                                                                                                                          |
|------|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|
| 8    | Generator ID                      | 0001h = BIOS POST                                                                                                                                    |
| 9    |                                   |                                                                                                                                                      |
| 11   | Sensor Type                       | 13h = Critical Interrupt                                                                                                                             |
| 12   | Sensor Number                     | 09h                                                                                                                                                  |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 77h (OEM Discrete)</li> </ul> |
| 14   | Event Data 1                      | <ul> <li>[7:6] – 10b = OEM code in Event Data 2</li> <li>[5:4] – 00b = Unspecified Event Data 3</li> <li>[3:0] – Event Trigger Offset</li> </ul>     |

#### Table 52: QPI Link Width Reduced Sensor Typical Characteristics

| Byte | Field        | Description             |
|------|--------------|-------------------------|
|      |              | 1h = Reduced to ½ width |
|      |              | 2h = Reduced to ¼ width |
| 15   | Event Data 2 | 0-3 = CPU1-4            |
| 16   | Event Data 3 | Not used                |

## 6.4.1.1 QPI Link Width Reduced Sensor – Next Steps

If the error continues:

- 1. Check the processor is installed correctly.
- 2. Inspect the socket for bent pins.
- 3. Cross test the processor. If the issue remains with the processor socket, replace the main board, otherwise the processor.

## 6.4.2 QPI Correctable Error Sensor

The system detected an error and corrected it. This is an informational event.

### Table 53: QPI Correctable Error Sensor Typical Characteristics

| Byte   | Field                             | Description                                                                                                                                                 |
|--------|-----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 8<br>9 | Generator ID                      | 0033h = BIOS SMI Handler                                                                                                                                    |
| 11     | Sensor Type                       | 13h = Critical Interrupt                                                                                                                                    |
| 12     | Sensor Number                     | 06h                                                                                                                                                         |
| 13     | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 72h (OEM Discrete)</li> </ul>        |
| 14     | Event Data 1                      | <ul> <li>[7:6] - 10b = OEM code in Event Data 2</li> <li>[5:4] - 00b = Unspecified Event Data 3</li> <li>[3:0] - Event Trigger Offset = Reserved</li> </ul> |

| Byte | Field        | Description  |
|------|--------------|--------------|
| 15   | Event Data 2 | 0-3 = CPU1-4 |
| 16   | Event Data 3 | Not used     |

## 6.4.2.1 QPI Correctable Error Sensor – Next Steps

This is an Informational event only. Correctable errors are acceptable and normal at a low rate of occurrence. If the error continues:

- 1. Check the processor is installed correctly.
- 2. Inspect the socket for bent pins.
- 3. Cross test the processor. If the issue remains with the processor socket, replace the main board, otherwise the processor.

## 6.4.3 QPI Fatal Error and Fatal Error #2

The system detected a QPI fatal or non-recoverable error. This is a fatal error.

| Byte | Field                             | Description                                                                                                                                                                                                                                                                |
|------|-----------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 8    | Generator ID                      | 0033h = BIOS SMI Handler                                                                                                                                                                                                                                                   |
| 9    |                                   |                                                                                                                                                                                                                                                                            |
| 11   | Sensor Type                       | 13h = Critical Interrupt                                                                                                                                                                                                                                                   |
| 12   | Sensor Number                     | 07h                                                                                                                                                                                                                                                                        |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 73h (OEM Discrete)</li> </ul>                                                                                                                       |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 10b = OEM code in Event Data 2</li> <li>[5:4] - 00b = Unspecified Event Data 3</li> <li>[3:0] - Event Trigger Offset <ul> <li>0h = Link Layer Uncorrectable ECC Error</li> <li>1h = Protocol Layer Poisoned Packet Reception Error</li> </ul> </li> </ul> |

#### Table 54: QPI Fatal Error Sensor Typical Characteristics

Processor Subsystem

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

| Byte | Field        | Description                                                                                   |  |
|------|--------------|-----------------------------------------------------------------------------------------------|--|
|      |              | 2h = Link/PHY Init Failure with resultant degradation in link width                           |  |
|      |              | 3h = PHY Layer detected drift buffer alarm                                                    |  |
|      |              | 4h = PHY detected latency buffer rollover                                                     |  |
|      |              | 5h = PHY Init Failure                                                                         |  |
|      |              | 6h = Link Layer generic control error (buffer overflow/underflow, credit underflow and so on) |  |
|      |              | 7h = Parity error in link or PHY layer                                                        |  |
|      |              | 8h = Protocol layer timeout detected                                                          |  |
|      |              | 9h = Protocol layer failed response                                                           |  |
|      |              | Ah = Protocol layer illegal packet field, target Node ID Error, and so on                     |  |
|      |              | Bh = Protocol Layer Queue/table overflow/underflow                                            |  |
|      |              | Ch = Viral Error                                                                              |  |
|      |              | Dh = Protocol Layer parity error                                                              |  |
|      |              | Eh = Routing Table Error                                                                      |  |
|      |              | Fh = (unused) = Reserved                                                                      |  |
| 15   | Event Data 2 | 0-3 = CPU1-4                                                                                  |  |
| 16   | Event Data 3 | Not used                                                                                      |  |

The QPI Fatal Error #2 is a continuation of QPI Fatal Error.

### Table 55: QPI Fatal #2 Error Sensor Typical Characteristics

| Byte | Field                             | Description                                                                                                                                          |
|------|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|
| 8    | Generator ID                      | 0033h = BIOS SMI Handler                                                                                                                             |
| 9    |                                   |                                                                                                                                                      |
| 11   | Sensor Type                       | 13h = Critical Interrupt                                                                                                                             |
| 12   | Sensor Number                     | 17h                                                                                                                                                  |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 74h (OEM Discrete)</li> </ul> |
| 14   | Event Data 1                      | [7:6] – 10b = OEM code in Event Data 2                                                                                                               |

Processor Subsystem

| Byte | Field        | Description                                                       |  |
|------|--------------|-------------------------------------------------------------------|--|
|      |              | [5:4] – 00b = Unspecified Event Data 3                            |  |
|      |              | [3:0] – Event Trigger Offset                                      |  |
|      |              | 0h = Illegal inbound request                                      |  |
|      |              | 1h = IIO Write Cache Uncorrectable Data ECC Error                 |  |
|      |              | 2h = IIO CSR crossing 32-bit boundary Error                       |  |
|      |              | 3h = IIO Received XPF physical/logical redirect interrupt inbound |  |
|      |              | 4h = IIO Illegal SAD or Illegal or non-existent address or memory |  |
|      |              | 5h = IIO Write Cache Coherency Violation                          |  |
| 15   | Event Data 2 | 0-3 = CPU1-4                                                      |  |
| 16   | Event Data 3 | Not used                                                          |  |

## 6.4.3.1 QPI Fatal Error and Fatal Error #2 – Next Steps

This is an Informational event only. Correctable errors are acceptable and normal at a low rate of occurrence. If the error continues:

- 1. Check the processor is installed correctly.
- 2. Inspect the socket for bent pins.
- 3. Cross test the processor. If the issue remains with the processor socket, replace the main board, otherwise the processor.

# 6.5 Processor ERR2 Timeout Sensor

The BMC supports an ERR2 Timeout Sensor (1 per CPU) that asserts if a CPU's ERR2 signal has been asserted for longer than a fixed time period (> 90 seconds). ERR[2] is a processor signal that indicates when the IIO (Integrated IO module in the processor) has a fatal error which could not be communicated to the core to trigger SMI. ERR[2] events are fatal error conditions, where the BIOS and OS will attempt to gracefully handle error, but may not always do so reliably. A continuously asserted ERR2 signal is an indication that the BIOS cannot service the condition that caused the error. This is usually because that condition prevents the BIOS from running.

When an ERR2 timeout occurs, the BMC asserts/deasserts the ERR2 Timeout Sensor, and logs a SEL event for that sensor. The default behavior for BMC core firmware is to initiate a system reset upon detection of an ERR2 timeout. The BIOS setup utility provides an option to disable or enable system reset by the BMC on detection of this condition.

| Byte | Field                             | Description                                                                                                                                                            |
|------|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | 07h = Processor                                                                                                                                                        |
| 12   | Sensor Number                     | 7Ch = Processor 1 ERR2 Timeout<br>7Dh = Processor 2 ERR2 Timeout<br>7Eh = Processor 3 ERR2 Timeout<br>7Fh = Processor 4 ERR2 Timeout                                   |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 03h ("digital" discrete)</li> </ul>             |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 00b = Unspecified Event Data 2</li> <li>[5:4] - 00b = Unspecified Event Data 3</li> <li>[3:0] - Event Trigger Offset = 1h (State Asserted)</li> </ul> |
| 15   | Event Data 2                      | Not used                                                                                                                                                               |
| 16   | Event Data 3                      | Not used                                                                                                                                                               |

Table 56: Processor ERR2 Timeout Sensor Typical Characteristics

## 6.5.1 Processor ERR2 Timeout – Next Steps

- 1. Check the SEL for any other events around the time of the failure.
- Take note of all IPMI activity that was occurring around the time of the failure. Capture a System BMC Debug Log as soon as you can after experiencing this failure. This log can be captured from the Integrated BMC Web Console or by using the Intel<sup>®</sup> Syscfg utility (syscfg /sbmcdl private filename.zip). Send the log file to your system manufacturer or Intel representative for failure analysis.

## 6.6 Processor MSID Mismatch Sensor

The BMC supports a *MSID Mismatch* sensor for monitoring for the fault condition that will occur if there is a power rating incompatibility between a baseboard and a processor.

The sensor is rearmed on power-on (AC or DC power on transitions).

| Byte | Field                             | Description                                                                                                                                                                    |
|------|-----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | 07h = Processor                                                                                                                                                                |
| 12   | Sensor Number                     | <ul> <li>81h = Processor 1 MSID Mismatch</li> <li>87h = Processor 2 MSID Mismatch</li> <li>88h = Processor 3 MSID Mismatch</li> <li>89h = Processor 4 MSID Mismatch</li> </ul> |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 03h ("digital" discrete)</li> </ul>                     |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 00b = Unspecified Event Data 2</li> <li>[5:4] - 00b = Unspecified Event Data 3</li> <li>[3:0] - Event Trigger Offset = 1h (State Asserted)</li> </ul>         |
| 15   | Event Data 2                      | Not used                                                                                                                                                                       |
| 16   | Event Data 3                      | Not used                                                                                                                                                                       |

### Table 57: Processor MSID Mismatch Sensor Typical Characteristics

## 6.6.1 Processor MSID Mismatch Sensor – Next Steps

Verify the processor is supported by your baseboard. Check your boards Technical Product Specification (TPS).

# 7. Memory Subsystem

Intel<sup>®</sup> servers report memory errors, status, and configuration in the SEL.

# 7.1 Memory RAS Configuration Status

A Memory RAS Configuration Status event is logged after an AC power-on occurs, only if any RAS Mode is currently configured, and only if RAS Mode is successfully initiated.

This is to make sure that there is a record in the SEL telling what the RAS Mode was at the time that the system started up. This is only logged after AC power-on, not DC power-on.

The Memory RAS Configuration Status Sensor is also used to log an event during POST whenever there is a RAS configuration error. This is a case where a RAS Mode has been selected but when the system boots, the memory configuration cannot support the RAS Mode. The memory configuration fails, and operates in Independent Channel Mode.

In the SEL record logged, the ED1 Offset value is "RAS Configuration Disabled", and ED3 contains the RAS Mode that is currently selected but could not be configured. ED2 gives the reason for the RAS configuration failure – at present, only two "RAS Configuration Error Type" values are implemented:

- 0 = None This is used for an AC power-on log record when the RAS configuration is successfully configured.
- 3 = Invalid DIMM Configuration for RAS Mode The installed DIMM configuration cannot support the currently selected RAS Mode. This may be due to DIMMs that have failed or been disabled, so when this reason has been logged, the user should check the preceding SEL events to see whether there are DIMM error events.

| Byte | Field         | Description       |
|------|---------------|-------------------|
| 8    | Generator ID  | 0001h = BIOS POST |
| 9    |               |                   |
| 11   | Sensor Type   | 0ch = Memory      |
| 12   | Sensor Number | 02h               |

Table 58: Memory RAS Configuration Status Sensor Typical Characteristics

## Memory Subsystem

| Byte | Field                             | Description                                                                                                                                                                  |
|------|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 09h (digital Discrete)</li> </ul>                     |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 10b = OEM code in Event Data 2</li> <li>[5:4] - 10b = OEM code in Event Data 3</li> <li>[3:0] - Event Trigger Offset as described in Table 59</li> </ul>    |
| 15   | Event Data 2                      | RAS Configuration Error Type<br>[7:4] = Reserved<br>[3:0] = Configuration Error<br>0 = None<br>3 = Invalid DIMM Configuration for RAS Mode<br>All other values are reserved. |
| 16   | Event Data 3                      | RAS Mode Configured<br>[7:4] = Reserved<br>[3:0] = RAS Mode<br>Oh = None (Independent Channel Mode)<br>1h = Mirroring Mode<br>2h = Lockstep Mode<br>4h = Rank Sparing Mode   |

## Table 59: Memory RAS Configuration Status Sensor – Event Trigger Offset – Next Steps

|     | Event Trigger Offset        | Description                                                                                                                                            | Next Steps                                                                                                                                                                                                                                                                                                                                                                |  |
|-----|-----------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| Hex | Description                 |                                                                                                                                                        |                                                                                                                                                                                                                                                                                                                                                                           |  |
| 01h | RAS configuration enabled.  | User enabled mirrored channel mode in setup.                                                                                                           | Informational event only.                                                                                                                                                                                                                                                                                                                                                 |  |
| 00h | RAS configuration disabled. | Mirrored channel mode is disabled<br>(either in setup or due to unavailability<br>of memory at post, in which case post<br>error 8500 is also logged). | <ol> <li>If this event is accompanied by a post error 8500, there was a problem<br/>applying the mirroring configuration to the memory. Check for other errors<br/>related to the memory and troubleshoot accordingly.</li> <li>If there is no post error, mirror mode was simply disabled in BIOS setup and<br/>this should be considered informational only.</li> </ol> |  |

# 7.2 Memory RAS Mode Select

Memory RAS Mode Select events are logged to record changes in RAS Mode.

When a RAS Mode selection is made that changes the RAS Mode (including selecting a RAS Mode from or to Independent Channel Mode), that change is logged to SEL in a Memory RAS Mode Select event message, which records the previous RAS Mode (from) and the newly selected RAS Mode (to). The event also includes an Offset value in ED1 which indicates whether the mode change left the system with a RAS Mode active (Enabled), or not (Disabled – Independent Channel Mode selected). This sensor provides the Spare Channel mode RAS Configuration status. Memory RAS Mode Select is an informational event.

| Byte   | Field                             | Description                                                                                                                                                                                                                       |
|--------|-----------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 8<br>9 | Generator ID                      | 0001h = BIOS POST                                                                                                                                                                                                                 |
| 11     | Sensor Type                       | 0ch = Memory                                                                                                                                                                                                                      |
| 12     | Sensor Number                     | 12h                                                                                                                                                                                                                               |
| 13     | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 09h (digital Discrete)</li> </ul>                                                                          |
| 14     | Event Data 1                      | <ul> <li>[7:6] - 10b = OEM code in Event Data 2</li> <li>[5:4] - 10b = OEM code in Event Data 3</li> <li>[3:0] - Event Trigger Offset</li> <li>0h = RAS Configuration Disabled</li> <li>1h = RAS Configuration Enabled</li> </ul> |
| 15     | Event Data 2                      | Prior RAS Mode<br>[7:4] = Reserved<br>[3:0] = RAS Mode<br>Oh = None (Independent Channel Mode)<br>1h = Mirroring Mode<br>2h = Lockstep Mode<br>4h = Rank Sparing Mode                                                             |

Table 60: Memory RAS Mode Select Sensor Typical Characteristics

Memory Subsystem

| -    |                      |                                      |  |
|------|----------------------|--------------------------------------|--|
| Byte | te Field Description |                                      |  |
| 16   | Event Data 3         | Selected RAS Mode                    |  |
|      |                      | [7:4] = Reserved                     |  |
|      |                      | [3:0] = RAS Mode                     |  |
|      |                      | 0h = None (Independent Channel Mode) |  |
|      |                      | 1h = Mirroring Mode                  |  |
|      |                      | 2h = Lockstep Mode                   |  |
|      |                      | 4h = Rank Sparing Mode               |  |

# 7.3 Mirroring Redundancy State

Mirroring Mode protects memory data by full redundancy – keeping complete copies of all data on both channels of a Mirroring Domain (channel pair). If an Uncorrectable Error, which is normally fatal, occurs on one channel of a pair, and the other channel is still intact and operational, then the Uncorrectable Error is "demoted" to a Correctable Error, and the failed channel is disabled. Because the Mirror Domain is no longer redundant, a Mirroring Redundancy State SEL Event is logged.

| Byte | Field                             | Description                                                                                                                                                                                                      |  |
|------|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| 8    | Generator ID                      | 0033h = BIOS SMI Handler                                                                                                                                                                                         |  |
| 9    |                                   |                                                                                                                                                                                                                  |  |
| 11   | Sensor Type                       | 0ch = Memory                                                                                                                                                                                                     |  |
| 12   | Sensor Number                     | 01h                                                                                                                                                                                                              |  |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 0Bh (Generic Discrete)</li> </ul>                                                         |  |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 10b = OEM code in Event Data 2</li> <li>[5:4] - 10b = OEM code in Event Data 3</li> <li>[3:0] - Event Trigger Offset</li> <li>0h = Fully Redundant</li> <li>2h = Redundancy Degraded</li> </ul> |  |

Table 61: Mirroring Redundancy State Sensor Typical Characteristics

| Byte | Field                                                                                                                                               | Description                                                                                                                                   |
|------|-----------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|
| 15   | Event Data 2<br>Location<br>[7:4] = Mirroring Domain<br>0-1 = Channel Pair for Soc<br>[3:2] = Reserved<br>[1:0] = Rank on DIMM<br>0-3 = Rank Number |                                                                                                                                               |
| 16   | Event Data 3                                                                                                                                        | Location<br>[7:5] = Socket ID<br>0-3 = CPU1-4<br>[4:3] = Channel<br>0-3 = Channel A-D for Socket<br>[2:0] = DIMM<br>0-2 = DIMM 1-3 on Channel |

## 7.3.1 Mirroring Redundancy State Sensor – Next Steps

This event is accompanied by memory errors indicating the source of the issue. Troubleshoot accordingly (probably replace affected DIMM).

For boards with DIMM Fault LEDs, the appropriate Fault LED is lit to indicate which DIMM was the source of the error triggering the Mirroring Failover action, that is, the failing DIMM.

# 7.4 Sparing Redundancy State

Rank Sparing Mode is a Memory RAS configuration option that reserves one memory rank per channel as a "spare rank". If any rank on a given channel experiences enough Correctable ECC Errors to cross the Correctable Error Threshold, the data in that rank is copied to the spare rank, and then the spare rank is mapped into the memory array to replace the failing rank.

Rank Sparing Mode protects memory data by reserving a "Spare Rank" on each channel that has memory installed on it. If a Correctable Error Threshold event occurs, the data from the failing rank is copied to the Spare Rank on the same channel, and the failing DIMM is disabled. Because the Sparing Domain is no longer redundant, a Sparing Redundancy State SEL Event is logged.

## Memory Subsystem

| Byte   | Field                             | Description                                                                                                                                                                                                      |  |
|--------|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| 8<br>9 | Generator ID                      | 0033h = BIOS SMI Handler                                                                                                                                                                                         |  |
| 11     | Sensor Type                       | 0ch = Memory                                                                                                                                                                                                     |  |
| 12     | Sensor Number                     | 11h                                                                                                                                                                                                              |  |
| 13     | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 0Bh (Generic Discrete)</li> </ul>                                                         |  |
| 14     | Event Data 1                      | <ul> <li>[7:6] - 10b = OEM code in Event Data 2</li> <li>[5:4] - 10b = OEM code in Event Data 3</li> <li>[3:0] - Event Trigger Offset</li> <li>0h = Fully Redundant</li> <li>2h = Redundancy Degraded</li> </ul> |  |
| 15     | Event Data 2                      | Location<br>[7:4] = Sparing Domain<br>0-3 = Channel A-D for Socket<br>[3:2] = Reserved<br>[1:0] = Rank on DIMM<br>0-3 = Rank Number                                                                              |  |
| 16     | Event Data 3                      | Location<br>[7:5]= Socket ID<br>0-3 = CPU1-4<br>[4:3] = Channel<br>0-3 = Channel A-D for Socket<br>[2:0] = DIMM<br>0-2 = DIMM 1-3 on Channel                                                                     |  |

## Table 62: Sparing Redundancy State Sensor Typical Characteristics

## 7.4.1 Sparing Redundancy State Sensor – Next Steps

This event is accompanied by memory errors indicating the source of the issue. Troubleshoot accordingly (probably replace affected DIMM).

For boards with DIMM Fault LEDs, the appropriate Fault LED is lit to indicate which DIMM was the source of the error triggering the Mirroring Failover action, that is, the failing DIMM.

# 7.5 ECC and Address Parity

- 1. Memory data errors are logged as correctable or uncorrectable.
- 2. Uncorrectable errors are fatal.
- 3. Memory addresses are protected with parity bits and a parity error is logged. This is a fatal error.

## 7.5.1 Memory Correctable and Uncorrectable ECC Error

ECC errors are divided into Uncorrectable ECC Errors and Correctable ECC Errors. A "Correctable ECC Error" actually represents a threshold overflow. More Correctable Errors are detected at the memory controller level for a given DIMM within a given timeframe. In both cases, the error can be narrowed down to particular DIMM(s). The BIOS SMI error handler uses this information to log the data to the BMC SEL and identify the failing DIMM module.

| Byte | Field                             | Description                                                                                                                                             |
|------|-----------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
| 8    | Generator ID                      | 0033h = BIOS SMI Handler                                                                                                                                |
| 9    |                                   |                                                                                                                                                         |
| 11   | Sensor Type                       | 0ch = Memory                                                                                                                                            |
| 12   | Sensor Number                     | 02h                                                                                                                                                     |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 6Fh (Sensor Specific)</li> </ul> |
| 14   | Event Data 1                      | [7:6] – 10b = OEM code in Event Data 2                                                                                                                  |

### Table 63: Correctable and Uncorrectable ECC Error Sensor Typical Characteristics

# Memory Subsystem

| Byte | Field        | Description                                           |
|------|--------------|-------------------------------------------------------|
|      |              | [5:4] – 10b = OEM code in Event Data 3                |
|      |              | [3:0] – Event Trigger Offset as described in Table 64 |
| 15   | Event Data 2 | [7:2] – Reserved. Set to 0.                           |
|      |              | [1:0] – Rank on DIMM                                  |
|      |              | 0-3 = Rank number                                     |
| 16   | Event Data 3 | [7:5] – Socket ID                                     |
|      |              | 0-3 = CPU1-4                                          |
|      |              | [4:3] –Channel                                        |
|      |              | 0-3 = Chan A-D for Socket                             |
|      |              | [2:0] DIMM                                            |
|      |              | 0-2 = DIMM 1-3 on Channel                             |

### Table 64: Correctable and Uncorrectable ECC Error Sensor Event Trigger Offset – Next Steps

| E١  | vent Trigger Offset                           | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | Next Steps                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|-----|-----------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Hex | Description                                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| 01h | Uncorrectable ECC<br>Error                    | An uncorrectable (multi-bit) ECC error has occurred. This<br>is a fatal issue that will typically lead to an OS crash<br>(unless memory has been configured in a RAS mode).<br>The system will generate a CATERR# (catastrophic error)<br>and an MCE (Machine Check Exception Error).<br>While the error may be due to a failing DRAM chip on the<br>DIMM, it can also be cause by incorrect seating or<br>improper contact between socket and DIMM, or by bent<br>pins in the processor socket. | <ol> <li>If needed, decode DIMM location from hex version of SEL.</li> <li>Verify the DIMM is seated properly.</li> <li>Examine gold fingers on edge of the DIMM to verify contacts are clean.</li> <li>Inspect the processor socket this DIMM is connected to for bent pins, and if found, replace the board.</li> <li>Consider replacing the DIMM as a preventative measure. For multiple occurrences, replace the DIMM.</li> </ol>                                                                       |
| 00h | Correctable ECC<br>Error threshold<br>reached | There have been too many (10 or more) correctable ECC<br>errors for this particular DIMM since last boot. This event<br>in itself does not pose any direct problems because the<br>ECC errors are still being corrected. Depending on the<br>RAS configuration of the memory, the IMC may take the<br>affected DIMM offline.                                                                                                                                                                     | <ul> <li>Even though this event doesn't immediately lead to problems, it can indicate one of the DIMM modules is slowly failing. If this error occurs more than once:</li> <li>1. If needed, decode DIMM location from hex version of SEL.</li> <li>2. Verify the DIMM is seated properly.</li> <li>3. Examine gold fingers on edge of the DIMM to verify contacts are clean.</li> <li>4. Inspect the processor socket this DIMM is connected to for bent pins, and if found, replace the board.</li> </ul> |

Memory Subsystem

### System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5 4600/2600/2400/1600/1400 Product Families

| Event Trigger Offset |             | Description |    | Next Steps                                                                                            |
|----------------------|-------------|-------------|----|-------------------------------------------------------------------------------------------------------|
| Hex                  | Description |             |    |                                                                                                       |
|                      |             |             | 5. | Consider replacing the DIMM as a preventative measure.<br>For multiple occurrences, replace the DIMM. |

## 7.5.2 Memory Address Parity Error

Address Parity errors are errors detected in the memory addressing hardware. Because these affect the addressing of memory contents, they can potentially lead to the same sort of failures as ECC errors. They are logged as a distinct type of error because they affect memory addressing rather than memory contents, but otherwise they are treated exactly the same as Uncorrectable ECC Errors. Address Parity errors are logged to the BMC SEL, with Event Data to identify the failing address by channel and DIMM to the extent that it is possible to do so.

### Table 65: Address Parity Error Sensor Typical Characteristics

| Byte   | Field                             | Description                                                                                                                                                                                                                                                                                                                               |
|--------|-----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 8<br>9 | Generator ID                      | 0033h = BIOS SMI Handler                                                                                                                                                                                                                                                                                                                  |
| 11     | Sensor Type                       | 0ch = Memory                                                                                                                                                                                                                                                                                                                              |
| 12     | Sensor Number                     | 13h                                                                                                                                                                                                                                                                                                                                       |
| 13     | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 6Fh (Sensor Specific)</li> </ul>                                                                                                                                                                                   |
| 14     | Event Data 1                      | <ul> <li>[7:6] - 10b = OEM code in Event Data 2</li> <li>[5:4] - 10b = OEM code in Event Data 3</li> <li>[3:0] - Event Trigger Offset = 2h</li> </ul>                                                                                                                                                                                     |
| 15     | Event Data 2                      | <ul> <li>[7:5] – Reserved. Set to 0.</li> <li>[4] – Channel Information Validity Check:<br/>0b = Channel Number in Event Data 3 Bits[4:3] is not valid<br/>1b = Channel Number in Event Data 3 Bits[4:3] is valid</li> <li>[3] – DIMM Information Validity Check:<br/>0b = DIMM Slot ID in Event Data 3 Bits[2:0] is not valid</li> </ul> |

Memory Subsystem

| Byte | Field        | Description                                                                                                                                                                                             |  |
|------|--------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
|      |              | 1b = DIMM Slot ID in Event Data 3 Bits[2:0] is valid                                                                                                                                                    |  |
|      |              | [2:0] – Error Type:                                                                                                                                                                                     |  |
|      |              | 000b = Parity Error Type not known                                                                                                                                                                      |  |
|      |              | 001b = Data Parity Error (not used)                                                                                                                                                                     |  |
|      |              | 010b = Address Parity Error                                                                                                                                                                             |  |
|      |              | All other values are reserved.                                                                                                                                                                          |  |
| 16   | Event Data 3 | [7:5] – Indicates the Processor Socket to which the DDR3 DIMM having the ECC error is attached:                                                                                                         |  |
|      |              | 0-3 = CPU1-4                                                                                                                                                                                            |  |
|      |              | All other values are reserved.                                                                                                                                                                          |  |
|      |              | [4:3] – Channel Number (if valid) on which the Parity Error occurred. This value will be indeterminate and should be ignored if ED2<br>Bit [4] is 0b.                                                   |  |
|      |              | 00b = Channel A                                                                                                                                                                                         |  |
|      |              | 01b = Channel B                                                                                                                                                                                         |  |
|      |              | 10b = Channel C                                                                                                                                                                                         |  |
|      |              | 11b = Channel D                                                                                                                                                                                         |  |
|      |              | [2:0] – DIMM Slot ID (if valid) of the specific DIMM that was involved in the transaction that led to the parity error. This value will<br>be indeterminate and should be ignored if ED2 Bit [3] is 0b. |  |
|      |              | 000b = DIMM Socket 1                                                                                                                                                                                    |  |
|      |              | 001b = DIMM Socket 2                                                                                                                                                                                    |  |
|      |              | 010b = DIMM Socket 3                                                                                                                                                                                    |  |
|      |              | All other values are reserved.                                                                                                                                                                          |  |

## 7.5.2.1 Memory Address Parity Error Sensor – Next Steps

These are bit errors that are detected in the memory addressing hardware. An Address Parity Error implies that the memory address transmitted to the DIMM addressing circuitry has been compromised, and data read or written is compromised in turn. An Address Parity Error is logged as such in SEL but in all other ways is treated the same as an Uncorrectable ECC Error.

While the error may be due to a failing DRAM chip on the DIMM, it can also be cause by incorrect seating or improper contact between the socket and DIMM, or by the bent pins in the processor socket.

- 1. If needed, decode DIMM location from hex version of SEL.
- 2. Verify the DIMM is seated properly.
- 3. Examine gold fingers on edge of the DIMM to verify contacts are clean.

Memory Subsystem System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

- 4. Inspect the processor socket this DIMM is connected to for bent pins, and if found, replace the board.
- 5. Consider replacing the DIMM as a preventative measure. For multiple occurrences, replace the DIMM.

# 8. PCI Express\* and Legacy PCI Subsystem

The *PCI Express\** (*PCIe*) Specification defines standard error types under the Advanced Error Reporting (AER) capabilities. The BIOS logs AER events into the SEL.

The Legacy PCI Specification error types are PERR and SERR. These errors are supported and logged into the SEL.

# 8.1 PCI Express\* Errors

PCIe error events are either correctable (informational event) or fatal. In both cases information is logged to help identify the source of the PCIe error and the bus, device, and function is included in the extended data fields. The PCIe devices are mapped in the operating system by bus, device, and function. Each device is uniquely identified by the bus, device, and function. PCIe device information can be found in the operating system.

## 8.1.1 Legacy PCI Errors

Legacy PCI errors include PERR and SERR; both are fatal errors.

| Byte   | Field                             | Description                                                                                                                                             |
|--------|-----------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
| 8<br>9 | Generator ID                      | 0033h = BIOS SMI Handler                                                                                                                                |
| 11     | Sensor Type                       | 13h = Critical Interrupt                                                                                                                                |
| 12     | Sensor Number                     | 03h                                                                                                                                                     |
| 13     | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 6Fh (Sensor Specific)</li> </ul> |
| 14     | Event Data 1                      | <ul> <li>[7:6] - 10b = OEM code in Event Data 2</li> <li>[5:4] - 10b = OEM code in Event Data 3</li> <li>[3:0] - Event Trigger Offset</li> </ul>        |

### Table 66: Legacy PCI Error Sensor Typical Characteristics

| Byte | Field        | Description                                              |
|------|--------------|----------------------------------------------------------|
|      |              | 4h = PCI PERR                                            |
|      |              | 5h = PCI SERR                                            |
| 15   | Event Data 2 | PCI Bus number                                           |
| 16   | Event Data 3 | [7:3] – PCI Device number<br>[2:0] – PCI Function number |

## 8.1.1.1 Legacy PCI Error Sensor – Next Steps

- 1. Decode the bus, device, and function to identify the card.
- 2. If this is an add-in card:
  - a. Verify the card is inserted properly.
  - b. Install the card in another slot and check whether the error follows the card or stays with the slot.
  - c. Update all firmware and drivers, including non-Intel components.
- 3. If this is an on-board device:
  - a. Update all BIOS, firmware, and drivers.
  - b. Replace the board.

## 8.1.2 PCI Express\* Fatal Errors and Fatal Error #2

When a PCI Express\* fatal error is reported to the BIOS SMI handler, it will record the error using the following format.

| Byte   | Field                             | Description                                                                                                                                          |
|--------|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|
| 8<br>9 | Generator ID                      | 0033h = BIOS SMI Handler                                                                                                                             |
| 11     | Sensor Type                       | 13h = Critical Interrupt                                                                                                                             |
| 12     | Sensor Number                     | 04h                                                                                                                                                  |
| 13     | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 70h (OEM Specific)</li> </ul> |

Table 67: PCI Express\* Fatal Error Sensor Typical Characteristics

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5 4600/2600/2400/1600/1400 Product Families PCI Express<sup>\*</sup> and Legacy PCI Subsystem

| Byte | Field        | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|------|--------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 14   | Event Data 1 | [7:6] – 10b = OEM code in Event Data 2         [5:4] – 10b = OEM code in Event Data 3         [3:0] – Event Trigger         0h = Data Link Layer Protocol Error         1h = Surprise Link Down Error         2h = Completer Abort         3h = Unsupported Request         4h = Poisoned TLP         5h = Flow Control Protocol         6h = Completion Timeout         7h = Receiver Buffer Overflow         8h = ACS Violation         9h = Malformed TLP         Ah = ECRC Error         Bh = Received Fatal Message From Downstream         Ch = Unexpected Completion         Dh = Received ERR_NONFATAL Message         Eh = Uncorrectable Internal |
| 15   | Event Data 2 | Fh = MC Blocked TLP<br>PCI Bus number                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| 16   | Event Data 3 | [7:3] – PCI Device number<br>[2:0] – PCI Function number                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |

The PCI Express\* Fatal Error #2 is a continuation of the PCI Express\* Fatal Error.

### Table 68: PCI Express\* Fatal Error #2 Sensor Typical Characteristics

| Byte   | Field        | Description              |
|--------|--------------|--------------------------|
| 8<br>9 | Generator ID | 0033h = BIOS SMI Handler |
| 11     | Sensor Type  | 13h = Critical Interrupt |

| Byte | Field                             | Description                                                                                                                                                                                                                                                         |
|------|-----------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 12   | Sensor Number                     | 14h                                                                                                                                                                                                                                                                 |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 76h (OEM Specific)</li> </ul>                                                                                                                |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 10b = OEM code in Event Data 2</li> <li>[5:4] - 10b = OEM code in Event Data 3</li> <li>[3:0] - Event Trigger Offset</li> <li>0h = Atomic Egress Blocked</li> <li>1h = TLP Prefix Blocked</li> <li>Fh = Unspecified Non-AER Fatal Error</li> </ul> |
| 15   | Event Data 2                      | PCI Bus number                                                                                                                                                                                                                                                      |
| 16   | Event Data 3                      | [7:3] – PCI Device number<br>[2:0] – PCI Function number                                                                                                                                                                                                            |

## 8.1.2.1 PCI Express\* Fatal Error and Fatal Error #2 Sensor – Next Steps

- 1. Decode the bus, device, and function to identify the card.
- 2. If this is an add-in card:
  - a. Verify the card is inserted properly.
  - b. Install the card in another slot and check whether the error follows the card or stays with the slot.
  - c. Update all firmware and drivers, including non-Intel components.
- 3. If this is an on-board device:
  - a. Update all BIOS, firmware, and drivers.
  - b. Replace the board.

## 8.1.3 PCI Express\* Correctable Errors

When a PCI Express\* correctable error is reported to the BIOS SMI handler, it will record the error using the following format.

| Byte   | Field                             | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|--------|-----------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 8<br>9 | Generator ID                      | 0033h = BIOS SMI Handler                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| 11     | Sensor Type                       | 13h = Critical Interrupt                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| 12     | Sensor Number                     | 05h                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| 13     | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 71h (OEM Specific)</li> </ul>                                                                                                                                                                                                                                                                                                                                                      |
| 14     | Event Data 1                      | <ul> <li>[7:6] - 10b = OEM code in Event Data 2</li> <li>[5:4] - 10b = OEM code in Event Data 3</li> <li>[3:0] - Event Trigger Offset         <ul> <li>0h = Receiver Error</li> <li>1h = Bad DLLP</li> <li>2h = Bad TLP</li> <li>3h = Replay Num Rollover</li> <li>4h = Replay Timer timeout</li> <li>5h = Advisory Non-fatal</li> <li>6h = Link BW Changed</li> <li>7h = Correctable Internal</li> <li>8h = Header Log Overflow</li> <li>Fh = Unspecified Non-AER Correctable Error</li> </ul></li></ul> |
| 15     | Event Data 2                      | PCI Bus number                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| 16     | Event Data 3                      | [7:3] – PCI Device number<br>[2:0] – PCI Function number                                                                                                                                                                                                                                                                                                                                                                                                                                                  |

Table 69: PCI Express\* Correctable Error Sensor Typical Characteristics

## 8.1.3.1 PCI Express\* Correctable Error Sensor – Next Steps

This is an informational event only. Correctable errors are acceptable and normal at a low rate of occurrence. If the error continues:

- 1. Decode the bus, device, and function to identify the card.
- 2. If this is an add-in card:
  - a. Verify the card is inserted properly.
  - b. Install the card in another slot and check whether the error follows the card or stays with the slot.
  - c. Update all firmware and drivers, including non-Intel components.

### 3. If this is an on-board device:

- a. Update all BIOS, firmware, and drivers.
- b. Replace the board.

# 9. System BIOS Events

There are a number of events that are owned by the system BIOS. These events can occur during Power On Self Test (POST) or when coming out of a sleep state. Not all of these events signify errors. Some events are described in other chapters in this document (for example, memory events).

# 9.1 System Events

These events can occur during POST or when coming out of a sleep state. These are informational events only.

- 1. When logging events during BIOS POST uses generator ID 0001h.
- 2. When logging events during BIOS SMI Handler uses generator ID 0033h.

## 9.1.1 System Boot

At the end of POST, just before the actual OS boot occurs, a System Boot Event is logged. This basically serves to mark the transition of control from completed POST to OS Loader. It is an informational only event.

## 9.1.2 Timestamp Clock Synchronization

These events are used when the time between the BIOS and the BMC is synchronized. Two events are logged. The BIOS does the first one to send the time synch message to the BMC for synchronization, and the timestamp that message gets is unknown, that is, the timestamp in the log can be anything because it gets the "before" timestamp.

So the BIOS sends a second time synch message to get a "baseline" correct timestamp in the log. That is the "starting time".

For example, say that the time the BMC has is March 1, 2011 21:00. The BIOS time synch updates that to the same date, 21:20 (the BMC was running behind). Without that second time synch message, you don't know that the log time jumped ahead, and when you get the next log message it looks like there was a 20-min delay during the boot for some unknown reasons.

Without that second time synch message, the time span to the next logged message is indeterminate. With the second time synch as a baseline, the following log timestamps are always determinate.

The timestamp clock synchronization is run and the events are logged by the BIOS POST every time the system boots. In addition during the shutdown from some Operating Systems the BIOS SMI Handler is called to run timestamp clock synchronization and log the events.

| Byte | Field                             | Description                                                                                                                                                                                                                |
|------|-----------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 8    | Generator ID                      | <ul> <li>0001h = BIOS POST</li> </ul>                                                                                                                                                                                      |
| 9    |                                   | <ul> <li>0033h = BIOS SMI Handler</li> </ul>                                                                                                                                                                               |
| 11   | Sensor Type                       | 12h = System Event                                                                                                                                                                                                         |
| 12   | Sensor Number                     | 83h                                                                                                                                                                                                                        |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 6Fh (Sensor Specific)</li> </ul>                                                                    |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 00b = Unspecified Event Data 2</li> <li>[5:4] - 00b = Unspecified Event Data 3</li> <li>[3:0] - Event Trigger Offset</li> <li>01h = System Boot</li> <li>05h = Timestamp Clock Synchronization</li> </ul> |
| 15   | Event Data 2                      | For Event Trigger Offset 05h only (Timestamp Clock<br>Synchronization)<br>00h = 1st in pair<br>80h = 2nd in pair                                                                                                           |
| 16   | Event Data 3                      | Not used                                                                                                                                                                                                                   |

### Table 70: System Event Sensor Typical Characteristics

# 9.2 System Firmware Progress (Formerly Post Error)

The BIOS logs any POST errors to the SEL. The 2-byte POST code gets logged in the ED2 and ED3 bytes in the SEL entry. This event will be logged every time a POST error is displayed. Even though this event indicates an error, it may not be a fatal error. If this is a serious error, there will typically also be a corresponding SEL entry logged for whatever was the cause of the error – this event may contain more information about what happened than the POST error event.

| Byte | Field                             | Description                                                                                                                                             |
|------|-----------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
| 8    | Generator ID                      | 0001h = BIOS POST                                                                                                                                       |
| 9    |                                   |                                                                                                                                                         |
| 11   | Sensor Type                       | 0Fh = System Firmware Progress (formerly POST Error)                                                                                                    |
| 12   | Sensor Number                     | 06h                                                                                                                                                     |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 6Fh (Sensor Specific)</li> </ul> |
| 14   | Event Data 1                      | <ul> <li>[7:6] – 10b = OEM code in Event Data 2</li> <li>[5:4] – 10b = OEM code in Event Data 3</li> <li>[3:0] – Event Trigger Offset = 0h</li> </ul>   |
| 15   | Event Data 2                      | Low Byte of POST Error Code                                                                                                                             |
| 16   | Event Data 3                      | High Byte of POST Error Code                                                                                                                            |

| Table 71: POST Error Sensor Typic | cal Characteristics |
|-----------------------------------|---------------------|
|-----------------------------------|---------------------|

# 9.2.1 System Firmware Progress (Formerly Post Error) – Next Steps

See the following table for POST Error Codes.

| 0012System RTC date/time not set0048Password check failed0140PCI component encountered a PERR error0141PCI resource conflict0146PCI out of resources error0191Processor core/thread count mismatch detected0192Processor cache size mismatch detected0194Processor family mismatch detected0195Processor Intel(R) QPI link frequencies unable to sync0196Processor frequencies unable to synchronize | Fatal<br>Fatal                                                                                              |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------|
| 0140PCI component encountered a PERR error0141PCI resource conflict0146PCI out of resources error0191Processor core/thread count mismatch detected0192Processor cache size mismatch detected0194Processor family mismatch detected0195Processor Intel(R) QPI link frequencies unable to symptotic0196Processor model mismatch detected                                                               | Major<br>Major<br>Major<br>Fatal<br>Fatal<br>Fatal<br>chronize<br>Fatal<br>Fatal<br>Fatal<br>Fatal<br>Fatal |
| 0141       PCI resource conflict         0146       PCI out of resources error         0191       Processor core/thread count mismatch detected         0192       Processor cache size mismatch detected         0194       Processor family mismatch detected         0195       Processor Intel(R) QPI link frequencies unable to synd         0196       Processor model mismatch detected       | Major<br>Major<br>Fatal<br>Fatal<br>Fatal<br>chronize Fatal<br>Fatal<br>Fatal<br>Fatal<br>Fatal             |
| 0146PCI out of resources error0191Processor core/thread count mismatch detected0192Processor cache size mismatch detected0194Processor family mismatch detected0195Processor Intel(R) QPI link frequencies unable to sync0196Processor model mismatch detected                                                                                                                                       | Major<br>Fatal<br>Fatal<br>Fatal<br>chronize Fatal<br>Fatal<br>Fatal<br>Fatal                               |
| 0191       Processor core/thread count mismatch detected         0192       Processor cache size mismatch detected         0194       Processor family mismatch detected         0195       Processor Intel(R) QPI link frequencies unable to symptom         0196       Processor model mismatch detected                                                                                           | Fatal<br>Fatal<br>Fatal<br>chronize<br>Fatal<br>Fatal<br>Fatal<br>Fatal                                     |
| 0192       Processor cache size mismatch detected         0194       Processor family mismatch detected         0195       Processor Intel(R) QPI link frequencies unable to sync         0196       Processor model mismatch detected                                                                                                                                                               | Fatal<br>Fatal<br>chronize Fatal<br>Fatal<br>Fatal<br>Fatal                                                 |
| 0194     Processor family mismatch detected       0195     Processor Intel(R) QPI link frequencies unable to symptom       0196     Processor model mismatch detected                                                                                                                                                                                                                                | Fatal<br>chronize Fatal<br>Fatal<br>Fatal<br>Fatal                                                          |
| 0195         Processor Intel(R) QPI link frequencies unable to synd           0196         Processor model mismatch detected                                                                                                                                                                                                                                                                         | chronize Fatal<br>Fatal<br>Fatal                                                                            |
| 0196 Processor model mismatch detected                                                                                                                                                                                                                                                                                                                                                               | Fatal<br>Fatal                                                                                              |
|                                                                                                                                                                                                                                                                                                                                                                                                      | Fatal                                                                                                       |
| 0107 Processor frequencies unable to synchronize                                                                                                                                                                                                                                                                                                                                                     |                                                                                                             |
| The synchronize                                                                                                                                                                                                                                                                                                                                                                                      | Malar                                                                                                       |
| 5220 BIOS Settings reset to default settings                                                                                                                                                                                                                                                                                                                                                         | Major                                                                                                       |
| 5221 Passwords cleared by jumper                                                                                                                                                                                                                                                                                                                                                                     | Major                                                                                                       |
| 5224 Password clear jumper is Set                                                                                                                                                                                                                                                                                                                                                                    | Major                                                                                                       |
| 8130 Processor 01 disabled                                                                                                                                                                                                                                                                                                                                                                           | Major                                                                                                       |
| 8131 Processor 02 disabled                                                                                                                                                                                                                                                                                                                                                                           | Major                                                                                                       |
| 8132 Processor 03 disabled                                                                                                                                                                                                                                                                                                                                                                           | Major                                                                                                       |
| 8133 Processor 04 disabled                                                                                                                                                                                                                                                                                                                                                                           | Major                                                                                                       |
| 8160 Processor 01 unable to apply microcode update                                                                                                                                                                                                                                                                                                                                                   | Major                                                                                                       |
| 8161 Processor 02 unable to apply microcode update                                                                                                                                                                                                                                                                                                                                                   | Major                                                                                                       |
| 8162 Processor 03 unable to apply microcode update                                                                                                                                                                                                                                                                                                                                                   | Major                                                                                                       |
| 8163 Processor 04 unable to apply microcode update                                                                                                                                                                                                                                                                                                                                                   | Major                                                                                                       |
| 8170 Processor 01 failed Self Test (BIST)                                                                                                                                                                                                                                                                                                                                                            | Major                                                                                                       |
| 8171 Processor 02 failed Self Test (BIST)                                                                                                                                                                                                                                                                                                                                                            | Major                                                                                                       |
| 8172 Processor 03 failed Self Test (BIST)                                                                                                                                                                                                                                                                                                                                                            | Major                                                                                                       |
| 8173 Processor 04 failed Self Test (BIST)                                                                                                                                                                                                                                                                                                                                                            | Major                                                                                                       |
| 8180 Processor 01 microcode update not found                                                                                                                                                                                                                                                                                                                                                         | Minor                                                                                                       |
| 8181 Processor 02 microcode update not found                                                                                                                                                                                                                                                                                                                                                         | Minor                                                                                                       |
| 8182 Processor 03 microcode update not found                                                                                                                                                                                                                                                                                                                                                         | Minor                                                                                                       |
| 8183 Processor 04 microcode update not found                                                                                                                                                                                                                                                                                                                                                         | Minor                                                                                                       |

## Table 72: POST Error Codes

## System BIOS Events

| Error Code | Error Message                                                     | Response |
|------------|-------------------------------------------------------------------|----------|
| 8190       | Watchdog timer failed on last boot                                | Major    |
| 8198       | OS boot watchdog timer failure                                    | Major    |
| 8300       | Baseboard management controller failed self test                  | Major    |
| 8305       | Hot-Swap Controller failure                                       | Major    |
| 83A0       | Management Engine (ME) failed self test                           | Major    |
| 83A1       | Management Engine (ME) Failed to respond.                         | Major    |
| 84F2       | Baseboard management controller failed to respond                 | Major    |
| 84F3       | Baseboard management controller in update mode                    | Major    |
| 84F4       | Sensor data record empty                                          | Major    |
| 84FF       | System event log full                                             | Minor    |
| 8500       | Memory component could not be configured in the selected RAS mode | Major    |
| 8501       | DIMM Population Error                                             | Major    |
| 8520       | DIMM_A1 failed test/initialization                                | Major    |
| 8521       | DIMM_A2 failed test/initialization                                | Major    |
| 8522       | DIMM_A3 failed test/initialization                                | Major    |
| 8523       | DIMM_B1 failed test/initialization                                | Major    |
| 8524       | DIMM_B2 failed test/initialization                                | Major    |
| 8525       | DIMM_B3 failed test/initialization                                | Major    |
| 8526       | DIMM_C1 failed test/initialization                                | Major    |
| 8527       | DIMM_C2 failed test/initialization                                | Major    |
| 8528       | DIMM_C3 failed test/initialization                                | Major    |
| 8529       | DIMM_D1 failed test/initialization                                | Major    |
| 852A       | DIMM_D2 failed test/initialization                                | Major    |
| 852B       | DIMM_D3 failed test/initialization                                | Major    |
| 852C       | DIMM_E1 failed test/initialization                                | Major    |
| 852D       | DIMM_E2 failed test/initialization                                | Major    |
| 852E       | DIMM_E3 failed test/initialization                                | Major    |
| 852F       | DIMM_F1 failed test/initialization                                | Major    |
| 8530       | DIMM_F2 failed test/initialization                                | Major    |
| 8531       | DIMM_F3 failed test/initialization                                | Major    |
| 8532       | DIMM_G1 failed test/initialization                                | Major    |
| 8533       | DIMM_G2 failed test/initialization                                | Major    |

| System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Produ | uct Families |
|-----------------------------------------------------------------------------------------------------------------------------|--------------|
| -j                                                                                                                          |              |

| Error Code           | Error Message                                       | Response |  |
|----------------------|-----------------------------------------------------|----------|--|
| 8534                 | DIMM_G3 failed test/initialization                  | Major    |  |
| 8535                 | DIMM_H1 failed test/initialization                  | Major    |  |
| 8536                 | DIMM_H2 failed test/initialization                  | Major    |  |
| 8537                 | DIMM_H3 failed test/initialization                  | Major    |  |
| 8538                 | DIMM_J1 failed test/initialization                  | Major    |  |
| 8539                 | DIMM_J2 failed test/initialization                  | Major    |  |
| 853A                 | DIMM_J3 failed test/initialization                  | Major    |  |
| 853B                 | DIMM_K1 failed test/initialization                  | Major    |  |
| 853C                 | DIMM_K2 failed test/initialization                  | Major    |  |
| 853D                 | DIMM_K3 failed test/initialization                  | Major    |  |
| 853E                 | DIMM_L1 failed test/initialization                  | Major    |  |
| 853F<br>(Go to 85C0) | DIMM_L2 failed test/initialization Major            |          |  |
| 8540                 | DIMM_A1 disabled                                    | Major    |  |
| 8541                 | DIMM_A2 disabled                                    | Major    |  |
| 8542                 | DIMM_A3 disabled                                    | Major    |  |
| 8543                 | DIMM_B1 disabled                                    |          |  |
| 8544                 | DIMM_B2 disabled                                    | Major    |  |
| 8545                 | DIMM_B3 disabled                                    |          |  |
| 8546                 | DIMM_C1 disabled                                    | Major    |  |
| 8547                 | DIMM_C2 disabled                                    | Major    |  |
| 8548                 | DIMM_C3 disabled                                    | Major    |  |
| 8549                 | DIMM_D1 disabled                                    | Major    |  |
| 854A                 | DIMM_D2 disabled                                    |          |  |
| 854B                 | DIMM_D3 disabled                                    | Major    |  |
| 854C                 | DIMM_E1 disabled                                    |          |  |
| 854D                 | DIMM_E1 disabled     M       DIMM_E2 disabled     M |          |  |
| 854E                 | DIMM_E3 disabled Majo                               |          |  |
| 854F                 | DIMM_F1 disabled Majo                               |          |  |
| 8550                 | DIMM_F2 disabled Major                              |          |  |
| 8551                 | DIMM_F3 disabled Ma                                 |          |  |
| 8552                 | DIMM_G1 disabled Majo                               |          |  |

## System BIOS Events

| Error Code           | Error Message                                                       | Response |
|----------------------|---------------------------------------------------------------------|----------|
| 8553                 | DIMM_G2 disabled                                                    | Major    |
| 8554                 | DIMM_G3 disabled                                                    | Major    |
| 8555                 | DIMM_H1 disabled                                                    | Major    |
| 8556                 | DIMM_H2 disabled                                                    | Major    |
| 8557                 | DIMM_H3 disabled                                                    | Major    |
| 8558                 | DIMM_J1 disabled                                                    | Major    |
| 8559                 | DIMM_J2 disabled                                                    | Major    |
| 855A                 | DIMM_J3 disabled                                                    | Major    |
| 855B                 | DIMM_K1 disabled                                                    | Major    |
| 855C                 | DIMM_K2 disabled                                                    | Major    |
| 855D                 | DIMM_K3 disabled                                                    | Major    |
| 855E                 | DIMM_L1 disabled                                                    | Major    |
| 855F<br>(Go to 85D0) | DIMM_L2 disabled Major                                              |          |
| 8560                 | DIMM_A1 encountered a Serial Presence Detection (SPD) failure       | Major    |
| 8561                 | DIMM_A2 encountered a Serial Presence Detection (SPD) failure       | Major    |
| 8562                 | DIMM_A3 encountered a Serial Presence Detection (SPD) failure       | Major    |
| 8563                 | DIMM_B1 encountered a Serial Presence Detection (SPD) failure       | Major    |
| 8564                 | DIMM_B2 encountered a Serial Presence Detection (SPD) failure Ma    |          |
| 8565                 | DIMM_B3 encountered a Serial Presence Detection (SPD) failure       | Major    |
| 8566                 | DIMM_C1 encountered a Serial Presence Detection (SPD) failure       | Major    |
| 8567                 | DIMM_C2 encountered a Serial Presence Detection (SPD) failure       | Major    |
| 8568                 | DIMM_C3 encountered a Serial Presence Detection (SPD) failure       | Major    |
| 8569                 | DIMM_D1 encountered a Serial Presence Detection (SPD) failure       | Major    |
| 856A                 | DIMM_D2 encountered a Serial Presence Detection (SPD) failure       | Major    |
| 856B                 | DIMM_D3 encountered a Serial Presence Detection (SPD) failure Majo  |          |
| 856C                 | DIMM_E1 encountered a Serial Presence Detection (SPD) failure Maj   |          |
| 856D                 | DIMM_E2 encountered a Serial Presence Detection (SPD) failure Major |          |
| 856E                 | DIMM_E3 encountered a Serial Presence Detection (SPD) failure       | Major    |
| 856F                 | DIMM_F1 encountered a Serial Presence Detection (SPD) failure Major |          |
| 8570                 | DIMM_F2 encountered a Serial Presence Detection (SPD) failure Major |          |
| 8571                 | DIMM_F3 encountered a Serial Presence Detection (SPD) failure Maj   |          |

| Error Code           | Error Message                                                                                 | Response |
|----------------------|-----------------------------------------------------------------------------------------------|----------|
| 8572                 | DIMM_G1 encountered a Serial Presence Detection (SPD) failure                                 |          |
| 8573                 | DIMM_G2 encountered a Serial Presence Detection (SPD) failure                                 |          |
| 8574                 | DIMM_G3 encountered a Serial Presence Detection (SPD) failure                                 |          |
| 8575                 | DIMM_H1 encountered a Serial Presence Detection (SPD) failure                                 |          |
| 8576                 | DIMM_H2 encountered a Serial Presence Detection (SPD) failure                                 | Major    |
| 8577                 | DIMM_H3 encountered a Serial Presence Detection (SPD) failure                                 | Major    |
| 8578                 | DIMM_J1 encountered a Serial Presence Detection (SPD) failure                                 | Major    |
| 8579                 | DIMM_J2 encountered a Serial Presence Detection (SPD) failure                                 | Major    |
| 857A                 | DIMM_J3 encountered a Serial Presence Detection (SPD) failure                                 | Major    |
| 857B                 | DIMM_K1 encountered a Serial Presence Detection (SPD) failure                                 | Major    |
| 857C                 | DIMM_K2 encountered a Serial Presence Detection (SPD) failure                                 | Major    |
| 857D                 | DIMM_K3 encountered a Serial Presence Detection (SPD) failure                                 | Major    |
| 857E                 | DIMM_L1 encountered a Serial Presence Detection (SPD) failure                                 | Major    |
| 857F<br>(Go to 85E0) | DIMM_L2 encountered a Serial Presence Detection (SPD) failure     Major                       |          |
| 85C0                 | DIMM_L3 failed test/initialization                                                            | Major    |
| 85C1                 | DIMM_M1 failed test/initialization                                                            |          |
| 85C2                 | DIMM_M2 failed test/initialization M                                                          |          |
| 85C3                 | DIMM_M3 failed test/initialization                                                            |          |
| 85C4                 | DIMM_N1 failed test/initialization                                                            |          |
| 85C5                 | DIMM_N2 failed test/initialization                                                            |          |
| 85C6                 | DIMM_N3 failed test/initialization                                                            | Major    |
| 85C7                 | DIMM_P1 failed test/initialization                                                            |          |
| 85C8                 | DIMM_P2 failed test/initialization                                                            |          |
| 85C9                 | DIMM_P3 failed test/initialization                                                            |          |
| 85CA                 | DIMM_P3 failed test/initialization       N         DIMM_R1 failed test/initialization       N |          |
| 85CB                 | DIMM_R2 failed test/initialization                                                            |          |
| 85CC                 | DIMM_R3 failed test/initialization M                                                          |          |
| 85CD                 | DIMM_T1 failed test/initialization Ma                                                         |          |
| 85CE                 | DIMM_T2 failed test/initialization Major                                                      |          |
| 85CF                 | DIMM_T3 failed test/initialization Ma                                                         |          |
| 85D0                 | DIMM_L3 disabled Mi                                                                           |          |

## System BIOS Events

| Error Code | Error Message                                                       | Response |
|------------|---------------------------------------------------------------------|----------|
| 85D1       | DIMM_M1 disabled                                                    | Major    |
| 85D2       | DIMM_M2 disabled                                                    | Major    |
| 85D3       | DIMM_M3 disabled                                                    |          |
| 85D4       | DIMM_N1 disabled                                                    |          |
| 85D5       | DIMM_N2 disabled                                                    | Major    |
| 85D6       | DIMM_N3 disabled                                                    | Major    |
| 85D7       | DIMM_P1 disabled                                                    | Major    |
| 85D8       | DIMM_P2 disabled                                                    | Major    |
| 85D9       | DIMM_P3 disabled                                                    | Major    |
| 85DA       | DIMM_R1 disabled                                                    | Major    |
| 85DB       | DIMM_R2 disabled                                                    | Major    |
| 85DC       | DIMM_R3 disabled                                                    | Major    |
| 85DD       | DIMM_T1 disabled                                                    | Major    |
| 85DE       | DIMM_T2 disabled                                                    | Major    |
| 85DF       | DIMM_T3 disabled                                                    | Major    |
| 85E0       | DIMM_L3 encountered a Serial Presence Detection (SPD) failure       |          |
| 85E1       | DIMM_M1 encountered a Serial Presence Detection (SPD) failure       |          |
| 85E2       | DIMM_M2 encountered a Serial Presence Detection (SPD) failure       | Major    |
| 85E3       | DIMM_M3 encountered a Serial Presence Detection (SPD) failure Ma    |          |
| 85E4       | DIMM_N1 encountered a Serial Presence Detection (SPD) failure       | Major    |
| 85E5       | DIMM_N2 encountered a Serial Presence Detection (SPD) failure       | Major    |
| 85E6       | DIMM_N3 encountered a Serial Presence Detection (SPD) failure       | Major    |
| 85E7       | DIMM_P1 encountered a Serial Presence Detection (SPD) failure       | Major    |
| 85E8       | DIMM_P2 encountered a Serial Presence Detection (SPD) failure       |          |
| 85E9       | DIMM_P3 encountered a Serial Presence Detection (SPD) failure       | Major    |
| 85EA       | DIMM_R1 encountered a Serial Presence Detection (SPD) failure       | Major    |
| 85EB       | DIMM_R2 encountered a Serial Presence Detection (SPD) failure Ma    |          |
| 85EC       | DIMM_R3 encountered a Serial Presence Detection (SPD) failure Major |          |
| 85ED       | DIMM_T1 encountered a Serial Presence Detection (SPD) failure       | Major    |
| 85EE       | DIMM_T2 encountered a Serial Presence Detection (SPD) failure       | Major    |
| 85EF       | DIMM_T3 encountered a Serial Presence Detection (SPD) failure Major |          |
| 8604       | POST Reclaim of non-critical NVRAM variables                        |          |

| System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel <sup>®</sup> Xeon <sup>®</sup> Processor E5 4600/2600/2400/1600/1400 Product Families |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------|
|----------------------------------------------------------------------------------------------------------------------------------------------------------------|

| Error Code | Error Message                                                                             |       |
|------------|-------------------------------------------------------------------------------------------|-------|
| 8605       | BIOS Settings are corrupted Ma                                                            |       |
| 8606       | NVRAM variable space was corrupted and has been reinitialized Ma                          |       |
| 92A3       | Serial port component was not detected                                                    | Major |
| 92A9       | Serial port component encountered a resource conflict error                               | Major |
| A000       | TPM device not detected.                                                                  | Minor |
| A001       | TPM device missing or not responding.                                                     | Minor |
| A002       | TPM device failure. Mi                                                                    |       |
| A003       | TPM device failed self test.                                                              | Minor |
| A100       | BIOS ACM Error                                                                            | Major |
| A421       | PCI component encountered a SERR error                                                    | Fatal |
| A5A0       | PCI Express* component encountered a PERR error Minor                                     |       |
| A5A1       | PCI Express* component encountered an SERR error Fatal                                    |       |
| A6A0       | DXE Boot Services driver: Not enough memory available to shadow a Legacy Option ROM. Mind |       |

Chassis Subsystem

# 10. Chassis Subsystem

The BMC monitors several aspects of the chassis. Next to logging when the power and reset buttons get pressed, the BMC also monitors chassis intrusion if a chassis intrusion switch is included in the chassis, as well as looking at the network connections, and logging an event whenever the physical network link is lost.

# 10.1 Physical Security

Two sensors are included in the physical security subsystem: chassis intrusion and LAN leash lost.

## 10.1.1 Chassis Intrusion

Chassis Intrusion is monitored on supported chassis, and the BMC logs corresponding events when the chassis lid is opened and closed.

## 10.1.2 LAN Leash Lost

The LAN Leash lost sensor monitors the physical connection on the on-board network ports. If a LAN Leash lost event is logged, this means the network port lost its physical connection.

| Byte | Field                             | Description                                                                                                                                                               |
|------|-----------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | 05h = Physical Security                                                                                                                                                   |
| 12   | Sensor Number                     | 04h                                                                                                                                                                       |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 6Fh (Sensor Specific)</li> </ul>                   |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 00b = Unspecified Event Data 2</li> <li>[5:4] - 00b = Unspecified Event Data 3</li> <li>[3:0] - Event Trigger Offset as described in Table 74</li> </ul> |

#### Table 73: Physical Security Sensor Typical Characteristics

#### Chassis Subsystem

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

| Byte | Field        | Description |
|------|--------------|-------------|
| 15   | Event Data 2 | Not used    |
| 16   | Event Data 3 | Not used    |

### Table 74: Physical Security Sensor Event Trigger Offset - Next Steps

| Event | Trigger Offset       | Description                                                                                                                                                                 | Next Steps                                                                                                                                                                                                                                                                                                                                                                                   |  |
|-------|----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| Hex   | Description          |                                                                                                                                                                             |                                                                                                                                                                                                                                                                                                                                                                                              |  |
| 00h   | chassis<br>intrusion | Somebody has opened the chassis (or the chassis intrusion sensor is not connected).                                                                                         | <ol> <li>Use the <i>Quick Start Guide</i> and the <i>Service Guide</i> to determine whether<br/>the chassis intrusion switch is connected properly.</li> <li>If this is the case, make sure it makes proper contact when the chassis is<br/>closed.</li> <li>If this is also the case, someone has opened the chassis. Ensure nobody<br/>has access to the system that shouldn't.</li> </ol> |  |
| 04h   | LAN leash<br>lost    | Someone has unplugged a LAN cable that was<br>present when the BMC initialized. This event gets<br>logged when the electrical connection on the NIC<br>connector gets lost. | <ul> <li>This is most likely due to unplugging the cable but can also happen if there is an issue with the cable or switch.</li> <li>1. Check the LAN cable and connector for issues.</li> <li>2. Investigate switch logs where possible.</li> <li>3. Ensure nobody has access to the server that shouldn't.</li> </ul>                                                                      |  |

# 10.2 FP (NMI) Interrupt

The BMC supports an NMI sensor for logging an event when a diagnostic interrupt is generated for the following cases:

- The front panel diagnostic interrupt button is pressed.
- The BMC receives an IPMI Chassis Control command that requests this action.

The front panel interrupt button (also referred to as NMI button) is a recessed button on the front panel that allows the user to force a critical interrupt which causes a crash error or kernel panic.

Chassis Subsystem

| Byte | Field                             | Description                                                                                                                                             |
|------|-----------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | 13h = Critical Interrupt                                                                                                                                |
| 12   | Sensor Number                     | 05h                                                                                                                                                     |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 6Fh (Sensor Specific)</li> </ul> |
| 14   | Event Data 1                      | [7:6] – 00b = Unspecified Event Data 2<br>[5:4] – 00b = Unspecified Event Data 3<br>[3:0] – Event Trigger Offset =0h                                    |
| 15   | Event Data 2                      | Not used                                                                                                                                                |
| 16   | Event Data 3                      | Not used                                                                                                                                                |

Table 75: FP (NMI) Interrupt Sensor Typical Characteristics

#### 10.2.1 FP (NMI) Interrupt – Next Steps

The purpose of this button is for diagnosing software issues – when a critical interrupt is generated the OS typically saves a memory dump. This allows for exact analysis of what is going on in system memory, which can be useful for software developers, or for troubleshooting OS, software, and driver issues.

If this button was not actually pressed, you should ensure there is no physical fault with the front panel.

This event only gets logged if a user pressed the NMI button or sent an IPMI *Chassis Control* command requesting this action, and although it causes the OS to crash, is not an error.

## 10.3 Button Sensor

The BMC logs when the front panel power and reset buttons get pressed. This is purely for informational purposes and these events do not indicate errors.

| Byte | Field                             | Description                                                                                                                                                                                            |
|------|-----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | 14h = Button / Switch                                                                                                                                                                                  |
| 12   | Sensor Number                     | 09h                                                                                                                                                                                                    |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 6Fh (Sensor Specific)</li> </ul>                                                |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 00b = Unspecified Event Data 2</li> <li>[5:4] - 00b = Unspecified Event Data 3</li> <li>[3:0] - Event Trigger Offset</li> <li>0h = Power Button</li> <li>2h = Reset Button</li> </ul> |
| 15   | Event Data 2                      | Not used                                                                                                                                                                                               |
| 16   | Event Data 3                      | Not used                                                                                                                                                                                               |

#### Table 76: Button Sensor Typical Characteristics

# 11. Miscellaneous Events

The miscellaneous events section addresses sensors not easily grouped with other sensor types.

## 11.1 IPMI Watchdog

EPSD server systems support an IPMI watchdog timer, which can check to see whether the OS is still responsive. The timer is disabled by default, and has to be enabled manually. It then requires an IPMI-aware utility in the operating system that will reset the timer before it expires. If the timer does expire, the BMC can take action if it is configured to do so (reset, power down, power cycle, or generate a critical interrupt).

| Byte | Field                             | Description                                                                                                                                                              |
|------|-----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | 23h = Watchdog 2                                                                                                                                                         |
| 12   | Sensor Number                     | 03h                                                                                                                                                                      |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 6Fh (Sensor Specific)</li> </ul>                  |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 00b = Unspecified Event Data 2</li> <li>[5:4] - 00b = Unspecified Event Data 3</li> <li>[3:0] - Event Trigger Offset as describe in Table 78</li> </ul> |
| 15   | Event Data 2                      | Not used                                                                                                                                                                 |
| 16   | Event Data 3                      | Not used                                                                                                                                                                 |

#### Table 77: IPMI Watchdog Sensor Typical Characteristics

#### **Miscellaneous Events**

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5 4600/2600/2400/1600/1400 Product Families

| Ever | nt Trigger Offset          | Description                                                                                                           | Next Steps                                                                                                                                   |
|------|----------------------------|-----------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------|
| Hex  | Description                |                                                                                                                       |                                                                                                                                              |
| 00h  | Timer expired, status only | Our server systems support a BMC watchdog timer,<br>which can check to see whether the OS is still                    | If this event is being logged, it is because the BMC has been configured to check the watchdog timer.                                        |
| 01h  | Hard reset                 | responsive. The timer is disabled by default, and has to<br>be enabled manually. It then requires an IPMI-aware       | <ol> <li>Make sure you have support for this in your OS (typically<br/>using a third-party IPMI-aware utility such as ipmitool or</li> </ol> |
| 02h  | Power down                 | utility in the operating system that will reset the timer<br>before it expires. If the timer does expire, the BMC can | ipmiutil along with the OpenIPMI driver).                                                                                                    |
| 03h  | Power cycle                | take action if it is configured to do so (reset, power                                                                | 2. If this is the case, it is likely your OS has hung, and you need to investigate OS event logs to determine what may have                  |
| 08h  | Timer interrupt            | down, power cycle, or generate a critical interrupt).                                                                 | caused this.                                                                                                                                 |

## 11.2 SMI Timeout

SMI stands for system management interrupt and is an interrupt that gets generated so the processor can service server management events (typically memory or PCI errors, or other forms of critical interrupts), in order to log them to the SEL. If this interrupt times out, the system is frozen. The BMC will reset the system after logging the event.

| Byte | Field                             | Description                                                                                                                                                            |
|------|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | F3h = SMI Timeout                                                                                                                                                      |
| 12   | Sensor Number                     | 06h                                                                                                                                                                    |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 03h ("digital" Discrete)</li> </ul>             |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 00b = Unspecified Event Data 2</li> <li>[5:4] - 00b = Unspecified Event Data 3</li> <li>[3:0] - Event Trigger Offset = 1h = State Asserted</li> </ul> |
| 15   | Event Data 2                      | Not used                                                                                                                                                               |
| 16   | Event Data 3                      | Not used                                                                                                                                                               |

#### Table 79: SMI Timeout Sensor Typical Characteristics

Miscellaneous Events

#### 11.2.1 SMI Timeout – Next Steps

This event normally only occurs after another more critical event.

- 1. Check the SEL for any critical interrupts, memory errors, bus errors, PCI errors, or any other serious errors.
- 2. If these are not present, the system locked up before it was able to log the original issue. In this case, low level debug is normally required.

## 11.3 System Event Log Cleared

The BMC logs a SEL clear event. This is only ever the first event in the SEL. Cause of this event is either a manual SEL clear using selview or some other IPMI-aware utility, or is done in the factory as one of the last steps in the manufacturing process.

This is an informational event only.

| Byte | Field                             | Description                                                                                                                                                                    |
|------|-----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | 10h = Event Logging Disabled                                                                                                                                                   |
| 12   | Sensor Number                     | 07h                                                                                                                                                                            |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 6Fh (Sensor Specific)</li> </ul>                        |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 00b = Unspecified Event Data 2</li> <li>[5:4] - 00b = Unspecified Event Data 3</li> <li>[3:0] - Event Trigger Offset = 2h = Log area reset/cleared</li> </ul> |
| 15   | Event Data 2                      | Not used                                                                                                                                                                       |
| 16   | Event Data 3                      | Not used                                                                                                                                                                       |

| Table 80: System | Event Log Clear | ed Sensor Typic | al Characteristics |
|------------------|-----------------|-----------------|--------------------|
| Table 00. System | LVEIT LUY CIEU  | ed Jenson Typic |                    |

## 11.4 System Event – PEF Action

The BMC is configurable to send alerts for events logged into the SEL. These alerts are called Platform Event Filters (PEF) and are disabled by default. The user must configure and enable this feature. PEF events are logged if the BMC takes action due to a PEF configuration. The BMC event triggering the PEF action will also be in the SEL.

This is functionality built into the BMC to allow it to send alerts (SNMP or other) for any event that gets logged to the SEL. PEF filters are turned off by default and have to be enabled manually using Intel<sup>®</sup> deployment assistant, Intel<sup>®</sup> syscfg utility, or an IPMI-aware utility.

| Byte | Field                             | Description                                                                                                                                                        |
|------|-----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | 12h = System Event                                                                                                                                                 |
| 12   | Sensor Number                     | 08h                                                                                                                                                                |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 6Fh (Sensor Specific)</li> </ul>            |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 00b = Unspecified Event Data 2</li> <li>[5:4] - 00b = Unspecified Event Data 3</li> <li>[3:0] - Event Trigger Offset = 4h = PEF Action</li> </ul> |
| 15   | Event Data 2                      | Not used                                                                                                                                                           |
| 16   | Event Data 3                      | Not used                                                                                                                                                           |

 Table 81: System Event – PEF Action Sensor Typical Characteristics

## 11.4.1 System Event – PEF Action – Next Steps

This event gets logged if the BMC takes an action due to PEF configuration. Actions can be sending an alert, along with possibly resetting, power cycling, or powering down the system. There will be another event that has led to the action so you need to investigate the SEL and PEF settings to identify this event, and troubleshoot accordingly.

## 11.5 BMC Watchdog Sensor

The BMC supports an IPMI sensor to report that a BMC reset has occurred due to an action taken by the BMC Watchdog feature. A SEL event will be logged whenever either the BMC FW stack is reset or the BMC CPU itself is reset.

| Byte | Field                             | Description                                                                                                                                                            |
|------|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | 28h = Management Subsystem Health                                                                                                                                      |
| 12   | Sensor Number                     | 0Ah                                                                                                                                                                    |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 03h ("digital" Discrete)</li> </ul>             |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 00b = Unspecified Event Data 2</li> <li>[5:4] - 00b = Unspecified Event Data 3</li> <li>[3:0] - Event Trigger Offset = 1h = State Asserted</li> </ul> |
| 15   | Event Data 2                      | Not used                                                                                                                                                               |
| 16   | Event Data 3                      | Not used                                                                                                                                                               |

Table 82: BMC Watchdog Sensor Typical Characteristics

## 11.5.1 BMC Watchdog Sensor – Next Steps

A SEL event will be logged whenever either the BMC FW stack is reset or the BMC CPU itself is reset.

- 1. Check the SEL for any other events around the time of the failure.
- 2. Take note of all IPMI activity that was occurring around the time of the failure. Capture a System BMC Debug Log as soon as you can after experiencing this failure. This log can be captured from the Integrated BMC Web Console or by using the Intel<sup>®</sup> Syscfg utility (syscfg /sbmcdl private filename.zip). Send the log file to your system manufacturer or Intel representative for failure analysis.

## 11.6 BMC FW Health Sensor

The BMC tracks the health of each of its IPMI sensors and reports failures by providing a "BMC FW Health" sensor of the IPMI 2.0 sensor type Management Subsystem Health with support for the Sensor Failure offset. Only assertions will be logged into the SEL for the Sensor Failure offset. The BMC Firmware Health sensor asserts for any sensor when 10 consecutive sensor errors are read. These are not standard sensor events (that is, threshold crossings or discrete assertions). These are BMC Hardware Access Layer (HAL) errors such as I<sup>2</sup>C NAKs or internal errors while attempting to read a register. If a successful sensor read is completed, the counter resets to zero.

| Byte | Field                             | Description                                                                                                                                                                                        |
|------|-----------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | 28h = Management Subsystem Health                                                                                                                                                                  |
| 12   | Sensor Number                     | 10h                                                                                                                                                                                                |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 6Fh (Sensor Specific)</li> </ul>                                            |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 11b = Sensor-specific event extension code in Event Data 2</li> <li>[5:4] - 00b = Unspecified Event Data 3</li> <li>[3:0] - Event Trigger Offset = 4h = Sensor failure</li> </ul> |
| 15   | Event Data 2                      | Sensor number of the failed sensor                                                                                                                                                                 |
| 16   | Event Data 3                      | Not used                                                                                                                                                                                           |

#### Table 83: BMC FW Health Sensor Typical Characteristics

## 11.6.1 BMC FW Health Sensor – Next Steps

- 1. Check the SEL for any other events around the time of the failure.
- Take note of all IPMI activity that was occurring around the time of the failure. Capture a System BMC Debug Log as soon as you can after experiencing this failure. This log can be captured from the Integrated BMC Web Console or by using the Intel<sup>®</sup> Syscfg utility (syscfg /sbmcdl private filename.zip). Send the log file to your system manufacturer or Intel representative for failure analysis.
- 3. If the failure continues around a specific sensor, replace the board with that sensor.

## 11.7 Firmware Update Status Sensor

The BMC FW supports a single Firmware Update Status sensor. This sensor is used to generate SEL events related to update of embedded firmware on the platform. This includes updates to the BMC, BIOS, and ME FW.

This sensor is an event-only sensor that is not readable. Event generation is only enabled for assertion events.

| Byte | Field                             | Description                                                                                                                                                                                                                                             |
|------|-----------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | 2Bh (Version Change)                                                                                                                                                                                                                                    |
| 12   | Sensor Number                     | 12h                                                                                                                                                                                                                                                     |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 70h = OEM defined</li> </ul>                                                                                                     |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 10b = OEM code in Event Data 2</li> <li>[5:4] - 00b = Unspecified Event Data 3</li> <li>[3:0] - Event Trigger Offset</li> <li>0h = Update started</li> <li>1h = Update completed successfully</li> <li>02h = Update failure</li> </ul> |
| 15   | Event Data 2                      | [Bits 7:4] Target of update<br>0000b = BMC<br>0001b = BIOS<br>0010b = ME<br>All other values are reserved.<br>[Bits 3:1] Target instance (zero-based)<br>[Bits 0:0] Reserved                                                                            |
| 16   | Event Data 3                      | Not used                                                                                                                                                                                                                                                |

Table 84: Firmware Update Status Sensor Typical Characteristics

## 11.8 Add-In Module Presence Sensor

Some server boards provide dedicated slots for add-in modules/boards (for example, SAS, IO, and PCIe-riser). For these boards the BMC provides an individual presence sensor to indicate whether the module/board is installed.

| Byte | Field                             | Description                                                                                                                                                                                                                               |
|------|-----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | 15h = Module/Board                                                                                                                                                                                                                        |
| 12   | Sensor Number                     | 0Eh = IO Module Presence<br>0Fh = SAS Module Presence<br>13h = IO Module2 Presence                                                                                                                                                        |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 08h ("digital" discrete)</li> </ul>                                                                                |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 00b = Unspecified Event Data 2</li> <li>[5:4] - 00b = Unspecified Event Data 3</li> <li>[3:0] - Event Trigger Offset</li> <li>0h = Device Removed/Device Absent.</li> <li>1h = Device Inserted/Device Present</li> </ul> |
| 15   | Event Data 2                      | Not used                                                                                                                                                                                                                                  |
| 16   | Event Data 3                      | Not used                                                                                                                                                                                                                                  |

Table 85: Add-In Module Presence Sensor Typical Characteristics

## 11.8.1 Add-In Module Presence – Next Steps

If an unexpected device is removed or inserted, ensure that the module has been seated properly.

## 11.9 Intel<sup>•</sup> Xeon Phi<sup>™</sup> Coprocessor Management Sensors

The Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5 4600/2600/2400/1600 Product Families BMC supports limited manageability of the Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor adapter as described in this section. The Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor adapter uses the Many Integrated Core (MIC) architecture and the sensors are referred to as MIC sensors.

For each manageable Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor adapter found in the system, the BMC automatically enables the associated thermal margin sensors (0xC4-0xC7) and status sensors (0xA2, 0xA3, 0xA6, 0xA7).

#### 11.9.1 Intel<sup>•</sup> Xeon Phi<sup>™</sup> Coprocessor (MIC) Thermal Margin Sensors

The management controller FW of the Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor adapter provides an IPMI sensor that is read to get the temperature data. The BMC then instantiates its own version of this sensor, which is used for fan speed control.

The thermal margin sensor is the difference between the Core Temp sensor value and the TControl value reported by the Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor adapter.

This sensor will not log events into the SEL.

#### 11.9.2 Intel<sup>•</sup> Xeon Phi<sup>™</sup> Coprocessor (MIC) Status Sensors

Every time DC power is turned on, the BMC checks for Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor adapters installed in the system. All compatible cards will be enabled for management. The status sensor is a direct copy of the status sensor reported by the Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor adapter.

| Byte | Field         | Description                                                                          |
|------|---------------|--------------------------------------------------------------------------------------|
| 11   | Sensor Type   | C0h = OEM defined                                                                    |
| 12   | Sensor Number | A2h = MIC 1 Status<br>A3h = MIC 2 Status<br>A6h = MIC 3 Status<br>A7h = MIC 4 Status |

#### Miscellaneous Events System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

| Byte | Field                             | Description                                                                                                                                                                                                                                                      |
|------|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 70h (OEM defined)</li> </ul>                                                                                                              |
| 14   | Event Data 1                      | <ul> <li>[7:6] – 00b = Unspecified Event Data 2</li> <li>[5:4] – 00b = Unspecified Event Data 3</li> <li>[3:0] – Event Trigger Offset</li> <li>Refer to the latest Intel<sup>®</sup> Xeon Phi<sup>™</sup></li> <li>Coprocessor Adapter specification.</li> </ul> |
| 15   | Event Data 2                      | Not used                                                                                                                                                                                                                                                         |
| 16   | Event Data 3                      | Not used                                                                                                                                                                                                                                                         |

#### 11.9.2.1 Intel<sup>•</sup> Xeon Phi<sup>™</sup> Coprocessor (MIC) Status Sensors Next Steps

Refer to the latest Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor Adapter specification for the next steps.

# 12. Hot-Swap Controller Backplane Events

All new EPSD Platforms Based on Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5 4600/2600/2400/1600 Product Families backplanes follow a hybrid architecture, in which the IPMI functionality previously supported in the HSC is integrated into the BMC FW.

## 12.1 HSC Backplane Temperature Sensor

There is a thermal sensor on the Hot-Swap Backplane to measure the ambient temperature.

| Byte | Field                             | Description                                                                                                                                                                               |
|------|-----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | 01h = Temperature                                                                                                                                                                         |
| 12   | Sensor Number                     | 29h = HSBP 1 Temp<br>2Ah = HSBP 2 Temp<br>2Bh = HSBP 3 Temp                                                                                                                               |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 01h (Threshold)</li> </ul>                                         |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 01b = Trigger reading in Event Data 2</li> <li>[5:4] - 01b = Trigger threshold in Event Data 3</li> <li>[3:0] - Event Trigger Offset as described in Table 88</li> </ul> |
| 15   | Event Data 2                      | Reading that triggered event                                                                                                                                                              |
| 16   | Event Data 3                      | Threshold value that triggered event                                                                                                                                                      |

#### Table 87: HSC Backplane Temperature Sensor Typical Characteristics

#### Hot-Swap Controller Backplane Events System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

| Hex | Event Trigger<br>Description  | Assertion<br>Severity | Deassert<br>Severity | Description                                                         | Next Steps                                                                                                                                                                                                                                                     |
|-----|-------------------------------|-----------------------|----------------------|---------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 00h | Lower non-critical going low  | Degraded              | ОК                   | The temperature has dropped below its lower non-critical threshold. | <ol> <li>Check for clear and unobstructed airflow into<br/>and out of the chassis.</li> </ol>                                                                                                                                                                  |
| 02h | Lower critical going low      | non-fatal             | Degraded             | The temperature has dropped below its lower critical threshold.     | <ol> <li>Ensure the SDR is programmed and correct chassis has been selected.</li> <li>Ensure there are no fan failures.</li> <li>Ensure the air used to cool the system is within the thermal specifications for the system (typically below 35°C).</li> </ol> |
| 07h | Upper non-critical going high | Degraded              | ОК                   | The temperature has gone over its upper non-<br>critical threshold. |                                                                                                                                                                                                                                                                |
| 09h | Upper critical<br>going high  | non-fatal             | Degraded             | The temperature has gone over its upper critical threshold.         |                                                                                                                                                                                                                                                                |

#### Table 88: HSC Backplane Temperature Sensor – Event Trigger Offset – Next Steps

## 12.2 Hard Disk Drive Monitoring Sensor

The new backplane design for EPSD Platforms Based on Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5 4600/2600/2400/1600 Product Families moves IPMI ownership of the HDD sensors to the BMC. Note that systems may have multiple storage backplanes. Hard Disk Drive status monitoring is supported through disk status sensors owned by the BMC.

#### Table 89: Hard Disk Drive Monitoring Sensor Typical Characteristics

| Byte | Field                             | Description                                                                                                                                                               |
|------|-----------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | 0Dh = Drive Slot (Bay)                                                                                                                                                    |
| 12   | Sensor Number                     | 60h-68h = Hard Disk Drive 15-23 Status<br>F0h-FEh = Hard Disk Drive 0-14 Status                                                                                           |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 6Fh (Sensor Specific)</li> </ul>                   |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 00b = Unspecified Event Data 2</li> <li>[5:4] - 00b = Unspecified Event Data 3</li> <li>[3:0] - Event Trigger Offset as described in Table 90</li> </ul> |
| 15   | Event Data 2                      | Not used                                                                                                                                                                  |

#### System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5 4600/2600/2400/1600/1400 Product Families Hot-Swap Controller Backplane Events

| Byte | Field        | Description |
|------|--------------|-------------|
| 16   | Event Data 3 | Not used    |

#### Table 90: Hard Disk Drive Monitoring Sensor - Event Trigger Offset - Next Steps

| Event<br>Trigger | Description                  | Next Steps                                                                                                                                                                                                                                                             |
|------------------|------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 00h              | Drive Presence               | If during normal operation the state changes unexpectedly, ensure that the drive was seated properly and the drive carrier was                                                                                                                                         |
| 01h              | Drive Fault                  | properly latched. If that does not work, replace the drive.                                                                                                                                                                                                            |
| 07h              | Rebuild/Remap<br>in progress | If you have replaced a hard drive, this is expected.<br>If you have a hot spare and one of the drives failed, this is expected. Check logs for which drive has failed.<br>If this is seen unexpectedly, it could be an indication of a drive that is close to failing. |

## 12.3 Hot-Swap Controller Health Sensor

The BMC supports an IPMI sensor to indicate the health of the Hot-Swap Controller (HSC). This sensor will indicate that the controller is offline for the cases that the BMC either cannot communicate with it or it is stuck in a degraded state so that the BMC cannot restore it to full operation through a firmware update.

#### Table 91: HSC Health Sensor Typical Characteristics

| Byte | Field                             | Description                                                                                                                                                                   |
|------|-----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 11   | Sensor Type                       | 16h = Microcontroller                                                                                                                                                         |
| 12   | Sensor Number                     | 69h = Hot-Swap Controller 1 Status<br>6Ah = Hot-Swap Controller 2 Status<br>6Bh = Hot-Swap Controller 3 Status                                                                |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 0Ah (Discrete)</li> </ul>                              |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 00b = Unspecified Event Data 2</li> <li>[5:4] - 00b = Unspecified Event Data 3</li> <li>[3:0] - Event Trigger Offset = 4h = Transition to offline</li> </ul> |

| Byte | Field        | Description |
|------|--------------|-------------|
| 15   | Event Data 2 | Not used    |
| 16   | Event Data 3 | Not used    |

## 12.3.1 HSC Health Sensor – Next Steps

Ensure that all connections to the HSC are well seated.

Cross test with another HSC. If the issue remains with the HSC, replace the HSC, otherwise start cross testing all interconnections.

# 13. Manageability Engine (ME) Events

The Manageability Engine controls the PECI interface and also contains the Node Manager functionality.

## 13.1 ME Firmware Health Event

This sensor is used in Platform Event messages to the BMC containing health information including but not limited to firmware upgrade and application errors.

| Byte   | Field                             | Description                                                                                                                                                          |
|--------|-----------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 8<br>9 | Generator ID                      | 002Ch or 602Ch – ME Firmware                                                                                                                                         |
| 11     | Sensor Type                       | DCh = OEM                                                                                                                                                            |
| 12     | Sensor Number                     | 17h                                                                                                                                                                  |
| 13     | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 75h (OEM)</li> </ul>                          |
| 14     | Event Data 1                      | <ul> <li>[7:6] - 10b = OEM code in Event Data 2</li> <li>[5:4] - 10b = OEM code in Event Data 3</li> <li>[3:0] - Health event type - 0h (Firmware Status)</li> </ul> |
| 15     | Event Data 2                      | See Table 93                                                                                                                                                         |
| 16     | Event Data 3                      | See Table 93                                                                                                                                                         |

Table 92: ME Firmware Health Event Sensor Typical Characteristics

## 13.1.1 ME Firmware Health Event – Next Steps

In the following table Event Data 3 is only noted for specific errors.

If the issue continues to be persistent, provide the content of Event Data 3 to Intel support team for interpretation. Event Data 3 codes are in general not documented, because their meaning only provides some clues, varies, and usually needs to be individually interpreted.

# Manageability Engine (ME) Events \_\_\_\_\_\_System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

| ED2         | ED3 | Description                                                                                                                                                                  | Next Steps                                                                                                                                                                                                                                                                                  |
|-------------|-----|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 00h         |     | Recovery GPIO forced. Recovery Image loaded due to recovery<br>MGPIO pin asserted. Pin number is configurable in factory presets.<br>Default recovery pin is MGPIO1.         | <ol> <li>Deassert MGPIO1 and reset the Intel<sup>®</sup> ME =1 – Image execution failed.<br/>Recovery Image or backup operational image loaded because<br/>operational image is corrupted. This may be either caused by flash<br/>device corruption or failed upgrade procedure.</li> </ol> |
|             |     |                                                                                                                                                                              | 2. Either the flash device must be replaced (if error is persistent) or the upgrade procedure must be started again.                                                                                                                                                                        |
| 02h         |     | Flash erase error. Error during flash erasure procedure.                                                                                                                     | The flash device must be replaced.                                                                                                                                                                                                                                                          |
| 03h         | 00h | Flash state information.                                                                                                                                                     | Recovery bootloader image or factory presets image corrupted.                                                                                                                                                                                                                               |
|             | 01h | Check extended info byte in ED3 whether this is wear-out protection causing this event. If so just wait until wear-out protection                                            | Flash erase limit has been reached.                                                                                                                                                                                                                                                         |
|             | 02h | expires, otherwise probably the flash device must be replaced (if                                                                                                            | Flash write limit has been reached; writing to flash has been disabled.                                                                                                                                                                                                                     |
|             | 03h | error is persistent).                                                                                                                                                        | Writing to the flash has been enabled                                                                                                                                                                                                                                                       |
| 04h         |     | Internal error. Error during firmware execution – FW Watchdog Timeout.                                                                                                       | Operational image needs to be updated to other version or hardware board repair is needed (if error is persistent).                                                                                                                                                                         |
| 05h         |     | BMC did not respond to cold reset request and Intel <sup>®</sup> ME rebooted the platform.                                                                                   | Verify the Intel <sup>®</sup> Node Manager configuration.                                                                                                                                                                                                                                   |
| 06h         |     | Direct Flash update requested by the BIOS. Intel <sup>®</sup> ME firmware will switch to recovery mode to perform full update from the BIOS.                                 | This is transient state. Intel <sup>®</sup> ME firmware will return to operational mode after successful image update performed by the BIOS.                                                                                                                                                |
| 07h         | 04h | Manufacturing error. Wrong manufacturing configuration detected by Intel <sup>®</sup> ME firmware.<br>Intel <sup>®</sup> ME FW configuration is inconsistent or out of range | The flash device must be replaced (if error is persistent).                                                                                                                                                                                                                                 |
| 08h         |     | Persistent storage integrity error. Flash file system error detected.                                                                                                        | If error is persistent, restore factory presets using "Force ME Recovery"<br>IPMI command or by doing AC power cycle with Recovery jumper asserted.                                                                                                                                         |
| 09h         |     | Firmware Exception.                                                                                                                                                          | Restore factory presets using "Force ME Recovery" IPMI command or by doing AC power cycle with Recovery jumper asserted. If this does not clear the issue, reflash the SPI flash.                                                                                                           |
| 10h-<br>FFh |     | Reserved                                                                                                                                                                     |                                                                                                                                                                                                                                                                                             |

#### Table 93: ME Firmware Health Event Sensor – Next Steps

## 13.2 Node Manager Exception Event

A Node Manager Exception Event will be sent each time maintained policy power limit is exceeded over Correction Time Limit.

| Byte   | Field                             | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|--------|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 8<br>9 | Generator ID                      | 002Ch or 602Ch – ME Firmware                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| 11     | Sensor Type                       | DCh = OEM                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| 12     | Sensor Number                     | 18h                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| 13     | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 72h (OEM)</li> </ul>                                                                                                                                                                                                                                                                                                                      |
| 14     | Event Data 1                      | <ul> <li>[7:6] - 10b = OEM code in Event Data 2</li> <li>[5:4] - 10b = OEM code in Event Data 3</li> <li>[3] - Node Manager Policy event <ul> <li>0 - Reserved</li> <li>1 - Policy Correction Time Exceeded - Policy did not meet the contract for the defined policy. The policy will continue to limit the power or shut down the platform based on the defined policy action.</li> </ul> </li> <li>[2] - Reserved <ul> <li>[1:0] - 00b</li> </ul> </li> </ul> |
| 15     | Event Data 2                      | [4:7] – Reserved<br>[0:3] – Domain Id (Currently, supports only one domain, Domain 0)                                                                                                                                                                                                                                                                                                                                                                            |
| 16     | Event Data 3                      | Policy Id                                                                                                                                                                                                                                                                                                                                                                                                                                                        |

Table 94: Node Manager Exception Sensor Typical Characteristics

## 13.2.1 Node Manager Exception Event – Next Steps

This is an informational event. Next steps depend on the policy that was set. See the Node Manager Specification for more details.

## 13.3 Node Manager Health Event

A Node Manager Health Event message provides a runtime error indication about Intel<sup>®</sup> Intelligent Power Node Manager's health. Types of service that can send an error are defined as follows:

- Misconfigured policy Error reading power data
- Error reading inlet temperature

| Byte | Field                             | Description                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|------|-----------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 8    | Generator ID                      | 002Ch or 602Ch – ME Firmware                                                                                                                                                                                                                                                                                                                                                                                                                 |
| 9    |                                   |                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| 11   | Sensor Type                       | DCh = OEM                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| 12   | Sensor Number                     | 19h                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 73h (OEM)</li> </ul>                                                                                                                                                                                                                                                                                                  |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 10b = OEM code in Event Data 2</li> <li>[5:4] - 10b = OEM code in Event Data 3</li> <li>[3:0] - Health Event Type = 02h (Sensor Node Manager)</li> </ul>                                                                                                                                                                                                                                                                    |
| 15   | Event Data 2                      | <ul> <li>[7:4] – Error type</li> <li>0-9 – Reserved</li> <li>10 – Policy Misconfiguration</li> <li>11 – Power Sensor Reading Failure</li> <li>12 – Inlet Temperature Reading Failure</li> <li>13 – Host Communication error</li> <li>14 – Real-time clock synchronization failure</li> <li>15 – Platform shutdown initiated by NM policy due to execution of action defined by Policy Exception Action</li> <li>[3:0] – Domain Id</li> </ul> |
| 16   | Event Data 3                      | If Error type = 10 or 15 <policy id=""></policy>                                                                                                                                                                                                                                                                                                                                                                                             |

#### Table 95: Node Manager Health Event Sensor Typical Characteristics

| Byte | Field | Description                                                                                                                                 |
|------|-------|---------------------------------------------------------------------------------------------------------------------------------------------|
|      |       | If Error type = 11 <power address="" sensor=""><br/>If Error type = 12 <inlet address="" sensor=""><br/>Otherwise set to 0.</inlet></power> |

#### 13.3.1 Node Manager Health Event – Next Steps

Misconfigured policy can happen if the max/min power consumption of the platform exceeds the values in policy due to hardware reconfiguration.

First occurrence of not acknowledged event will be retransmitted no faster than every 300 milliseconds.

Real-time clock synchronization failure alert is sent when NM is enabled and capable of limiting power, but within 10 minutes the firmware cannot obtain valid calendar time from the host side, so NM cannot handle suspend periods.

Next steps depend on the policy that was set. See the Node Manager Specification for more details.

## 13.4 Node Manager Operational Capabilities Change

This message provides a runtime error indication about Intel<sup>®</sup> Intelligent Power Node Manager's operational capabilities. This applies to all domains.

Assertion and deassertion of these events are supported.

| Byte   | Field                             | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|--------|-----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 8<br>9 | Generator ID                      | 002Ch or 602Ch – ME Firmware                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| 11     | Sensor Type                       | DCh = OEM                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| 12     | Sensor Number                     | 1Ah                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| 13     | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 74h (OEM)</li> </ul>                                                                                                                                                                                                                                                                                                                                         |
| 14     | Event Data 1                      | <ul> <li>[7:6] - 00b = Unspecified Event Data 2</li> <li>[5:4] - 00b = Unspecified Event Data 3</li> <li>[3:0] - Current state of Operational Capabilities. Bit pattern:</li> <li>0 - Policy interface capability</li> <li>0 - Not Available</li> <li>1 - Available</li> <li>1 - Monitoring capability</li> <li>0 - Not Available</li> <li>1 - Available</li> <li>2 - Power limiting capability</li> <li>0 - Not Available</li> <li>1 - Available</li> <li>1 - Available</li> </ul> |
| 15     | Event Data 2                      | Not used                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| 16     | Event Data 3                      | Not used                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |

Table 96: Node Manager Operational Capabilities Change Sensor Typical Characteristics

## 13.4.1 Node Manager Operational Capabilities Change – Next Steps

Policy Interface available indicates that Intel<sup>®</sup> Intelligent Power Node Manager is able to respond to the external interface about querying and setting Intel<sup>®</sup> Intelligent Power Node Manager policies. This is generally available as soon as the microcontroller is initialized.

Monitoring Interface available indicates that Intel<sup>®</sup> Intelligent Power Node Manager has the capability to monitor power and temperature. This is generally available when firmware is operational.

Power limiting interface available indicates that Intel<sup>®</sup> Intelligent Power Node Manager can do power limiting and is indicative of an ACPI-compliant OS loaded (unless the OEM has indicated support for non-ACPI compliant OS).

Current value of not acknowledged capability sensor will be retransmitted no faster than every 300 milliseconds.

Next steps depend on the policy that was set. See the Node Manager Specification for more details.

## 13.5 Node Manger Alert Threshold Exceeded

Policy Correction Time Exceeded Event will be sent each time maintained policy power limit is exceeded over Correction Time Limit.

| Byte   | Field                             | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|--------|-----------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 8<br>9 | Generator ID                      | 002Ch – ME Firmware                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| 11     | Sensor Type                       | DCh = OEM                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| 12     | Sensor Number                     | 1Bh                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| 13     | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 72h (OEM)</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                          |
| 14     | Event Data 1                      | <ul> <li>[7:6] - 10b = OEM code in Event Data 2</li> <li>[5:4] - 10b = OEM code in Event Data 3</li> <li>[3] = Node Manager Policy event <ul> <li>0 - Threshold exceeded</li> <li>1 - Policy Correction Time Exceeded - Policy did not meet the contract for the defined policy. The policy will continue to limit the power or shut down the platform based on the defined policy action.</li> </ul> </li> <li>[2] - Reserved <ul> <li>[1:0] - Threshold Number. Valid only if Byte 5 bit [3] is set to 0.</li> <li>0 to 2 - Threshold index</li> </ul> </li> </ul> |
| 15     | Event Data 2                      | [7:4] – Reserved<br>[3:0] – Domain Id (Currently, supports only one domain, Domain 0)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| 16     | Event Data 3                      | Policy ID                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |

Table 97: Node Manager Alert Threshold Exceeded Sensor Typical Characteristics

## 13.5.1 Node Manger Alert Threshold Exceeded – Next Steps

First occurrence of not acknowledged event will be retransmitted no faster than every 300 milliseconds.

First occurrence of Threshold exceeded event assertion/deassertion will be retransmitted no faster than every 300 milliseconds.

Next steps depend on the policy that was set. See the Node Manager Specification for more details.

# 14. Microsoft Windows\* Records

With Microsoft Windows Server 2003\* R2 and later versions, an Intelligent Platform Management Interface (IPMI) driver was added. This added the capability of logging some OS events to the SEL. The driver can write multiple records to the SEL for the following events:

- Boot-up
- Shutdown
- Bug Check / Blue Screen

## 14.1 Boot up Event Records

When the system boots into the Microsoft Windows\* OS, two events can be logged. The first is a boot-up record and the second is an OEM event. These are informational only records.

| Byte | Field                             | Description                                                                                                                                                               |
|------|-----------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 8    | Generator ID                      | 0041h – System Software with an ID = 20h                                                                                                                                  |
| 9    |                                   |                                                                                                                                                                           |
| 11   | Sensor Type                       | 1Fh = OS Boot                                                                                                                                                             |
| 12   | Sensor Number                     | 00h                                                                                                                                                                       |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 6Fh (Sensor Specific)</li> </ul>                   |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 00b = Unspecified Event Data 2</li> <li>[5:4] - 00b = Unspecified Event Data 3</li> <li>[3:0] - Event Trigger Offset = 1h = C: boot completed</li> </ul> |
| 15   | Event Data 2                      | Not used                                                                                                                                                                  |
| 16   | Event Data 3                      | Not used                                                                                                                                                                  |

#### Table 98: Boot up Event Record Typical Characteristics

| Byte | Field                | Description                                                                                                                                                                                           |
|------|----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1    | Record ID            | ID used for SEL Record access                                                                                                                                                                         |
| 2    |                      |                                                                                                                                                                                                       |
| 3    | Record Type          | [7:0] – DCh = OEM timestamped, bytes 8-16 OEM defined                                                                                                                                                 |
| 4    | Timestamp            | Time when the event was logged. LS byte first.                                                                                                                                                        |
| 5    |                      |                                                                                                                                                                                                       |
| 6    |                      |                                                                                                                                                                                                       |
| 7    |                      |                                                                                                                                                                                                       |
| 8    | IPMI Manufacturer ID | 0137h (311d) = IANA enterprise number for Microsoft                                                                                                                                                   |
| 9    |                      |                                                                                                                                                                                                       |
| 10   |                      |                                                                                                                                                                                                       |
| 11   | Record ID            | Sequential number reflecting the order in which the records are read. The numbers start at 1 for the first entry in the SEL and continue sequentially to <i>n</i> , the number of entries in the SEL. |
| 12   | Boot Time            | Timestamp of when the system booted into the OS                                                                                                                                                       |
| 13   |                      |                                                                                                                                                                                                       |
| 14   |                      |                                                                                                                                                                                                       |
| 15   |                      |                                                                                                                                                                                                       |
| 16   | Reserved             | 00h                                                                                                                                                                                                   |

#### Table 99: Boot up OEM Event Record Typical Characteristics

## 14.2 Shutdown Event Records

When the system shuts down from the Microsoft Windows\* OS, multiple events can be logged. The first is an OS Stop/Shutdown Event Record; this can be followed by a shutdown reason code OEM record, and then zero or more shutdown comment OEM records. These are all informational only records.

| Byte   | Field                             | Description                                                                                                                                                                  |
|--------|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 8<br>9 | Generator ID                      | 0041h – System Software with an ID = 20h                                                                                                                                     |
| 11     | Sensor Type                       | 20h = OS Stop/Shutdown                                                                                                                                                       |
| 12     | Sensor Number                     | 00h                                                                                                                                                                          |
| 13     | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 6Fh (Sensor Specific)</li> </ul>                      |
| 14     | Event Data 1                      | <ul> <li>[7:6] - 00b = Unspecified Event Data 2</li> <li>[5:4] - 00b = Unspecified Event Data 3</li> <li>[3:0] - Event Trigger Offset = 3h = OS Graceful Shutdown</li> </ul> |
| 15     | Event Data 2                      | Not used                                                                                                                                                                     |
| 16     | Event Data 3                      | Not used                                                                                                                                                                     |

| Table 100 <sup>,</sup> Shutdown R | Reason Code Event Reco | rd Typical Characteristics |
|-----------------------------------|------------------------|----------------------------|
|                                   |                        | u Typical Characteristics  |

#### Table 101: Shutdown Reason OEM Event Record Typical Characteristics

| Byte | Field       | Description                                           |
|------|-------------|-------------------------------------------------------|
| 1    | Record ID   | ID used for SEL Record access                         |
| 2    |             |                                                       |
| 3    | Record Type | [7:0] – DDh = OEM timestamped, bytes 8-16 OEM defined |
| 4    | Timestamp   | Time when the event was logged. LS byte first.        |
| 5    |             |                                                       |

| Byte                 | Field                   | Description                                                                                                                                                                                           |
|----------------------|-------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 6<br>7               |                         |                                                                                                                                                                                                       |
| 8<br>9<br>10         | IPMI Manufacturer<br>ID | 0137h (311d) = IANA enterprise number for Microsoft                                                                                                                                                   |
| 11                   | Record ID               | Sequential number reflecting the order in which the records are read. The numbers start at 1 for the first entry in the SEL and continue sequentially to <i>n</i> , the number of entries in the SEL. |
| 12<br>13<br>14<br>15 | Shutdown Reason         | Shutdown Reason code from the registry (LSB first):<br>HKLM/Software/Microsoft/Windows/CurrentVersion/Reliability/shutdown/ReasonCode                                                                 |
| 16                   | Reserved                | 00h                                                                                                                                                                                                   |

#### Table 102: Shutdown Comment OEM Event Record Typical Characteristics

| Byte | Field             | Description                                                                                                                                                                                           |
|------|-------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1    | Record ID         | ID used for SEL Record access                                                                                                                                                                         |
| 2    |                   |                                                                                                                                                                                                       |
| 3    | Record Type       | [7:0] – DDh = OEM timestamped, bytes 8-16 OEM defined                                                                                                                                                 |
| 4    | Timestamp         | Time when the event was logged. LS byte first.                                                                                                                                                        |
| 5    |                   |                                                                                                                                                                                                       |
| 6    |                   |                                                                                                                                                                                                       |
| 7    |                   |                                                                                                                                                                                                       |
| 8    | IPMI Manufacturer | 0137h (311d) = IANA enterprise number for Microsoft                                                                                                                                                   |
| 9    | ID                | 0157h (343) = IANA enterprise number for Intel                                                                                                                                                        |
| 10   |                   | The value logged depends on the Intelligent Management Bus Driver (IMBDRV) that is loaded.                                                                                                            |
| 11   | Record ID         | Sequential number reflecting the order in which the records are read. The numbers start at 1 for the first entry in the SEL and continue sequentially to <i>n</i> , the number of entries in the SEL. |

#### Microsoft Windows\* Records System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

| Byte                 | Field            | Description                                                                                                                    |
|----------------------|------------------|--------------------------------------------------------------------------------------------------------------------------------|
| 12<br>13<br>14<br>15 | Shutdown Comment | Shutdown Comment from the registry (LSB first):<br>HKLM/Software/Microsoft/Windows/CurrentVersion/Reliability/shutdown/Comment |
| 16                   | Reserved         | 00h                                                                                                                            |

## 14.3 Bug Check / Blue Screen Event Records

When the system experiences a bug check (blue screen), multiple records will be written to the event log. The first is a Bug Check / Blue Screen OS Stop/Shutdown Event Record; this can be followed by multiple Bug Check / Blue Screen code OEM records that will contain the Bug Check / Blue Screen codes. This information can be used to determine what caused the failure.

| Byte | Field                             | Description                                                                                                                                                                                                         |
|------|-----------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 8    | Generator ID                      | 0041h – System Software with an ID = 20h                                                                                                                                                                            |
| 9    |                                   |                                                                                                                                                                                                                     |
| 11   | Sensor Type                       | 20h = OS Stop/Shutdown                                                                                                                                                                                              |
| 12   | Sensor Number                     | 00h                                                                                                                                                                                                                 |
| 13   | Event Direction and<br>Event Type | <ul> <li>[7] Event direction</li> <li>0b = Assertion Event</li> <li>1b = Deassertion Event</li> <li>[6:0] Event Type = 6Fh (Sensor Specific)</li> </ul>                                                             |
| 14   | Event Data 1                      | <ul> <li>[7:6] - 00b = Unspecified Event Data 2</li> <li>[5:4] - 00b = Unspecified Event Data 3</li> <li>[3:0] - Event Trigger Offset = 1h = Runtime Critical Stop (that is, "core dump", "blue screen")</li> </ul> |
| 15   | Event Data 2                      | Not used                                                                                                                                                                                                            |
| 16   | Event Data 3                      | Not used                                                                                                                                                                                                            |

Table 103: Bug Check/Blue Screen – OS Stop Event Record Typical Characteristics

| Byte                 | Field                           | Description                                                                                                                                                                                                                                                                                                                                                |
|----------------------|---------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1<br>2               | Record ID                       | ID used for SEL Record access                                                                                                                                                                                                                                                                                                                              |
| 3                    | Record Type                     | [7:0] – DEh = OEM timestamped, bytes 8-16 OEM defined                                                                                                                                                                                                                                                                                                      |
| 4<br>5<br>6<br>7     | Timestamp                       | Time when the event was logged. LS byte first.                                                                                                                                                                                                                                                                                                             |
| 8<br>9<br>10         | IPMI Manufacturer ID            | 0137h (311) = IANA enterprise number for Microsoft<br>0157h (343) = IANA enterprise number for Intel<br>The value logged depends on the Intelligent Management Bus Driver (IMBDRV) that is loaded.                                                                                                                                                         |
| 11                   | Sequence Number                 | Sequential number reflecting the order in which the records are read. The numbers start at 1 for the first entry in the SEL and continue sequentially to <i>n</i> , the number of entries in the SEL.                                                                                                                                                      |
| 12<br>13<br>14<br>15 | Bug Check / Blue Screen<br>Data | The first record of this type contains the Bug Check / Blue Screen Stop code and is followed by the four Bug Check / Blue Screen parameters. LSB first.<br>Note that each of the Bug Check / Blue Screen parameters requires two records each.<br>Both of the two records for each parameter have the same Record ID.<br>There is a total of nine records. |
| 16                   | Operating system type           | 00 = 32-bit OS<br>01 = 64-bit OS                                                                                                                                                                                                                                                                                                                           |

#### Table 104: Bug Check/Blue Screen code OEM Event Record Typical Characteristics

# 15. Linux\* Kernel Panic Records

The Open IPMI driver supports the ability to put semi-custom and custom events in the system event log if a panic occurs. If you enable the "Generate a panic event to all BMCs on a panic" option, you will get one event on a panic in a standard IPMI event format. If you enable the "Generate OEM events containing the panic string" option, you will also get a set of OEM events holding the panic string.

| Byte | Field                          | Description                                                                                   |
|------|--------------------------------|-----------------------------------------------------------------------------------------------|
| 8    | Generator ID                   | 0021h – Kernel                                                                                |
| 9    |                                |                                                                                               |
| 10   | EvM Rev                        | 03h = IPMI 1.0 format                                                                         |
| 11   | Sensor Type                    | 20h = OS Stop/Shutdown                                                                        |
| 12   | Sensor Number                  | The first byte of the panic string (0 if no panic string)                                     |
| 13   | Event Direction and Event Type | [7] Event direction                                                                           |
|      |                                | 0b = Assertion Event                                                                          |
|      |                                | 1b = Deassertion Event                                                                        |
|      |                                | [6:0] Event Type = 6Fh (Sensor Specific)                                                      |
| 14   | Event Data 1                   | [7:6] – 10b = OEM code in Event Data 2                                                        |
|      |                                | [5:4] – 10b = OEM code in Event Data 3                                                        |
|      |                                | [3:0] – Event Trigger Offset = 1h = Runtime Critical Stop (a.k.a. "core dump", "blue screen") |
| 15   | Event Data 2                   | The second byte of the panic string                                                           |
| 16   | Event Data 3                   | The third byte of the panic string                                                            |

Table 105: Linux\* Kernel Panic Event Record Characteristics

#### System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5 4600/2600/2400/1600/1400 Product Families Linux\* Kernel Panic Records

| Byte  | Field              | Description                                                                                                                               |
|-------|--------------------|-------------------------------------------------------------------------------------------------------------------------------------------|
| 1     | Record ID          | ID used for SEL Record access                                                                                                             |
| 2     |                    |                                                                                                                                           |
| 3     | Record Type        | [7:0] – F0h = OEM non-timestamped, bytes 4-16 OEM defined                                                                                 |
| 4     | Slave Address      | The slave address of the card saving the panic                                                                                            |
| 5     | Sequence<br>Number | A sequence number (starting at zero)                                                                                                      |
| 6<br> | Kernel Panic Data  | These hold the panic sting. If the panic string is longer than 11 bytes, multiple messages will be sent with increasing sequence numbers. |
| 16    |                    |                                                                                                                                           |

#### Table 106: Linux\* Kernel Panic String Extended Record Characteristics