4.11. Reliability, Availability, and Serviceability (RAS) Extensions
This document describes TF-A support for Arm Reliability, Availability, and Serviceability (RAS) extensions. RAS is a mandatory extension for Armv8.2 and later CPUs, and also an optional extension to the base Armv8.0 architecture.
For the description of Arm RAS extensions, Standard Error Records, and the precise definition of RAS terminology, please refer to the Arm Architecture Reference Manual and RAS Supplement. The rest of this document assumes familiarity with architecture and terminology.
There are two philosophies for handling RAS errors from Non-secure world point of view.
4.11.1. Firmware First Handling (FFH)
4.11.1.1. Introduction
EA’s and Error interrupts corresponding to NS nodes are handled first in firmware
Errors signaled back to NS world via suitable mechanism
Kernel is prohibited from accessing the RAS error records directly
Firmware creates CPER records for kernel to navigate and process
Firmware signals error back to Kernel via SDEI
4.11.1.2. Overview
FFH works in conjunction with Exception Handling Framework. Exceptions resulting from errors in Non-secure world are routed to and handled in EL3. Said errors are Synchronous External Abort (SEA), Asynchronous External Abort (signalled as SErrors), Fault Handling and Error Recovery interrupts. RAS Framework in TF-A allows the platform to define an external abort handler and to register RAS nodes and interrupts. It also provides helpers for accessing Standard Error Records as introduced by the RAS extensions
4.11.2. Kernel First Handling (KFH)
4.11.2.1. Introduction
EA’s originating/attributed to NS world are handled first in NS and Kernel navigates the std error records directly.
KFH can be supported in a platform without TF-A being aware of it but there are few corner cases where TF-A needs to have special handling, which is currently missing and will be added in future
4.11.3. TF-A build options
ENABLE_FEAT_RAS: Manage FEAT_RAS extension when switching the world.
RAS_FFH_SUPPORT: Pull in necessary framework and platform hooks for Firmware first handling(FFH) of RAS errors.
RAS_TRAP_NS_ERR_REC_ACCESS: Trap Non-secure access of RAS error record registers.
RAS_EXTENSION: Deprecated macro, equivalent to ENABLE_FEAT_RAS and RAS_FFH_SUPPORT put together.
RAS feature has dependency on some other TF-A build flags
EL3_EXCEPTION_HANDLING: Required for FFH
HANDLE_EA_EL3_FIRST_NS: Required for FFH
FAULT_INJECTION_SUPPORT: Required for testing RAS feature on fvp platform
4.11.4. RAS Framework
4.11.4.1. Platform APIs
The RAS framework allows the platform to define handlers for External Abort, Uncontainable Errors, Double Fault, and errors rising from EL3 execution. Please refer to RAS Porting Guide.
4.11.4.2. Registering RAS error records
RAS nodes are components in the system capable of signalling errors to PEs through one one of the notification mechanisms—SEAs, SErrors, or interrupts. RAS nodes contain one or more error records, which are registers through which the nodes advertise various properties of the signalled error. Arm recommends that error records are implemented in the Standard Error Record format. The RAS architecture allows for error records to be accessible via system or memory-mapped registers.
The platform should enumerate the error records providing for each of them:
A handler to probe error records for errors;
When the probing identifies an error, a handler to handle it;
For memory-mapped error record, its base address and size in KB; for a system register-accessed record, the start index of the record and number of continuous records from that index;
Any node-specific auxiliary data.
With this information supplied, when the run time firmware receives one of the notification mechanisms, the RAS framework can iterate through and probe error records for error, and invoke the appropriate handler to handle it.
The RAS framework provides the macros to populate error record information. The
macros are versioned, and the latest version as of this writing is 1. These
macros create a structure of type struct err_record_info
from its arguments,
which are later passed to probe and error handlers.
For memory-mapped error records:
ERR_RECORD_MEMMAP_V1(base_addr, size_num_k, probe, handler, aux)
And, for system register ones:
ERR_RECORD_SYSREG_V1(idx_start, num_idx, probe, handler, aux)
The probe handler must have the following prototype:
typedef int (*err_record_probe_t)(const struct err_record_info *info,
int *probe_data);
The probe handler must return a non-zero value if an error was detected, or 0
otherwise. The probe_data
output parameter can be used to pass any useful
information resulting from probe to the error handler (see below). For
example, it could return the index of the record.
The error handler must have the following prototype:
typedef int (*err_record_handler_t)(const struct err_record_info *info,
int probe_data, const struct err_handler_data *const data);
The data
constant parameter describes the various properties of the error,
including the reason for the error, exception syndrome, and also flags
,
cookie
, and handle
parameters from the top-level exception handler.
The platform is expected populate an array using the macros above, and register
the it with the RAS framework using the macro REGISTER_ERR_RECORD_INFO()
,
passing it the name of the array describing the records. Note that the macro
must be used in the same file where the array is defined.
4.11.4.2.1. Standard Error Record helpers
The TF-A RAS framework provides probe handlers for Standard Error Records, for both memory-mapped and System Register accesses:
int ras_err_ser_probe_memmap(const struct err_record_info *info,
int *probe_data);
int ras_err_ser_probe_sysreg(const struct err_record_info *info,
int *probe_data);
When the platform enumerates error records, for those records in the Standard Error Record format, these helpers maybe used instead of rolling out their own. Both helpers above:
Return non-zero value when an error is detected in a Standard Error Record;
Set
probe_data
to the index of the error record upon detecting an error.
4.11.4.3. Registering RAS interrupts
RAS nodes can signal errors to the PE by raising Fault Handling and/or Error Recovery interrupts. For the firmware-first handling paradigm for interrupts to work, the platform must setup and register with EHF. See Interaction with Exception Handling Framework.
For each RAS interrupt, the platform has to provide structure of type struct
ras_interrupt
:
Interrupt number;
The associated error record information (pointer to the corresponding
struct err_record_info
);Optionally, a cookie.
The platform is expected to define an array of struct ras_interrupt
, and
register it with the RAS framework using the macro
REGISTER_RAS_INTERRUPTS()
, passing it the name of the array. Note that the
macro must be used in the same file where the array is defined.
The array of struct ras_interrupt
must be sorted in the increasing order of
interrupt number. This allows for fast look of handlers in order to service RAS
interrupts.
4.11.4.4. Double-fault handling
A Double Fault condition arises when an error is signalled to the PE while handling of a previously signalled error is still underway. When a Double Fault condition arises, the Arm RAS extensions only require for handler to perform orderly shutdown of the system, as recovery may be impossible.
The RAS extensions part of Armv8.4 introduced new architectural features to deal
with Double Fault conditions, specifically, the introduction of NMEA
and
EASE
bits to SCR_EL3
register. These were introduced to assist EL3
software which runs part of its entry/exit routines with exceptions momentarily
masked—meaning, in such systems, External Aborts/SErrors are not immediately
handled when they occur, but only after the exceptions are unmasked again.
TF-A, for legacy reasons, executes entire EL3 with all exceptions unmasked. This means that all exceptions routed to EL3 are handled immediately. TF-A thus is able to detect a Double Fault conditions in software, without needing the intended advantages of Armv8.4 Double Fault architecture extensions.
Double faults are fatal, and terminate at the platform double fault handler, and doesn’t return.
4.11.4.5. Engaging the RAS framework
Enabling RAS support is a platform choice
The RAS support in TF-A introduces a default implementation of
plat_ea_handler
, the External Abort handler in EL3. When RAS_FFH_SUPPORT
is set to 1
, it’ll first call ras_ea_handler()
function, which is the
top-level RAS exception handler. ras_ea_handler
is responsible for iterating
to through platform-supplied error records, probe them, and when an error is
identified, look up and invoke the corresponding error handler.
Note that, if the platform chooses to override the plat_ea_handler
function
and intend to use the RAS framework, it must explicitly call
ras_ea_handler()
from within.
Similarly, for RAS interrupts, the framework defines
ras_interrupt_handler()
. The RAS framework arranges for it to be invoked
when a RAS interrupt taken at EL3. The function bisects the platform-supplied
sorted array of interrupts to look up the error record information associated
with the interrupt number. That error handler for that record is then invoked to
handle the error.
4.11.4.6. Interaction with Exception Handling Framework
As mentioned in earlier sections, RAS framework interacts with the EHF to
arbitrate handling of RAS exceptions with others that are routed to EL3. This
means that the platform must partition a priority level for handling RAS exceptions. The platform must then define
the macro PLAT_RAS_PRI
to the priority level used for RAS exceptions.
Platforms would typically want to allocate the highest secure priority for
RAS handling.
Handling of both interrupt and non-interrupt exceptions follow the sequences outlined in the EHF documentation. I.e., for interrupts, the priority management is implicit; but for non-interrupt exceptions, they’re explicit using EHF APIs.
Copyright (c) 2018-2023, Arm Limited and Contributors. All rights reserved.