12. NVIDIA Tegra

  • T194

T194 has eight NVIDIA Carmel CPU cores in a coherent multi-processor configuration. The Carmel cores support the ARM Architecture version 8.2, executing both 64-bit AArch64 code, and 32-bit AArch32 code. The Carmel processors are organized as four dual-core clusters, where each cluster has a dedicated 2 MiB Level-2 unified cache. A high speed coherency fabric connects these processor complexes and allows heterogeneous multi-processing with all eight cores if required.

  • T186

The NVIDIA® Parker (T186) series system-on-chip (SoC) delivers a heterogeneous multi-processing (HMP) solution designed to optimize performance and efficiency.

T186 has Dual NVIDIA Denver 2 ARM® CPU cores, plus Quad ARM Cortex®-A57 cores, in a coherent multiprocessor configuration. The Denver 2 and Cortex-A57 cores support ARMv8, executing both 64-bit Aarch64 code, and 32-bit Aarch32 code including legacy ARMv7 applications. The Denver 2 processors each have 128 KB Instruction and 64 KB Data Level 1 caches; and have a 2MB shared Level 2 unified cache. The Cortex-A57 processors each have 48 KB Instruction and 32 KB Data Level 1 caches; and also have a 2 MB shared Level 2 unified cache. A high speed coherency fabric connects these two processor complexes and allows heterogeneous multi-processing with all six cores if required.

  • T210

T210 has Quad Arm® Cortex®-A57 cores in a switched configuration with a companion set of quad Arm Cortex-A53 cores. The Cortex-A57 and A53 cores support Armv8-A, executing both 64-bit Aarch64 code, and 32-bit Aarch32 code including legacy Armv7-A applications. The Cortex-A57 processors each have 48 KB Instruction and 32 KB Data Level 1 caches; and have a 2 MB shared Level 2 unified cache. The Cortex-A53 processors each have 32 KB Instruction and 32 KB Data Level 1 caches; and have a 512 KB shared Level 2 unified cache.

  • T132

Denver is NVIDIA’s own custom-designed, 64-bit, dual-core CPU which is fully Armv8-A architecture compatible. Each of the two Denver cores implements a 7-way superscalar microarchitecture (up to 7 concurrent micro-ops can be executed per clock), and includes a 128KB 4-way L1 instruction cache, a 64KB 4-way L1 data cache, and a 2MB 16-way L2 cache, which services both cores.

Denver implements an innovative process called Dynamic Code Optimization, which optimizes frequently used software routines at runtime into dense, highly tuned microcode-equivalent routines. These are stored in a dedicated, 128MB main-memory-based optimization cache. After being read into the instruction cache, the optimized micro-ops are executed, re-fetched and executed from the instruction cache as long as needed and capacity allows.

Effectively, this reduces the need to re-optimize the software routines. Instead of using hardware to extract the instruction-level parallelism (ILP) inherent in the code, Denver extracts the ILP once via software techniques, and then executes those routines repeatedly, thus amortizing the cost of ILP extraction over the many execution instances.

Denver also features new low latency power-state transitions, in addition to extensive power-gating and dynamic voltage and clock scaling based on workloads.

12.1. Directory structure

  • plat/nvidia/tegra/common - Common code for all Tegra SoCs

  • plat/nvidia/tegra/soc/txxx - Chip specific code

12.2. Trusted OS dispatcher

Tegra supports multiple Trusted OS’.

  • Trusted Little Kernel (TLK): In order to include the ‘tlkd’ dispatcher in the image, pass ‘SPD=tlkd’ on the command line while preparing a bl31 image.

  • Trusty: In order to include the ‘trusty’ dispatcher in the image, pass ‘SPD=trusty’ on the command line while preparing a bl31 image.

This allows other Trusted OS vendors to use the upstream code and include their dispatchers in the image without changing any makefiles.

These are the supported Trusted OS’ by Tegra platforms.

  • Tegra132: TLK

  • Tegra210: TLK and Trusty

  • Tegra186: Trusty

  • Tegra194: Trusty

12.3. Scatter files

Tegra platforms currently support scatter files and ld.S scripts. The scatter files help support ARMLINK linker to generate BL31 binaries. For now, there exists a common scatter file, plat/nvidia/tegra/scat/bl31.scat, for all Tegra SoCs. The LINKER build variable needs to point to the ARMLINK binary for the scatter file to be used. Tegra platforms have verified BL31 image generation with ARMCLANG (compilation) and ARMLINK (linking) for the Tegra186 platforms.

12.4. Preparing the BL31 image to run on Tegra SoCs

CROSS_COMPILE=<path-to-aarch64-gcc>/bin/aarch64-none-elf- make PLAT=tegra \
TARGET_SOC=<target-soc e.g. t194|t186|t210|t132> SPD=<dispatcher e.g. trusty|tlkd>
bl31

Platforms wanting to use different TZDRAM_BASE, can add TZDRAM_BASE=<value> to the build command line.

The Tegra platform code expects a pointer to the following platform specific structure via ‘x1’ register from the BL2 layer which is used by the bl31_early_platform_setup() handler to extract the TZDRAM carveout base and size for loading the Trusted OS and the UART port ID to be used. The Tegra memory controller driver programs this base/size in order to restrict NS accesses.

typedef struct plat_params_from_bl2 { /* TZ memory size / uint64_t tzdram_size; / TZ memory base / uint64_t tzdram_base; / UART port ID */ int uart_id; /* L2 ECC parity protection disable flag */ int l2_ecc_parity_prot_dis; /* SHMEM base address for storing the boot logs */ uint64_t boot_profiler_shmem_base; } plat_params_from_bl2_t;

12.5. Power Management

The PSCI implementation expects each platform to expose the ‘power state’ parameter to be used during the ‘SYSTEM SUSPEND’ call. The state-id field is implementation defined on Tegra SoCs and is preferably defined by tegra_def.h.

12.6. Tegra configs

  • ‘tegra_enable_l2_ecc_parity_prot’: This flag enables the L2 ECC and Parity Protection bit, for Arm Cortex-A57 CPUs, during CPU boot. This flag will be enabled by Tegrs SoCs during ‘Cluster power up’ or ‘System Suspend’ exit.