Home | Projects | Notes > Linux Kernel Analysis > head.S
The following comment sets the stage for the kernel's early initialization process, ensuring that key hardware settings and data structures are appropriately handled before the MMU is enabled and the kernel switches to more advanced memory management and virtual addressing.
xxxxxxxxxx121/*2 * Kernel startup entry point.3 * ---------------------------4 *5 * The requirements are:6 * MMU = off, D-cache = off, I-cache = on or off,7 * x0 = physical address to the FDT blob.8 *9 * Note that the callee-saved registers are used for storing variables10 * that are useful before the MMU is enabled. The allocations are described11 * in the entry routines.12 */Why does the kernel startup entry point not care about the I-cache on/off state?
During the early kernel startup, the primary concern is to ensure that the basic hardware setup, data structures, and essential features like MMU are correctly initialized. The impact of I-cache being on or off might not be critical at this point, and its state can be appropriately handled later in the kernel's initialization process when more advanced memory management and virtual addressing are in place. Therefore, the code does not specifically require the I-cache state to be either on or off during the early kernel startup with MMU off and D-cache off.
When the MMU is off and the D-cache is off, the kernel operates in a simple flat memory model, where virtual addresses are the same as physical addresses. In this mode, there is no translation of virtual addresses to physical addresses, and memory accesses are performed directly using physical addresses.
The role of the D-cache is to store frequently accessed data to speed up memory access. However, with the D-cache off and MMU off, memory accesses are not cached, so any changes to memory will be immediately visible to all parts of the kernel and other hardware.
On the other hand, the role of the I-cache is to store frequently accessed instructions to speed up instruction fetches. Whether the I-cache is on or off may not have a significant impact at this very early stage of kernel startup. The kernel's initialization code is relatively small and straightforward at this point, and the code execution mainly involves a sequence of instructions that set up essential data structures, manage hardware, and enable features like MMU and caches.
Since the kernel is executed from memory, even if the I-cache is off, the instruction fetches are still likely to be fast because there is no virtual memory translation overhead, and the code is readily available in memory.
__INIT is a macro defined in include/linux/init.h:
xxxxxxxxxx11When the code is assembled, the assembly code inside the __HEAD macro will be placed in the .head.text section with the appropriate flags, ensuring that it is both allocatable (a) and marked as executable (x) in the final output. The section name and flags can be adjusted based on the specific needs of the project.
The image header provides essential information for the boot-loader to load and execute the kernel properly. The bootloader recognizes the signature (EFI signature NOP) and the magic number to identify this as an ARM64 executable. It then uses the entry point (primary_entry) to start executing the kernel and considers the load offset, kernel size, and flags to manage memory allocation and other operations correctly.
The image header conforms to the header format defined in the documentation Documentation/arm64/booting.rst section 4.
xxxxxxxxxx141 /* 2 * DO NOT MODIFY. Image header expected by Linux boot-loaders.3 */4 efi_signature_nop @ special NOP to identity as PE/COFF executable5 b primary_entry @ Branch to kernel start. The bootloader will use this address6 @ to start executing the kernel code.7 .quad 0 @ Image load offset from start of RAM, little-endian8 le64sym _kernel_size_le @ Effective size of kernel image, little-endian9 le64sym _kernel_flags_le @ Informative flags, little-endian10 .quad 0 @ reserved11 .quad 0 @ reserved12 .quad 0 @ reserved13 .ascii ARM64_IMAGE_MAGIC @ Magic number14 .long .Lpe_header_offset @ Offset to the PE header.L4:
efi_signature_nopis a macro defined in the filearch/arm64/kernel/eif-header.S:xxxxxxxxxx181.macro efi_signature_nop2#ifdef CONFIG_EFI3.L_head:4/*5* This ccmp instruction has no meaningful effect except that6* its opcode forms the magic "MZ" signature required by UEFI.7*/8ccmp x18, #0, #0xd, pl @ ccmp instruction is 0xFA405A4D in hex code and this will9@ appear as 4D 5A 40 FA in the kernel image built in little-endian10@ mode, first two of which are 'M' and 'Z' respectively11#else12/*13* Bootloaders may inspect the opcode at the start of the kernel14* image to decide if the kernel is capable of booting via UEFI.15* So put an ordinary NOP here, not the "MZ.." pseudo-nop above.16*/17nop @ simply consume one clock cycle, doing nothing meaningful18#endif
__INIT is a macro defined in include/linux/init.h:
xxxxxxxxxx11When the code is assembled, the assembly code inside the __INIT macro will be placed in the .init.text section with the appropriate flags, ensuring that it is both allocatable (a) and marked as executable (x) in the final output. The section name and flags can be adjusted based on the specific needs of the project.
The record_mmu_state subroutine is designed to handle the endianness and MMU state at different exception levels (EL) in the ARM64 processor architecture. The code handles toggling the endianness, clearing the MMU enable bit, and applying workarounds if necessary before writing the modified values back to the relevant control registers. The subroutine ensures that memory accesses issued before the init_kernel_el() function occur in the correct byte order and with the appropriate MMU state.
record_mmu_state records the MMU on/off state in x19 upon return.
If x19 == 0, cache maintenance (e.g., invalidation) will be necessary
If x19 == 1, cache maintenance will not be necessary
Think of the following four possible scenarios:
Cache on, MMU on (x19 = 1)
Cache invalidation not necessary since the cache will already contain valid data
Cache on, MMU off (x19 = 0)
Cache invalidation necessary since the validity of the cache contents will not be guaranteed at the moment when the MMU turns on
Cache off, MMU on (x19 = 0)
Cache invalidation necessary since the validity of the cache contents is not guaranteed
Cache off, MMU off (x19 = 0)
Cache invalidation necessary since the validity of the cache contents will not be guaranteed at the moment when the MMU turns on
xxxxxxxxxx371SYM_CODE_START_LOCAL(record_mmu_state)2 mrs x19, CurrentEL @ move the current exception level value into x193 cmp x19, #CurrentEL_EL2 @ is the current exception level EL2(Hybervisor mode)?4 mrs x19, sctlr_el15 b.ne 0f @ if not (EL2), branch to '0f'6 mrs x19, sctlr_el270:8CPU_LE( tbnz x19, #SCTLR_ELx_EE_SHIFT, 1f ) @ If the current endianness is LE and the bit9 @ SCTLR_ELx_EE_SHIFT is not 1, branch to '1f'.10CPU_BE( tbz x19, #SCTLR_ELx_EE_SHIFT, 1f ) @ If the current endianness is BE and the bit11 @ SCTLR_ELx_EE_SHIFT is not 0, branch to '1f'.12 tst x19, #SCTLR_ELx_C // Z := (C == 0)13 and x19, x19, #SCTLR_ELx_M // isolate M bit14 csel x19, xzr, x19, eq // clear x19 if Z15 ret 16
17 /* 18 * Set the correct endianness early so all memory accesses issued19 * before init_kernel_el() occur in the correct byte order. Note that20 * this means the MMU must be disabled, or the active ID map will end21 * up getting interpreted with the wrong byte order.22 */231: eor x19, x19, #SCTLR_ELx_EE @ toggle the Endianness Enable bit24 bic x19, x19, #SCTLR_ELx_M @ clear the MMU enable bit25 b.ne 2f @ if EL1, branch to '2f'26 pre_disable_mmu_workaround @ A placeholder for code that handles a workaround27 @ before disabling the MMU. 28 msr sctlr_el2, x19 @ Write the modified x19 back to the 29 @ System Control Register for EL2 (sctlr_el2).30 b 3f 312: pre_disable_mmu_workaround32 msr sctlr_el1, x19 @ write the modified x19 back to the33 @ System Control Register for EL2 (sctlr_el1)343: isb 35 mov x19, xzr @ move zero into x1936 ret @ return from the subroutine37SYM_CODE_END(record_mmu_state)
x19is typically used to store the value of the System Control Register (SCTLR) during early boot.
The preserve_boot_args subroutine ensures that the important bootloader arguments are preserved in memory for the kernel's use. Additionally, it handles cache invalidation, especially when the MMU is off, to ensure data consistency during the early stages of the kernel's execution.
xxxxxxxxxx311/*2 * Preserve the arguments passed by the bootloader in x0 .. x33 */4SYM_CODE_START_LOCAL(preserve_boot_args)5 mov x21, x0 @ x21=FDT; load the Flattened-Device-Tree (FDT) start address into x216
7 adr_l x0, boot_args @ Load x0 with the address of the label 'boot_args' (this label 8 @ represents a memory location where the bootloader arguments 9 @ will be preserved for the kernel's reference).10 @ 'boot_args' is an array of 4 64-bit elements.11 stp x21, x1, [x0] @ Store x21 and x1 in memory at the address pointed by x012 stp x2, x3, [x0, #16] @ followed by x2 and x3.13
14 cbnz x19, 0f @ if x19 != 0 (i.e., MMU=on), skip the cache invalidation (jump to '0'f)15 dmb sy @ Data Memory Barrier (DMB) instruction with the 'sy(system)'16 @ option. This barrier ensures that all memory accesses before17 @ it are completed before proceeding. It is used before the 18 @ cache invalidation operation with the MMU off19
20 add x1, x0, #0x20 @ Load x1 with 'x0 + 0x20' (4 x 8 bytes after the initial address) to21 @ prepare x1 to be used in the following tail call.22 b dcache_inval_poc @ Tail call that jumps to a cache invalidation function. (The23 @ purpose of the cache invalidation is to ensure the chache 24 @ coherency for the preserved dat.)250: str_l x19, mmu_enabled_at_boot, x0 @ The value of x19, which represents whether the26 @ MMU was enabled at boot time, is stored at the27 @ address pointed by x0. This is done after the 28 @ cache invalidation part and is likely used for29 @ reference or debugging purposes.30 ret31SYM_CODE_END(preserve_boot_args)L7:
adr_lis a macro defined inarch/arm64/include/asm/assembler.h:xxxxxxxxxx121/*2* Pseudo-ops for PC-relative adr/ldr/str <reg>, <symbol> where3* <symbol> is within the range +/- 4 GB of the PC.4*/5/*6* @dst: destination register (64 bit wide)7* @sym: name of the symbol8*/9.macro adr_l, dst, sym10adrp \dst, \sym11add \dst, \dst, :lo12:\sym12.endmDue to the limitation bit-field length in expressing 64-bit address, the macro
adr_lis introduced. It uses the "page address" and the "offset" to express 64-bit address. (Commonly seen in arm64)
Can be written as the following C code:
xxxxxxxxxx81void preserve_boot_args() {2 boot_args[0] = x21; // fdt3 boot_args[1] = x1;4 boot_args[2] = x2;5 boot_args[3] = x3;6 dcache_inval_proc(boot_args);7 mmu_enabled_at_boot = x19; // MMU on or off8}
The function dcache_inval_poc is used to invalidate and clean (flush) the data cache for a specific memory range [start, end). It handles cache-line alignment for both the start and end addresses and uses the civac and ivac cache maintenance instructions to perform cleaning and invalidation operations, respectively. The function iterates over each cache line in the specified range and ensures data coherency between the cache and memory.
xxxxxxxxxx451/*2 * dcache_inval_poc(start, end) 3 *4 * Ensure that any D-cache lines for the interval [start, end)5 * are invalidated. Any partial lines at the ends of the interval are6 * also cleaned to PoC to prevent data loss.7 *8 * - start - kernel start address of region9 * - end - kernel end address of region10 */11SYM_FUNC_START(__pi_dcache_inval_poc)12 dcache_line_size x2, x3 @ Load the size of a cache line into x2. (x3 will be used as a13 @ scratch register inside this macro to hold the value of 14 @ 'ctr_el0' bits[19:16].15 sub x3, x2, #1 @ Get x3 ready to be used as a bit-mask to mask off the lower bits of16 @ ensure cache-line alignment.17 tst x1, x3 @ Is the x1 (end address of 'boot_args') cache line-aligned?18 bic x1, x1, x3 @ Mask off the lower bits of x1 using x3 to make it cache line-aligned.19 @ e.g., If the cache line size is 64 bytes and x1 contained 70, then x120 @ will be adjusted to 64. (Lower 6 bits dropped)21 b.eq 1f @ If the end address is already cache line-aligned, skip the cache 22 @ maintenance and jump to label 1.23 dc civac, x1 @ Clean and invalidate the D-cache line where the end address of 24 @ 'boot_args' is located. The 'civac' cache maintenance operation 25 @ combines cleaning (write-back)and invalidation of a single cache line.261: tst x0, x3 @ Is the x0 (start address of 'boot_args') cache line-aligned?27 bic x0, x0, x3 @ Mask off the lower bits of x0 using x3 to make it cache line-aligned.28 b.eq 2f @ If the start address is already cache line-aligned, jump to label 2 29 @ to perform cache invalidation only.30 dc civac, x0 @ Clean and invalidate the data cache for the start address x0.31 b 3f @ Branch to label 3 to perform the cache invalidation for the remaining 32 @ memory range.332: dc ivac, x0 @ Invalidate the data cache for the start address x0. This instruction34 @ invalidates the cache line without writing its contents back to memory.353: add x0, x0, x2 @ Move the start address x0 forward by the cache line size to handle the36 @ next cache line.37 cmp x0, x1 @ Compare the updated start address x0 with the end address x1.38 b.lo 2b @ If x0 is less than x1, loop back to label 2 to continue the cache39 @ invalidation for the remaining cache lines.40 dsb sy @ Data Synchronization Barrier (dsb) instruction with the sy (system)41 @ option. It ensures that all explicit memory accesses before the barrier 42 @ are completed before proceeding.43 ret 44SYM_FUNC_END(__pi_dcache_inval_poc)45SYM_FUNC_ALIAS(dcache_inval_poc, __pi_dcache_inval_poc)
SYM_FUNC_START(function_name)is a symbol defining the start of thefunction_namefunction. It is typically used for debugging and symbol tracking purposes.L12:
dcache_line_sizeis a macro defined inarch/arm64/include/asm/assembler.h.xxxxxxxxxx101/*2* dcache_line_size - get the safe D-cache line size across all CPUs3*/4.macro dcache_line_size, reg, tmp5read_ctr \tmp // Read the Cache Type Register into \tmp.6ubfm \tmp, \tmp, #16, #19 // Extract the bits [19:16] from \tmp and store the7// result back in \tmp.8mov \reg, #4 // Bytes per word.9lsl \reg, \reg, \tmp // reg <<= tmp; (Actual cache line size in bytes).10.endm // End of the macro definition.
C representation of this function block:
xxxxxxxxxx211x2 = dcache_line_size();2x3 = x2 - 1;3if(x1 & x3 != 0) { // end cache line aligned?4 // not aligned5 x1 &= ~x3;6 dc_civac(x1); // clean & invalidate D / U line7}8// 1:9if(x0 & x3 != 0) { // start cache line aligned?10 // not aligned11 x0 &= ~x3;12 dc_civac(x0); // clean & invalidate D / U line13}14
15// 2: 3:16do {17 dc_ivac(x0);18 x0 += x2; // x0 += dcache_line_size19} while(x0 < x1) { // cmp x0, x1; b.lo 2b;20dsb_sy();21return;
create_idmap is responsible for setting up the identity mapping (ID map) for specific memory regions in the ARM64 Linux kernel. The ID map establishes a 1:1 mapping between physical addresses and virtual addresses, allowing direct access to physical memory using virtual addresses.
xxxxxxxxxx881SYM_FUNC_START_LOCAL(create_idmap)2 mov x28, lr3 /*4 * The ID map carries a 1:1 mapping of the physical address range5 * covered by the loaded image, which could be anywhere in DRAM. This6 * means that the required size of the VA (== PA) space is decided at7 * boot time, and could be more than the configured size of the VA8 * space for ordinary kernel and user space mappings.9 *10 * There are three cases to consider here:11 * - 39 <= VA_BITS < 48, and the ID map needs up to 48 VA bits to cover12 * the placement of the image. In this case, we configure one extra13 * level of translation on the fly for the ID map only. (This case14 * also covers 42-bit VA/52-bit PA on 64k pages).15 *16 * - VA_BITS == 48, and the ID map needs more than 48 VA bits. This can17 * only happen when using 64k pages, in which case we need to extend18 * the root level table rather than add a level. Note that we can19 * treat this case as 'always extended' as long as we take care not20 * to program an unsupported T0SZ value into the TCR register.21 *22 * - Combinations that would require two additional levels of23 * translation are not supported, e.g., VA_BITS==36 on 16k pages, or24 * VA_BITS==39/4k pages with 5-level paging, where the input address25 * requires more than 47 or 48 bits, respectively.26 */27#if (VA_BITS < 48)28#define IDMAP_PGD_ORDER (VA_BITS - PGDIR_SHIFT)29#define EXTRA_SHIFT (PGDIR_SHIFT + PAGE_SHIFT - 3)30
31 /*32 * If VA_BITS < 48, we have to configure an additional table level.33 * First, we have to verify our assumption that the current value of34 * VA_BITS was chosen such that all translation levels are fully35 * utilised, and that lowering T0SZ will always result in an additional36 * translation level to be configured.37 */38#if VA_BITS != EXTRA_SHIFT39#error "Mismatch between VA_BITS and page size/number of translation levels"40#endif41#else42#define IDMAP_PGD_ORDER (PHYS_MASK_SHIFT - PGDIR_SHIFT)43#define EXTRA_SHIFT44 /*45 * If VA_BITS == 48, we don't have to configure an additional46 * translation level, but the top-level table has more entries.47 */48#endif49 adrp x0, init_idmap_pg_dir50 adrp x3, _text51 adrp x6, _end + MAX_FDT_SIZE + SWAPPER_BLOCK_SIZE52 mov x7, SWAPPER_RX_MMUFLAGS53
54 map_memory x0, x1, x3, x6, x7, x3, IDMAP_PGD_ORDER, x10, x11, x12, x13, x14, EXTRA_SHIFT55
56 /* Remap the kernel page tables r/w in the ID map */57 adrp x1, _text58 adrp x2, init_pg_dir59 adrp x3, init_pg_end60 bic x4, x2, #SWAPPER_BLOCK_SIZE - 161 mov x5, SWAPPER_RW_MMUFLAGS62 mov x6, #SWAPPER_BLOCK_SHIFT63 bl remap_region64
65 /* Remap the FDT after the kernel image */66 adrp x1, _text67 adrp x22, _end + SWAPPER_BLOCK_SIZE68 bic x2, x22, #SWAPPER_BLOCK_SIZE - 169 bfi x22, x21, #0, #SWAPPER_BLOCK_SHIFT // remapped FDT address70 add x3, x2, #MAX_FDT_SIZE + SWAPPER_BLOCK_SIZE71 bic x4, x21, #SWAPPER_BLOCK_SIZE - 172 mov x5, SWAPPER_RW_MMUFLAGS73 mov x6, #SWAPPER_BLOCK_SHIFT74 bl remap_region75
76 /*77 * Since the page tables have been populated with non-cacheable78 * accesses (MMU disabled), invalidate those tables again to79 * remove any speculatively loaded cache lines.80 */81 cbnz x19, 0f // skip cache invalidation if MMU is on82 dmb sy83
84 adrp x0, init_idmap_pg_dir85 adrp x1, init_idmap_pg_end86 bl dcache_inval_poc870: ret x2888SYM_FUNC_END(create_idmap)
The function create_idmap() is used to create an identity mapping for a specific memory region, called the ID map, in the ARM64 Linux kernel. The identity mapping allows direct access to physical memory using virtual addresses without any translation.
xxxxxxxxxx601static void __init create_idmap(void)2{3 /* '__pa_symbol' is a macro that retrieves the physical address of a symbol. */4 u64 start = __pa_symbol(__idmap_text_start);5 u64 size = __pa_symbol(__idmap_text_end) - start;6 /*7 * Declare a pointer 'pgd' which points to a Page Global Directory (PGD) structure 8 * and initialize it with the address of the 'idmap_pg_dir', which points to the 9 * Page Global Directory for the ID map.10 */11 pgd_t *pgd = idmap_pg_dir;12 u64 pgd_phys; // Declare a variable to store the physical address of the PGD.13
14 /* 15 * The following code block checks if an additional level of translation is needed16 * for the ID map.17 * - 'VA_BITS' represents the number of virtual address bits.18 * - 'idmap_t0sz' is a kernel configuration parameter that sets the size 19 * of the top-level translation table (TTBR0) for the ID map.20 */21 if (VA_BITS < 48 && idmap_t0sz < (64 - VA_BITS_MIN)) {22 /* Allocate memory for the PGD and store its PA into 'pgd_phys'. */23 pgd_phys = early_pgtable_alloc(PAGE_SHIFT);24 /*25 * Set the entry in the PGD corresponding to the start address of the ID map. 26 * It configures the PGD entry to point to a new level of translation table (P4D)27 * using the physical address pgd_phys and sets the type to indicate that it's a 28 * table entry.29 */30 set_pgd(&idmap_pg_dir[start >> VA_BITS], __pgd(pgd_phys | P4D_TYPE_TABLE));31 /* Convert the PA 'pgd_phys' to a VA using the '__va' macro and store it into 'pgd'. */32 pgd = __va(pgd_phys);33 }34 /*35 * '__create_pgd_mapping' is a helper function that creates a mapping between 36 * the virtual addresses and physical addresses in the PGD. It sets up the identity37 * mapping for the ID map by configuring the PGD entries.38 */39 __create_pgd_mapping(pgd, start, start, size, PAGE_KERNEL_ROX, early_pgtable_alloc, 0);40
41 /* 42 * The following code block creates a mapping for the KPTI (Kernel Page Table Isolation)43 * synchronization flag if the CONFIG_UNMAP_KERNEL_AT_EL0 kernel configuration is enabled.44 * - 'CONFIG_UNMAP_KERNEL_AT_EL0' is a kernel configuration option that enables the45 * KPTI feature.46 * - '__idmap_kpti_flag' is a symbol representing the KPTI synchronization flag.47 */48 if (IS_ENABLED(CONFIG_UNMAP_KERNEL_AT_EL0)) {49 extern u32 __idmap_kpti_flag;50 u64 pa = __pa_symbol(&__idmap_kpti_flag);51
52 /* 53 * '__create_pgd_mapping()' function is called again to create a mapping between54 * the virtual and physical addresses for the KPTI synchronization flag. It is55 * mapped as read-write with the 'PAGE_KERNEL' flag.56 */57 __create_pgd_mapping(pgd, pa, pa, sizeof(u32), PAGE_KERNEL,58 early_pgtable_alloc, 0);59 } 60}