Home | Projects | Notes > Linux Kernel Analysis > head.S
The following comment sets the stage for the kernel's early initialization process, ensuring that key hardware settings and data structures are appropriately handled before the MMU is enabled and the kernel switches to more advanced memory management and virtual addressing.
xxxxxxxxxx
121/*
2 * Kernel startup entry point.
3 * ---------------------------
4 *
5 * The requirements are:
6 * MMU = off, D-cache = off, I-cache = on or off,
7 * x0 = physical address to the FDT blob.
8 *
9 * Note that the callee-saved registers are used for storing variables
10 * that are useful before the MMU is enabled. The allocations are described
11 * in the entry routines.
12 */
Why does the kernel startup entry point not care about the I-cache on/off state?
During the early kernel startup, the primary concern is to ensure that the basic hardware setup, data structures, and essential features like MMU are correctly initialized. The impact of I-cache being on or off might not be critical at this point, and its state can be appropriately handled later in the kernel's initialization process when more advanced memory management and virtual addressing are in place. Therefore, the code does not specifically require the I-cache state to be either on or off during the early kernel startup with MMU off and D-cache off.
When the MMU is off and the D-cache is off, the kernel operates in a simple flat memory model, where virtual addresses are the same as physical addresses. In this mode, there is no translation of virtual addresses to physical addresses, and memory accesses are performed directly using physical addresses.
The role of the D-cache is to store frequently accessed data to speed up memory access. However, with the D-cache off and MMU off, memory accesses are not cached, so any changes to memory will be immediately visible to all parts of the kernel and other hardware.
On the other hand, the role of the I-cache is to store frequently accessed instructions to speed up instruction fetches. Whether the I-cache is on or off may not have a significant impact at this very early stage of kernel startup. The kernel's initialization code is relatively small and straightforward at this point, and the code execution mainly involves a sequence of instructions that set up essential data structures, manage hardware, and enable features like MMU and caches.
Since the kernel is executed from memory, even if the I-cache is off, the instruction fetches are still likely to be fast because there is no virtual memory translation overhead, and the code is readily available in memory.
__INIT
is a macro defined in include/linux/init.h
:
xxxxxxxxxx
11
When the code is assembled, the assembly code inside the __HEAD
macro will be placed in the .head.text
section with the appropriate flags, ensuring that it is both allocatable (a
) and marked as executable (x
) in the final output. The section name and flags can be adjusted based on the specific needs of the project.
The image header provides essential information for the boot-loader to load and execute the kernel properly. The bootloader recognizes the signature (EFI signature NOP) and the magic number to identify this as an ARM64 executable. It then uses the entry point (primary_entry
) to start executing the kernel and considers the load offset, kernel size, and flags to manage memory allocation and other operations correctly.
The image header conforms to the header format defined in the documentation Documentation/arm64/booting.rst
section 4.
xxxxxxxxxx
141 /*
2 * DO NOT MODIFY. Image header expected by Linux boot-loaders.
3 */
4 efi_signature_nop @ special NOP to identity as PE/COFF executable
5 b primary_entry @ Branch to kernel start. The bootloader will use this address
6 @ to start executing the kernel code.
7 .quad 0 @ Image load offset from start of RAM, little-endian
8 le64sym _kernel_size_le @ Effective size of kernel image, little-endian
9 le64sym _kernel_flags_le @ Informative flags, little-endian
10 .quad 0 @ reserved
11 .quad 0 @ reserved
12 .quad 0 @ reserved
13 .ascii ARM64_IMAGE_MAGIC @ Magic number
14 .long .Lpe_header_offset @ Offset to the PE header.
L4:
efi_signature_nop
is a macro defined in the filearch/arm64/kernel/eif-header.S
:xxxxxxxxxx
181.macro efi_signature_nop
2#ifdef CONFIG_EFI
3.L_head:
4/*
5* This ccmp instruction has no meaningful effect except that
6* its opcode forms the magic "MZ" signature required by UEFI.
7*/
8ccmp x18, #0, #0xd, pl @ ccmp instruction is 0xFA405A4D in hex code and this will
9@ appear as 4D 5A 40 FA in the kernel image built in little-endian
10@ mode, first two of which are 'M' and 'Z' respectively
11#else
12/*
13* Bootloaders may inspect the opcode at the start of the kernel
14* image to decide if the kernel is capable of booting via UEFI.
15* So put an ordinary NOP here, not the "MZ.." pseudo-nop above.
16*/
17nop @ simply consume one clock cycle, doing nothing meaningful
18#endif
__INIT
is a macro defined in include/linux/init.h
:
xxxxxxxxxx
11
When the code is assembled, the assembly code inside the __INIT
macro will be placed in the .init.text
section with the appropriate flags, ensuring that it is both allocatable (a
) and marked as executable (x
) in the final output. The section name and flags can be adjusted based on the specific needs of the project.
The record_mmu_state
subroutine is designed to handle the endianness and MMU state at different exception levels (EL) in the ARM64 processor architecture. The code handles toggling the endianness, clearing the MMU enable bit, and applying workarounds if necessary before writing the modified values back to the relevant control registers. The subroutine ensures that memory accesses issued before the init_kernel_el()
function occur in the correct byte order and with the appropriate MMU state.
record_mmu_state
records the MMU on/off state in x19
upon return.
If x19
== 0, cache maintenance (e.g., invalidation) will be necessary
If x19
== 1, cache maintenance will not be necessary
Think of the following four possible scenarios:
Cache on, MMU on (x19
= 1)
Cache invalidation not necessary since the cache will already contain valid data
Cache on, MMU off (x19
= 0)
Cache invalidation necessary since the validity of the cache contents will not be guaranteed at the moment when the MMU turns on
Cache off, MMU on (x19
= 0)
Cache invalidation necessary since the validity of the cache contents is not guaranteed
Cache off, MMU off (x19
= 0)
Cache invalidation necessary since the validity of the cache contents will not be guaranteed at the moment when the MMU turns on
xxxxxxxxxx
371SYM_CODE_START_LOCAL(record_mmu_state)
2 mrs x19, CurrentEL @ move the current exception level value into x19
3 cmp x19, #CurrentEL_EL2 @ is the current exception level EL2(Hybervisor mode)?
4 mrs x19, sctlr_el1
5 b.ne 0f @ if not (EL2), branch to '0f'
6 mrs x19, sctlr_el2
70:
8CPU_LE( tbnz x19, #SCTLR_ELx_EE_SHIFT, 1f ) @ If the current endianness is LE and the bit
9 @ SCTLR_ELx_EE_SHIFT is not 1, branch to '1f'.
10CPU_BE( tbz x19, #SCTLR_ELx_EE_SHIFT, 1f ) @ If the current endianness is BE and the bit
11 @ SCTLR_ELx_EE_SHIFT is not 0, branch to '1f'.
12 tst x19, #SCTLR_ELx_C // Z := (C == 0)
13 and x19, x19, #SCTLR_ELx_M // isolate M bit
14 csel x19, xzr, x19, eq // clear x19 if Z
15 ret
16
17 /*
18 * Set the correct endianness early so all memory accesses issued
19 * before init_kernel_el() occur in the correct byte order. Note that
20 * this means the MMU must be disabled, or the active ID map will end
21 * up getting interpreted with the wrong byte order.
22 */
231: eor x19, x19, #SCTLR_ELx_EE @ toggle the Endianness Enable bit
24 bic x19, x19, #SCTLR_ELx_M @ clear the MMU enable bit
25 b.ne 2f @ if EL1, branch to '2f'
26 pre_disable_mmu_workaround @ A placeholder for code that handles a workaround
27 @ before disabling the MMU.
28 msr sctlr_el2, x19 @ Write the modified x19 back to the
29 @ System Control Register for EL2 (sctlr_el2).
30 b 3f
312: pre_disable_mmu_workaround
32 msr sctlr_el1, x19 @ write the modified x19 back to the
33 @ System Control Register for EL2 (sctlr_el1)
343: isb
35 mov x19, xzr @ move zero into x19
36 ret @ return from the subroutine
37SYM_CODE_END(record_mmu_state)
x19
is typically used to store the value of the System Control Register (SCTLR) during early boot.
The preserve_boot_args
subroutine ensures that the important bootloader arguments are preserved in memory for the kernel's use. Additionally, it handles cache invalidation, especially when the MMU is off, to ensure data consistency during the early stages of the kernel's execution.
xxxxxxxxxx
311/*
2 * Preserve the arguments passed by the bootloader in x0 .. x3
3 */
4SYM_CODE_START_LOCAL(preserve_boot_args)
5 mov x21, x0 @ x21=FDT; load the Flattened-Device-Tree (FDT) start address into x21
6
7 adr_l x0, boot_args @ Load x0 with the address of the label 'boot_args' (this label
8 @ represents a memory location where the bootloader arguments
9 @ will be preserved for the kernel's reference).
10 @ 'boot_args' is an array of 4 64-bit elements.
11 stp x21, x1, [x0] @ Store x21 and x1 in memory at the address pointed by x0
12 stp x2, x3, [x0, #16] @ followed by x2 and x3.
13
14 cbnz x19, 0f @ if x19 != 0 (i.e., MMU=on), skip the cache invalidation (jump to '0'f)
15 dmb sy @ Data Memory Barrier (DMB) instruction with the 'sy(system)'
16 @ option. This barrier ensures that all memory accesses before
17 @ it are completed before proceeding. It is used before the
18 @ cache invalidation operation with the MMU off
19
20 add x1, x0, #0x20 @ Load x1 with 'x0 + 0x20' (4 x 8 bytes after the initial address) to
21 @ prepare x1 to be used in the following tail call.
22 b dcache_inval_poc @ Tail call that jumps to a cache invalidation function. (The
23 @ purpose of the cache invalidation is to ensure the chache
24 @ coherency for the preserved dat.)
250: str_l x19, mmu_enabled_at_boot, x0 @ The value of x19, which represents whether the
26 @ MMU was enabled at boot time, is stored at the
27 @ address pointed by x0. This is done after the
28 @ cache invalidation part and is likely used for
29 @ reference or debugging purposes.
30 ret
31SYM_CODE_END(preserve_boot_args)
L7:
adr_l
is a macro defined inarch/arm64/include/asm/assembler.h
:xxxxxxxxxx
121/*
2* Pseudo-ops for PC-relative adr/ldr/str <reg>, <symbol> where
3* <symbol> is within the range +/- 4 GB of the PC.
4*/
5/*
6* @dst: destination register (64 bit wide)
7* @sym: name of the symbol
8*/
9.macro adr_l, dst, sym
10adrp \dst, \sym
11add \dst, \dst, :lo12:\sym
12.endm
Due to the limitation bit-field length in expressing 64-bit address, the macro
adr_l
is introduced. It uses the "page address" and the "offset" to express 64-bit address. (Commonly seen in arm64)
Can be written as the following C code:
xxxxxxxxxx
81void preserve_boot_args() {
2 boot_args[0] = x21; // fdt
3 boot_args[1] = x1;
4 boot_args[2] = x2;
5 boot_args[3] = x3;
6 dcache_inval_proc(boot_args);
7 mmu_enabled_at_boot = x19; // MMU on or off
8}
The function dcache_inval_poc
is used to invalidate and clean (flush) the data cache for a specific memory range [start, end). It handles cache-line alignment for both the start and end addresses and uses the civac
and ivac
cache maintenance instructions to perform cleaning and invalidation operations, respectively. The function iterates over each cache line in the specified range and ensures data coherency between the cache and memory.
xxxxxxxxxx
451/*
2 * dcache_inval_poc(start, end)
3 *
4 * Ensure that any D-cache lines for the interval [start, end)
5 * are invalidated. Any partial lines at the ends of the interval are
6 * also cleaned to PoC to prevent data loss.
7 *
8 * - start - kernel start address of region
9 * - end - kernel end address of region
10 */
11SYM_FUNC_START(__pi_dcache_inval_poc)
12 dcache_line_size x2, x3 @ Load the size of a cache line into x2. (x3 will be used as a
13 @ scratch register inside this macro to hold the value of
14 @ 'ctr_el0' bits[19:16].
15 sub x3, x2, #1 @ Get x3 ready to be used as a bit-mask to mask off the lower bits of
16 @ ensure cache-line alignment.
17 tst x1, x3 @ Is the x1 (end address of 'boot_args') cache line-aligned?
18 bic x1, x1, x3 @ Mask off the lower bits of x1 using x3 to make it cache line-aligned.
19 @ e.g., If the cache line size is 64 bytes and x1 contained 70, then x1
20 @ will be adjusted to 64. (Lower 6 bits dropped)
21 b.eq 1f @ If the end address is already cache line-aligned, skip the cache
22 @ maintenance and jump to label 1.
23 dc civac, x1 @ Clean and invalidate the D-cache line where the end address of
24 @ 'boot_args' is located. The 'civac' cache maintenance operation
25 @ combines cleaning (write-back)and invalidation of a single cache line.
261: tst x0, x3 @ Is the x0 (start address of 'boot_args') cache line-aligned?
27 bic x0, x0, x3 @ Mask off the lower bits of x0 using x3 to make it cache line-aligned.
28 b.eq 2f @ If the start address is already cache line-aligned, jump to label 2
29 @ to perform cache invalidation only.
30 dc civac, x0 @ Clean and invalidate the data cache for the start address x0.
31 b 3f @ Branch to label 3 to perform the cache invalidation for the remaining
32 @ memory range.
332: dc ivac, x0 @ Invalidate the data cache for the start address x0. This instruction
34 @ invalidates the cache line without writing its contents back to memory.
353: add x0, x0, x2 @ Move the start address x0 forward by the cache line size to handle the
36 @ next cache line.
37 cmp x0, x1 @ Compare the updated start address x0 with the end address x1.
38 b.lo 2b @ If x0 is less than x1, loop back to label 2 to continue the cache
39 @ invalidation for the remaining cache lines.
40 dsb sy @ Data Synchronization Barrier (dsb) instruction with the sy (system)
41 @ option. It ensures that all explicit memory accesses before the barrier
42 @ are completed before proceeding.
43 ret
44SYM_FUNC_END(__pi_dcache_inval_poc)
45SYM_FUNC_ALIAS(dcache_inval_poc, __pi_dcache_inval_poc)
SYM_FUNC_START(function_name)
is a symbol defining the start of thefunction_name
function. It is typically used for debugging and symbol tracking purposes.L12:
dcache_line_size
is a macro defined inarch/arm64/include/asm/assembler.h
.xxxxxxxxxx
101/*
2* dcache_line_size - get the safe D-cache line size across all CPUs
3*/
4.macro dcache_line_size, reg, tmp
5read_ctr \tmp // Read the Cache Type Register into \tmp.
6ubfm \tmp, \tmp, #16, #19 // Extract the bits [19:16] from \tmp and store the
7// result back in \tmp.
8mov \reg, #4 // Bytes per word.
9lsl \reg, \reg, \tmp // reg <<= tmp; (Actual cache line size in bytes).
10.endm // End of the macro definition.
C representation of this function block:
xxxxxxxxxx
211x2 = dcache_line_size();
2x3 = x2 - 1;
3if(x1 & x3 != 0) { // end cache line aligned?
4 // not aligned
5 x1 &= ~x3;
6 dc_civac(x1); // clean & invalidate D / U line
7}
8// 1:
9if(x0 & x3 != 0) { // start cache line aligned?
10 // not aligned
11 x0 &= ~x3;
12 dc_civac(x0); // clean & invalidate D / U line
13}
14
15// 2: 3:
16do {
17 dc_ivac(x0);
18 x0 += x2; // x0 += dcache_line_size
19} while(x0 < x1) { // cmp x0, x1; b.lo 2b;
20dsb_sy();
21return;
create_idmap
is responsible for setting up the identity mapping (ID map) for specific memory regions in the ARM64 Linux kernel. The ID map establishes a 1:1 mapping between physical addresses and virtual addresses, allowing direct access to physical memory using virtual addresses.
xxxxxxxxxx
881SYM_FUNC_START_LOCAL(create_idmap)
2 mov x28, lr
3 /*
4 * The ID map carries a 1:1 mapping of the physical address range
5 * covered by the loaded image, which could be anywhere in DRAM. This
6 * means that the required size of the VA (== PA) space is decided at
7 * boot time, and could be more than the configured size of the VA
8 * space for ordinary kernel and user space mappings.
9 *
10 * There are three cases to consider here:
11 * - 39 <= VA_BITS < 48, and the ID map needs up to 48 VA bits to cover
12 * the placement of the image. In this case, we configure one extra
13 * level of translation on the fly for the ID map only. (This case
14 * also covers 42-bit VA/52-bit PA on 64k pages).
15 *
16 * - VA_BITS == 48, and the ID map needs more than 48 VA bits. This can
17 * only happen when using 64k pages, in which case we need to extend
18 * the root level table rather than add a level. Note that we can
19 * treat this case as 'always extended' as long as we take care not
20 * to program an unsupported T0SZ value into the TCR register.
21 *
22 * - Combinations that would require two additional levels of
23 * translation are not supported, e.g., VA_BITS==36 on 16k pages, or
24 * VA_BITS==39/4k pages with 5-level paging, where the input address
25 * requires more than 47 or 48 bits, respectively.
26 */
27#if (VA_BITS < 48)
28#define IDMAP_PGD_ORDER (VA_BITS - PGDIR_SHIFT)
29#define EXTRA_SHIFT (PGDIR_SHIFT + PAGE_SHIFT - 3)
30
31 /*
32 * If VA_BITS < 48, we have to configure an additional table level.
33 * First, we have to verify our assumption that the current value of
34 * VA_BITS was chosen such that all translation levels are fully
35 * utilised, and that lowering T0SZ will always result in an additional
36 * translation level to be configured.
37 */
38#if VA_BITS != EXTRA_SHIFT
39#error "Mismatch between VA_BITS and page size/number of translation levels"
40#endif
41#else
42#define IDMAP_PGD_ORDER (PHYS_MASK_SHIFT - PGDIR_SHIFT)
43#define EXTRA_SHIFT
44 /*
45 * If VA_BITS == 48, we don't have to configure an additional
46 * translation level, but the top-level table has more entries.
47 */
48#endif
49 adrp x0, init_idmap_pg_dir
50 adrp x3, _text
51 adrp x6, _end + MAX_FDT_SIZE + SWAPPER_BLOCK_SIZE
52 mov x7, SWAPPER_RX_MMUFLAGS
53
54 map_memory x0, x1, x3, x6, x7, x3, IDMAP_PGD_ORDER, x10, x11, x12, x13, x14, EXTRA_SHIFT
55
56 /* Remap the kernel page tables r/w in the ID map */
57 adrp x1, _text
58 adrp x2, init_pg_dir
59 adrp x3, init_pg_end
60 bic x4, x2, #SWAPPER_BLOCK_SIZE - 1
61 mov x5, SWAPPER_RW_MMUFLAGS
62 mov x6, #SWAPPER_BLOCK_SHIFT
63 bl remap_region
64
65 /* Remap the FDT after the kernel image */
66 adrp x1, _text
67 adrp x22, _end + SWAPPER_BLOCK_SIZE
68 bic x2, x22, #SWAPPER_BLOCK_SIZE - 1
69 bfi x22, x21, #0, #SWAPPER_BLOCK_SHIFT // remapped FDT address
70 add x3, x2, #MAX_FDT_SIZE + SWAPPER_BLOCK_SIZE
71 bic x4, x21, #SWAPPER_BLOCK_SIZE - 1
72 mov x5, SWAPPER_RW_MMUFLAGS
73 mov x6, #SWAPPER_BLOCK_SHIFT
74 bl remap_region
75
76 /*
77 * Since the page tables have been populated with non-cacheable
78 * accesses (MMU disabled), invalidate those tables again to
79 * remove any speculatively loaded cache lines.
80 */
81 cbnz x19, 0f // skip cache invalidation if MMU is on
82 dmb sy
83
84 adrp x0, init_idmap_pg_dir
85 adrp x1, init_idmap_pg_end
86 bl dcache_inval_poc
870: ret x28
88SYM_FUNC_END(create_idmap)
The function create_idmap()
is used to create an identity mapping for a specific memory region, called the ID map, in the ARM64 Linux kernel. The identity mapping allows direct access to physical memory using virtual addresses without any translation.
xxxxxxxxxx
601static void __init create_idmap(void)
2{
3 /* '__pa_symbol' is a macro that retrieves the physical address of a symbol. */
4 u64 start = __pa_symbol(__idmap_text_start);
5 u64 size = __pa_symbol(__idmap_text_end) - start;
6 /*
7 * Declare a pointer 'pgd' which points to a Page Global Directory (PGD) structure
8 * and initialize it with the address of the 'idmap_pg_dir', which points to the
9 * Page Global Directory for the ID map.
10 */
11 pgd_t *pgd = idmap_pg_dir;
12 u64 pgd_phys; // Declare a variable to store the physical address of the PGD.
13
14 /*
15 * The following code block checks if an additional level of translation is needed
16 * for the ID map.
17 * - 'VA_BITS' represents the number of virtual address bits.
18 * - 'idmap_t0sz' is a kernel configuration parameter that sets the size
19 * of the top-level translation table (TTBR0) for the ID map.
20 */
21 if (VA_BITS < 48 && idmap_t0sz < (64 - VA_BITS_MIN)) {
22 /* Allocate memory for the PGD and store its PA into 'pgd_phys'. */
23 pgd_phys = early_pgtable_alloc(PAGE_SHIFT);
24 /*
25 * Set the entry in the PGD corresponding to the start address of the ID map.
26 * It configures the PGD entry to point to a new level of translation table (P4D)
27 * using the physical address pgd_phys and sets the type to indicate that it's a
28 * table entry.
29 */
30 set_pgd(&idmap_pg_dir[start >> VA_BITS], __pgd(pgd_phys | P4D_TYPE_TABLE));
31 /* Convert the PA 'pgd_phys' to a VA using the '__va' macro and store it into 'pgd'. */
32 pgd = __va(pgd_phys);
33 }
34 /*
35 * '__create_pgd_mapping' is a helper function that creates a mapping between
36 * the virtual addresses and physical addresses in the PGD. It sets up the identity
37 * mapping for the ID map by configuring the PGD entries.
38 */
39 __create_pgd_mapping(pgd, start, start, size, PAGE_KERNEL_ROX, early_pgtable_alloc, 0);
40
41 /*
42 * The following code block creates a mapping for the KPTI (Kernel Page Table Isolation)
43 * synchronization flag if the CONFIG_UNMAP_KERNEL_AT_EL0 kernel configuration is enabled.
44 * - 'CONFIG_UNMAP_KERNEL_AT_EL0' is a kernel configuration option that enables the
45 * KPTI feature.
46 * - '__idmap_kpti_flag' is a symbol representing the KPTI synchronization flag.
47 */
48 if (IS_ENABLED(CONFIG_UNMAP_KERNEL_AT_EL0)) {
49 extern u32 __idmap_kpti_flag;
50 u64 pa = __pa_symbol(&__idmap_kpti_flag);
51
52 /*
53 * '__create_pgd_mapping()' function is called again to create a mapping between
54 * the virtual and physical addresses for the KPTI synchronization flag. It is
55 * mapped as read-write with the 'PAGE_KERNEL' flag.
56 */
57 __create_pgd_mapping(pgd, pa, pa, sizeof(u32), PAGE_KERNEL,
58 early_pgtable_alloc, 0);
59 }
60}