This document creates a broad and complete picture of the Zynq startup process, with particular focus on how the First Stage Bootloader works, including its use in the Program Flash operation. Special emphasis is given to non-standard configurations, particularly those not using DDR RAM and those using execute-in-place. The discussion focuses on QSPI Flash based designs because that is what I am familiar with. However, most of it is directly applicable to other types of NV memory, and QSPI with its linear mode, probably requires the most detail, anyway.
OVERVIEW. Zynq startup begins in a ROM bootloader. A bootmode register is read in at power up and guides the ROM bootloader where to look for boot code. In our systems that is QSPI flash. The JTAG can generally pre-empt whatever booting is done, provided that the register configuration done by the booting is not incompatible with what JTAG needs and hasn’t disabled JTAG.
There are two aspects to creating a bootable system. One is creating the image to go into NV memory. The other is putting it there. Both involve the boot process.
BOOTING. The first stage bootloader is the first program loaded by the ROM bootloader. It is also used by the Program Flash operation (starting with 2017.3). In simplified terms, it does the following:
Initializes the PS hardware. There are registers for everything from clock PLLs to MIO configuration and DDR controller setup. Vivado creates three files, PS7_init, one is C source, one is TCL and one is xml documentation. The TCL version is used by JTAG for debugging and running. The C version becomes part of FSBL.
A normal FSBL then checks for DDR memory and tests that it can write and read it. It assumes that the application will be loaded there. That is a bit of a brain dead and unnecessary assumption; one which must be removed in a system without DDR.
Then it uses the bootmode register to determine what the boot source should be. It is noteworthy here that if bootmode is JTAG, no further action is taken. The code ends in an infinite loop waiting for JTAG to disrupt it. It is this feature that allows Program Flash to use the FSBL as a configuration tool.
It then loads a header structure from the boot source and starts processing the partitions. Normally the first partition is the FSBL itself, which, since it is already running is obviously loaded (or is XIP), is skipped in this process. The next partition (if required) must be the bitstream file for the FPGA fabric (PL). The one after that is the application or a second stage loader. Once it is loaded, execution transfers to the entry point, which is generally the .boot vector at offset 0 in the program image.
There are several possible variations in the process, depending on whether there is a second stage loader, whether security is enforced, whether the partitions are RSA encrypted, etc. There is also a Fallback option such that if the load fails, a second attempt is made from a different part of the memory that has a known good (but possibly older) version in it.
The normal load process is to use PCAP to do a DMA transfer, both to the PL fabric and to the PS memory.
There can be a couple of variations on default behavior, particularly in systems without DDR RAM. One is to use OCM instead of DDR. My preference is to move the entire OCM to the top of memory so it is one contiguous block. In order to maximize OCM availability, the FSBL canexecute directly from QSPI flash in linear address mode, which is slower, but who cares—we’re talking milliseconds after startup! The top of OCM is reserved for data, stack, etc. and the balance will be application. Another trick is to lock the application (or part of it) in the L2 cache and let it run from there. The L2 cache is 512k bytes in size and the OCM is 256k bytes in size. Between the two, a fairly decent application can be built with no external RAM. Loss of the L2 cache as a cache slows things down, but may be acceptable, particularly since much of the code and possibly data are already residing in it, and therefore running at its speed. The L1 caches are also still available to enhance performance.
PROGRAM FLASH. The Program Flash operation can be done from within SDK, as a stand-alone application or from inside the Hardware Manager window of Vivado (or Vivado Lab Edition). In all cases the same process is carried out.
- A JTAG connection to the hardware is established.
- FSBL is run to configure the PS registers. For a normal FSBL, the bootmode jumpers should be set in JTAG mode to insure that FSBL doesn’t do anything other than configure the PS registers. Also an xip-FSBL cannot be used, and if there is no DDR, the FSBL must be modified to not fault on that fact.
- JTAG is used to carry out the erase, write and read operations to program the Flash.
CREATE BOOT IMAGE. For completeness, I will mention Create Boot Image. First, at present the option for XIP FSBL is not available through the GUI. You can use the GUI to get an initial file, but from there you must add the XIP option in a text editor. (Note: Xilinx tools are Linux centric, so you need an editor that handles Linux line endings if you are working on Windows.) Revisions to the file should be done manually, the same way. Then from Create Boot Image, select “Import from existing BIF file” and point it at the .bif file manually edited. The remaining options should be filled out from that .bif file and even though the xip option is not visible, it will be used. Then go to the bottom of the window and click “Create Image”. That’s all there is to it. I generally have it create a “Bootimage” directory directly under <project>.sdk/<project., which is the top level directory of the application I’m making an image for. That directory will contain two files. By convention, BOOT.mcs and <project>.bif. The first is what will be programmed into the flash, the second is the instructions for building it. Here is a sample .bif file with xip option.
//arch = zynq; split = false; format = MCS
the_ROM_image:
{
[bootloader, xip_mode, offset = 0x2000]C:\WV-Git-Zynq\FTRTS\FTRTS.sdk\FSBL_XIP\Release\FSBL_XIP.elf
[offset = 0x200000]C:\WV-Git-Zynq\FTRTS\FTRTS.sdk\FTRTS_wrapper_hw_platform_0\FTRTS_wrapper.bit
[offset = 0x700000]C:\WV-Git-Zynq\FTRTS\FTRTS.sdk\FTRTS\Debug\FTRTS.elf
}
DEBUGGING. One of the two debugging options FSBL_DEBUG or FSBL_DEBUG_INFO should be placed in the Properties/C/C++ Build/Settings/Tool Settings/Symbols/Defined symbols list (-D) for debug builds and debug builds should be used if it doesn’t work the first time. This requires that UART0 or UART1 be accessible on MIO pins, with appropriate level shifters or other connectivity to a terminal emulator program (possibly the one in SDK), and that the UART used be defined as the stdout device. But the result is a lot of output as FSBL runs, which can be very useful in determining where and why it didn’t do what was expected (and possibly locked up or exited with an error). Be aware that the fallback capability will generally cause it to run twice. The first run is the one you want to examine. The second one assumes a golden image, which probably doesn’t exist at this point, so is guaranteed to fail, but probably not for the reasons you are looking for.
Another technique I was not aware of is that FSBL can be run from JTAG under the debugger. That makes both debugging and changes easier than putting each new version in flash and then power cycling. It may (particularly for xip) require allocating memory differently in ldscript.ld.
Finally, so long as there is a connection between SDK via JTAG to the CPU, even running the program some other way (such as being called by Program Flash) will cause it to run in the context of the debugger. Breakpoints are honored. Execution trace occurs, variables can be examined. Another neat tool.
Don’t forget to build a release version when it all works and change the file links (in Program Flash and in the .bif file as appropriate) to the release version.
FSBL SOURCE CODE CUSTOMIZATION. The following steps need to be done to create an FSBL that does not require DDR. Note that some steps are optional or specific to a particular goal. Also, if xip is used, a separate FSBL must be mode for Program Flash, and for placing in the boot image, as the one for Program Flash cannot use xip. The Program Flash FSBL can be much simpler, since it does very little. In fact, a very simple generic Program Flash FSBL could be made in only two or three source files that would work in just about all cases. But in cases where DDR is present, there is no need for a second FSBL, so that may not be generically useful. Note that both FSBLs can share the same BSP.
One of the options without DDR is to use L2 cache as memory, by locking code in it. This is documented in the Xilinx Wiki article Zynq-7000 AP SoC Boot - Booting and Running Without External Memory Tech Tip. I have not tried that option. Some of the information contained here (that I have used) came from that article. It has been maintained for various versions of the tools, which is helpful. If that option is selected, it would be wise to refer to that article and download the associated zip file. I found much of the Reference_Design_Files to be of little use to me, but I did go into the software_projects.zip/FSBL_ZIP/src and copy it into my FSBL/src directory, replacing the version that SDK provided by default. Then I did some customizing. Their approach involved playing some tricks with DDR memory addresses, etc. resulting in doing things a bit differently than described below. This may be because they were locking the cache to the region that would normally be occupied by DDR, and needed those addresses. Since I was loading everything into OCM and placing it at the top of the address range, I opted for a different approach.
If the DDRLESS files are used, place DDRLESS_SYSTEM in the Properties/C/C++ Build/Settings/Tool Settings/Symbols/Defined symbols list (-D) for all build environments.
main.c is the heart of the operation. The first executable line in main() calls ps7_int which is in ps7_init.c. which is linked into the project from the hardware wrapper project. This is a key element for both types of FSBL.
The line
#ifdef XPAR_PS7_DDR_0_S_AXI_BASEADDR
About line 297 (version dependent) has two implications that must be fixed in any FSBL for a system with out DDR. First, it conditionally skips just about all the remainder of main(), leaving only an error exit behind. Second the next 15 or so lines will attempt to write and then read back from DDR, generating an error exit when it fails. There are several ways to fix this, but I just remove the 15 or so lines down to “PCAP initialization”.
The #ifdef has a matching #else and #endif at the end of main() (line 578 in 2017.4 version). These need to be removed, also.
The remainder of main() consists of tests for various boot modes and code to execute each. Boot modes that are not available to the hardware platform in question are #ifdef’d out, so won’t compile into the finished code, but could be physically removed if desired. In particular, when making an FSBL for Program Flash, nothing but JTAG is needed, and it will work better without messing with boot mode jumpers if the rest is left out. Start with the QSPI section. Take it, the NAND, NOR, SD and MMC modes out. Also take out the if at the beginning of the JTAG mode, since we want it to always assume JTAG. And since the if has been removed, removed the failure else clause at the end of JTAG. (Note: You can do this for the Program Flash FSBL, but you need to leave at least the desired boot mode in for the actual FSBL that will be used for booting).
In practice, each mode calls any necessary functions and ends up leaving main() never to return, so the code for remaining modes never executes. Take a closer look at the JTAG one used for the Program Flash FSBL. It includes function calls to a few simple utility functions followed by a call to FsblHandoffJtagExit(). This function is written in assembly language. It contains a few instructions to invalidate caches, etc. and then ends in a two-line infinite loop of waiting for exception and jumping back to the wait instruction. It never returns or exits in any manner. The JTAG interface usurps control via an exception when it is ready.
The next three paragraphs apply only to the FSBL used in the image file. They do not apply to making a Program Flash FSBL.
The XIP version of FSBL includes a “preload” function at the beginning of main. I think it would have been better placed in image_mover, but either way, it is only used for loading code into the L2 cache. It is not needed to simply run from OCM. It also includes a couple of copy() function calls that are probably not needed if cache isn’t being usurped.
Right after the lines associated with ps7_init, is a call to SlcrUnlock(). I places a usleep(50000) before that to give time for the PLLS to lock and settle before going on, and right after that placed the following line:
Xil_Out32(XPS_SYS_CTRL_BASEADDR + 0x910, 0x1F);
which relocates the entire OCM to the top of the address space.
In the QSPI section, the line
FlashReadBaseAddress = XPS_QSPI_LINEAR_BASEADDR;
needs to be added after the MoveImag = line
qspi.c is where the next change comes. If xip is used, the QSPI must be configured in linear mode. This is a non-issue for QSPI up to 16MB total size, but for larger QSPI, only the first 16MB are available in linear mode. There are a number of choices that can be made in that case. Probably a system with a larger QSPI is likely to have DDR, so none of this matters. But if not, the partitions must be arranged so that the parts needed are in the first 16MB. The options are limited because FSBL is operating from QSPI in linear mode already, so should not be switched. One option would be for the application to load the bit stream, eliminating the need for that partition to be in the first 16 MB. Another is for a second stage bootloader to complete the task, but it would have to live in cache or OCM, somewhat defeating the purpose of XIP mode. Other options, such as having the application finish loading itself or work in overlays may be useful in special cases, but have their disadvantages, also. The bottom line here, though is that QSPI must remain in linear mode for an XIP FSBL to work, so the line
if (QspiFlashSize <= FLASH_SIZE_16MB) { (line 278 in 2017.4)
must be removed or made true. The simplest way is to replace it with if(1). It should be noted that this is not an issue for the Program Flash FSBL, as it never calls anything in qspi.c at all. In fact, the whole file (and several others) can be eliminated from the project.
image_mover.c contains a couple of tests for DDR memory range (starting at line 411 in 2017.4). For the Program Flash FSBL, this is another file that won’t even be used, so this is not an issue. However, for a version that intends to move the application image into OCM, these tests will fail. Either eliminate these two tests, or give them OCM addresses. Since we are creating a custom solution anyway, I favor eliminating them, as the developer should know if the image is going to fit—it shouldn’t need to be determined at boot time every time.
There are a number of other changes in the version of image_mover.c in the DDRLESS system files. Most of these do not apply if cache is not being usurped.
One that is both the default and DDRLESS versions that needs to be removed is the line
LoadAddr = DDR_TEMP_START_ADDR;
which simply messes things up when moving code to OCM. This is part of PartitionMove() associated with moving PL to DDR temporarily.
Also if the DDRLESS version is used as the starting point, and the other instructions here are followed, the line
SourceAddr += FlashReadBaseAddress;
Should be removed. It is not in the normal FSBL. It too messes things up.
Fsbl_handoff.S is an assembly language source file that handles the various exit condtions for FSBL. The DDRLESS version adds a couple of lines to FsblHandoffExit to jump appropriately into the application. Here is the new and replaced code with #ifdef
#ifdef DDRLESS_SYSTEM
movlr, r0/* move the destination address into link register */
bxlr/* force the switch, destination should have been in r0*/
#else
movlr, r0/* move the destination address into link register */
mcr 15,0,r0,cr7,cr5,0/* Invalidate Instruction cache */
mcr 15,0,r0,cr7,cr5,6/* Invalidate branch predictor array */
dsb
isb/* make sure it completes */
ldrr4, =0
mcr 15,0,r4,cr1,cr0,0/* disable the ICache and MMU */