Contributions to the design of high-performance digital systems using field programmable system-on-chip platforms

Fernández Molanes, Roberto

Contributions to the design of high-performance digital systems using field programmable system-on-chip platforms

Fernández Molanes, Roberto

Dirixida por:

Juan J. Rodríguez Andina Director
José Fariña Rodríguez Director

Universidade de defensa: Universidade de Vigo

Fecha de defensa: 23 de novembro de 2018

Tribunal:

Luis Filipe dos Santos Gomes Presidente/a
María Dolores Valdés Peña Secretaria
Eduardo Torre Arnanz Vogal

Departamento:

Tecnoloxía electrónica

Tipo: Tese

Teseo: 570893 DIALNET Investigo editor

Resumo

Introduction Nowadays Systems-on-Chip (SoCs) are the preferred way to keep pace with the ever increasing performance requirements of the digital processing systems. They combine in a single chip general purpose processors and spezialized cores like Graphics Processing Units (GPUs), accelerators based on reconfigurable logic like Field-Programmable Gate Arrays (FPGAs), single-task units (e.g., for inertial data processing or voice recognition), or specialized coprocessors like floating-point or Digital Signal Processing (DSP) units. The advent of SoCs coincides with the deceleration of Moore's Law, that makes it economically reasonable to work on optimized software and hardware structures, as opposed to the trend of the last 30 years, where waiting for the next generation of devices was more profitable than investing in optimization. One of the most interesting SoCs results from the combination of processors and FPGAs in the same chip, which gives birth to the so called Field Programmable System-on-Chip (FPSoC) paradigm. FPSoCs provide software designers with an efficient way to accelerate the execution of their algorithms and hardware designers with much more high-level processing power than that provided by soft-cores implemented in standard logic FPGA resources. In addition, there some new applications for which the FPSoC architecture is the best solution. In first place, the Internet-of-Things (IoT) market is producing a diversification of applications. If manufacturers offer a wider portfolio of devices to cover the different applications the market share per device is reduced, increasing manufacturing costs. Similarly, offering complex heterogeneous devices that can be used in several applications implies higher integration of functionality and a waste of silicon, also increasing the overall cost. In this context, combining FPGAs with processors on the same chip is a way to achieve a reconfigurable processor that the designer can adapt to the needs of each application. Other important factor that is boosting the usage of FPGAs is the artificial intelligence revolution driven by deep and convolutional neural networks. While the training of these networks is very intensive and will rely in heavy-duty cloud-based GPUs and ASICs, the network usage (inference) will be mainly implemented in processors, ASICs and FPSoCs. While processors will be restricted to very simple solutions and ASICs will be only applicable to very low power devices or high performance applications, FPSoCs offer a balance among flexibility, processing power and consumption very suitable for a big number of intermediate applications. Despite the advantages that FPSoCs offer they are still used in a limited number of works. One of the reasons is the difficulty to program its FPGA using low-level Hardware Description Languages (HDLs). To solve this problem FPSoC manufacturers nowadays offer different ways to program the FPGA using high level languages like C/C++, OpenCL and graphical languages like Simulink and Labview. In these cases the designer does not need to deal with low level details and development time is greatly reduced, though the performance is poorer than with the low-level approach. Another reason for the slow penetration of FPSoCs, affecting both the low-level and high-level workflows, is the lack of knowledge about the best way to interconnect the processor and the FPGA fabric in these devices, which is a key factor for developing high-performance heterogeneous computing solutions using them. Few works have already shown the importance that factors like data size, Operating System (OS) and FPGA operating frequency have in the transfer rate between the processor and the FPGA but all of them are partial or are tied to a specific application, being not really useful in practice for designers. For this reason, in this thesis a complete characterization of FPGA-processor data transfer rates in two important FPSoC families, namely Cyclone V SoC and Zynq-7000, is carried out, taking into account all the parameters that can affect the transfer rate. The study is generic and relevant to all kind of applications. From the results and the experience gained from the experiments, a set of design guidelines is also proposed, which is intended to help all kind of designers, from beginners in FPGA and FPSoC platforms to application designers using both low-level or high level workflows. The proposed methodology and numeric results should also be useful for designers of intermediate layers like drivers or compilers for high-level languages (like OpenCL). To demonstrate the usefulness of the characterization experiments, the design methodology, and the guidelines, three application examples using FPSoC devices have been developed in this thesis. These examples address different real-life problems and show the advantages provided by FPSoCs if properly used. They have different processor-FPGA interconnect needs and cover most of the situations an FPSoC programmer will have to face. FPSoC architecture The FPSoC architecture consists of a Hard Processor System (HPS) and FPGA fabric, coupled by high throughput datapaths. The most important element in the HPS is the processor. Most of current FPSoCs on the market feature one application processor intended to run a full-featured OS like GNU-Linux, Android or Windows Embedded. They contain one to four processor cores with two cache levels inside. A wide range of complex applications can be built using this kind of devices. Some FPSoC manufactures opted for including a simpler real-time processor with one or two cores running at a fraction of the application processors frequency and only with one level of cache. This kind of devices fit better for industrial applications where latency must be ensured. Apart from processors HPS include other hard peripherals typically included in processor based systems like DMA controller (DMAC), Ethernet and USB controllers, UARTs, and timers. In this thesis Intel FPGA Cyclone V SoC and Xilinx Zynq-7000 families are used to do the characterization and the application examples. These two devices are very similar. They include a dual ARM Cortex-A9 32-bit application processor with two levels of cache. A multiport SDRAM controller (SDRAMC) provides access to an external SDRAM memory which is used by the processor as main and shared with the HPS peripherals and the FPGA. The interconnect resources between HPS and FPGA in both devices are: • Bridges where the HPS acts as a master: memory-mapped bridges providing around 1GB of address space to implement peripherals in the FPGA. Cyclone V SoC has a configurable 32-, 64- and 128-bit HPS-to-FPGA and a Lightweight 32-bit bridge with less performance. Zynq-7000 has two fixed 32-bit HPS-to-FPGA bridges. • Bridges where the FPGA acts as a master: memory-mapped bridges used by the FPGA to access the HPS. The same resources are available in both devices. The FPGA-to-HPS bridges allow the FPGA to access the main SDRAM or to use the HPS peripherals. In Cyclone V SoC they are configurable to 32-, 64- and 128-bit. In Zynq two fixed 32-bit bridges are available. In addition to FPGA-to-HPS bridges, FPGA-to-SDRAM ports allow direct access to main SDRAM memory. Lastly an Accelerator Coherency Port (ACP) allows coherent accesses to the cache memory of the processors. FPSoC Characterization Due to the complex nature of FPSoC chips, choosing a method to transfer data between processor and FPGA is not straightforward, and several questions may arise when deploying an application to an FPSoC device. One thing is clear, as much of the code as possible should go in the processor because software is more easy to develop, modify, maintain and port than hardware. However, which bridge to use from the available options? Are their speeds enough for the application or will it undo the acceleration provided by the FPGA, making the hardware implementation a waste of time? Answering these questions may force the designer to change which part of the algorithm goes in each device so the data transfer rate is reduced, or may even force FPSoC to be discarded. Moreover, should the HPS be used as the master controlling the transfers or is the FPGA a better option? HPS looks easier and saves FPGA resources but seems slower because it is software. And there are also software related questions when accessing from HPS. Should functions from standard library (like memcpy) be used or is direct access using pointers more appropiate? Should designers use commercial software and drivers that ease its connection like those offered by Xilinux OS? Should designers better write their own drivers? When is DMA Controller worth using? Manufacturers provide rough numbers on the transfer speed of the bridges connecting FPGA and HPS and they do not explain the conditions for the measurements provided. On the other hand the existing research works analyze a subset of all the parameters that can affect transfer speed, a small range of these parameters or the study is tied to an specific application. These drawbacks make difficult for a designer to get a clear overview of the FPSoC inner workings and select a transfer method for a new application. Because of this in this thesis a complete study is addressed taking into account all the parameters involved in data transfer rate: • Two families of devices: Cyclone V SoC and Zynq-7000. • Platform. OS or baremetal (no OS) implementation. • Part of the chip controlling the transfer. HPS or FPGA. • Bridge type. All bridges connecting HPS and FPGA are tested. HPS-to-FPGA, Lightweight HPS-to-FPGA , FPGA-to-HPS, FPGA-to-SDRAMC and ACP. • Master starting AXI bus transfers. When HPS controls the transfer processor or DMAC are used. When FPGA controls the transfer DMACs in the FPGA control the transfer. • Influence of the microcode preparation time in the DMAC in the transfer rate. Ways to minimize this effect are explained. • For CPU different methods are tested. The for loop method (access using pointers that increment inside a for loop) and memcpy function from the standard library are tested. • Different cache enablement levels and special configurations of the cache controller. • Special configurations of the SDRAMC. • Influence of CPU load. Some results for Cyclone V SoC and Zynq-7000 confirmed expected behaviors. For example, both platforms achieve faster transfer rates when using a bridge with bigger data size and higher FPGA operating frequency and when cache is enabled. In both platforms processor is also faster at small data sizes while DMAC is faster for big data sizes. However other unexpected behaviors have been uncovered. For example, in both platforms the transfer rate is very dependent on data size. Fastest data rates are for intermediate data sizes (1kB to 512kB). For small data sizes initialization time slows down the net transfer rate and for big data sizes the cache overflows and produces a reduction in performance in methods using cache. In Cyclone V SoC, coherent DMA accesses through ACP to read data from the FPGA fabric are faster than direct accesses from SDRAM, whereas in the case of write accesses, they may be faster or slower depending on data size. The expected behavior is that direct access to the SDRAMC is always faster than ACP for big transfer sizes. In experiments with OS in Zynq-7000, CPU transfers are faster than DMAC ones for all data sizes in RD operations whereas in WR operations they are faster only for the smallest data sized (which is the expected behavior). When comparing the two devices Cyclone V SoC shows much better performance when using the DMAC whereas Zynq-7000 is faster with CPU, in spite of the CPU operating at lower frequency. From the results obtained and the experience gained through the realization of the experiments a set of design guidelines have been developed, representing the first straightforward design guidelines for FPSoC devices. They contain a series of steps indicating, in order, which decisions a designer has to deal with during an FPSoC implementation and tips on how to face them depending on the characteristics of the application. One of the most important conclusions from the guidelines is the usage of the FPGA only to implement the most time consuming parts of the algorithm. Keeping as much of the algorithm as possible in software enables a better usage of the silicon area (because the FPGA is reserved for the slower parts) and reduces the development time (because the design time for hardware is higher than for software). Another important conclusion is that the AXI master controlling the data transfers must be located in the part of the device where data are generated. This means that if data are generated in HPS, the DMAC in HPS or the processor must be used. In the same way when data are generated in the FPGA a DMAC or other kind of master in the FPGA must initiate the transfers. This approach reduces the logic in the FPGA involved in the transfer (interconnect and intermediate storage elements) and eases the programming (because no synchronization elements are needed between the part of the chip generating data and the one transferring them). The guidelines are complemented with online repositories that contain the full set of numerical results, the code used for the characterization, starting guides for beginners and a set of easy examples. These examples, mainly focused on HPS-FPGA transfers, complement the more complex ones usually available in the official web sites. FPSoC Application Examples The proposed guidelines have been applied to three application examples with different HPS-FPGA communication needs. These examples served to prove the usefulness of the guidelines and the advantages of the FPSoC architecture over other platforms when properly used. The first application example is related to the design of a high accuracy, portable frequency meter to be used in a Quartz Crystal Microbalance (QCM). QCM are sensors that use a quartz crystal to define the frequency of an oscillator circuit. If the frequency of its oscillator is measured with the required accuracy mass measurements in the order of nanograms and picograms can be carried out. In this example the FPGA available in the FPSoC is used to perform accurate frequency measurements of the QCM signal. A circuit called Interpolating Reciprocal Counter is implemented. It counts flanks from the QCM signal and a reference clock and uses a Tapped Delay Line (TDL) to achieve sub-clock accuracy. The HPS is only used for high level calculations and to provide Ethernet communications. In this example the required processor-FPGA transfer rate is relatively slow and the easiest option from the programming point of view is selected. When a new frequency measurement is available in the FPGA the processor detects it using an interrupt and proceeds to do the data transfer. This has been, to the best of the author's knowledge, the first implementation of a frequency meter using TDLs in an FPSoC platform. It demonstrates that using FPSoCs it is possible to include all digital logic for this kind of instrument in a single chip, reducing the size and cost and easing the software development with respect to existing multichip solutions. The instrument is made virtual (without panel), further reducing the complexity of the design. This implementation can be used as a base to develop other instruments. The frequency error achieved in the proposed implementation is compared to a commercial frequency meter obtaining similar error with smaller size, cost and power consumption. The compilation techniques developed to control the routing of the TDL circuit produce a smaller linearity error than previous implementations of the circuit. These techniques can be used to any other application that requires a regular global routing in the FPGA. The second developed application is a free-software UGV localization system based on FPSoC. In this case the FPGA is used to read and preprocess images from a CMOS camera and to efficiently save them in the HPS. The processor reads the preprocessed images, detects the UGVs and sends their location through Ethernet to a main controller. To the author's knowledge it is the first camera-based localization system implemented in FPSoC. The system is highly modular (localization nodes can be easily added), robust to variable light conditions, fast (up to around 100fps) and capable of localizing UGVs in a 2D space with accuracy around 1mm / 1º. This system achieves a better combination of accuracy and frame rate than other research platforms and it is only outperformed by commercial devices like Optitrack. Optitrack is a localization system for 3D objects, able to achieve an accuracy similar to that of the proposed system at higher frame rate. This system is much more expensive than the FPSoC based system and it is a closed system, which relies on proprietary markers and illumination system. In contrast the proposed system is free software, does not require special illumination and uses cheap markers. Another advantage of this system is that most of the processing power, thanks to the FPSoC platform, is moved to the localization nodes. This allows the system to reduce the requirements of the computer implementing the main controller as well as the network traffic, unlike other localization systems, including Optitrack, which rely on powerful computers and Internet networks. In this application example the data transfer rates from FPGA to processor are very high, since whole images must be written into the processor memory. The key feature to obtain such a high frame rate is the mechanism developed to save images into the processor memory without its intervention. This method uses a double buffer and a software synchronization mechanism that allows images to be saved in real time without loss of information. This feature can be used in other applications involving image processing with high frame rates and in general in any type of intelligent system that generates a high amount of data. The third and last example involves the hardware acceleration of the Particle Swarm Optimization (PSO) algorithm. PSO is an algorithm used to solve optimization problems with non-linear objective functions. In PSO, a swarm of particles is spread through the solution space to perform a quasi-random search that finds the set of parameters that optimize (minimize or maximize) the objective function. Particles evaluate the objective function in their positions; share the information with other particles and move based on that information. Then the cycle starts again. In the end all particles tend to converge into a single solution. An analysis of the PSO steps shows that the best implementation strategy for FPSoC is to execute most of the algorithm in the processor and move only the most time consuming part, the evaluation of the objective function, to the FPGA hardware. In this thesis this implementation strategy has been tested on a real application to diagnose the state of health of a photovoltaic panel. This application has a big objective function with bad characteristics for an FPGA implementation: it is not pipelineable and contains some operands not suitable for FPGA implementation, like division and exponential. Despite this the performance achieved in the FPSoC implementation is similar to that of Cubieboard, one of the most powerful embedded boards in the market and slightly slower than a computer (at lower cost, size and power consumption). From the study it is clear that in other applications, especially those with pipelineable objective functions and operands that fit with the FPGA architecture, like multiplications and additions, an FPSoC implementation would provide better results than processor-based platforms. The analysis of how to split the workload between HPS and FPGA and the design of the HPS-FPGA interconnect can be used in any application, not necessarily PSO, involving software acceleration using FPSoC devices. Lastly, in this thesis a new variant of the PSO algorithm called Numerically Stable PSO (NSPSO) is proposed, that has some advantages for the hardware implementation of PSO algorithm, especially when the objective function is very complex.