How to match FPGA and CPU?

A story to understand pipeline, parallel, cache, memory, storage

Eggs spent a day making a toy car for Little Eggs. I didn't expect it to be so good. The students all bought it. Egg eggs can't be busy alone, and they work together to produce them together. Eggs are responsible for sorting the raw materials on the table, small eggs are screwed and assembled, and the mother puts the toys in the box. The three people used the assembly line method, everyone was not idle, the production efficiency was greatly improved, and 50 toys could be made a day.

The king next door saw that eggs made money for making toys, and a family of three joined the army of toy manufacturers, and slowly the whole community smoked and produced every household. Fifty family members can work in parallel, making 2,500 toys a day.

Originally, there was a box next to the workbench in the egg house, called the cache. Put the raw materials for the production of toys and the produced toys in your own home, close and fast.

After smoking in every community in the community, the supply of raw materials and the demand for toy transportation increased significantly. The residents built a warehouse in the center of the community, called the memory, which put a lot of raw materials and toys. The uncle janitor pushed the trolley to every house to send raw materials and collect toys. People gave him a nickname- bus.

The mayor saw the increase in toy taxes paid by the egg farm community. In order to increase income, he decided to build a toy town, so every community in the town began to produce toys, and the scale of toy production was gradually expanding. A large warehouse was built in the town center to store raw materials and toys. The warehouse is too large. If it is too inconvenient to collect one by one, it must be collected in batches of 500 units each time the goods are picked up or delivered.

In our computer, the CPU works in a pipelined way, divides a task into a dozen steps, and uses a dozen levels of pipeline calculation, the speed has increased by a factor of ten. Multi-core CPUs work in parallel and can increase the calculation speed several times. After the calculation speed is fast, the read and write speed requirements for data become higher. Therefore, modern CPUs have designed advanced cache systems. It can be said that for modern general-purpose CPUs, cache is the core.

However, for some applications, the CPU is a bit crooked. What it is good at is to use high frequency to decompose sequentially executed tasks into many stages of pipeline execution at high speed. However, for deep learning and other calculations, a large amount of parallel computing is required. At that time, the number of CPU cores became a limit.

FPGA parallelism

Egg Egg has designed a complex machine called CPU, which can read and execute the instructions written by Egg Egg one by one, decompose the task into many levels of assembly lines, and produce a variety of complex toys.

There is always a new trend in the toy market. Recently, Lego toys started to burst into flames, and each building block is similar, but many building blocks can form complex shapes. Parents now want to teach and play, so that children can use their brains to play games. Eggs see this trend and want to produce Lego bricks. But he found that the production efficiency of complex machine CPU is too low. What he needs is to produce hundreds of thousands of machines in parallel at the same time, and the CPU can only produce more than a dozen at the same time. Therefore, Eggan has studied for a while and developed a new machine-FPGA. There are thousands of small engines in this machine. According to the tasks set by the eggs, everyone is enthusiastically producing their own small blocks.

The C language we studied at the university is the language for programming the CPU. Its characteristic is to write a main function, which contains a lot of content. After the program starts, it is executed in order from the beginning to the end. This is because only one CPU is running, and the program will be converted into instructions for the CPU to execute one by one.

FPGA is a parallel computer. Its basic programming language is called Verilog. In this language, each piece of program is converted into a small computing engine in the FPGA chip. Everyone executes in parallel, and they are doing their jobs.

How to match FPGA and CPU

As a reconfigurable computing engine, FPGA is generally used with CPU. As shown in the figure below, the reconfigurable computing engine is either directly connected to the CPU or connected to the cache. The former is called tight coupling, and the latter is called loose coupling. In another case, the FPGA is connected to the memory bus as a coprocessor coprocessor and shares memory with the CPU.

As shown in the figure below, FPGA acts as a coprocessor, CPU writes instructions into memory, FPGA reads instructions from memory and executes, and writes calculation results into memory. The advantage of this mode is that it is simple and easy to operate, and the coprocessor and CPU are separated. The bottleneck lies in shared memory, which limits performance. At the same time, due to the interaction through memory, the communication delay between the CPU and FPGA becomes longer. Therefore, it is suitable for acceleration tasks that FPGA can independently execute, such as video codec and data encryption and decryption.

The figure below is an example of loose coupling. The CPU (ARC) and reconfigurable computing logic are placed in a single chip. The CPU can directly access the reconfigurable computing engine and share memory. The reconfigurable computing engine can directly read and write memory through DMA. At the same time, we see that the reconfigurable computing engine has its own data read and write interface, so it can work independently from the CPU, working and starting a business.

Let's look at an example of tight coupling. As shown in the following figure, CPU FU is the basic calculation unit of general CPU, such as ALU, multiplier, floating point processor, etc. RFU is placed in the chip like these basic units, and can be directly controlled by CPU registers, and can also access the cache Data.

To sum up, the coprocessor is equivalent to being an official in the field and can only accept orders from the emperor. Loose coupling is an official in the capital, who can go to the palace regularly to chatter with the emperor. Tight coupling has reached the point where the relatives and eunuchs can often be on duty in the palace.

Both loosely coupled and tightly coupled must put a reconfigurable computing engine in the chip, which is relatively expensive, but also very efficient. It is equivalent to an FPGA in the CPU, which can be programmed to do different calculations at any time. When processing video, it is configured as video codec logic, and when doing AI calculations, it is configured as a deep learning calculator. A chip has both a CPU and a configurable hardware calculation engine.

IU Industrial Switch

Shenzhen Scodeno Technology Co.,Ltd , https://www.scodenonet.com

This entry was posted in on