UE 7FP ICT FET Open project

From REuP Project

Jump to: navigation, search

Contents

General call information

It was ICT FET Open Call in frame of the United Europe 7th Framework Progarmme. Targeted type of funding scheme: small or medium-scale focused research project (STREP).

Proposal must be strictly anonymous, which means that there must be neither name of the organisations involved in the consortium nor any other information that could identify the applicants. Furthermore, strictly no bibliographic references are allowed in the proposal.

Proposal general information

Project name for the call

Reconfigurable General-Purpose Processor (RGPP)

Abstarct of the proposal

This project introduces an innovative, completely novel approach to architecture of general purpose processor. The concept exploits the parallelism of data processing and resource reuse of dynamically reconfigurable FPGAs (Field Programmable Gate Arrays), without any impact on development of new applications as well as migration (porting) of the existing ones. The proposed solution is an FPGA-based implementation of an instructionless, general-purpose processor, where there are no sequentially executed instructions but all tasks are implemented in the reconfigurable hardware. The process of translating individual tasks (instructions) into hardware representation is done in a completely automatic way. The whole array of reconfigurable blocks can be modified at runtime using a small fixed reconfiguration management unit. The feasibility of the proposed unique hardware architecture will be confirmed with an Application Specific Integrated Circuit (ASIC). Furthermore, the tools for compilation, application execution and a dedicated operating system will be elaborated in the project. Achieved results, if successful, will constitute a breakthrough in the domain of computer architecture. The accomplished solution will open new horizons for software developers, being at the same time fully backward compatible as far as existing applications and programming techniques are concerned.

Proposal details

Targeted breakthrough and its relevance towards a long-term vision

State of the art

A few years ago the maximum processor clock frequency stopped at the level of about 3.5 GHz and is not increasing any more. The advances in performance of the processors are achieved by the means of architectural changes. Dual core processors are currently commonplace, and the multi-core architectures are getting larger and larger market share. On the other hand, the portable devices are becoming more and more powerful, with quad core ARM processors to enter the market by the end of 2012. However, in devices like mobile phones the energy efficiency becomes the primary objective. The energy dissipated in order to perform a specific operation is the most important figure of merit in such applications. Reconfigurable computing is a new computing paradigm that bridges the gap between software and hardware. The reconfigurable hardware is perfectly matched to the computational requirements at the given time instant. It can be adapted to fulfil the demands of applications changing the functionality in every clock cycle. Since the dynamic reconfiguration techniques have been discovered, and started to be supported by the commercially available FPGAs, different concepts of using them for computation acceleration have appeared. The dynamically reconfigurable resources were initially used as stand-alone coprocessors for demanding computations like image processing, data encryption, etc. Later, the development of nanometer CMOS technologies has allowed the implementation of sequential processors together with dynamically reconfigurable resources in a single die (system on chip). This technological progress allowed to develop a plenty of different concepts utilizing dynamically reconfigurable resources in computations and even in custom-made processors. Some examples of reconfigurable architectures coupled more or less tightly with a classic processor are Chimaera, PRISC, OneChip, Chameleon and MOLEN. Some research groups invented systems that acted like independent processors but still needed some external control, like e.g. RaPiD (Reconfigurable Pipelined Datapaths). However, it suffered from limited bandwidth and performed well only in computationally intensive applications. PipeRench architecture is another example of such kind of system. It was able to execute some applications independently, but still needed a host processor for most of them. The architecture was composed of individually configurable pipeline stages. The executed application was first compiled into the set of virtual stages and then mapped onto physical stages in the chip. Currently an active research is performed in the domain of Application-Specific-Instruction set-Processors (ASIPs), in which the pipeline structure can be customized and utilized in the program through custom instructions. An extension to this approach is the No-Instruction-Set-Computer (NISC). The NISC compiler maps the application directly to the datapath. The pipelined datapath structure remains constant throughout the entire execution of the program. Nevertheless the NISC compiler can obtain better parallelism and resource utilization than conventional instruction-set based compilers. This approach has allowed speedups of up to 70% compared to an instruction-set based compiler.. Integration of processor cores with reconfigurable resources opens issues regarding an OS (Operating System) dedicated for such solutions. BORPH – a slightly modified UNIX OS running on PowerPC of Virtex – allows programmers to treat the tasks implemented in the reconfigurable hardware in the same way as software tasks running on PowerPC. Nevertheless, developing a new or porting an existing application to use such a powerful architecture efficiently requires a quite low-level programming approach, which is far from current standards and trends. Although none of the above-mentioned architectures managed to achieve a commercial success, reconfigurable computing remains a field of thorough research and commercial interest. It is worth emphasizing that in all of these architectures computation was distributed between a sequential processor and a FPGA coprocessor. The proposed solution in the RGPP is based on increasing the flexibility of the system by executing all the code only on dynamically reconfigurable fabric. Objective This proposal presents a completely new architecture of general purpose processor RGPP (Reconfigurable General-Purpose Processor), which does not have any fixed instructions and is based on the Dynamically Reconfigurable Field Programmable Gate Array (DFPGA). The processor is divided into several independently reconfigurable blocks called hardware partitions. The software code is partitioned and translated into optimized hardware representation instead of predefined (constant) set of assembly instructions. The proposed processor is being reconfigured on-the-fly in order to match the needs of the part of the program that is currently executed. This alternative approach to a multi-core architecture allows the dynamically-controlled exploitation of fine-grained parallelism in standard software programs. The proposed architecture can be seamlessly integrated into the standard off-the-shelf computing hardware systems. The successful implementation of such an approach to a data processing may cause a revolution in the domain of the information and communication technologies (ICT) when the providers of personal and mobile computer platforms, such as Intel and AMD, realize the powerful features of the proposed solution. The RGPP combines – in an unprecedented way – the idea of application-specific circuit with hardware virtualization achieved by means of self-managed dynamic reconfiguration. Besides of novel DFPGA architecture development, the realization of the RGPP project will come to fruition with a new software-to-hardware conversion technique, dedicated scheduling technique and some others algorithms supporting efficient reconfigurable computing.

Feasiblity study

The crucial question is: can a reconfigurable chip really achieve higher performance than a standard general-purpose processor? At first sight it seems improbable, as standard CPUs operate at much higher frequency and they do not suffer from reconfiguration overhead. However, there are two main sources of speedup in RGPP which may compensate for lower operating frequency and for the time needed for allocation (reconfiguration).

  • on RGPP it should be possible to execute tasks faster on average (in terms of clock cycles) than on CPU due to the configurations optimized for each task
  • on RGPP it should be possible to execute more tasks simultaneously than on CPU with comparable silicon area

The evaluation of RGPP performance and its comparison with conventional CPU is very difficult because it depends on many parameters of tasks to be executed and the manner in which both processors operate. Therefore, in order to roughly estimate the crucial parameters characterizing RGPP operation (reconfiguration time, maximal operating frequency, average resource usage per task, etc.), preliminary version of a custom configurable cell was proposed. Architecture based on this cell was compared with a six-core processor. For this purpose dedicated software was developed. It is able to simulate at an abstract level the execution of tasks on both CPU and RGPP. The software takes as an input several architectural parameters, simulates the execution of a randomly generated set of tasks and computes the average total execution time. A sample result of our study is presented in Figure 1.

Figure 1. RGPP vs. CPU comparison.
Figure 1. RGPP vs. CPU comparison.

It was assumed that a classic processor is clocked 10 times faster than RGPP. The chart shows (see point A) that if we manage to achieve a 4x speedup in terms of clock cycles (k) in our RGPP chip and have enough resources to execute simultaneously 20 tasks on average, then the RGPP performance will be approximately equal to that of a CPU. Obviously, if higher level of parallelism or higher speedup is achieved, then RGPP will outperform classic processors. The preliminary tests of proposed reconfigurable cell indicate that achieving this goal is challenging, but feasible. Moreover, bearing in mind current publications on energy consumption of reconfigurable devices it is expected that RGPP-like architecture can be competitive also in this field. Therefore, the idea of reconfigurable computing as an alternative to general-purpose processor architectures used nowadays is certainly worth pursuing and more in-depth studies are needed to further evaluate its usefulness. The RGPP approach has a potential to solve some of today’s problems with nanometer generation of multi-core sequential processors. The main problem that engineers and programmers need to cope with is extracting parallelism. Since increasing clock frequency became impossible due to its impact on power consumption, the performance of processors is increased by integrating more and more cores on a single chip. However, it is known that using the multi-core processors with more than eight cores will be inefficient because standard applications cannot be efficiently parallelized to make use of all the cores. This bottleneck should be eliminated in natural way with RGPP. Furthermore, the proposed approach to processor architecture has the possibility to increase energy efficiency due to a fine-grained structure that allows flexible resource usage and deactivating idle partitions. Another problem of today’s VLSI chips is the increasing rate of physical defects. Thanks to the RGPP’s architecture and its ability of reconfiguration, it allows implementation of a natural fault-tolerant mechanism. Moreover, defect-tolerating techniques are possible at very fine granularity. Without going into very technical details, this mechanism is based on the idea that the detected defective part is simply not used when the device is configured and, consequently, the chip operates correctly. The research and development of reconfigurable systems with their inherent fault-tolerance ability may help finding an efficient long-term solution for increasing chip manufacturing yield and contribute to further advance of microelectronics. Moreover, the environmental waste could be reduced by accepting as operational the integrated circuits with some faulty partitions. Due to a large amount of partitions, not all cells are required for normal operation of the processor. The faulty cells can be invalidated in similar way as in currently produced FLASH memories. To conclude, the RGPP project results may bring milestones which will help in the development of self-contained reconfigurable systems. In the future, such systems due to their flexibility, easy fault-tolerance and the ability to naturally exploit parallelism may become a ground-breaking alternative to mainstream sequential processors. The successful completion of the project can strengthen the reconfigurable computing research network across Europe and ignite a worldwide industrial revolution.

Novelty and foundational character

The project is aimed at defining and prototyping non-conventional architecture of NISC (No-Instruction-Set Computer) processor, which implements paradigm very much different from any of the currently prevalent architectures such as CISC (Complex-Instruction-Set Computer), RISC (Reduced-Instruction-Set Computer) or VLIW (Very Long Instruction Word). Numerous attempts to develop similar architecture, none of which resulted in a wide adoption, prove that the research objective is valid and significant with respect to state of the art. The processing power of NISC is believed to significantly overcome currently available processors based on Von-Neumann or Harvard architecture. To our best knowledge, the RGPP project is the first attempt to design the general purpose processor on the basis of fully reconfigurable hardware, without using any predefined hardware structures like datapaths. The proposed solution is an FPGA-based implementation of general-purpose processor, in which there are no sequentially executed instructions but all tasks are implemented in reconfigurable hardware, in a completely automatic way. The whole array of configurable blocks can be modified at runtime, except a small part responsible for reconfiguration management. The reconfigurable nature of RGPP allows the general purpose processor to adapt itself to the currently running software, thus maximally utilizing the underlying hardware, with advantages such as inherent parallelisation. Thus, RGPP exploits the parallelism of data processing and resources reuse of DFPGAs, without any limits on development of new applications nor exploiting the existing ones.

S/T methodology

There are several technical issues that need to be addressed during the project realization. Some of them are briefly described below.

Concurrency

Bearing in mind that the proposed processor is intended to be of general purpose, it should be able to execute any user application. Note that the processor will be a kind of dynamically reconfigurable FPGA, without typical hard- or soft- processor core, like in most custom reconfigurable systems. Applications to be executed should be stored in a memory as a set of bitstreams, which provide required functionality, when loaded into DFPGA in a correct sequence. The same must apply to the operating system, shared libraries, interrupt handlers, drivers, etc. Each bitstream must be loaded into DFPGA on demand, when operations corresponding to this bitstream should be performed now or in the near future. When the operations being performed by a given part of DFPGA are finished, this part can be configured with a different bitstream, serving a new functionality. This means that the DFPGA should be partially reconfigurable and should support very efficient reconfiguration techniques. Devoting a part of FPGA for each thread allows running many threads concurrently, without time-division technique (Figure 2). Obviously, similar acceleration is obtained in multi-core, traditional processors, but proposed approach is much more flexible and scalable.

Figure 2. Concurrent thread executions.
Figure 2. Concurrent thread executions.

To manage the thread executions (in other words: to load the required bitstreams into the proper FPGA partitions) a special Control Unit is required. This block is the only static module in the FPGA – its configuration cannot be modified at runtime. The required dependencies between bitstreams should be determined during the bitstream set generation process and stored – together with these bitstreams – in memory. The control block can only suspend a thread execution if there is no place for loading of the new bitstream.

Plasticity

The functionality of most threads is too complex to be implemented directly in the DFPGA. Therefore they must be split into the sequence of smaller pieces – partitions. The partitions can be in turn composed of several sub-units – operations, optionally finished with a jump. The exact definition of the operation constitutes an open issue and will be elaborated by means of some experiments. In order to obtain a satisfying flexibility, it should be possible to place the partitions in any part of the FPGA. It means that the current, commercially available reconfiguration techniques, due to their limitations, cannot be efficiently used for this purpose (e.g., such techniques provided by Xilinx require the dynamic partition to be placed in a predefined area corresponding to one clock region). Therefore, in the project a completely new custom reconfigurable architecture providing the necessary reconfiguration flexibility, dedicated to the specific project requirements, will be designed.

Impact on software development

The crucial point of the proposed general-purpose reconfigurable processor is an automatic transformation of any software operation into the dedicated, optimized hardware partition – this optimisation constitutes the main source of potential processing speedup. Assuming that the thread to be partitioned is specified by means of any high-level programming language, e.g. C++, it can be converted to a set of corresponding bitstreams in different ways. There are many projects, based on C++ to HDL conversion, but since this problem is complex, the results of these projects are still limited to fractions of C++ standard and therefore are not suitable for general-purpose processor implementation. Our novel approach is composed of three phases: C++ to an intermediate code conversion, partitioning and finally synthesis. Such a method, based on intermediate program representation seems to be much more efficient. The main advantage is the use of advanced front-end tools existing in e.g. a GNU Compiler, for reduction of complex, high-level programming structures to simple intermediate code. New processor architecture would be incomplete without a dedicated operating system (OS). Traditionally, an operating system is responsible for memory management, interrupts and hardware control, scheduling and inter-process communication (IPC). Many operating systems have common programming interface and it is possible to run an application written on one OS on a different OS. The most known operating system interface is POSIX (Portable Operating System Interface) derived from Unix family. The operating system for the NISC processor should also have this interface to make the application development independent from the underlying system architecture. Proposed platform drastically differs from platforms based on the commonly used processors (RISC or CISC), thus the architecture of its OS will be much different from already known ones. The required OS will not manage applications written for the sequential processor but will load the bitstreams corresponding to applications. This issue changes the way in which programs will be run and how the OS will manage them. The OS has to manage partition bitstreams to satisfy the condition of existence of all required functionality (implemented in particular bitstreams) at a given time. Hence, the process of scheduling is more complex than on a typical RISC or CISC processor. Therefore, the OS needs to provide explicit functionality to:

  • loading and unloading partition bitstreams
  • scheduling partitions to fully utilize the reprogrammable array
  • arranging the communication channels between partitions if needed
  • serializing memory accesses

Challenges

As there are no solutions which might be considered the ultimate NISC architecture, the actual work carried out during realization of the project will have exploratory and risky character. The outcome of work will detail the completely new architecture and lay out the formal description of the resulting system behaviour. Given the extent of the expected functionality, which in effect shall allow the RGPP to be used in general purpose applications, the research scope goes beyond any of the currently running projects. Due to the complexity of the processor, efficient methods of emulation need to be defined, which will require an innovative approach in order to make the system simulations feasible for functional verification. The general purpose applications of the RGPP require special handling both at the compiler level, which creates the program in a form suitable for execution on the processor, and the OS level, which traditionally provides facilities and methods of accessing and sharing the system (main processor and devices). The work on compiler will be focused on methods of efficient translation of given high-level program to applicable bitstream, which due to the conceptual level of execution (direct execution on FPGA, rather than on a CPU which hides the details of the processor implementation), has not been widely explored. Despite its risky character, the proposed research has the potential of changing the current thinking about general purpose processors.

The interdisciplinary team

The project idea constitutes a fresh mixture of insights from various disciplines. The project scope includes development of software tools (compilers, OS, emulators) together with a silicon demonstrator that will exercise some aspects of dynamic reconfiguration. The project team consists of experts from different fields (reconfigurable computing, processor architectures, VLSI design, high-level hardware compilers, programming languages, operating systems, etc.), which in connection with the specific work plan guarantees high level of synergy and increase in the overall efficiency of the group. The team members involved in different workpackages at the same time may get a unique opportunity to exchange ideas at different stages of the project. In particular, experiences from hardware implementation or system architecture can contribute significantly to the dedicated operating system and the compiler development. This course of action is expected to result in a more robust and reliable system. The complementarity of the team should permit to apply cutting-edge VLSI techniques to advanced dynamic reconfigurable architectures, enveloped with a programming interface that will allow programmers to easily exploit the high-performance computing capabilities offered by the reconfigurable platform. The project touches several research areas of general interest (software to hardware conversion, parallelisation of calculations, efficient dynamic reconfiguration, and dedicated operating system). Therefore even if the final goal will not be achieved or some partial results will not be satisfactory, the project can produce many interesting outcomes and conclusions for the future.

Personal tools