Polish-Swiss Research Programme 2010

From REuP Project

1 Proposal details

Proposal details

Justification of the project’s realisation

A few years ago the maximum processor clock frequency has stopped at the level of about 3.5 GHz and is not increasing any more. The advances in performance of the processors are achieved by the means of architectural changes. The dual core processors are currently commonplace, and the multicore architectures are getting larger and larger market share. On the other hand, the portable devices are becoming more and more powerful, with the dual core ARM processors entering the market at the moment, but the energy efficiency is the primary factor in their usefulness. The energy dissipation per the unit of work performed is a figure of merit in such applications. In this proposal a new architecture of general purpose processor, which does not have any fixed instructions and is based on the Dynamically Reconfigurable Field Programmable Gate Array (DFPGA) is presented. It is an alternative approach to a multicore architecture, allowing the dynamically-controlled exploitation of fine-grained parallelism in standard software programs. Fine-grained resource management could also reduce the energy consumption by not providing power to the partitions which are currently not used. This architecture can be seamlessly integrated with the standard off-the-shelf computing hardware systems. The successful implementation of such an approach to a data processing may cause a revolution in the domain of the information and communication technologies (ICT) when the providers of personal and mobile computers, such as Intel and AMD, realize the powerful features of the proposed solution. The ICT region constitutes a very important part of European Union economy. According to The 2010 report on R&D in ICT in the European Union [1] it is major sector in terms of R&D costs and labour productivity playing a significant role in the overall EU economy. In case of Poland, taking into account the manufacturing sub-sector of ICT, one can observe a significant increase in employment over last few years (up to 4.2% of the EU total). However, considering other figures of merit, it becomes quite obvious that development of countries like Poland, Hungary and Czech Republic is based on rather lower-end activities. Projects like REuP is believed to contribute drastically to the process of increasing inventiveness of the Polish economy by stimulating creation of R&D centres and attracting foreign capital from this field. The realization of this project will allow to enhance the links between Technical University of Lodz and Haute Ecole d'Ingénierie et de Gestion du Canton de Vaud (HEIG-VD) acquired during the successful realization of the EU FP6 project PERPLEXUS. The results of the REuP project can provide a solid foundation for a future research in the domain of reconfigurable computing.

[1] G. Turlea, D. Nepelski, G. de Prato, S. Lindmark, A. de Panizza, L. Picci, P. Desruelle, D. Broster, "The 2010 Report on R&D in ICT in the European Union", ISBN 978-92-79-15542-0, European Union, 2010 (available from http://ftp.jrc.es/EURdoc/JRC57808.pdf)

[edit]

Scientific Quality

Since the dynamic reconfiguration techniques have been invented and become available in the commercial FPGAs, different concepts of using them for acceleration of many types of computations have appeared. The dynamically reconfigurable resources were initially used as stand-alone coprocessors for demanding computations like image processing, data encryption, etc. Later, the development of nanometer CMOS technologies has allowed to implement the sequential processors together with dynamically reconfigurable resources in a single die (system on chip). This technological progress allowed to develop a plenty of different concepts utilizing dynamically reconfigurable resources in computations and even in custom-made processors [1], [2], [13], [14]. Some examples of reconfigurable architectures coupled more or less tightly with a classic processor are Chimaera [7], PRISC [8], OneChip [9], Chameleon [10] and MOLEN [12]. Some research groups invented systems that acted like independent processors but still needed some external control, like e.g. RaPiD (Reconfigurable Pipelined Datapaths) [11]. However, it suffered from limited bandwidth and performed well only in computationally intensive applications. PipeRench [5] architecture is another example of such kind of system. It was able to execute some applications independently, but still needed a host processor for most of them. The architecture was composed of individually configurable pipeline stages. The executed application was first compiled into the set of virtual stages and then mapped onto physical stages in the chip. Currently an active research is performed in the domain of Application-Specific-Instruction set-Processors (ASIPs), in which the pipeline structure can be customized and utilized in the program through custom instructions. An extension to this approach is the No-Instruction-Set-Computer (NISC). The NISC compiler maps the application directly to the datapath [4]. The pipelined datapath structure remains constant throughout the entire execution of the program. Nevertheless the NISC compiler can obtain better parallelism and resource utilization than conventional instruction-set based compilers. Although none of above-mentioned architectures managed to achieve a commercial success, reconfigurable computing remains a field both of research and commercial interest. It is worth emphasizing that in all of these architectures the computations were distributed between a sequential processor and a FPGA coprocessor. The solution proposed in the REuP is based on increasing the flexibility of the system by executing all the code only on dynamically reconfigurable fabric. Therefore, the control module needs to be responsible only for managing the bitstreams and configuration of partitions and remains the only static part of the system. The REuP is a direct continuation of the NISC idea, extended with the possibility of dynamic datapath modification and hardware virtualization achieved by means of self-managed dynamic reconfiguration [5]. Besides of novel DFPGA architecture development, the realization of the REuP project will provide new compilation techniques, specialized physical synthesis tools and dedicated scheduling algorithms. Another problem of today’s VLSI chips is the increasing rate of defects. The REuP’s architecture and its ability of reconfiguration allow implementing a natural fault-tolerant mechanism at fine granularity. The defective partitions are simply not used when the device is configured and, consequently, the program operates correctly. The research and development of reconfigurable systems with their inherent fault-tolerance ability may help finding an efficient long-term solution to chip defects [6]. Integration of processor cores with reconfigurable resources opens issues regarding an OS (Operating System) dedicated for such solutions. In [3] a slightly modified UNIX OS running on PowerPC of Virtex allows programmers to treat the tasks implemented in the reconfigurable hardware in the same way as software tasks running on PowerPC. Developing a new or porting an existing application to use such powerful architecture efficiently requires a quite low-level programming approach, which is far from current standards and trends. This is one of very important challenges that will have to be overcome. The results of the REuP project (C-to-VHDL compiler, processor architecture with custom dynamic reconfiguration techniques, physical synthesis tool, dedicated OS, system emulator) may become a milestone which will help in the development of self-contained reconfigurable systems. In the future, such systems due to their flexibility, fault-tolerance and the ability to naturally exploit parallelism may become a revolutionary alternative to mainstream sequential processors.

[1] M. Chmiel, J. Mocha, D.Kania, E. Hrynkiewicz, „Dynamic partial reconfiguration of CPU-s for Programmable Logic Controllers executing control programs developed in the Ladder Diagram language”, DESDes 2009, Valencia, Spain, October 2009 [2] S. Banerjee, E. Bozorgzadeh, N Dutt, “Exploiting Application Data-Parallelism on Dynamically Reconfigurable Architectures: Placement and Architectural Considerations”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, pp. 234-247, San Francisco, CA, USA, February 2009 [3] H. K.-H. So, R. Brodersen, “A unified hardware/software runtime environment for FPGA-based reconfigurable computers using BORPH”, Transactions on Embedded Computing Systems, vol. 7(2), pp 1-28, February 2008 [4] M. Reshadi, B. Gorjiara, D. Gajski, “Utilizing Horizontal and Vertical Parallelism with a No-Instruction-Set Compiler for Custom Datapaths”, Proceedings of the 2005 International Conference on Computer Design, pp. 69-76, San Jose, CA, USA, October 2005 [5] H. Kagotani, H. Schmit, “Asynchronous PipeRench: Architecture and Performance Evaluations”, Proceedings of the 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’03), Napa, CA, USA, April 2003 [6] Collet J.H., Psarakis M., Zając P., Gizopoulos D., Napieralski A.; “Comparison of Fault-Tolerance Techniques for Massively Defective Fine- and Coarse-Grained Nanochips”, 16th International Conference Mixed Design of Integrated Circuits and Systems MIXDES 2009, pp. 23-30, Lodz, Poland 25-27 June 2009 [7] Z. A. Ye, A. Moshovos, S. Hauck, P. Banerjee. CHIMAERA: A high-performance architecture with a tightly coupled reconﬁgurable functional unit. Proceedings of the 27th International Symposium on Computer Architecture, June 2000. [8] R. Razdan, M. Smith. A high-performance microarchitecture with hardware-programmable functional units. Proceedings of the 27th Annual IEEE/ACM International Symposium on Microarchitecture, November 1994 [9] J. E. Carrillo, E. P. Chow. The effect of reconﬁgurable units in superscalar processors. Proceedings of the Ninth ACM International Symposium on Field-Programmable Gate Arrays, February 2001. [10] D. Wilson. Chameleon takes on FPGAs, ASICs. Electronic Business Asia, EDN Online Magazine (http://www.edn.com/article/CA50551.html?partner=enews), October 2000 [11] D. Cronquist, P. Franklin, C. Fisher, M. Figueroa, C. Ebeling. Architecture design of reconﬁgurable pipelined datapaths. Proceedings of the 20th Anniversary Conference on Advanced Research in VLSI, March 1999 [12] S. Vassiliadis, S. Wong, G. Gaydadjiev, K. Bertels, G. Kuzmanov, E.M. Panainte, “The MOLEN polymorphic processor”, IEEE Transactions on Computers, vol.53, no.11, pp. 1363- 1375, Nov. 2004 [13] C. Paiz, T. Chinapirom, U. Witkowski, M. Porrmann, "Dynamically Reconfigurable Hardware for Autonomous Mini-Robots," Conference on IEEE Industrial Electronics, IECON 2006 - 32nd Annual, vol., no., pp.3981-3986, 6-10 Nov. 2006 [14] Zexin Pan, B. Earl Wells, "Hardware Supported Task Scheduling on Dynamically Reconfigurable SoC Architectures," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol.16, no.11, pp.1465-1474, Nov. 2008

[edit]

Innovative character

The project is aimed at defining and prototyping architecture of NISC processor, which implements paradigm very much different from any of the currently prevalent architectures such as CISC, RISC or VLIW. Numerous attempts to develop similar architecture, none of which resulted in a wide adoption, prove that the research objective is valid and significant with respect to state of the art. To our knowledge, the REuP project is the first attempt to design the general purpose processor on the basis of fully reconfigurable hardware, without using any predefined hardware structures like datapaths. The proposed solution is an FPGA-based implementation of general-purpose processor, in which there are no sequentially executed instructions but all tasks are implemented in reconfigurable hardware, in a completely automatic way. The whole array of configurable blocks can be modified at runtime, except for a small part responsible for reconfiguration management. The reconfigurable nature of REuP allows the general purpose processor to adapt itself to running particular software, thus maximally utilizing the underlying hardware, with advantages such as inherent parallelisation. Thus, REuP exploits the parallelism of data processing and resources reuse of DFPGAs, without any impact on development of new applications nor the existing ones. Bearing in mind that the proposed processor is intended to be a general-purpose one, it should be able to execute any user application. Note that the processor will be a kind of fully reconfigurable FPGA, without typical hard- or soft- processor core, like in most custom reconfigurable systems. Applications to be executed should be stored in a memory as a set of bitstreams, which provide required functionality, when loaded into FPGA in a correct sequence. The same must apply to the operating system, shared libraries, interrupt handlers, drivers, etc. Each bitstream must be loaded into FPGA on demand, when operations corresponding to this bitstream should be performed now or in the near future. When the operations being performed by a given part of FPGA are finished, this part can be configured with a different bitstream, serving a new functionality. This means that the FPGA should be partially reconfigurable and should support very efficient reconfiguration techniques. Devoting a part of FPGA for each thread allows running many threads concurrently, without time-division technique. Obviously, similar acceleration is obtained in multi-core, traditional processors, but proposed approach is much more flexible and scalable. The functionality of most threads is too complex to be implemented directly in the DFPGA. Therefore it must be split into the sequence of smaller pieces – partitions. The partitions can be in turn composed of several sub-units – operations, optionally finished with a jump. The exact definition of the operation constitutes an open issue and will be elaborated by means of some experiments. In order to obtain the satisfying flexibility, it should be possible to place the partitions in any part of the FPGA. It means that the current, commercially available reconfiguration techniques, provided e.g. by Xilinx, cannot be efficiently used for this purpose. These techniques require the dynamic partition to be placed in a predefined area, surrounded with special communication blocks (Bus Macros) providing a predefined IO interface. Therefore, a completely new custom reconfigurable architecture will be designed providing the necessary reconfiguration flexibility. The crucial point of the proposed general-purpose reconfigurable processor is an automatic transformation of any software operation into the hardware partition. Assuming that the thread to be partitioned is specified by means of any high-level programming language, e.g. C++, it can be converted to a set of corresponding bitstreams in different ways. There are many projects, based on C++ to HDL conversion [3][4], but since this problem is complex, the results of these projects are still limited to fractions of C++ standard and therefore are not suitable for general-purpose processor implementation. Our novel approach is composed of three phases: C++ to an intermediate code conversion, partitioning and finally synthesis. Such a method, based on intermediate program representation seems to be much more efficient. The main advantage is the use of advanced front-end tools existing in e.g. a GNU Compiler, for reduction of complex, high-level programming structures to simple intermediate code. New processor architecture would be incomplete without a dedicated operating system. Traditionally, an operating system is responsible for memory management, interrupts and hardware control, scheduling and inter-process communication (IPC). Many operating systems have common programming interface and it is possible to run an application written on one OS on a different OS. The most known operating system interface is POSIX (Portable Operating System Interface) derived from Unix family. The operating system for the instructionless processor, called BitstreamOS, should also have this interface to make the application development independent from the underlying system architecture. Proposed platform differs significantly from platforms based on the commonly used processors (RISC or CISC), thus the architecture of BitstreamOS will be different from known operating systems. BitstreamOS will not manage applications written for the sequential processor but will load the bitstreams corresponding to applications. This issue changes the way in which programs will be run and how the BitstreamOS will manage them. BitstreamOS has to manage partition bitstreams to satisfy the condition of existence of all required functionality (implemented in particular bitstreams) at a given time. Hence, the process of scheduling is more complex than on a typical RISC or CISC processor. Therefore, BitstreamOS needs to provide explicit functionality to:

loading and unloading partition bitstreams
scheduling partitions to fully utilize the reprogrammable array
arranging the communication channels between partitions if needed
serializing memory accesses

As there are no solutions which might be considered the ultimate NISC architecture, the actual work carried out during realization of the project will have exploratory character. The outcome of work will detail the new architecture and lay out the formal description of the resulting system behaviour. Given the extent of the expected functionality, which in effect shall allow the REuP to be used in general purpose applications, goes beyond any of the currently running projects. Due to the complexity of the processor, an efficient methods of emulation need to be defined, which will require an innovative approach in order to make the system simulations feasible for verification of functionality. The general purpose applications of the REuP require special handling both at the compiler level, which creates the program in a form suitable for execution on the processor, and the OS level, which traditionally provides facilities and methods of accessing and sharing the system (main processor and devices). The work on compiler will be focused on methods of efficient translation of given high-level program to applicable bitstream, which due to the conceptual level of execution (direct execution on FPGA, rather than on a CPU which hides the details of the processor implementation), has not been widely explored.

[edit]