Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
1980, IEEE Journal of Solid-State Circuits
In the mid 1980's it will be possible to put a million devices (transistors or active MO.S gate electrodes) onto a single silicon chip. General trends in the evolution of silicon integrated circuits are reviewed and design constraints for emerging VLSI circuits are analyzed. Desirable architectural features in modern computers are then discussed and consequences for an implementation with large-scale integrated circuits are investigated. The resulting recommended processor design includes features such as an on-chip memory hierarchy, multiple homogeneous caches for enhanced execution parallelism, support for complex data structures and high-level languages, a flexible instruction set, and communication hardware. It is concluded that a viable modular building block for the next generation of computing systems will be a self-contained computer on a single chip. A tentative allocation of the one milion transistors to the various functional blocks is given, and the result is a memory intensive design.
1979
Current trends in the design of general purpose VLSI chips are analyzed to explore what a truly modular, general-purpose component for digital computing systems might look like in the mid 1980' s. It is concluded that such a component would be a complete single-chip computer, in which the hardware for effective interprocessor communication has been designed with the architecture of the overall multiprocessor system in mind. Computation and communication are handled by separate processors in such a manner, that both can be performed simultaneously with full efficiency.
In this paper we suggest a different computing environment as a worthy new direction for computer architecture research: personal mobile computing, where portable devices are used for visual computing and personal communications tasks. Such a device supports in an integrated fashion all the functions provided today by a portable computer, a cellular phone, a digital camera and a video game. The requirements placed on the processor in this environment are energy efficiency, high performance for multimedia and DSP functions, and area efficient, scalable designs. We examine the architectures that were recently proposed for billion transistor microprocessors. While they are very promising for the stationary desktop and server workloads, we discover that most of them are unable to meet the challenges of the new environment and provide the necessary enhancements for multimedia applications running on portable devices. We conclude with Vector IRAM, an initial example of a microprocessor architecture and implementation that matches the new environment.
1974 IEEE International Solid-State Circuits Conference. Digest of Technical Papers, 1974
How to effectively use the increasing number of transistors available on a single chip while avoiding the wire delay problem? This is one of the most interesting research questions for the microarchitecture community. We have finally arrived at the point where the time needed for signals to reach the opposite edge of a chip is becoming longer than one cycle. This leads to the impossibility of gaining performance improvements via the scaling of superscalar architectures. One possible solution for using the available transistors efficiently and effectively, while hiding wire delay as much as possible is to parallelize resource usage through resource clustering and decoupling. For example, using on chip multiprocessor architectures is the most natural way to increase performance beyond what we can obtain from a single processor core. A generalization of this concept has led to several solutions for chip multiprocessors. The focus of this paper is to review some recent proposals that employ the clusterization/tiling paradigm, at different extents, in a comparative fashion, and highlight their main features and advantages.
2009
&OVER THE PAST decade, we have witnessed farreaching changes in the IT field. Semiconductor sales for consumer and communication devices now surpass those for traditional computation. The IT infrastructure is moving away from the desktop and laptop model to centralized servers, communicating with ubiquitously distributed (and often mobile) access devices. Sensor networks and distributed information-capture devices are fundamentally changing the nature of the Internet from download centric to upload rich (see Figure 1). Whereas today a billion mobile phones are sold per year, in the near future perhaps upwards of a trillion sensory nodes per year will be sold and deployed, with the majority of these connected wirelessly. User interfaces and humanmachine interactions could become responsible for a large percentage of the computational needs. This has the potential to fundamentally change the ways we interact with and live in this information-rich world. This evolution of the IT platform is bound to have a profound impact on the semiconductor business and its operational models. Although Moore's law will still fuel the development of ever more complex devices at lower cost, the nature of these computational and communication devices will probably be substantially different from what we know today, potentially combining hundreds of processing cores. Moving from the core to the fringes of the network, computational prowess will play a less dominant role, and low-power, small form-factor integration of sensors, communication interfaces, and energy sources will be of the essence. It is safe to presume that the ''More than Moore'' and ''Beyond Moore'' paradigms will prevail. 1
ACM SIGARCH …, 2005
The exponential increase in uniprocessor performance has begun to slow. Designers have been unable to scale performance while managing thermal, power, and electrical effects. Furthermore, design complexity limits the size of monolithic processors that can be designed while keeping costs reasonable. Industry has responded by moving toward chip multi-processor architectures (CMP). These architectures are composed from replicated processors utilizing the die area afforded by newer design processes. While this approach mitigates the issues with design complexity, power, and electrical effects, it does nothing to directly improve the performance of contemporary or future single-threaded applications.
IEEE Journal of Solid-state Circuits, 1999
This superscalar microprocessor is the first implementation of a 32-bit RISC architecture specification incorporating a single-instruction, multiple-data vector processing engine. Two instructions per cycle plus a branch can be dispatched to two of seven execution units in this microarchitecture designed for high execution performance, high memory bandwidth, and low power for desktop, embedded, and multiprocessing systems. The processor features an enhanced memory subsystem, 128-bit internal data buses for improved bandwidth, and 32-KB eightway instruction/data caches. The integrated L2 tag and cache controller with a dedicated L2 bus interface supports L2 cache sizes of 512 KB, 1 MB, or 2 MB with two-way set associativity. At 450 MHz, and with a 2-MB L2 cache, this processor is estimated to have a floating-point and integer performance metric of 20 while dissipating only 7 W at 1.8 V. The 10.5 million transistor, 83-mm 2 die is fabricated in a 1.8-V, 0.20-m CMOS process with six layers of copper interconnect.
VLSI Design, 2014
With physical feature sizes in VLSI designs decreasing rapidly, existing efficient architecture designs need be reexamined. Advanced VLSI architecture designs are required to further reduce power consumption, compress chip area, and speed up operating frequency for high performance integrated circuits. With time-to-market pressure and rising mask costs in the semiconductor industry, engineering change order (ECO) design methodology plays a main role in advanced chip design. Digital systems such as communication and multimedia applications demand for advanced VLSI architecture design methodologies so that low power consumption, small area overhead, high speed, and low cost can be achieved.
2012
This paper starts from the programmer’s view and states architectural principles for designing the onechip many processor computer. John Backus’s and Gary Sabot’s old visions about how a parallel computer must be programmed are followed in the context of the current technologies. The proposed architectural principles are exemplified with the Connex chip, a 1024-cell SPMD engine able to provide 120 GOPS/Watt, 6 GOPS/mm2 and 60 GOPS/$. Key–Words: Parallel computation, computer architecture, parallel programming, functional programming, manycell computer.
Third Caltech Conference on Very Large Scale Integration, 1983
Current VLSI fabrication technology makes it possible to design a 32-bit CPU on a single chip. However, to achieve high performance from that processor, the architecture and implementation must be carefully designed and tuned. The MIPS processor incorporates some new architectural ideas into a single-chip, nMOS implementation. Processor performance is obtained by the careful integration of the software (e.g., compilers), the architecture, and the hardware implementation. This integrated view also simplifies the design, making it practical to implement the processor at a university.
Proceedings of the International Symposium on Memory Systems, 2018
This paper presents the notion of a monolithic computer, a future computer architecture in which a CPU and a high-capacity main memory system are integrated in a single die. Such computers will become possible in the near future due to emerging non-volatile memory technology. In particular, we consider using resistive random access memory, or ReRAM, from Crossbar Incorporated. Crossbar's ReRAM is dense, fast, and consumes zero static power. Also, it can be fabricated in a standard CMOS logic process, allowing it to be integrated into a CPU's die. The ReRAM cells are manufactured in between metal wires and do not employ per-cell access transistors, so the bulk of the transistors underneath the ReRAM arrays are vacant. This means a CPU can be implemented using a die's logic transistors (minus transistors for access circuits), while the ReRAM can be implemented "in the wires" above the CPU. This will enable massive memory parallelism, as well as high performance and power efficiency. We discuss several challenges that must be overcome in order to realize monolithic computers. First, there is a physical design challenge of merging ReRAM access circuits with CPU logic. Second, while Crossbar's ReRAM technology exists today, it is currently targeted for storage. There is a device challenge to redesign Crossbar's ReRAM so that it is more optimized for CPUs. And third, there is an architecture challenge to expose the massive memory-level parallelism that will be available in the on-die ReRAM. This will require highly parallel network-on-chip and memory controller designs, and a CPU architecture that can issue memory requests at a sufficiently high rate to make use of the memory parallelism. 1 INTRODUCTION The primary means by which computer hardware improves is via increased integration due to Moore's Law scaling. Over the years, Moore's Law has provided a steady improvement in computer hardware, but along the way, there have been notable points of disruptive
the objective of this paper is to provide an overview on the present state of design technology for SoC. Attempt has been made to capture the basic issues regarding SoC design the paper describes the SoC components explores present day architecture of SoC and the issues involved in the SoC design process. SoC design offers many advantages, there are still the familiar challenges of designing a complex system, now on a chip. The ever-shortening time-to-market compounds these challenges. Without a major advance in productivity, designers will be able to consider only a very few highlevel system designs and will have to limit their product differentiation to the software running on a standard embedded processor.
Microelectronics Journal, 1984
1he trend in modem hardware design (especially in that of processors) has been away from unsophisticated designs, and towards the implementation of highly functional systems. However, there are still some applications for which unrefined systems have a place. In these, ease of using the device is sacrificed for small physical size. Often there is a need to be able to insert an "extra" processor into a small space (for instance on a memory chip); the need for high yields from a wafer of semiconductor is another possible reason for requiring such a small processor, this time as a stand-alone device. This report describes the design of a simple sixteen bit processor which uses only 600 transistors in its implementation. Understandably, the power of the processor, the "NVDAC', is very limited. However, this report sets out to demonstrate that many of the savings have been made by optimising the usa of the hardware which is provided, and that the resultant design has more power than its small size would suggest.
2020
I present the design and evaluation of two new processing elements for reconfigurable computing. I also present a circuitlevel implementation of the data paths in static and dynamic design styles to explore the various performance-power tradeoffs involved. When implemented in IBM 90-nm CMOS process, the 8-b data paths achieve operating frequencies ranging over 1 GHz both for static and dynamic implementations, with each data path supporting single-cycle computational capability. A novel single-precision floating point processing element (FPPE) using a 24-b variant of the proposed data paths is also presented. The full dynamic implementation of the FPPE shows that it operates at a frequency of 1 GHz with 6.5-mW average power consumption. Comparison with competing architectures shows that the FPPE provides two orders of magnitude higher throughput. Furthermore, to evaluate its feasibility as a soft-processing solution, we also map the floating point unit onto the Virtex 4 and 5 devices, and observe that the unit requires less than 1% of the total logic slices, while utilising only around 4% of the DSP blocks available. When compared against popular fieldprogrammable-gate-array-based floating point units, our design on Virtex 5 showed significantly lower resource utilisation, while achieving comparable peak operating frequency. 3D integration of solid-state memories and logic, as demonstrated by the Hybrid Memory Cube (HMC), offers major opportunities for revisiting near-memory computation and gives new hope to mitigate the power and performance losses caused by the "memory wall". Several publications in the past few years demonstrate this renewed interest. In this paper we present the first exploration steps towards design of the Smart Memory Cube (SMC), a new Processor-in-Memory (PIM) architecture that enhances the capabilities of the logic-base (LoB) die in HMC. An accurate simulation environment called SMCSim has been developed, along with a full featured software stack. The key contribution of this work is full system analysis of near memory computation including high-level software to low-level firmware and hardware layers, considering offloading and dynamic overheads caused by the operating system (OS), cache coherence, and memory management. A zero-copy pointer passing mechanism has been devised to allow low overhead data sharing between the host and the PIM. Benchmarking results demonstrate up to 2X performance improvement in comparison with the host System-on-Chip (SoC), and around 1.5X against a similar host-side accelerator. Moreover, by scaling down the voltage and frequency of PIM's processor it is possible to reduce energy by around 70 % and 55 % in comparison with the host and the accelerator, respectively.
Computer, 2005
Multiprocessor Systemson-Chips requirements and implementation constraints push developers to build custom, heterogeneous architectures. The applications that SoC designs target exhibit a punishing combination of constraints: • not simply high computation rates, but realtime performance that meets deadlines; • low power or energy consumption; and • low cost. Each of these constraints is difficult in itself, but the combination is extremely challenging. And, of course, while meeting these requirements, we can't break the laws of physics. MPSoCs balance these competing constraints by adapting the system's architecture to the application's requirements. Putting computational power where it is needed meets performance constraints; removing unnecessary elements reduces both energy consumption and cost. MPSoCs are not chip multiprocessors. Chip multiprocessors are components that take advantage of increased transistor densities to put more processors on a single chip, but they don't try to leverage application needs. MPSoCs, in contrast, are custom architectures that balance the constraints of VLSI technology with an application's needs. Single processors may be sufficient for low-performance applications that are typical of early microcontrollers, but an increasing number of applications require multiprocessors to meet their performance goals.
2000
Looking into the future, when the billion transitor ASICs will become reality, this paper presents Network on a chip (NOC) concept and its associated methodology as solution to the design productivity problem. NOC is a network of computational, storage and I/O resources, interconnected by a network of switches. Resources communcate with each other using addressed data packets routed to their destination by the switch fabric. Arguments are presented to justify that in the billion transistor era, the area and performance penalty would be minimum. A concrete topology for the NOC, a honeycomb structure, is proposed and discussed. A methodology to support NOC is presented. This methodology outlines steps from requirements down to implementation. As an illustration of the concepts, a plausible mapping of an entire basestation on hypothetical NOC is discussed.
J. Embed. Comput., 2006
Nowadays and future embedded and special purpose systems need a qualitative step forward in the research efforts better than continue in quantitatively improve the designs: it’s time for scaling-out architectures, instead of scaling-up frequency. As transistor count is still increasing as expected by Moore’s law, recent challenges like wire-delay, design complexity, and power requirements are becoming more and more important. These problems are preventing the evolution of chip architecture in the directions followed in the previous decades, when clock frequency as well could scale-up with Moore’s law. Many researchers and companies have started to look at building multiprocessors on a single chip, following both past and novel design solutions: no doubt that we are all expecting several cores on a single chip in the near future. Such single-chip architectures are also expected to have full success in the embedded domain. In this case, application specific requirements would also dem...
Proceedings of ASP-DAC'95/CHDL'95/VLSI'95 with EDA Technofair, 1995
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.