Parallel systems, computer architecture, silicon photonics, memory systems, approximate computing, design for dark silicon, data-oriented software architectures.
(With a little effort this might work itself into reasonable shape. Due to perpetual lack of time, for now, the information below should suffice to whet your appetite. If you want more information on any of these projects, please drop me a note.)
The overreaching research umbrella at the Parallel Architecture Lab at Northwestern (PARAG@N) is energy-efficient computing. At the macro scale, computers consume inordinate amounts of energy, negatively impacting the economics and environmental footprint of computing. At the micro scale, power constraints prevent us from riding Moore's Law. We attack both problems by identifying sources of energy inefficiencies and working at hardware/software techniques for cross-stack energy optimization. Thus, our work extends from circuit and hardware design, through programming languages and OS optimizations, all the way to application software. In a nutshell, our work aims to minimize the overheads associated with data storage and data transfers (e.g., through adaptive memory hierarchy designs, memory technologies, and silicon photonics), computational overheads (e.g., through specialized computing on dark silicon, approximate computing), circuits (e.g., through speculative arithmetic units, fused accelerators), and in the long term aims to push back the bandwidth and power walls by designing 1000+-core virtual macro-chips with nanophotonic interconnects and optical memories. An overview of our research at PARAG@N was presented at an invited talk at IBM T.J. Watson Research Center and Google Chicago in March 2012. That talk is a little old and many things have happened since then, but it is a good starting point.
More specifically, we work on:
Elastic Caches: In this project we develop adaptive cache designs and memory hierarchy sub-systems that minimize the overheads of storing, retrieving and communicating data to/from memories and other cores. An incarnation of Elastic Caches for near-optimal data placement was published at ISCA 2009 and won an IEEE Micro Top Picks award in 2010, while newer papers at DATE 2012 and IEEE Computer Special Issue on Multicore Coherence in 2013 present an instance of Elastic Caches that minimize interconnect power by collocating directory meta-data with sharer cores. You can also find an interview on Dynamic Directories conducted by Prof. Srini Devadas (MIT) here. This thrust currently focuses on revisiting memory hierarchy designs, optical memories, and new hardware-software co-designs for virtual-to-physical address mapping. This work was partially funded by NSF CCF-1218768.
Elastic Fidelity: At the circuit level, the shrinking transistor geometries and race for energy-efficient computing result in significant error rates at smaller technologies due to process variation and low voltages (especially with near-threshold computing). Traditionally, these errors are handled at the circuit and architectural layers, as computations expect 100% reliability. Elastic Fidelity computing is based on the observation that not all computations and data require 100% fidelity; we can judiciously let errors manifest in the error-resilient data, and handle them higher in the stack. We develop programming language extensions that allow data objects to be instantiated with certain accuracy guarantees, which are recorded by the compiler and communicated to hardware, which then steers computations and data to separate ALU/FPU blocks and cache/memory regions that relax the guard-bands and run at lower voltage to conserve energy. This work was funded by NSF CCF-1218768 and NSF CCF-1217353. The Elastic Fidelity NSF project website has more information, papers, released software, and datasets.
SeaFire (Specialized Computing on Dark Silicon): While Elastic Fidelity and Elastic Caches cut back on the energy consumption, they do not push the power wall far enough. To gain another order of magnitude, we must minimize the overheads of modern computing. The idea behind the SeaFire project is that instead of building conventional high-overhead multicores that we cannot power, we should repurpose the dark silicon for specialized energy-efficient cores. A running application will power up only the cores most closely matching its computational requirements, while the rest of the chip remains off to conserve energy. Preliminary results on SeaFire have been published at a highly-cited IEEE Micro article in July 2011, an invited USENIX ;login: article in April 2012, the ACLD workshop in 2010, a keynote at ISPDC in 2010, an invited presentation at the NSF Workshop on Sustainable Energy-Efficient Data Management in 2011 (the abstract is here), and an invited presentation at HPTS in 2011. This work was funded by an ISEN Booster award and now continues as part of the Intel Parallel Computing Center at Northwestern (here is the Intel Press release) that I co-founded with faculty from the IEMS department.
Galaxy (Computer Architecture Meets Silicon Photonics): This project combines advances in parallel computer architecture and silicon photonics to develop architectures that break past the power, bandwidth and utilization walls (dark silicon) that plague modern processors. The Galaxy architecture of optically-connected disintegrated processors argues that instead of building monolithic chips, we should split them into several smaller chiplets and form a "virtual macro-chip" by connecting them with optical links. The optics allow such high bandwidth communication that break the bandwidth wall entirely, and such low latency that the virtual macro-chip behaves as a single tightly-coupled chip. As each chiplet has its own power budget and the optical links eliminate the traditional chip-to-chip communication overheads, the macro-chip behaves as an oversized multicore that scales beyond single-chip area limits, while maintaining high yield and reasonable cost (only faulty chiplets need replacement). Our preliminary results indicate that Galaxy scales seamlessly to 4000 cores, making it possible to shrink an entire rack's worth of computational power onto a single wafer. The full design was presented at an EPFL talk in 2014 and published at ICS-2014. This project has advanced the state of the art in silicon photonic interconnects by designing laser power-gating NoCs, developing the concept further through co-designing the on-chip NoC with the architecture, escalating the laser power-gating to datacenter optical networks, and overcoming the thermal transfer problems of 3D-stacked processor-photonic chips. A full list of publications appears in the energy-proportional photonic interconnects project web page. This work was funded by NSF CCF-1453853.
Computation Affinity: The research steps presented above may allow us to realize 1000+ core virtual macro-chips. However, to make 1000+ core systems practical to more than just a narrow set of highly-optimized applications, we need to revisit the fundamentals of data access and sharing. For this last part of our vision we advocate Computation Affinity, an execution system that partitions the data objects among clusters of cores in the system, allowing the code to migrate from one cluster to the other based on the data it accesses. Computation Affinity is a hybrid of active messages and process-in-memory that minimizes data transfers and conserves enormous amounts of energy. The first incarnation of this idea was DORA, published in VLDB 2010 with database systems as a proof-of-concept. We specifically chose database systems as a proof-of-concept because they are notoriously hard to optimize, as their arbitrarily complex access and sharing patterns resist most forms of architectural optimization. We aim to extend this paradigm to general-purpose programs.
Our research is sponsored by:
... and by generous equipment and/or software donations by
Past Research – Northwestern University
DRAM Thermal Management: While more than a third of energy is consumed on memory and thermal characteristics play an important role on the overall DRAM power consumption and reliability, power/thermal optimizations for DRAM have been largely overlooked. Together with fellow faculty S.O. Memik and G. Memik, we recognized the importance of the problem, and worked on minimizing the power and thermal profile of DRAMs using OS-level optimizations. We published some of our results on DRAM thermal management at HPCA 2011.
Past Research – Carnegie Mellon University
1. The increasing wire delays force processors to become distributed and disperse the cores and the cache across the die area, making cache block placement a determinant of the processor's performance.
2. The on-chip transistor counts increase exponentially, but this increase does not directly translate into performance improvement; rather, processors must optimally allocate transistors to components (e.g., cores, caches), in a quest to attain maximum performance and remain within physical constraints (e.g., power, area, bandwidth).
3. Conventional software is still hampered by arbitrarily complex data access and sharing patterns that inhibit most hardware optimizations. To fully realize the potential of modern multicore processors, the software must be redesigned for the new hardware landscape.
To address these requirements, my research proceeds along three synergistic fronts: (a) hardware designs that optimize for fast data accesses and utilize efficiently the transistors on chip, (b) scalable parallel software architectures that mitigate the rising on-chip data latencies, and (c) scalable performance evaluation techniques.
a) Scalable Hardware: cache designs, transistor-efficient multicore designs and memory systems.
Cache Designs. To optimize for data placement on chip, we developed R-NUCA (Reactive Non-Uniform Cache Access). R-NUCA is a distributed cache architecture that places blocks on chip based on the observation that cache accesses can be classified into distinct classes, where each class lends itself to a different placement policy. Fast lookup is provided by Rotational Interleaving, an indexing scheme that affords the fast lookup of conventional address interleaving while allowing cache block replication and migration. Finally, through intelligent cache block placement, R-NUCA obviates the need for hardware coherence at the last-level cache, greatly simplifying the design and improving scalability. [IEEE Micro Top Pick 2010]
Transistor-Efficient Multicores. To utilize efficiently the abundant transistors on chip, we developed ADviSE (Analytic Design-Space Exploration), a collection of performance, area, bandwidth, power and thermal analytical models for multicore processors. ADviSE suggests design rules across process technologies that optimize for a given metric (e.g., throughput, power efficiency) and allocates transistors judiciously among components, leading to near-optimal designs.
Memory Systems. To tame the off-chip data latency, we developed STeMS (Spatio-Temporal Memory Streaming), a memory system in which data move in correlated streams, rather than in individual cache blocks. STeMS is based on the observation that applications execute repetitive code sequences that result in recurring data access sequences ("streams"), which can be used to predict future requests and prefetch their data.
b) Scalable Parallel Software. To overcome the limitations of conventional software, we develop Data-Oriented Staging, a software architecture that decomposes otherwise single-threaded requests into parallel tasks and partitions their data logically across the cores. The logical data partitioning (a) transforms data that were shared across requests and slow to access into core-private data with fast local access times, and (b) renders data sharing patterns within a parallel request predictable, enabling prefetch mechanisms to hide data access latencies. To show the feasibility of the design, we developed a prototype staged database system as part of the StagedDB-CMP project. [best demonstration award at ICDE 2006]
c) Scalable Performance Evaluation Techniques. The growing size and complexity of modern hardware make software simulators prohibitively slow, barring researchers from evaluating their designs on commercial-grade workloads and large-scale systems. To overcome this limitation, we develop FLEXUS, a cycle-accurate full-system simulation infrastructure that reduces simulation turnaround through statistical sampling. FLEXUS has been adopted by several research groups, it has been the primary infrastructure for computer architecture courses, and is currently in its third public release.
Past Research – University of Rochester
Cashmere: a software distributed-shared-memory system over low-latency remote-memory-access networks like DEC's Memory Channel. Cashmere utilizes the hardware coherence within a multiprocessor, provides software coherence across nodes in a cluster and allows memory scaling through remote memory paging.
Carnival: a tool for the characterization and analysis of waiting time and communication patterns of parallel shared-memory applications. Through cause-effect analysis, Carnival detects performance bottlenecks of individual code fragments.
Past Research – Industry (Digital, Compaq, Hewlett-Packard)
Alpha Processors and High-Performance Multiprocessor Systems: While affiliated with Digital Equipment Corp., Compaq Computer Corp., and Hewlett-Packard, I was a member of the design team of high-end enterprise servers. I contributed to the Alpha EV6 (21264), EV7 (21364), and EV8 (21464) generations of microprocessors, and had a fleeting relationship with the Piranha multicore. The rest of my time was spent working on a number of AlphaServers in the Titan, Wildfire, and Marvel families: the Marvel (GS-1280), WildFire (GS-320), and Privateer (ES-45) multiprocessor systems. My work focused on memory hierarchy and multiprocessor system design, including adaptive and multi-level cache coherence protocols, migratory data optimizations, novel caching schemes, RAMbus modeling and optimizations, link retraining, flow control, directory caches, routing, and system topology. Also, I worked on the design and development of full-system execution-driven, trace-driven, and statistical simulators, and tools and techniques for performance analysis.