Research Interests
Parallel computer
architecture, memory systems, runtime environments, optical interconnects,
elastic fidelity computing, design for dark silicon, data-oriented software
architectures.
Current Research
(With a little effort this might work itself into reasonable shape. Due to
perpetual lack of time, for now, the information below should suffice to wet
your appetite. If you want more information on any of these projects, please
drop me a note.)
The overreaching research
umbrella at the Parallel Architecture Lab at Northwestern (PARAG@N) is energy-efficient
computing. At the macro scale, computers consume inordinate amounts of
energy, negatively impacting the economics and environmental footprint of computing.
At the micro scale, power constraints prevent us from riding Moore's Law. We
attack both problems by identifying sources of energy inefficiencies and
working at hardware/software techniques for cross-stack energy optimization.
Thus, our work extends from circuit and hardware design, through programming
languages and OS optimizations, all the way to application software. In a
nutshell, our work goes from near-term solutions (minimize data-transfer
overheads with adaptive caching and DRAM management), to medium term (eliminate
computational overheads with specialized computing on dark silicon), to
medium-long term (minimize energy at logic circuits through selective
approximate computation), to the long term (push back the bandwidth and power
walls by designing 1000+-core virtual macro-chips with nanophotonics, along
with the runtime environment that goes with it). An overview of our research at
PARAG@N was presented at an invited talk at IBM T.J. Watson Research Center and
Google Chicago in March 2012.
More specifically, we work
on:
DRAM Thermal Management: In the near term, while
more than a third of energy is consumed on memory, and thermal characteristics
play an important role on the overall DRAM power consumption and reliability,
power/thermal optimizations for DRAM have been largely overlooked. Together
with fellow faculty S.O. Memik and G. Memik, we recognize the importance of the
problem, and work on minimizing the power and thermal profile of DRAMs using
OS-level optimizations. We have already published some of our results on DRAM
thermal management at HPCA 2011 and work on subsequent publications.
Elastic Caches: As a significant fraction
of the energy is consumed on data transfers/storage, together with SeaFire we
work on Elastic Caches. In this project we develop
adaptive cache management policies that minimize the overheads of storing and
communicating data among the cores. An incarnation of Elastic Caches for
near-optimal data placement was published at ISCA 2009 and won an IEEE Micro Top Picks award in 2010, while
a newer paper at DATE 2012 presents an instance of Elastic
Caches that minimize interconnect power. While Elastic Caches are independent of
SeaFire (described below), their combination attains higher benefits than the
sum of parts.
SeaFire (Specialized
Computing on Dark Silicon): While Elastic Fidelity cuts back on the energy consumption,
it does not push the power wall far enough. To gain another order of magnitude,
we must minimize the overheads of modern computing. The idea behind the SeaFire project (targeting the medium term) is
that instead of building conventional high-overhead multicores that we cannot
power, we should repurpose the dark silicon for specialized energy-efficient
cores. A running application will power up only the cores most closely matching
its computational requirements, while the rest of the chip remains off to
conserve energy. Preliminary results on SeaFire have been published at an IEEE Micro article in July 2011, an invited USENIX ;login: article in April 2012, the
ACLD workshop in 2010, the keynote at ISPDC in 2010, a TR in 2010, an invited
presentation at the NSF Workshop on Sustainable Energy-Efficient Data Management in 2011, and an invited
presentation at HPTS in 2011.
Elastic Fidelity: At the circuit level, the shrinking
transistor geometries and race for energy-efficient computing result in
significant error rates at smaller technologies due to process variation and
low voltages (especially with near-threshold computing). Traditionally, these
errors are handled at the circuit and architectural layers, as computations
expect 100% reliability. Elastic Fidelity computing targets near-to-medium
term, and is based on the observation that not all computations and data
require 100% fidelity; we can judiciously let errors manifest in the
error-resilient data, and handle them higher in the stack. We develop
programming language extensions that allow data objects to be instantiated with
certain accuracy guarantees, which are recorded by the compiler and
communicated to hardware, which then steers computations and data to separate
ALU/FPU blocks and cache/memory regions that relax the guardbands and run at
lower voltage to coserve energy. This is a relatively new project; we had a
poster presentation on Elastic Fidelity at ASPLOS 2011, and Technical Report in Feb 2011.
Galaxy (Optically-Connected
Disintegrated Processors): The combined works above offer a respite of a few orders of
magnitude, but a fundamental problem remains unaffected: chips are ultimately
limited by bandwidth, power delivery and cooling constraints. In the Galaxy project (targeting the long term) we
take a step towards pushing back the power, bandwidth, and yield walls. Instead
of building monolithic chips, we advocate split them into several smaller
chiplets connected with photonic interconnects (fiber optics across chiplets,
silicon photonics within chiplets for long distances, electrical interconnects
for short). The photonics allow such high bandwidth communication that break
the bandwidth wall entirely (8 TBps/mm bandwidth density demonstrated in lab
prototypes by IBM), and such low latency that the virtual macro-chip behaves as
a single chip. Yet, the power delivery is now split among multiple chiplets
solving the problem of power delivery, the chiplets are distributed in space
far apart to cool them efficiently with forced air, thereby pushing away the
power wall, and they are sized optimally to maximize yield (another hurdle of
technology scaling). While competing designs --e.g., the Oracle macrochip--
have to reside to liquid cooling and microfluidics to cool 4KW from a
wafer-size device, our design allows us to space chiplets 8-10 cm apart and
minimize heat transfer. Our preliminary results indicate that Galaxy scales
seamlessly to 4000 cores. It is important to note that all the other research
above is still relevant, as Elastic Caches are required to handle data accesses
in such large scales, specialized computing on dark silicon can maximize
performance and minimize energy consumption on a per-chiplet basis, and Elastic
Fidelity and DRAM optimizations shave off significant overheads at the circuit
layer and the main memory. A preliminary report on the Galaxy design was
presented at WINDS 2010, and at a talk at Google
Madison
in March 2013.
Computation Affinity: The research steps presented
above have gradually taken us from circuit-level optimizations for the near
term, to new architectures in the medium and long term, that may allow us to
realize 1000+ core virtual macro-chips. However, to make 1000+ core systems
practical to more than just a narrow set of highly-optimized applications, we
need to revisit the fundamentals of data access and sharing. For this last part
of our vision we advocate Computation Affinity, an execution system that
partitions the data objects among clusters of cores in the system, allowing the
code to migrate from one cluster to the other based on the data it accesses.
Computation Affinity is a hybrid of active messages and process-in-memory that
minimizes data transfers and conserves enormous amounts of energy. The first
incarnation of this idea was DORA, published in VLDB 2010 with database
systems as a proof-of-concept. We specifically chose database systems as a
proof-of-concept because they are notoriously hard to optimize, as their
arbitrarily complex access and sharing patterns resist most forms of
architectural optimization. Our current work extends this paradigm to
general-purpose programs.
Sponsors
Recent Research
The performance of modern
multicore processors is shaped by three trends:
1.
The
increasing wire delays force processors to become distributed and disperse the
cores and the cache across the die area, making cache block placement a
determinant of the processor's performance.
2.
The
on-chip transistor counts increase exponentially, but this increase does not
directly translate into performance improvement; rather, processors must
optimally allocate transistors to components (e.g., cores, caches), in a quest
to attain maximum performance and remain within physical constraints (e.g.,
power, area, bandwidth).
3.
Conventional
software is still hampered by arbitrarily complex data access and sharing
patterns that inhibit most hardware optimizations. To fully realize the
potential of modern multicore processors, the software must be redesigned for
the new hardware landscape.
To address these
requirements, my research proceeds along three synergistic fronts: (a) hardware
designs that optimize for fast data accesses and utilize efficiently the
transistors on chip, (b) scalable parallel software architectures that mitigate
the rising on-chip data latencies, and (c) scalable performance evaluation
techniques.
a)
Scalable Hardware:
cache designs, transistor-efficient multicore designs and memory systems.
Cache Designs.
To optimize for data placement on chip, we developed R-NUCA
(Reactive Non-Uniform Cache Access).
R-NUCA is a distributed cache architecture that places blocks on chip based on
the observation that cache accesses can be classified into distinct classes,
where each class lends itself to a different placement policy. Fast lookup is provided by Rotational
Interleaving,
an indexing scheme that affords the fast lookup of conventional address
interleaving while allowing cache block replication and migration.
Finally, through intelligent cache block placement, R-NUCA obviates the need
for hardware coherence at the last-level cache, greatly simplifying the design
and improving scalability. [IEEE Micro Top Pick 2010]
Transistor-Efficient
Multicores.
To utilize efficiently the abundant transistors on chip, we developed ADviSE (Analytic Design-Space
Exploration), a collection of performance, area, bandwidth,
power and thermal analytical models for multicore processors. ADviSE suggests design rules across process technologies
that optimize for a given metric (e.g., throughput, power efficiency) and
allocates transistors judiciously among components, leading to near-optimal
designs.
Memory Systems.
To tame the off-chip data latency, we developed STeMS (Spatio-Temporal Memory Streaming),
a memory system in which data move in correlated streams, rather than in
individual cache blocks. STeMS is based on the
observation that applications execute repetitive code sequences that result in
recurring data access sequences ("streams"), which can be used to
predict future requests and prefetch their data.
b)
Scalable Parallel Software. To overcome the limitations
of conventional software, we develop Data-Oriented Staging, a software architecture that decomposes otherwise
single-threaded requests into parallel tasks and partitions their data
logically across the cores. The logical data partitioning (a) transforms data
that were shared across requests and slow to access into core-private data with
fast local access times, and (b) renders data sharing patterns within a
parallel request predictable, enabling prefetch
mechanisms to hide data access latencies. To show the feasibility of the
design, we developed a prototype staged database system as part of the StagedDB-CMP project. [best
demonstration award at ICDE 2006]
c)
Scalable Performance
Evaluation Techniques.
The growing size and complexity of modern hardware make software simulators
prohibitively slow, barring researchers from evaluating their designs on
commercial-grade workloads and large-scale systems. To overcome this
limitation, we develop FLEXUS, a cycle-accurate full-system simulation infrastructure that
reduces simulation turnaround through statistical sampling. FLEXUS has
been adopted by several research groups, it has been the primary
infrastructure for computer architecture courses, and is currently in its third
public release.
Past Research
Alpha Processors
and High-Performance Multiprocessor Systems: While affiliated with Digital Equipment Corp.,
Compaq Computer Corp., and Hewlett-Packard, I was a member of the design team
of high-end enterprise servers. I contributed to the Alpha
EV6 (21264), EV7 (21364), and EV8
(21464) generations of microprocessors, and had a fleeting
relationship with the Piranha multicore. The rest of my time was
spent working on the Marvel (GS-1280), WildFire
(GS-320), and Privateer (ES-45) multiprocessor systems. My
work focused on memory hierarchy and multiprocessor system design, including
adaptive and multi-level cache coherence protocols, migratory data
optimizations, novel caching schemes, RAMbus modeling
and optimizations, link retraining, flow control, directory caches, routing,
and system topology. Also, I worked on the design and development of
full-system execution-driven, trace-driven, and statistical simulators, and
tools and techniques for performance analysis.
Cashmere:
a software distributed-shared-memory system over low-latency
remote-memory-access networks like DEC's Memory Channel. Cashmere utilizes the hardware
coherence within a multiprocessor, provides software coherence across nodes in
a cluster and allows memory scaling through remote memory paging.
Carnival: a tool for the characterization
and analysis of waiting time and communication patterns of parallel
shared-memory applications. Through cause-effect analysis, Carnival detects performance
bottlenecks of individual code fragments.
