Paul Gratz

 Paul Gratz

Paul Gratz

  • Courses1
  • Reviews6

Biography

Texas A&M University College Station - Engineering


Resume

  • 8694704

    The present disclosure relates to an example of a method for a first router to adaptively \ndetermine status within a network. The network may include the first router

    a second router \nand a third router. The method for the first router may comprise determining status \ninformation regarding the second router located in the network

    and transmitting the status \ninformation to the third router located in the network. The second router and the third router \nmay be indirectly coupled to one another.

    Method and apparatus for congestion-aware routing in a computer interconnection network

  • 2001

    The University of Texas at Austin

    Texas A&M University

    Associate Professor

    Bryan/College Station

    Texas Area

    Texas A&M University

    The University of Texas at Austin

    IEEE

    Doctor of Philosophy (PhD)

    Designed the Last-level Cache and on-chip interconnect for the TRIPS processor system.

    Electrical and Computer Engineering

    Tau Beta Pi

    The University of Texas at Austin

    Bachelor of Science (BS)

    Electrical Engineering

    The University of Florida

  • 1997

    Intel

    Intel

    Texas A&M University

    Dept. of Electrical Engineering

    College Station

    Texas

    Assistant Professor

    Robert McDnoald

    Haiming Liu

  • Logic Design

    High Performance Computing

    Algorithms

    Distributed Systems

    Simulations

    Parallel Computing

    Verilog

    Memory Design

    VHDL

    Computer Architecture

    Microprocessors

    LaTeX

    Embedded Systems

    VLSI

    C

    Interconnect

    ModelSim

    Xilinx

    C++

    Memory Test

    B-Fetch: Branch Prediction Directed Prefetching for Chip-Multiprocessors

    Reena Panda

    David Kadjo

    Microarchitecture (MICRO)

    2014 47th Annual IEEE/ACM International Symposium on

    For decades

    the primary tools in alleviating the \"Memory Wall\" have been large cache hierarchies and dataprefetchers. Both approaches

    become more challenging in modern

    Chip-multiprocessor (CMP) design. Increasing the last-level cache (LLC) size yields diminishing returns in terms of performance per Watt

    given VLSI power scaling trends

    this approach becomes hard to justify. These trends also impact hardware budgets for prefetchers. Moreover

    in the context of CMPs running multiple concurrent processes

    prefetching accuracy is critical to prevent cache pollution effects. These concerns point to the need for a light-weight prefetcher with high accuracy. Existing data prefetchers may generally be classified as low-overhead and low accuracy (Next-n

    Stride

    etc.) or high-overhead and high accuracy (STeMS

    ISB). Wepropose B-Fetch: a data prefetcher driven by branch prediction and effective address value speculation. B-Fetch leverages control flow prediction to generate an expected future path of the executing application. It then speculatively computes the effective address of the load instructions along that path based upon a history of past register transformations. Detailed simulation using a cycle accurate simulator shows a geometric mean speedup of 23.4% for single-threaded workloads

    improving to 28.6% for multi-application workloads over a baseline system without prefetching. We find that B-Fetch outperforms an existing \"best-of-class\" light-weight prefetcher under single-threaded and multi programmed workloads by 9% on average

    with 65% less storage overhead.

    B-Fetch: Branch Prediction Directed Prefetching for Chip-Multiprocessors

    Gwan Choi

    This paper presents a bidirectional interconnect design which achieves significant reduction in area and power by allowing for simultaneous transmission and reception of signals on a single interconnect segment. The proposed interconnect design achieves twice the throughput with the same link width. We have modeled the bi-directional link on the 7×7 cycle accurate NoC design. We have explored the latency for synthetic and realistic SPLASH-2 benchmarks. Synthetic benchmark results show that bidirectional design does exceedingly well in high congestion. Combination of realistic benchmarks shows that bidirectional design does much better with latency whenever the injection level of the combined benchmark is higher.

    Bidirectional interconnect design for low latency high bandwidth NoC

    The energy cost of asymmetric cryptography

    a vital component of modern secure communications

    inhibits its wide spread adoption within the ultra-low energy regimes such as Implantable Medical Devices (IMDs)

    Wireless Sensor Networks (WSNs)

    and Radio Frequency Identification tags (RFIDs). Consequently

    a gamut of hardware/software acceleration techniques exists to alleviate this energy burden. In this paper

    we explore this design space

    estimating the energy consumption for three levels of acceleration across the commercial security spectrum. First we examine an efficient baseline architecture centered around a pipelined RISC processor. We then include simple

    yet beneficial instruction set extensions to our microarchitecture and evaluate the improvement in terms of energy per operation compared to baseline. Finally

    we introduce a novel

    dedicated accelerator to our microarchitecture and measure the energy per operation against the baseline and the ISA extensions. For ISA extensions

    we show between 1.28 to 1.41 factor improvement in energy efficiency over baseline

    while for full acceleration we demonstrate a 4.36 to 6.45 factor improvement.\n

    The design space of ultra-low energy asymmetric cryptography

    Umit Ogras

    As the core count in processor chips grows

    so do the on-die

    shared resources such as on-chip communication fabric and shared cache

    which are of paramount importance for chip performance and power. This paper presents a method for dynamic voltage/frequency scaling of networks-on-chip and last level caches in multicore processor designs

    where the shared resources form a single voltage/frequency domain. Several new techniques for monitoring and control are developed

    and validated through full system simulations on the PARSEC benchmarks. These techniques reduce energy-delay product by 56% compared to a state-of-the-art prior work.

    Dynamic voltage and frequency scaling for shared resources in multicore processor designs

    With the breakdown of Dennard scaling

    future processor designs will be at the mercy of power limits as Chip Multi-Processor (CMP) designs scale out to many-cores. It is critical

    therefore

    that future CMPs be optimally designed in terms of performance efficiency with respect to power. A characterization analysis of future workloads is imperative to ensure maximum returns of performance per Watt consumed. Hence

    a detailed analysis of emerging workloads is necessary to understand their characteristics with respect to hardware in terms of power and performance tradeoffs. In this paper

    we conduct a limit study simultaneously analyzing the two dominant forms of parallelism exploited by modern computer architectures: Instruction Level Parallelism (ILP) and Thread Level Parallelism (TLP). This study gives insights into the upper bounds of performance that future architectures can achieve. Furthermore it identifies the bottlenecks of emerging workloads. To the best of our knowledge

    our work is the first study that combines the two forms of parallelism into one study with modern applications. We evaluate the PARSEC multithreaded benchmark suite using a specialized trace-driven simulator. We make several contributions describing the high-level behavior of next-generation applications. For example

    we show these applications contain up to a factor of 929X more ILP than what is currently being extracted from real machines. We then show the effects of breaking the application into increasing numbers of threads (exploiting TLP)

    instruction window size

    realistic branch prediction

    realistic memory latency

    and thread dependencies on exploitable ILP. Our examination shows that theses benchmarks differed vastly from one another. As a result

    we expect no single

    homogeneous

    micro-architecture will work optimally for all

    arguing for reconfigurable

    heterogeneous designs.

    ILP and TLP in shared memory applications: a limit study

    Umit Ogras

    Michael Kishinevsky

    Jiang

    H.J.Kim

    Z.Xu

    · Targeted an uninvestigated but promising computer architecture for power management\n· Proposed a new but practical monitoring technique\n· Employed DVFS based PID control policy to control system\n· Achieved around 33% dynamic energy savings with negligible performance degradation\n· Implemented in C++

    In-network Monitoring and Control Policy for DVFS of Networks-on-Chip and Last Level Caches in CMPs

    Umit Ogras

    Michael Kishinevsky

    In chip design today and for a foreseeable future

    the last-level cache and on-chip interconnect is not only performance critical but also a substantial power consumer. This work focuses on employing dynamic voltage and frequency scaling (DVFS) policies for networks-on-chip (NoC) and shared

    distributed last-level caches (LLC). In particular

    we consider a practical system architecture where the distributed LLC and the NoC share a voltage/frequency domain that is separate from the core domain. This architecture enables the control of the relative speed between the cores and memory hierarchy without introducing synchronization delays within the NoC. DVFS for this architecture is more complex than individual link/core-based DVFS since it involves spatially distributed monitoring and control. We propose an average memory access time (AMAT)-based monitoring technique and integrate it with DVFS based on PID control theory. Simulations on PARSEC benchmarks yield a 27% energy savings with a negligible impact on system performance.

    In-network monitoring and control policy for dvfs of cmp networks-on-chip and last level caches

    Michael Kishinevsky

    Umit Ogras

    David Kadjo

    System-on-Chip Conference (SOCC)

    2014 27th IEEE International

    This paper presents a platform-level power management framework for mobile platforms. The proposed framework minimizes the overall platform energy while meeting system level performance and power budget constraints. To this end

    we construct analytical performance and power models using dynamic information collected via performance monitoring counters. Using these models

    we design two different closed loop controllers to ensure that both the performance and the power targets are achieved and maintained in the presence of dynamic workload variations. Experimental evaluations performed on an Android platform show up to 8% energy savings at the platform level and up to 15% CPU energy savings.

    Towards platform level power management in mobile systems

    David Kadjo

    Computer Design (ICCD)

    2013 IEEE 31st International Conference on

    We propose a novel technique to significantly reduce the leakage energy of last level caches while mitigating any significant performance impact. In general

    cache blocks are not ordered by their temporal locality within the sets; hence

    simply power gating off a partition of the cache

    as done in previous studies

    may lead to considerable performance degradation. We propose a solution that migrates the high temporal locality blocks to facilitate power gating

    where blocks likely to be used in the future are migrated from the partition being shutdown to the live partition at a negligible performance impact and hardware overhead. Our detailed simulations show energy savings of 66% at low performance degradation of 2.16%.

    Power gating with block migration in chip-multiprocessor last-level caches

    Moore's Law scaling is continuing to yield even higher transistor density with each succeeding process generation

    leading to today's multi-core Chip Multi-Processors (CMPs) with tens or even hundreds of interconnected cores or tiles. Unfortunately

    deep sub-micron CMOS process technology is marred by increasing susceptibility to wearout. Prolonged operational stress gives rise to accelerated wearout and failure

    due to several physical failure mechanisms

    including Hot Carrier Injection (HCI) and Negative Bias Temperature Instability (NBTI). Each failure mechanism correlates with different usage-based stresses

    all of which can eventually generate permanent faults. While the wearout of an individual core in many-core CMPs may not necessarily be catastrophic for the system

    a single fault in the inter-processor Network-on-Chip (NoC) fabric could render the entire chip useless

    as it could lead to protocol-level deadlocks

    or even partition away vital components such as the memory controller or other critical I/O. In this paper

    we develop critical path models for HCI- and NBTI-induced wear due to the actual stresses caused by real workloads

    applied onto the interconnect microarchitecture. A key finding from this modeling being that

    counter to prevailing wisdom

    wearout in the CMP on-chip interconnect is correlated with lack of load observed in the NoC routers

    rather than high load. We then develop a novel wearout-decelerating scheme in which routers under low load have their wearout-sensitive components exercised

    without significantly impacting cycle time

    pipeline depth

    area or power consumption of the overall router. We subsequently show that the proposed design yields a 13.8x-65x increase in CMP lifetime.

    Use it or lose it: wear-out and lifetime in future chip multiprocessors

    With increasing core counts in Chip Multi-Processor (CMP) designs

    the size of the on-chip communication fabric and shared Last-Level Caches (LLC)

    which we term uncore here

    is also growing

    consuming as much as 30% of die area and a significant portion of chip power budget. In this work

    we focus on improving the uncore energy-efficiency using dynamic voltage and frequency scaling. Previous approaches are mostly restricted to reactive techniques

    which may respond poorly to abrupt workload and uncore utility changes. We find

    however

    there are predictable patterns in uncore utility which point towards the potential of a proactive approach to uncore power management. In this work

    we utilize artificial intelligence principles to proactively leverage uncore utility pattern prediction via an Artificial Neural Network (ANN). ANNs

    however

    require training to produce accurate predictions. Architecting an efficient training mechanism without a priori knowledge of the workload is a major challenge. We propose a novel technique in which a simple Proportional Integral (PI) controller is used as a secondary classifier during ANN training

    dynamically pulling the ANN up by its bootstraps to achieve accurate predictions. Both the ANN and the PI controller

    then

    work in tandem once the ANN training phase is complete. The advantage of using a PI controller to initially train the ANN is a dramatic acceleration of the ANN's initial learning phase. Thus

    in a real system

    this scenario allows quick power-control adaptation to rapid application phase changes and context switches during execution. We show that the proposed technique produces results comparable to those of pure offline training without a need for prerecorded training sets. Full system simulations using the PARSEC benchmark suite show that the bootstrapped ANN improves the energy-delay product of the uncore system by 27% versus existing state-of-the-art methodologies.\n

    Up by their bootstraps: Online learning in artificial neural networks for CMP uncore power management

    Jasson Caseyal

    The Software Defined Networking (SDN) approach has numerous advantages

    including the ability to program the network through simple abstractions

    provide a centralized view of network state

    and respond to changing network conditions. One of the main challenges in designing SDN enabled switches is efficient packet classification in the data plane. As the complexity of SDN applications increases

    the data plane becomes more susceptible to Denial of Service (DoS) attacks

    which can result in increased delays and packet loss. Accordingly

    there is a strong need for network architectures that operate efficiently in the presence of malicious traffic. In particular

    there is a need to protect authorized flows from DoS attacks. In this work we utilize a probabilistic data structure to pre-classify traffic with the aim of decoupling likely legitimate traffic from malicious traffic by leveraging the locality of packet flows. We validate our approach by examining a fundamental SDN application: software defined network firewall. For this application

    our architecture dramatically reduces the impact of unknown/malicious flows on established/legitimate flows. We explore the effect of stochastic pre-classification in prioritizing data plane classification. We show how pre-classification can be used to increase the effective Quality of Service (QoS) for established flows and reduce the impact of adversarial traffic.\n

    Stochastic Pre-classification for SDN Data Plane Matching

    Gwan Choi

    Yoonseok Yangre

    ACM Transactions on Design Automation of Electronic Systems (TODAES)

    WaveSync is a network-on-chip architecture for a globally asynchronous locally-synchronous (GALS) design. The WaveSync design facilitates low-latency communication leveraging the source-synchronous clock sent along with the data to time components in the datapath of a downstream router

    reducing the number of synchronizations needed. WaveSync accomplishes this by partitioning the router components at each node into different clock domains

    each synchronized with one of the orthogonal incoming source-synchronous clocks in a GALS 2D mesh network. The data and clock subsequently propagate through each node/router synchronously until the destination is reached

    regardless of the number of hops this may take. As long as the data travels in the path of clock propagation and no congestion is encountered

    it will be propagated without latching as if in a long combinatorial path

    with both the clock and the data accruing delay at the same rate. The result is that the need for synchronization between the mesochronous nodes and/or the asynchronous control associated with the typical GALS network is completely eliminated. To further reduce the latency overhead of synchronization

    for those occasions when synchronization is still required (when a flit takes a turn or arrives at the destination)

    we propose a novel less-than-one-cycle synchronizer. The proposed WaveSync network outperforms conventional GALS networks by 87--90% in average latency

    synthesized using a 45nm CMOS library.

    WaveSync: Low-Latency Source-Synchronous Bypass Network-on-Chip Architecture

    Firewalls are ubiquitous security functions and exist in almost all network connected devices whether protecting host stacks or providing transient packet filtering. Firewall performance

    which is a key ingredient for network performance

    can be greatly degraded by traffic crafted to exploit its filtering algorithms. These attacks can greatly reduce the Quality of Service (QoS) received by existing authorized flows in the firewall. This paper proposes a novel architecture that decouples this linkage between authorized flow QoS and adversarial traffic

    marginalizing disruption caused by unauthorized flows

    and ultimately improving overall performance of software defined firewalls. We show substantial improvements in throughput

    packet loss

    and latency over baseline software defined firewalls with varying ratios of attack traffic. All results are obtained using the cycle accurate architecture simulator gem5

    and Internet packet traces obtained from 10 Gbps interfaces of core Internet routers.

    Stochastic Pre-Classification for Software Defined Firewalls

    Cheng Limark

    Computer-Aided Design of Integrated Circuits and Systems

    IEEE Transactions on

    To meet energy-efficient performance demands

    the computing industry has moved to parallel computer architectures

    such as chip multiprocessors (CMPs)

    internally interconnected via networks-on-chip (NoC) to meet growing communication needs. Achieving scaling performance as core counts increase to the hundreds in future CMPs

    however

    will require high performance

    yet energy-efficient interconnects. Silicon nanophotonics is a promising replacement for electronic on-chip interconnect due to its high bandwidth and low latency

    however

    prior techniques have required high static power for the laser and ring thermal tuning. We propose a novel nano-photonic NoC (PNoC) architecture

    LumiNOC

    optimized for high performance and power-efficiency. This paper makes three primary contributions: a novel

    nanophotonic architecture which partitions the network into subnets for better efficiency; a purely photonic

    in-band

    distributed arbitration scheme; and a channel sharing arrangement utilizing the same waveguides and wavelengths for arbitration as data transmission. In a 64-node NoC under synthetic traffic

    LumiNOC enjoys 50% lower latency at low loads and ~40% higher throughput per Watt on synthetic traffic

    versus other reported PNoCs. LumiNOC reduces latencies ~40% versus an electrical 2-D mesh NoCs on the PARSEC shared-memory

    multithreaded benchmark suite.

    LumiNOC: A Power-Efficient

    High-Performance

    Photonic Network-on-Chip

    As processor chips become increasingly parallel

    an efficient communication substrate is critical for meeting performance and energy targets. In this work

    we target the root cause of network energy consumption through techniques that reduce link and router-level switching activity. We specifically focus on memory subsystem traffic

    as it comprises the bulk of NoC load in a CMP. By transmitting only the flits that contain words predicted useful using a novel spatial locality predictor

    our scheme seeks to reduce network activity. We aim to further lower NoC energy through microarchitectural mechanisms that inhibit datapath switching activity for unused words in individual flits. Using simulation-based performance studies and detailed energy models based on synthesized router designs and different link wire types

    we show that 1) the prediction mechanism achieves very high accuracy

    with an average rate of false-unused prediction of just 2.5 percent; 2) the combined NoC energy savings enabled by the predictor and microarchitectural support is 36 percent

    on average

    and up to 57 percent in the best case; and 3) there is no system performance penalty as a result of this technique.

    Spatial Locality Speculation to Reduce Energy in Chip-Multiprocessor Networks-on-Chip

    Reena Panda

    IEEE Computer Architecture Letters

    Computer architecture is beset by two opposing trends. Technology scaling and deep pipelining have led to high memory access latencies; meanwhile

    power and energy considerations have revived interest in traditional in-order processors. In-order processors

    unlike their superscalar counterparts

    do not allow execution to continue around data cache misses. In-order processors

    therefore

    suffer a greater performance penalty in the light of the current high memory access latencies. Memory prefetching is an established technique to reduce the incidence of cache misses and improve performance. In this paper

    we introduce B-Fetch

    a new technique for data prefetching which combines branch prediction based lookahead deep path speculation with effective address speculation

    to efficiently improve performance in in-order processors. Our results show that B-Fetch improves performance 38.8% on SPEC CPU2006 benchmarks

    beating a current

    state-of-the-art prefetcher design at ~1/3 the hardware overhead.

    B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors

    As modern CMPs scale to ever increasing core counts

    Networks-on-Chip (NoCs) are emerging as an interconnection fabric

    enabling communication between components. While NoCs provide high and scalable bandwidth

    current routing algorithms

    such as dimension-ordered routing

    suffer from poor load balance

    leading to reduced throughput and high latencies. Improving load balance

    hence

    is critical in future CMP designs where increased latency leads to wasted power and energy waiting for outstanding requests to resolve. Adaptive routing is a known technique to improve load balance

    however

    prior adaptive routing techniques either use local or regionally aggregated information to form their routing decisions. This paper proposes a new

    light-weight

    adaptive routing algorithm for on-chip routers based on global link state and congestion information

    Global Congestion Awareness (GCA). GCA uses a simple

    low-complexity route calculation unit

    to calculate paths to their destination without the myopia of local decisions

    nor the aggregation of unrelated status information

    found in prior designs. In particular GCA outperforms local adaptive routing by 26%

    Regional Congestion Awareness (RCA) by 15%

    and a recent competing adaptive routing algorithm

    DAR

    by 8% on average on realistic workloads.

    GCA: Global congestion awareness for load balance in networks-on-chip

    Viacheslav Fedorov

    ACM Transactions on Architecture and Code Optimization (TACO)

    Decreasing the traffic from the CPU LLC to main memory is a very important issue in modern systems. Recent work focuses on cache misses

    overlooking the impact of writebacks on the total memory traffic

    energy consumption

    IPC

    and so forth. Policies that foster a balanced approach

    between reducing write traffic to memory and improving miss rates

    can increase overall performance and improve energy efficiency and memory system lifetime for NVM memory technology

    such as phase-change memory (PCM). We propose Adaptive Replacement and Insertion (ARI)

    an adaptive approach to last-level CPU cache management

    optimizing the two parameters (miss rate and writeback rate) simultaneously. Our specific focus is to reduce writebacks as much as possible while maintaining or improving the miss rate relative to conventional LRU replacement policy. ARI reduces LLC writebacks by 33%

    on average

    while also decreasing misses by 4.7%

    on average. In a typical system

    this boosts IPC by 4.9%

    on average

    while decreasing energy consumption by 8.9%. These results are achieved with minimal hardware overheads.

    ARI: Adaptive LLC-memory traffic management

    Paul V.

    Gratz