Efficient Realization of Householder Transform through Algorithm-Architecture Co-design for Acceleration of QR Factorization

Authors:
Farhad Merchant, Tarun Vatwani, Anupam Chattopadhyay, Soumyendu Raha, S K Nandy, and Ranjani Narayan
Journal:
IEEE Transactions on Parallel and Distributed Systems
Publisher:
IEEE
Date:
Mar. 2018
DOI:
10.1109/TPDS.2018.2803820
Language:
English
Copyright:
©2018  IEEE

BibTeX

@article{merchant18,
author = {Farhad Merchant, Tarun Vatwani, Anupam Chattopadhyay, Soumyendu Raha, S K Nandy, and Ranjani Narayan},
title = {Efficient Realization of Householder Transform through Algorithm-Architecture Co-design for Acceleration of QR Factorization},
year = {2018},
month = {mar},
journal = {IEEE Transactions on Parallel and Distributed Systems},
publisher = {IEEE},
doi = {10.1109/TPDS.2018.2803820},
}

Abstract

QR factorization is a ubiquitous operation in many engineering and scientific applications. In this paper, we present efficient realization of Householder Transform (HT) based QR factorization through algorithm-architecture co-design where we achieve performance improvement of 3-90x in-terms of Gflops/watt over state-of-the-art multicore, General Purpose Graphics Processing Units (GPGPUs), Field Programmable Gate Arrays (FPGAs), and ClearSpeed CSX700. Theoretical and experimental analysis of classical HT is performed for opportunities to exhibit higher degree of parallelism where parallelism is quantified as a number of parallel operations per level in the Directed Acyclic Graph (DAG) of the transform. Based on theoretical analysis of classical HT, an opportunity to re-arrange computations in the classical HT is identified that results in Modified HT (MHT) where it is shown that MHT exhibits 1.33x times higher parallelism than classical HT. Experiments in off-the-shelf multicore and General Purpose Graphics Processing Units (GPGPUs) for HT and MHT suggest that MHT is capable of achieving slightly better or equal performance compared to classical HT based QR factorization realizations in the optimized software packages for Dense Linear Algebra (DLA). We implement MHT on a customized platform for Dense Linear Algebra (DLA) and show that MHT achieves 1.3x better performance than native implementation of classical HT on the same accelerator. For custom realization of HT and MHT based QR factorization, we also identify macro operations in the DAGs of HT and MHT that are realized on a Reconfigurable Data-path (RDP). We also observe that due to re-arrangement in the computations in MHT, custom realization of MHT is capable of achieving 12% better performance improvement over multicore and GPGPUs than the performance improvement reported by General Matrix Multiplication (GEMM) over highly tuned DLA software packages for multicore and GPGPUs which is counter-intuitive.

Download

No download found.

News >> News >> News

Prof. Leupers is Keynote Speaker at IEEE IDAACS-SWS 2018

Prof. Leupers is Keynote Speaker at the 4th IEEE INTERNATIONAL SYMPOSIUM ON WIRELESS SYSTEMS"

TETRAMAX held its first Industrial Advisory Board meeting and presented the granted Technology Transfer Experiments!

On September 13 in Aachen, the TETRAMAX partners from all over Europe as well as the industrial

ICE excursion to the University of Kaiserslautern

During May 24/25, 2018 the ICE team visited the University of Kaiserslautern and met the local

User login

Login

Forgot your password?