Towards a Performance-portable Description of Geometric Multigrid Algorithms using a Domain-specific LanguageJournal of Parallel and Distributed Computing (JPDC), 24(12):3191-3201
Keywords: multigrid; multiresolution; image pyramid; domain-specific language; stencil codes; code generation; GPU; CUDA; OpenCL
Abstract: High Performance Computing (HPC) systems are nowadays more and more heterogeneous. Different processor types can be found on a single node including accelerators such as Graphics Processing Units (GPUs). To cope with the challenge of programming such complex systems, this work presents a domain-specific approach to automatically generate code tailored to different processor types. Low-level CUDA and OpenCL code is generated from a high-level description of an algorithm specified in a Domain-Specific Language (DSL) instead of writing hand-tuned code for GPU accelerators. The DSL is part of the Heterogeneous Image Processing Acceleration (HIPAcc) framework and was extended in this work to handle grid hierarchies in order to model different cycle types. Language constructs are introduced to process and represent data at different resolutions. This allows to describe image processing algorithms that work on image pyramids as well as multigrid methods in the stencil domain. By decoupling the algorithm from its schedule, the proposed approach allows to generate efficient stencil code implementations. Our results show that similar performance compared to hand-tuned codes can be achieved.
shade.js: Adaptive Material DescriptionsComputer Graphics Forum, 33(7):51--60
Code Refinement of Stencil CodesParallel Processing Letters (PPL), 24(3):1-16
Keywords: stencil codes; partial evaluation; domain-specific language
Abstract: A straightforward implementation of an algorithm in a general-purpose programming language does usually not deliver peak performance: compilers often fail to automatically tune the code for certain hardware peculiarities like memory hierarchy or vector execution units. Manually tuning the code is firstly error-prone as well as time-consuming and secondly taints the code by exposing those peculiarities to the implementation. A popular method to circumvent these problems is to implement the algorithm in a Domain-Specific Language (DSL). A DSL compiler can then automatically tune the code for the target platform. In this paper we show how to embed a DSL for stencil codes in another language. In contrast to prior approaches we only use a single language for this task. Furthermore, we offer explicit control over code refinement in the language itself which is used to specialize stencils for particular scenarios. Our first results show that our specialized programs achieve competitive performance compared to hand-tuned CUDA programs.
Progressive Light Transport Simulation on the GPU: Survey and ImprovementsCM Trans. Graph, 33(3):29:1-29:19
Keywords: GPU; Global illumination; bidirectional path tracing; high performance; vertex connection and merging
Abstract: Graphics Processing Units (GPUs) recently became general enough to enable implementation of a variety of light transport algorithms. However, the efficiency of these GPU implementations has received relatively little attention in the research literature and no systematic study on the topic exists to date. The goal of our work is to fill this gap. Our main contribution is a comprehensive and in-depth investigation of the efficiency of the GPU implementation of a number of classic as well as more recent progressive light transport simulation algorithms. We present several improvements over the state-of-the-art. In particular, our Light Vertex Cache, a new approach to mapping connections of sub-path vertices in Bidirectional Path Tracing on the GPU, outperforms the existing implementations by 30-60%. We also describe a first GPU implementation of the recently introduced Vertex Connection and Merging algorithm [Georgiev et al. 2012], showing that even relatively complex light transport algorithms can be efficiently mapped on the GPU. With the implementation of many of the state-of-the-art algorithms within a single system at our disposal, we present a unique direct comparison and analysis of their relative performance.
A Collaborative Virtual Workspace for Factory Configuration and EvaluationCollaborative Computing,
Combined Scanning Transmission Electron Microscopy Tilt- and Focal SeriesMicroscopy and Microanalysis, :1-13
Keywords: STEM, tomography, 3D, focal series, whole cell, nanoparticle, SART, 3D reconstruction, back projection
Abstract: In this study, a combined tilt- and focal series is proposed as a new recording scheme for high-angle annular dark-field scanning transmission electron microscopy (STEM) tomography. Three-dimensional (3D) data were acquired by mechanically tilting the specimen, and recording a through-focal series at each tilt direction. The sample was a whole-mount macrophage cell with embedded gold nanoparticles. The tilt–focal algebraic reconstruction technique (TF-ART) is introduced as a new algorithm to reconstruct tomograms from such combined tilt- and focal series. The feasibility of TF-ART was demonstrated by 3D reconstruction of the experimental 3D data. The results were compared with a conventional STEM tilt series of a similar sample. The combined tilt- and focal series led to smaller “missing wedge” artifacts, and a higher axial resolution than obtained for the STEM tilt series, thus improving on one of the main issues of tilt series-based electron tomography.
Target-Specific Refinement of Multigrid Codes
Proceedings of the 4th International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC) , page 52-57.
Keywords: multigrid codes; partial evaluation; domain- specific language
Abstract: This paper applies partial evaluation to stage a stencil code Domain-Specific Language (DSL) onto a functional and imperative programming language. Platform-specific primitives such as scheduling or vectorization, and algorithmic variants such as boundary handling are factored out into a library that make up the elements of that DSL. We show how partial evaluation can eliminate all overhead of this separation of concerns and creates code that resembles hand-crafted versions for a particular target platform. We evaluate our technique by implementing a DSL for the V-cycle multigrid iteration. Our approach generates code for AMD and NVIDIA GPUs (via SPIR and NVVM) as well as for CPUs using AVX/AVX2 alike from the same high-level DSL program. First results show that we achieve a speedup of up to 3× on the CPU by vectorizing multigrid components and a speedup of up to 2× on the GPU by merging the computation of multigrid components.
Specialization through Dynamic Staging
Proceedings of the 13th International Conference on Generative Programming: Concepts & Experiences (GPCE) , page 103-112.
Keywords: dynamic staging; partial evaluation; code specialization
Abstract: Partial evaluation allows for specialization of program fragments. This can be realized by staging, where one fragment is executed earlier than its surrounding code. However, taking advantage of these capabilities is often a cumbersome endeavor. In this paper, we present a new metaprogramming concept using staging parameters that are first-class citizen entities and define the order of execution of the program. Staging parameters can be used to define MetaML-like quotations, but can also allow stages to be created and resolved dynamically. The programmer can write generic, polyvariant code which can be reused in the context of different stages. We demonstrate how our approach can be used to define and apply domain-specific optimizations. Our implementation of the proposed metaprogramming concept generates code which is on a par with templated C++ code in terms of execution time.
Combined tilt- and focal series scanning transmission electron microscopy: TFS 3D STEM
Proceedings of 18th International Microscopy Congress
TFS: Combined Tilt- and Focal Series for Scanning Transmission Electron Microscopy.
Proceedings of Microscopy & Microanalysis 2014
Optimized patient-specific implants
Proceedings of 11th World Congress on Computational Mechanics
Platform-Specific Optimization and Mapping of Stencil Codes through Refinement
Proceedings of the First International Workshop on High-Performance Stencil Computations (HiStencils) , page 1-6.