Laxmikant Kale
Raising the Level of Abstraction in Parallel Programming
Parallel programming is undoubtedly more difficult than sequential programming, yet it needs to become widespread in the upcoming era of ubiquitous parallel computer. I will review an approach to parallel programming that my group has been following for 15-20 years, summarize how this approach has evolved, and present a view of the future.
At the foundation of our approach is a programming model based on migratable entities that interact with each other, without reference to the processors on which they reside. This model, embodied in Charm++, empowers an adaptive runtime system to automate resource management, often via introspection. In addition, the message-driven execution it engenders leads to several performance benefits. Most importantly, it promotes parallel composition. Raising the level of abstraction further appears to require giving up the idea of a single complete notation. Instead, we can design multiple notations (aka languages), each capable of capturing a specific interaction pattern well. This is similar to the observations behind parallel skeletons. I will describe two mini-languages designed with this philosophy --- Multiphase shared Arrays and Charisma. Parallel composition is now critical to allow multi-paradigm programs to work effectively.
I will illustrate the concepts with examples from production-quality scalable applications we have been working on, including NAMD for Biophysics, OpenAtom for quantum chemistry/nanotechnology, and ChaNGa for astronomy, running on thousands of processors.
Steve Linton
The HPCGAP project: Delivering Usable Scalable Parallelism to Computational Pure Mathematicians
The GAP (Groups, Algorithms, Programming) system (www.gap-system.org) is a large software system widely used in computational algebra and discrete mathematics. Most of the capabilities of the system (and the almost 100 contributed extension packages which are distributed with it) are programmed in the GAP language, which is supported by a run-time system written in C. The 1000-2000 users (estimated) range from those who use GAP as an "algebraic desk calculator" to those who develop and support substantial extension packages and/or have calculations that run for weeks and months.
GAP 4 was originally developed in the early 1990s (at RWTH-Aachen) and (except for one experimental extension package) is entirely sequential. The popular wisdom in the community has been "if you have two processors, use them to solve two problems". Since 2009, supported by EPSRC, we have been undertaking a project to develop GAP to support parallel programming in a range of environments, from multicore laptops to supercomputers.
This project faces a number of challenges which will hopefully lead to developments of wider interest:
* algebaic calculations have noticeably different characteristics from the standard range of numerical algorithms regularly seen in HPC, for instance they are often highly irregular (with subproblems of very different sizes) and dynamic (with subproblems giving rise to one another in hard-to-predict patterns).
* we are parallelising a language and a library, not an application (although we have demonstrator applications). We can't know, in detail, how things we write will be used
* we need existing sequential code to continue to work, and we need well-behaved sequential code to be usable with little or no modification inside parallel programmes.
* we want to present our users and developers with a reasonably small number of parallel frameworks to understand, and, as far as possible for those frameworks to be usable in the same ways for shared-memory and distributed memory computations across a range of scales.
This talk will fill in more of the background of the project, and explain our plans (and some preliminary results) for addressing these challenges.
Andreas Frommer
Beyond Standard Krylov Methods: Desirable Features for Scientific Software
Today, efficient implementations of the Conjugate Gradient method and of the Generalized Minimal Resiudal method as well as some other stan- dard iterative Krylov subspace methods are part of virtually any library for scientific computation. In this talk, we would like to stress the importance of some extensions of these methods an efficient implementation of which seems to be very desirable for many applications. Our particular perspec- tive are the linear algebra needs in numerical simulation methods for lattice QCD, but we will also point to other applications. The extension we will consider are methods for families of shifted linear systems (and their appli- cation for the computation of matrix functions), so-called flexible methods which allow for a varying preconditioner and augmented methods, including deflation approaches.
Vijay Nagarajan
Monitoring Parallel Programs for Performance and Reliability
While parallel architectures like multicores are becoming ubiquitous,
extracting performance from them is contingent on programmers writing
parallel software. Demand for parallel software has sparked off yet
another demand, the demand for monitoring parallel programs for
improving performance and reliability. Debugging tools such as race
detection, techniques such as speculative parallelisation that strive
to expose parallelism and improve performance, are examples of
applications that require a parallel program to be monitored during
runtime.
While the monitoring applications are quite different in their purpose
and implementation, they all share a common requirement in the context
of monitoring a parallel program -- the need to detect and react to
interprocessor shared memory dependences (ISMDs). We propose a simple
architectural extension in the form of support for exposing cache
events to the software, in effect, efficiently exposing the ISMDs to
software. We demonstrate how a wide variety of monitoring
applications, including those that allow speculation past barrier
synchronisations, and those that ensure sequential consistency in a
machine with a relaxed memory model, can be implemented efficiently
using this support.
Ozgur Ergul
Parallelization of the Multilevel Fast Multipole Algorithm for Accurate Solutions of Extremely Large Electromagnetics Problems
Large-scale electromagnetics problems can be solved efficiently with the multilevel fast multipole algorithm (MLFMA), which reduces the complexity of matrix-vector multiplications required by iterative solvers. On the other hand, accurate solutions of many real-life problems require discretizations with millions of elements, leading to dense matrix equations with millions of unknowns, which cannot easily be solved with sequential implementations of MLFMA running on a single processor. To solve such large problems, it is helpful to increase computational resources by assembling parallel computing platforms and, at the same time, by parallelizing MLFMA. Unfortunately, the parallelization of MLFMA is not trivial because of the complicated structure of this algorithm. Simple parallelization techniques usually fail to provide efficient solutions, due to communications among processors, poor load-balancing of the workload, and unavoidable duplications of computations over multiple processors. In this talk, an efficient parallelization of MLFMA using a hierarchical partitioning strategy will be presented. Combining the hierarchical strategy with optimizations and load-balancing algorithms allows for the solution of very large electromagnetics problems involving hundreds of millions of unknowns, which were unsolvable before due to their complexity.
Sebastien Loisel
Condition number estimates for the nonoverlapping optimized Schwarz method and the 2-Lagrange multiplier method for general domains and cross points
The optimized Schwarz method and the closely related 2-Lagrange multiplier method are domain decomposition methods which can be used to parallelize the solution of partial differential equations. Although these methods are known to work well in special cases (e.g., when the domain is a square and the two subdomains are rectangles), the problem has never been systematically stated nor analyzed for general domains with general subdomains. The problem of cross points (when three or more subdomains meet at a single vertex) has been particularly vexing.
We introduce a 2-Lagrange multiplier method for domain decompositions with cross points, and describe its relationship with the nonoverlapping optimized Schwarz method. We estimate the condition number of the iteration and provide an optimized Robin parameter for general domains. We hope that this new systematic theory will allow broader utilization of optimized Schwarz and 2-Lagrange multiplier preconditioners.
Magnus Svard
Finite difference methods for the Navier-Stokes equations
High-order finite difference schemes are notoriously difficult to stabilize at boundaries. A successful remedy for this problem is to use Summation-by-Parts (SBP) finite difference schemes along with Simultaneous Approximation Terms (SAT) for imposing boundary conditions. I will review this theory and indicate how linear stability can be proved via energy estimates for the full Navier-Stokes equations on multi-block grids.
Furthermore, I will discuss the extension of the SBP theory to flows with non-smooth solutions, i.e. supersonic flows. For such flows, entropy plays the same role as energy for linear problems and is the key to stable approximations. I will show how entropy stable high-order finite difference schemes can be constructed for Cauchy problems and also discuss problems that arise when boundaries are introduced.