# Methods » Direct: Parallel numerical solution of Linear Systems

As multiprocessors become commercially available there is a vast need for developing software for such machines. Since the core of most scientific computations is the numerical solution of linear systems of equations the goal of our research will be the development of algorithms for shared and distributed memory multiprocessors. Emphasis will be given on the implementation of such software using PVM or MPI. Direct methods will be considered for solving dense linear systems taking advantage of the particular architecture of the aforementioned parallel machines. We will proceed by identifying independent subtask modules in the known sequential algorithms. Each subtask module can be defined either as an elementary computer operation or as an entire procedure. Next, we will consider the task graph which is produced by the data dependencies of the subtasks imposed by the flow of the sequential algorithms. The final step will consist of devising a scheduling algorithm which will execute these subtasks such that:

(i) The constraints imposed are met and (ii) the execution time is minimum.

From the above description one can clearly identify that: (i) the selection of the subtask module and (ii) the scheduling algorithm, require further research such that the overall performance of the algorithm is maximized. For both the shared memory and distributed memory machines we will assume that the number of processors p is analogous to O(n) or O(n2), where n is the number of equations. Also, the granularity of the subtasks must be such that communication time does not dominate over computation time.

The characteristics of the target architecture critically affect the partitioning of matrices. Several partitionings are plausible, including those by rows, columns, submatrices, or diagonals. Partitionings by diagonals can be effective for matrices having various special structures (e.g. banded, circulant, Toeplitz) but is usually not optimal for general matrices. Partitioning by submatrices tends to minimize the resulting communication volume. Since, it minimizes the perimeter of the individual pieces. However, it also tends to spread this volume over a relatively large number of messages, yielding poor performance on machines having high start-up costs for communication. We will initially restrict our attention to partitioning by rows and by columns. Next, we will focus our attention of finding good heuristic (or optimal if possible) scheduling algorithms by studying the possible ways of executing the defined subtask modules under the given precedence constraints. The performance study of the scheduling algorithms will follow a deterministic approach taking into consideration the communication time analysis also.

The numerical solution of linear systems is one of the central problems in scientific computing. Many numerical methods exist for the solution of such systems and efficient codes for uniprocessor machines have been developed. With the advent of parallel computers, algorithms suitable for such machines were studied and are surveyed in [1], [2] and [3]. Most of this research was based on the modification of the sequential form of the algorithm into a parallel version, suitable for SIMD (Single Instruction Multiple Data) machines and required a very large number of processors (as many as O(n^{2}). It is only recently that research is carried out to develop efficient algorithms for MIMD (Multiple Instruction Multiple Data) machines using only O(n) processors (see e.g. [4]). The research in this area follows two main streamlines. The first considers parallel algorithms suitable for various types of Distributed Memory Message Passing (DMMP) machines [5], [6], [7], [8], [9], [10], [11], [12], [13]. The second develops parallel algorithms for a shared memory multiprocessor [4], [14], [15], [16], [17], [19], [20]. The papers for the DMMP machine study the implementation of the classical Gaussian elimination and the LU factorization method on the hypercube by considering column or row partitionings and various types of mapping to processors (reflection, wrap).

## Objective

To develop software for the numerical solution of Linear Systems for parallel machines. Currently we are interested in developing parallel algorithms for GPUs with CUDA.

## Current Research

Although, typically the CPUs of computer systems have been used to solve computational problems, current trend is to offload heavy computations to accelerators and particularly graphics processing units (GPUs). As commodity GPUs have increased computational power compared to modern CPUs they are proposed as a more efficient compute unit in solving scientific problems with large computational burden. Thus, application programming environments have been developed like the proprietary CUDA (Compute Unified Development Architecture) by NVidia and the OpenCL (Open Computing Language) which is supported by many hardware vendors, including NVidia. CUDA environment is rapidly evolving and a constantly increasing number of researchers is adopting it in order to exploit GPU capabilities. It provides an elegant way for writing GPU parallel programs, by using similar syntax and rules based on C/C++ language, without involving other graphics APIs. In our work we use GPUs for the numerical solution of dense linear systems.

We have started our study with the parallel implementation of the Gauss-Jordan and WZ (or QIF) [18]. Note that: (i) preliminary work for the study of Gauss-Jordan method has already been initiated in [15] and (ii) study of WZ method is also proposed in [2]. In the study of these algorithms we also include the solution of triangular systems which is the final phase of the direct methods, since this process requires a significant portion of the total time.

As far as the research work carried out for the shared memory machines is concerned, the methods examined are the Gaussian elimination, Gauss-Jordan, Doolittle and Cholesky algorithms. It is only in [19] where a communication time study appears, resulting in the modification of the latter method. As a first step, we study the influence of communication costs on the computational complexity for each of the above algorithms. Next, we will consider the LU and WZ factorization methods and attempt to compare all these methods in order to draw some firm conclusions about their overall performance. The study of sparse triangular linear systems will also be considered for shared memory machines.

The consideration of the communication time analysis along with the computational complexity will lead us to the ideal multiprocessor performance. For the achievement of this goal on shared memory machines the primary impediments are: (i) load imbalance and (ii) synchronization delays. Using deterministic and probabilistic strategies we hope to find good heuristic scheduling algorithms, which will provide a reduction in (i) and (ii). In message passing machines communication startups are quite expensive. Techniques that minimize the number of such startups are consequently of crucial importance. The reduction in the number of synchronizations clearly also reduce the number of startups required. For such machines the amount of information communicated also can play an important role and strongly depends on the work mapping onto the processors. A first attempt to tackle the problem of introducing parallelism into iterative methods for solving linear systems for shared memory machines is presented in [17].

We intent to advance this approach by exploring different partitionings of the directed acyclic graph obtained by the discretization of an elliptic partial differential equation using 5 and 9 point stencils. Finally we intend to introduce parallelism in Linear Programming problems (simplex, interior point methods) using similar study as for the linear systems.

Back to top