With growing digitization of the world it is increasingly easier to collect mammoth-size amounts of data. Often, this data is analyzed using an optimization algorithm, and leads to difficult huge-scale optimization problems with millions or billions of variables. Existing optimization algorithms, which are perfectly suited for solving problems of medium size, such as polynomial-time interior-point methods, are often not useful in this new setting due to the bad dependence of their complexity on the problem dimension. Hence, there is a pressing need to devise and analyze new methods, or adapt classical methods, that would be capable of working in the big data setting. An entirely different approach to the scale problem is via acceleration of existing methods on parallel computing architectures such as many-core computers and clusters thereof, or systems based on graphical processing units (GPUs). In this talk we describe a new method that combines the two approaches outlined above. Our method has both i) a good dependence on the problem dimension and b) is parallel in nature, and hence is well-suited for solving certain structured big data optimization problems. In particular, we show that randomized block coordinate descent methods, such as those developed in , can be accelerated by parallelization when applied to the problem of minimizing the sum of a partially block separable smooth convex function and a simple block separable convex function. Many problems of current interest in diverse communities (statistics, optimal experimental design, machine learning, mechanical engineering), can be cast in this form, including least-squares, L1 regularized least-squares (LASSO), group and sparse group LASSO , computing c and A-optimal designs of statistical experiments, training (sparse) linear support vector machines (SVM) and truss topology design (TTD). We describe a generic parallel randomized block coordinate descent algorithm (PR-BCD) and several variants thereof based on the way parallelization is performed. In all cases we prove iteration complexity results, i.e., we give bounds on the number of iterations sufficient to approximately solve the problem with high probability. Our results generalize the intuitive observation that in the separable case the theoretical speedup caused by parallelization should be equal to the number of processors. We show that the speedup increases with the number of processors and with the degree of partial separability of the smooth component of the objective function. Our analysis also works in the mode when the number of blocks being updated at each iteration is random, which allows for modelling situations with variable (busy or unreliable) number of processors. We conclude with some encouraging computational results applied to a LASSO instance involving a matrix with 20 billion nonzeros. All results are based on joint work with Martin Takac (Edinburgh).
Tuesday, December 11, 2012
University of Strathclyde