Declarative Parameterizations of User-Defined Functions for Large-Scale Machine Learning and Optimization
Author(s) -
Zekai J. Gao,
Niketan Pansare,
Christopher Jermaine
Publication year - 2018
Publication title -
ieee transactions on knowledge and data engineering
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.36
H-Index - 174
eISSN - 1558-2191
pISSN - 1041-4347
DOI - 10.1109/tkde.2018.2873325
Subject(s) - computer science , join (topology) , parameterized complexity , object (grammar) , context (archaeology) , set (abstract data type) , scale (ratio) , theoretical computer science , data mining , artificial intelligence , programming language , algorithm , mathematics , paleontology , physics , combinatorics , quantum mechanics , biology
Large-scale optimization has become an important application for data management systems, particularly in the context of statistical machine learning. In this paper, we consider how one might implement the join-and-co-group pattern in the context of a fully declarative data processing system. The join-and-co-group pattern is ubiquitous in iterative, large-scale optimization. In the join-and-co-group pattern, a user-defined function $g$g is parameterized with a data object $x$x as well as the subset of the statistical model $\Theta _x$Θx that applies to that object, so that $g(x | \Theta _x)$g(x|Θx) can be used to compute a partial update of the model. This is repeated for every $x$x in the full data set $X$X. All partial updates are then aggregated and used to perform a complete update of the model. The join-and-co-group pattern has several implementation challenges, including the potential for a massive blow-up in the size of a fully parameterized model. Thus, unless the correct physical execution plan be chosen for implementing the join-and-co-group pattern, it is easily possible to have an execution that takes a very long time or even fails to complete. In this paper, we carefully consider the alternatives for implementing the join-and-co-group pattern on top of a declarative system, as well as how the best alternative can be selected automatically. Our focus is on the SimSQL database system, which is an SQL-based system with special facilities for large-scale, iterative optimization. Since it is an SQL-based system with a query optimizer, those choices can be made automatically.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom