A couple of days ago in a short post, I announced duckdb-mlpack as ‘ML quacks’: combining the powerful C++ machine learning library mlpack with the amazing analytical database engine duckdb. See that post for more background.
The duckdb-mlpack package is now a community extension joining an impressive list of existing extensions. This means duckdb builds and distributes duckdb-mlpack for all supported platforms allowing users to just install the resulting (signed) binary. (We currently only support Linux in both arm64 and amd64, adding macOS should be straightforward once we sort one build issue out. Windows and WASM should work too, with a little love and polish, as both duckdb and mlpack support them.) Given the binary build, a simple
INSTALL mlpack FROM community;
LOAD mlpack;installs and loads the package. By the duckdb convention the code is stored per-user and per-version, so the first line needs to be executed only once per duckdb release used. The second line is then per session.
We also extended the capabilities of duckdb-mlpack. While still a MVP stressing minimal viable product, the two supported methods adaBoost and (regularized) linear regression both serialize and store their model object permitting rapid prediction on new data as shown in the adaBoost example:
-- Perform adaBoost (using weak learner 'Perceptron' by default)
-- Read 'features' into 'X', 'labels' into 'Y', use optional parameters
-- from 'Z', and prepare model storage in 'M'
CREATE TABLE X AS SELECT * FROM read_csv("https://eddelbuettel.github.io/duckdb-mlpack/data/iris.csv");
CREATE TABLE Y AS SELECT * FROM read_csv("https://eddelbuettel.github.io/duckdb-mlpack/data/iris_labels.csv");
CREATE TABLE Z (name VARCHAR, value VARCHAR);
INSERT INTO Z VALUES ('iterations', '50'), ('tolerance', '1e-7');
CREATE TABLE M (json VARCHAR);
-- Train model for 'Y' on 'X' using parameters 'Z', store in 'M'
CREATE TEMP TABLE A AS SELECT * FROM mlpack_adaboost("X", "Y", "Z", "M");
-- Count by predicted group
SELECT COUNT(*) as n, predicted FROM A GROUP BY predicted;
-- Model 'M' can be used to predict
CREATE TABLE N (x1 DOUBLE, x2 DOUBLE, x3 DOUBLE, x4 DOUBLE);
-- inserting approximate column mean values
INSERT INTO N VALUES (5.843, 3.054, 3.759, 1.199);
-- inserting approximate column mean values, min values, max values
INSERT INTO N VALUES (5.843, 3.054, 3.759, 1.199), (4.3, 2.0, 1.0, 0.1), (7.9, 4.4, 6.9, 2.5);
-- and this predict one element each
SELECT * FROM mlpack_adaboost_pred("N", "M");Ryan and I have some ideas for where to go from here, ideally towards autogenerating bindings for most (if not all) methods as is done for the mlpack language bindings. Anybody interested and willing to help should reach out to us.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. If you like this or other open-source work I do, you can sponsor me at GitHub.