A new release of the budding duckdb extension for mlpack, the C++ header-only library for machine learning, was merged into the duckdb community extensions repo today, and has been updated at its duckdb ‘mlpack’ extension page.
This release 0.0.4 adds two new methods (random forests, and
regularized logistic regression), reworked the interface a little to now
consistently provide fit (or train) and
predict methods, adds a new internal state variable
mlpack_verbose which can trigger (or suppress) verbose mode
directly from SQL, expanded the documentation and added more unit
tests.
For more details, see the repo for code, issues and more, and the extension page for more about this duckdb community extension.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. If you like this or other open-source work I do, you can sponsor me at GitHub.
A littler two weeks a short post announced the duckdb-mlpack as ‘ML quacks’: combining the powerful C++ machine learning library mlpack with the amazing analytical database engine duckdb. About a week ago another short post covered first extensions. We actually followed-up with release 0.0.3 days later, and never posted about it so this short note catches up.
In release 0.0.3, we provide macOS binaries: following a known issue with one of the components, we apply a simple patch to enable the build. Next up are wasm and windows, if you know your way around these platforms please get in touch. Release 0.0.3 also added first unit tests, serializes the coefficients from the (regularized) linear regression into the output table.
See see two previous posts linked above for details and background, the repo for code, issues and more, and the extension page for more about this duckdb community extension.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. If you like this or other open-source work I do, you can sponsor me at GitHub.
A couple of days ago in a short post, I announced duckdb-mlpack as ‘ML quacks’: combining the powerful C++ machine learning library mlpack with the amazing analytical database engine duckdb. See that post for more background.
The duckdb-mlpack package is now a community extension joining an impressive list of existing extensions. This means duckdb builds and distributes duckdb-mlpack for all supported platforms allowing users to just install the resulting (signed) binary. (We currently only support Linux in both arm64 and amd64, adding macOS should be straightforward once we sort one build issue out. Windows and WASM should work too, with a little love and polish, as both duckdb and mlpack support them.) Given the binary build, a simple
INSTALL mlpack FROM community;
LOAD mlpack;installs and loads the package. By the duckdb convention the code is stored per-user and per-version, so the first line needs to be executed only once per duckdb release used. The second line is then per session.
We also extended the capabilities of duckdb-mlpack. While still a MVP stressing minimal viable product, the two supported methods adaBoost and (regularized) linear regression both serialize and store their model object permitting rapid prediction on new data as shown in the adaBoost example:
-- Perform adaBoost (using weak learner 'Perceptron' by default)
-- Read 'features' into 'X', 'labels' into 'Y', use optional parameters
-- from 'Z', and prepare model storage in 'M'
CREATE TABLE X AS SELECT * FROM read_csv("https://eddelbuettel.github.io/duckdb-mlpack/data/iris.csv");
CREATE TABLE Y AS SELECT * FROM read_csv("https://eddelbuettel.github.io/duckdb-mlpack/data/iris_labels.csv");
CREATE TABLE Z (name VARCHAR, value VARCHAR);
INSERT INTO Z VALUES ('iterations', '50'), ('tolerance', '1e-7');
CREATE TABLE M (json VARCHAR);
-- Train model for 'Y' on 'X' using parameters 'Z', store in 'M'
CREATE TEMP TABLE A AS SELECT * FROM mlpack_adaboost("X", "Y", "Z", "M");
-- Count by predicted group
SELECT COUNT(*) as n, predicted FROM A GROUP BY predicted;
-- Model 'M' can be used to predict
CREATE TABLE N (x1 DOUBLE, x2 DOUBLE, x3 DOUBLE, x4 DOUBLE);
-- inserting approximate column mean values
INSERT INTO N VALUES (5.843, 3.054, 3.759, 1.199);
-- inserting approximate column mean values, min values, max values
INSERT INTO N VALUES (5.843, 3.054, 3.759, 1.199), (4.3, 2.0, 1.0, 0.1), (7.9, 4.4, 6.9, 2.5);
-- and this predict one element each
SELECT * FROM mlpack_adaboost_pred("N", "M");Ryan and I have some ideas for where to go from here, ideally towards autogenerating bindings for most (if not all) methods as is done for the mlpack language bindings. Anybody interested and willing to help should reach out to us.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. If you like this or other open-source work I do, you can sponsor me at GitHub.