Thinking inside the box

duckdb-mlpack 0.0.5: Added kmeans, version helpers, documentation

A new release of the still-recent duckdb extension for mlpack, the C++ header-only library for machine learning, was merged into the duckdb community extensions repo today, and has been updated at its duckdb ‘mlpack’ extension page.

This release 0.0.5 adds one new method: kmeans clustering. We also added two version accessors for both mlpack and armadillo. We found during the work on random forests (added in 0.0.4) that the multithreaded random number generation was not quite right in the respective upstream codes. This has by now been corrected in armadillo 15.2.2 as well as the trunk version of mlpack so if you build with those, and set a seed, then your forests and classification will be stable across reruns. We added a second state variable mlpack_silent that can be used to suppress even the minimal prediction quality summary some methods show, and expanded the documentation.

For more details, see the repo for code, issues and more, and the extension page for more about this duckdb community extension.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. If you like this or other open-source work I do, you can sponsor me at GitHub.

/code/duckdb-mlpack | permanent link

duckdb-mlpack 0.0.4: Added random forest and logistic regression

A new release of the budding duckdb extension for mlpack, the C++ header-only library for machine learning, was merged into the duckdb community extensions repo today, and has been updated at its duckdb ‘mlpack’ extension page.

This release 0.0.4 adds two new methods (random forests, and regularized logistic regression), reworked the interface a little to now consistently provide fit (or train) and predict methods, adds a new internal state variable mlpack_verbose which can trigger (or suppress) verbose mode directly from SQL, expanded the documentation and added more unit tests.

For more details, see the repo for code, issues and more, and the extension page for more about this duckdb community extension.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. If you like this or other open-source work I do, you can sponsor me at GitHub.

/code/duckdb-mlpack | permanent link

duckdb-mlpack 0.0.3: macOS binaries, unit tests, more outputs

A littler two weeks a short post announced the duckdb-mlpack as ‘ML quacks’: combining the powerful C++ machine learning library mlpack with the amazing analytical database engine duckdb. About a week ago another short post covered first extensions. We actually followed-up with release 0.0.3 days later, and never posted about it so this short note catches up.

In release 0.0.3, we provide macOS binaries: following a known issue with one of the components, we apply a simple patch to enable the build. Next up are wasm and windows, if you know your way around these platforms please get in touch. Release 0.0.3 also added first unit tests, serializes the coefficients from the (regularized) linear regression into the output table.

See see two previous posts linked above for details and background, the repo for code, issues and more, and the extension page for more about this duckdb community extension.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. If you like this or other open-source work I do, you can sponsor me at GitHub.

/code/duckdb-mlpack | permanent link

duckdb-mlpack 0.0.2: mlpack is now a duckdb community extension

A couple of days ago in a short post, I announced duckdb-mlpack as ‘ML quacks’: combining the powerful C++ machine learning library mlpack with the amazing analytical database engine duckdb. See that post for more background.

The duckdb-mlpack package is now a community extension joining an impressive list of existing extensions. This means duckdb builds and distributes duckdb-mlpack for all supported platforms allowing users to just install the resulting (signed) binary. (We currently only support Linux in both arm64 and amd64, adding macOS should be straightforward once we sort one build issue out. Windows and WASM should work too, with a little love and polish, as both duckdb and mlpack support them.) Given the binary build, a simple

INSTALL mlpack FROM community;
LOAD mlpack;

installs and loads the package. By the duckdb convention the code is stored per-user and per-version, so the first line needs to be executed only once per duckdb release used. The second line is then per session.

We also extended the capabilities of duckdb-mlpack. While still a MVP stressing minimal viable product, the two supported methods adaBoost and (regularized) linear regression both serialize and store their model object permitting rapid prediction on new data as shown in the adaBoost example:

-- Perform adaBoost (using weak learner 'Perceptron' by default)
-- Read 'features' into 'X', 'labels' into 'Y', use optional parameters
-- from 'Z', and prepare model storage in 'M'
CREATE TABLE X AS SELECT * FROM read_csv("https://eddelbuettel.github.io/duckdb-mlpack/data/iris.csv");
CREATE TABLE Y AS SELECT * FROM read_csv("https://eddelbuettel.github.io/duckdb-mlpack/data/iris_labels.csv");
CREATE TABLE Z (name VARCHAR, value VARCHAR);
INSERT INTO Z VALUES ('iterations', '50'), ('tolerance', '1e-7');
CREATE TABLE M (json VARCHAR);

-- Train model for 'Y' on 'X' using parameters 'Z', store in 'M'
CREATE TEMP TABLE A AS SELECT * FROM mlpack_adaboost("X", "Y", "Z", "M");

-- Count by predicted group
SELECT COUNT(*) as n, predicted FROM A GROUP BY predicted;

-- Model 'M' can be used to predict
CREATE TABLE N (x1 DOUBLE, x2 DOUBLE, x3 DOUBLE, x4 DOUBLE);
-- inserting approximate column mean values
INSERT INTO N VALUES (5.843, 3.054, 3.759, 1.199);
-- inserting approximate column mean values, min values, max values
INSERT INTO N VALUES (5.843, 3.054, 3.759, 1.199), (4.3, 2.0, 1.0, 0.1), (7.9, 4.4, 6.9, 2.5);
-- and this predict one element each
SELECT * FROM mlpack_adaboost_pred("N", "M");

Ryan and I have some ideas for where to go from here, ideally towards autogenerating bindings for most (if not all) methods as is done for the mlpack language bindings. Anybody interested and willing to help should reach out to us.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. If you like this or other open-source work I do, you can sponsor me at GitHub.

/code/duckdb-mlpack | permanent link

Tue, 02 Dec 2025

duckdb-mlpack 0.0.5: Added kmeans, version helpers, documentation

Tue, 11 Nov 2025

duckdb-mlpack 0.0.4: Added random forest and logistic regression

Mon, 03 Nov 2025

duckdb-mlpack 0.0.3: macOS binaries, unit tests, more outputs

Sun, 26 Oct 2025

duckdb-mlpack 0.0.2: mlpack is now a duckdb community extension