A side project I have been working on a little since last winter and which explores extending duckdb with mlpack is now public at the duckdb-mlpack repo.
duckdb is an excellent ‘small’ (as in ‘runs as a self-contained binary’) database engine with both a focus on analytical payloads (OLAP rather than OLTP) and an impressive number of already bolted-on extensions (for example for cloud data access) delivered as a single-build C++ executable (or of course as a library used from other front-ends). mlpack is an excellent C++ library containing many/most machine learning algorithms, also built in a self-contained manner (or library) making it possible to build compact yet powerful binaries, or to embed (as opposed to other ML framework accessed from powerful but not lightweight run-times such as Python or R). The compact build aspect as well as the common build tools (C++, cmake) make these two a natural candidate for combining them. Moreover, duckdb is a champion of data access, management and control—and the complementary machine learning insights and predictions offered by mlpack are fully complementary and hence fit this rather well.
duckdb also has a very robust and active extension system. To use it, one starts from a template repository and its ‘use this template’ button, runs a script and can then start experimenting. I have now grouped my initial start and test functions into a separate repository duckdb-example-extension to keep the duckdb-mlpack one focused on the ‘extend to mlpack’ aspect.
duckdb-mlpack
is right an “MVP”, i.e. a minimally viable product (or demo). It just
runs the adaboost
classifier but does so on any dataset
fitting the ‘rectangular’ setup with columns of features (real valued)
and a final column (integer valued) of labels. I had hope to use two
select
queries for both features and then labels but it
turns a ‘table’ function (returning a table of data from a query) can
only run one select *
. So the basic demo, also on the repo
README is now to run the following script (where the
SELECT * FROM mlpack_adaboost((SELECT * FROM D));
is the
key invocation of the added functionality):
#!/bin/bash
cat <<EOF | build/release/duckdb
SET autoinstall_known_extensions=1;
SET autoload_known_extensions=1; # for httpfs
CREATE TEMP TABLE Xd AS SELECT * FROM read_csv("https://mlpack.org/datasets/iris.csv");
CREATE TEMP TABLE X AS SELECT row_number() OVER () AS id, * FROM Xd;
CREATE TEMP TABLE Yd AS SELECT * FROM read_csv("https://mlpack.org/datasets/iris_labels.csv");
CREATE TEMP TABLE Y AS SELECT row_number() OVER () AS id, CAST(column0 AS double) as label FROM Yd;
CREATE TEMP TABLE D AS SELECT * FROM X INNER JOIN Y ON X.id = Y.id;
ALTER TABLE D DROP id;
ALTER TABLE D DROP id_1;
CREATE TEMP TABLE A AS SELECT * FROM mlpack_adaboost((SELECT * FROM D));
SELECT COUNT(*) as n, predicted FROM A GROUP BY predicted;
EOF
to produce the following tabulation / group by
:
./sampleCallRemote.sh
Misclassified: 1
┌───────┬───────────┐
│ n │ predicted │
│ int64 │ int32 │
├───────┼───────────┤
│ 50 │ 0 │
│ 49 │ 1 │
│ 51 │ 2 │
└───────┴───────────┘
$
(Note that this requires the httpfs
extension. So when
you build from a freshly created extension repository you may be ‘ahead’
of the most recent release of duckdb
by a few commits. It
is easy to check out the most recent release tag (or maybe the one you
are running for your local duckdb
binary) to take advantage
of the extensions you likely already have for that version. So here, and
in the middle of October 2025, I picked v1.4.1 as I run
duckdb
version 1.4.1 on my box.)
There are many other neat duckdb extensions. The ‘core’ ones are regrouped here while a list of community extensions is here and here.
For this (still more minimal) extension, I added a few TODO items to the README.md:
SELECT
from multiple tabels,
or else maybe SELECT
into temp. tables and pass temp. table
names into routinemlpack
as a git submodule
Please reach out if you are interested in working on any of this.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. If you like this or other open-source work I do, you can now sponsor me at GitHub.