Fri, 17 Oct 2025

ML quacks: Combining duckdb and mlpack

A side project I have been working on a little since last winter and which explores extending duckdb with mlpack is now public at the duckdb-mlpack repo.

duckdb is an excellent ‘small’ (as in ‘runs as a self-contained binary’) database engine with both a focus on analytical payloads (OLAP rather than OLTP) and an impressive number of already bolted-on extensions (for example for cloud data access) delivered as a single-build C++ executable (or of course as a library used from other front-ends). mlpack is an excellent C++ library containing many/most machine learning algorithms, also built in a self-contained manner (or library) making it possible to build compact yet powerful binaries, or to embed (as opposed to other ML framework accessed from powerful but not lightweight run-times such as Python or R). The compact build aspect as well as the common build tools (C++, cmake) make these two a natural candidate for combining them. Moreover, duckdb is a champion of data access, management and control—and the complementary machine learning insights and predictions offered by mlpack are fully complementary and hence fit this rather well.

duckdb also has a very robust and active extension system. To use it, one starts from a template repository and its ‘use this template’ button, runs a script and can then start experimenting. I have now grouped my initial start and test functions into a separate repository duckdb-example-extension to keep the duckdb-mlpack one focused on the ‘extend to mlpack’ aspect.

duckdb-mlpack is right an “MVP”, i.e. a minimally viable product (or demo). It just runs the adaboost classifier but does so on any dataset fitting the ‘rectangular’ setup with columns of features (real valued) and a final column (integer valued) of labels. I had hope to use two select queries for both features and then labels but it turns a ‘table’ function (returning a table of data from a query) can only run one select *. So the basic demo, also on the repo README is now to run the following script (where the SELECT * FROM mlpack_adaboost((SELECT * FROM D)); is the key invocation of the added functionality):

#!/bin/bash

cat <<EOF | build/release/duckdb
SET autoinstall_known_extensions=1;
SET autoload_known_extensions=1; # for httpfs

CREATE TEMP TABLE Xd AS SELECT * FROM read_csv("https://mlpack.org/datasets/iris.csv");
CREATE TEMP TABLE X AS SELECT row_number() OVER () AS id, * FROM Xd;
CREATE TEMP TABLE Yd AS SELECT * FROM read_csv("https://mlpack.org/datasets/iris_labels.csv");
CREATE TEMP TABLE Y AS SELECT row_number() OVER () AS id, CAST(column0 AS double) as label FROM Yd;
CREATE TEMP TABLE D AS SELECT * FROM X INNER JOIN Y ON X.id = Y.id;
ALTER TABLE D DROP id;
ALTER TABLE D DROP id_1;
CREATE TEMP TABLE A AS SELECT * FROM mlpack_adaboost((SELECT * FROM D));

SELECT COUNT(*) as n, predicted FROM A GROUP BY predicted;
EOF

to produce the following tabulation / group by:

./sampleCallRemote.sh
Misclassified: 1
┌───────┬───────────┐
   n   │ predicted │
 int64 │   int32   │
├───────┼───────────┤
    50 │         0 │
    49 │         1 │
    51 │         2 │
└───────┴───────────┘
$

(Note that this requires the httpfs extension. So when you build from a freshly created extension repository you may be ‘ahead’ of the most recent release of duckdb by a few commits. It is easy to check out the most recent release tag (or maybe the one you are running for your local duckdb binary) to take advantage of the extensions you likely already have for that version. So here, and in the middle of October 2025, I picked v1.4.1 as I run duckdb version 1.4.1 on my box.)

There are many other neat duckdb extensions. The ‘core’ ones are regrouped here while a list of community extensions is here and here.

For this (still more minimal) extension, I added a few TODO items to the README.md:

  • More examples of model fitting and prediction
  • Maybe set up model serialization into table to predict on new data
  • Ideally: Work out how to SELECT from multiple tabels, or else maybe SELECT into temp. tables and pass temp. table names into routine
  • Maybe add mlpack as a git submodule

Please reach out if you are interested in working on any of this.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. If you like this or other open-source work I do, you can now sponsor me at GitHub.

/code/misc | permanent link