Spicule - Data Processing Experts

Pentaho Data Integration as a Snap

Pentaho Data Integration as a Snap

Most people won't have heard of snap packages, I suspect, but here's some cool stuff we've been working on with Pentaho Data Integration.

We're big fans of Juju but we'd like a more coherent way of getting software onto our servers. We were asked to spin up a new server with Pentaho Data Integration on it and used the latest 8.0 release as a good excuse to delve further into snaps. Snap packages allow system administrators to deploy confined packages onto Linux servers with access to only the stuff they need. Snaps also update automatically and allow users to roll back if an update fails using a snapshotted config.

Pentaho Data Integration, is Java based. So you might be wondering why bother when you can just unzip and run the package. Well, a number of reasons.

Why Pentaho Data Integration?

Pentaho Data Integration is a world class ETL tool. Its easy to design data processes, easy to test and deploy and also nice to write plugins for. You don't need to be a software developer to write the ETL scripts and there is plenty of online resource available.


Snaps are secure by default. They run in a confined model that only allows access to specific parts of the system. This means, for example, that a PDI install isn't necessarily allow access to users home directories. It also means that nefarious hackers can't replace jar files without you knowing, or even worse, some know it all who works in the IT Department doing you a "favour" and breaking your database connections.

Automated Updates

Snaps update automatically, I don't have to `apt update` or in the case of a standalone package, download a new version.


Because the snapd daemon runs on multiple Linux platforms, I can run the identical snap package on Fedora, Arch, Suse and of course Ubuntu without updating the package. No debs vs RPMs etc.

Orchestrated Deployments

We haven't got it deployed yet, but soon we'll have a Juju charm that allows us to deploy Pentaho Data Integration, connect different applications to it and have them wire up services on the fly. Separating orchestration from package versions can make keeping the software up to date much easier for the end user.

Imutable ETL deployments

This is my favourite bit. You can inherit snaps, so that means if I want to build a completely immutable ETL, I can package my transformations and PDI all in the same read only file system and deploy it to a server, safe in the knowledge that no one can change it.

How to get it?

On an Ubuntu Xenial or above system, you can simply sudo snap install pentaho-data-integration-spicule this will install the snap and you can then run it with the pentaho-data-integration-spicule.spoon command.

Whats coming next?

Tutorials! We'll be developing some tutorials to show how to build immutable ETL's ready for deployment.

More features. Still to do includes:

  • External optional libraries folder
  • More testing
  • Exposing more tools

Juju Charm