December 17, 2019

Building deb/rpm packages with FPM in Docker

Whether you're building on-premise software or just want to use packages as your atomic deployment mechanism of choice in a traditional bare-metal/VM infrastructure, deb/rpm packages are a nice thing to provide.

Unfortunately, building them is super tedious. Try googling for official documentation on how to build Debian packages and you'll find at least 5 official wiki pages all with slight variations. Redhat packages are a bit better but still rather tedious. Luckily, there's a program which has our back: FPM - Effing Package Management.

Looking into how to run this on an engineer's laptop as easily as in a CI/CD pipeline, annoyances keep piling up.

  1. Running FPM with the correct arguments is annoying. Let's put the invocations in a shell script or Makefile.
  2. We don't want to have to install FPM on our main system - it requires ruby, gems and more. Let's run it in a Docker container. Luckily, there exists a Docker image which contains FPM and its dependencies, as well as optional dependencies for extra features: eclecticiq/package
  3. The application runtime (binaries, libraries...) needs to be available in the same container, otherwise FPM doesn't know what to package. Let's use multi-stage Docker builds to build the application and make the resulting files available to the FPM container.
  4. Running docker build, docker run, then copying files out of the docker container is annoying. Let's put the invocations in the Makefile or another shell script.

With all of the above fixed, you end up with a single Makefile target or shell script which would build the project's deb/rpm packages for me, either on my laptop or in the CI/CD system of my choice.

I put together a complete example of how to build, run, and package an application using this system is available on Github: fpm-docker-example

August 6, 2018

Multi-stage Docker builds for Python projects

Multi-stage builds can help reduce your Docker image sizes in production. This has many benefits: Development dependencies may potentially expose extra security holes in your system (I've yet to see this happen, but why not be cautious if it's easy to be so?), but mostly by reducing image size you make it faster for others to docker pull it.

The concept of multi-stage builds is simple: Install development dependencies, build all the stuff you need, then copy over just the stuff you need to run in production in a brand new image without installing development dependencies not needed to run the application.

Here's an example Dockerfile using the official Python Docker images, which are based on Debian - but you can easily apply the same principle when building from Debian, Ubuntu, CentOS, or Alpine images: Have one stage where build/development dependencies are installed and the application is built, and another where runtime dependencies are installed and the application is ran.

FROM python:3.7-stretch AS build
RUN python3 -m venv /venv

# example of a development library package that needs to be installed
RUN apt-get -qy update && apt-get -qy install libldap2-dev && \
    rm -rf /var/cache/apt/* /var/lib/apt/lists/*

# install requirements separately to prevent pip from downloading and
# installing pypi dependencies every time a file in your project changes
ADD ./requirements /project/requirements
RUN /venv/bin/pip install -r /project/requirements/$REQS.txt

# install the project, basically copying its code, into the virtualenv.
# this assumes the project has a functional
ADD . /project
RUN /venv/bin/pip install /project

# this won't have any effect on our production image, is only meant for
# if we want to run commands like pytest in the build image
WORKDIR /project

# the second, production stage can be much more lightweight:
FROM python:3.7-slim-stretch AS production
COPY --from=build /venv /venv

# install runtime libraries (different from development libraries!)
RUN apt-get -qy update && apt-get -qy install libldap-2.4-2 && \
    rm -rf /var/cache/apt/* /var/lib/apt/lists/*

# remember to run python from the virtualenv
CMD ["/venv/bin/python3", "-m", "myproject"]

Copying the virtual environment is by far the easiest approach to this problem. Python purists will say that virtual environments shouldn't be copied, but when the underlying system is the same and the path is the same, it makes literally no difference (plus virtual environments are a dirty hack to begin with, one more dirty hack doesn't make a difference).

There are a few alternate approaches, the most relevant of which is to build a wheel cache of your dependencies and mount that in as a volume. The problem with this is that Docker doesn't let you mount volumes in the build stage, so you have to make complex shell scripts and multiple Dockerfiles to make it work, and the only major advantage is that you don't always have to re-compile wheels (which should be on pypi anyway, and my dependencies don't change that often).

Another thing of note: In our example, we install both project dependencies and the project itself into the virtualenv. This means we don't even need the project root directory in the production image, which is also nice (no risk of leaking example configuration files, git history etc.).

To build the image and run our project, assuming it's a webserver listening on port 5000, these commands should let you visit http://localhost:5000 in your browser:

$ docker build --tag=myproject .
$ docker run --rm -it -p5000:5000 myproject

Running tests

What if we want to build an image for running tests, which require some extra development dependencies? That's where the purpose of our ARG REQS comes in. By setting this build argument when running docker build, we can control which requirements file is read. Combine that with the --target argument to docker run and this is how you build a development/testing image:

$ docker build --target=build --build-arg REQS=dev --tag=myproject-dev .

And let's say you want to run some commands using that image:

$ docker run --rm -it myproject-dev /venv/bin/pytest
$ docker run --rm -it myproject-dev bash

Development in Docker

Note that you'll have to re-build the image any time code changes. I don't care too much about this since I do all my development locally anyway, and only use Docker for production and continuous integration, but if it's important to you, you'll have to:

  1. Change pip install /project to pip install -e /project
  2. Copy the entire /project directory into the production image as well
  3. Mount the project's root directory as /project with docker run --volume=$PWD:/project

Example project

If you want a functional example to play around with, I've got a git repository on Github with a sample Python project which has a docker-multistage branch: python-project-examples

April 13, 2018

Streaming and saving subprocess output at the same time in Python

Sometimes, you want to run a subprocess with Python and stream/print its output live to the calling process' terminal, and at the same time save the output to a variable. Here's how:

proc = subprocess.Popen(cmd, stdout=subprocess.PIPE)
for line in proc.stdout:
    # do stuff with the line variable here
November 6, 2017

Better ways of managing pip dependencies

Of all the languages I've worked with, Python is one of the most annoying to work with when it comes to managing dependencies - only Go annoys me more. The industry standard is to keep a strict list of your dependencies (and their dependencies) in a requirements.txt file. Handily, this can be auto-generated with pip freeze > requirements.txt.

What's the problem with requirement files? It's not really a problem as long as you only have one requirements file, but if you want to start splitting up dev vs test/staging vs production dependencies, you'll immediately run into problems.

The most common solution is to have a requirements directory with base.txt, dev.txt, prod.txt and so on for whatever environments/contexts you need. The problem with this approach starts showing up when you want to add or upgrade a package and its dependencies - because you no longer have a single requirements file, you can't simply pip freeze > requirements.txt, so you end up carefully updating the file(s) by hand.

There are some existing third-party tools out there written to help with this problem. Two of them create entirely new formats of storing information on dependencies:

  • pipenv/pipfile uses a completely new file format for storing dependencies, inspired by other language's more modern dependency managers. In the future this may be part of pip core, but it is not currently. Until then I'm staying far away from the project, as trying to implement it in a real-world project revealed all sorts of bugs. The codebase itself looks super sketchy, as it's downloaded upstream libraries like pip, but then applied patches on top of them.
  • poetry is a far more promising project. Its goals are similar to that of pipenv but it just seems to have been developed in a more sane way. It uses the PEP518 pyproject.toml file to store information about which dependencies should be installed.

Some other tools aim to stick closer to the existing workflow of requirement files:

  • pipwrap scans your virtualenv for packages, compares them to what's in your requirements files, and interactively asks you where it should place packages that are in your environment, but not in any requirements file.
  • pip-compile (part of pip-tools) lets you write more minimal files, and auto-generates strict version requirements.txt files based on them. As a bonus you get to see where your nested dependencies are coming from.

There is also an existing solution that works without introducing third-party tools. Since version 7.1, there is a --constraint flag to the pip install command which can be used to solve this problem.

A constraint file is an additional requirements file which won't be used to determine which packages to install, but will be used to lock down versions for any packages that do get installed. This means that you can put your base requirements (that is, you don't need to include dependencies of dependencies) in your requirements file, then store version locks for all environments in a separate constraint file.

First of all, we want to make sure we never forget to add --constraint constraint.txt by adding it to the top of our requirements/base.txt file (and any other requirements file that does not include -r base.txt). Next, generate the constraint file with pip freeze > requirements/constraint.txt. You can now modify all your requirements files, removing or loosening version constraint, and removing nested dependencies.

With that out of the way, let's look at some example workflows. Upgrade an existing package:

pip install 'django >= 2'
# no need to edit requirements/base.txt, "django" is already there
pip freeze > requirements/constraint.txt

Install a new package in dev:

echo 'pytest-cov' >> requirements/dev.txt
pip install -r requirements/dev.txt
pip freeze > requirements/constraint.txt

Install requirements in a fresh production or development environment works just like before:

pip install -r requirements/base.txt
pip install -r requirements/dev.txt

This isn't perfect. If you don't install every requirement file in development, your constraint file will be missing those files' requirements. A code review would catch accidentally removing a constraint, but how do you detect a package that is entirely missing from the constraint file? pip install doesn't even have a dry-run mode. Still, constraint files (or any of the third-party tools, really) are nice ways of improving and simplifying dependency managment with pip.

There's also a shell command you can use as a commit hook or part of your test/CI suite to check that you're not missing anything in your constraint.txt:

! pip freeze | grep -vxiF -f requirements/constraint.txt -

This will output all pip packages that are installed but not present in your constraint file. We use the ! to make sure that the command gives a non-zero exit code if there are any matches.