Home Assistant backups, restic, gzip’s rsyncable flag

tl;dr: if gzip and restic interact anywhere for you, you should consider passing --rsyncable to gzip, or not using gzip at all.

tl;dr2: Click here to read the summary right now.

Like many people, I have a Home Assistant installation. Mine does not control many things, but it collects a lot from various sensors and devices around the house. Besides Home Assistant’s own database with short and long term storage, I use InfluxDB as HA addon. This means that HA backups are also InfluxDB backups.

Over the course of a few years I have, apparently, collected 354 Home Assistant backups in a restic repository. The oldest one is around 300 MB; the newest, 1.4GB.

Background

Home Assistant backups look like this:

$ ssh root@XX

| |  | |                          /\           (_)   | |            | |  
| |__| | ___  _ __ ___   ___     /  \   ___ ___ _ ___| |_ __ _ _ __ | |_ 
|  __  |/ _ \| '_ \ _ \ / _ \   / /\ \ / __/ __| / __| __/ _\ | '_ \| __|
| |  | | (_) | | | | | |  __/  / ____ \\__ \__ \ \__ \ || (_| | | | | |_ 
|_|  |_|\___/|_| |_| |_|\___| /_/    \_\___/___/_|___/\__\__,_|_| |_|\__|

Welcome to the Home Assistant command line.

[core-ssh ~]$ cd backup
[core-ssh backup]$ ls -al
total 10583052
drwxr-xr-x    3 root     root          4096 Jul  6 00:00 .
drwxr-xr-x    1 root     root          4096 Jul  4 13:22 ..
-rw-r--r--    1 root     root     1559705600 Jul  3 00:00 277e0264.tar
-rw-r--r--    1 root     root     1559388160 Jul  2 00:00 3456e3c7.tar
-rw-r--r--    1 root     root     1558886400 Jul  1 00:00 4303e28c.tar
-rw-r--r--    1 root     root     1560238080 Jul  4 00:00 77a4287d.tar
-rw-r--r--    1 root     root     1557125120 Jun 30 00:00 7b821e1e.tar
-rw-r--r--    1 root     root     1522309120 Jul  6 00:00 d6400715.tar
-rw-r--r--    1 root     root     1519339520 Jul  5 00:00 f0eba8bf.tar
[core-ssh backup]$ tar tvf 277e0264.tar 
-rw-r--r-- 0/0     10510 2024-07-03 00:00:02 core_ssh.tar.gz
-rw-r--r-- 0/0      1800 2024-07-03 00:00:02 a0d7b954_aircast.tar.gz
-rw-r--r-- 0/0      2381 2024-07-03 00:00:02 15ef4d2f_esphome.tar.gz
-rw-r--r-- 0/0 1309290473 2024-07-03 00:00:02 a0d7b954_influxdb.tar.gz
-rw-r--r-- 0/0  36485918 2024-07-03 00:00:41 a0d7b954_tautulli.tar.gz
-rw-r--r-- 0/0     52714 2024-07-03 00:00:43 a0d7b954_grafana.tar.gz
-rw-r--r-- 0/0      2008 2024-07-03 00:00:43 a0d7b954_wireguard.tar.gz
-rw-r--r-- 0/0      1905 2024-07-03 00:00:43 a0d7b954_nodered.tar.gz
-rw-r--r-- 0/0      1560 2024-07-03 00:00:43 a0d7b954_chrony.tar.gz
-rw-r--r-- 0/0      2031 2024-07-03 00:00:43 a0d7b954_ssh.tar.gz
-rw-r--r-- 0/0      2422 2024-07-03 00:00:43 core_mosquitto.tar.gz
-rw-r--r-- 0/0 213812233 2024-07-03 00:00:43 homeassistant.tar.gz
-rw-r--r-- 0/0       205 2024-07-03 00:00:54 share.tar.gz
-rw-r--r-- 0/0       212 2024-07-03 00:00:54 addons_local.tar.gz
-rw-r--r-- 0/0      1961 2024-07-03 00:00:54 ssl.tar.gz
-rw-r--r-- 0/0       288 2024-07-03 00:00:54 media.tar.gz
-rw-r--r-- 0/0      1440 2024-07-03 00:00:53 ./backup.json

Each backup is a tarball, containing a bunch of tar.gz files, including one for each addon, such as influxdb. The oldest influxdb backup is 513MB; the newest, 1.2GB. Clearly Influx is the bulk of my backup data. But Influx is my archive. It barely changes, it just grows a bit. I would love for it to dedup a bit in restic storage!

Those 354 backups, all sitting together in a directory as 354 .tar files, take up 291GB of disk space. A restic repository holding those 354 backups, with compression and deduplication enabled .. take up 291GB of disk space. There is effectively no compression (which is expected, because there’s .gz inside already), and no dedup (which is the sad surprise this article is about):

$ restic -r ha-snapshots-all-together-restic/ stats --mode=raw-data
repository ed03b8ed opened (version 2, compression level auto)
Stats in raw-data mode:
     Snapshots processed:  1
        Total Blob Count:  199240
 Total Uncompressed Size:  289.952 GiB
              Total Size:  288.845 GiB
    Compression Progress:  100.00%
       Compression Ratio:  1.00x
Compression Space Saving:  0.38%

(The alert reader will notice that I only have one snapshot. I am assuming that restic dedup works as well within one snapshot as between snapshots, so I simplified the experiment by just backing up the one dir full of tars directly. My actual backup repository - the source of all these tarballs - has these 354 tarballs spread out over a few hundred snapshots, but with similar results: effectively no dedup).

It seems obvious that the .gz in the tarball stack is the reason restic cannot dedup. (Spoiler alert: it is). Let’s find out how we can improve that.

I have extensive notes on the steps I took to get all these numbers, so please feel free to ask questions.

Unpack the tarballs once

I unpacked the tarballs once. In the process, I found some of the tarballs were corrupted and/or truncated. This has introduced around 1-2GB of noise into these measurements. I believe this to be irrelevant to the end result. (To be clear, I think the tarball truncation is my own fault, and not a problem in HA. I keep forgetting that rsync -P includes --partial and most likely moved some backups from HA to restic via an undersized intermediate filesystem.)

My source now looks like this:

ha-snapshots-unpacked-once$ find . | head
.
./cff234c8.tar
./cff234c8.tar/15ef4d2f_esphome.tar.gz
./cff234c8.tar/a0d7b954_nodered.tar.gz

etc.

This takes up 291GB of disk space, and restic manages to squeeze it into 289GB.

Clearly the outside tar layer is not the problem.

Decompress

I gunziped all the .gz files. Looks like this (funny how the order of files has changed):

ha-snapshots-unpacked-twice$ find . | head
.
./cff234c8.tar
./cff234c8.tar/core_ssh.tar
./cff234c8.tar/share.tar

etc.

This takes up a whopping 713GB of disk space. Restic packs it down to 40GB.

Forty. Gigabyte.

If I am reading this correctly:

     Snapshots processed:  1
        Total Blob Count:  211179
 Total Uncompressed Size:  140.975 GiB
              Total Size:  39.799 GiB
    Compression Progress:  100.00%
       Compression Ratio:  3.54x
Compression Space Saving:  71.77%

deduplication stored 713GB in 141GB. Compression then gave us another 100GB.

It turns out (unsurprisingly, I guess) that this is the best number we are going to get.

gzip’s rsyncable flag

At some point, I ran into a blog post that mentioned that restic was smart about dedup, and did not need things to be identical at block boundaries. This reminded me of gzip --rsyncable. Martin Pool describes it well:

There is a patch called --rsyncable for gzip that fixes this behaviour: gzip files are basically broken up into blocks so that changes (including insertion or deletion) in the input file affect only the corresponding blocks in the output file. (The blocks are not of fixed size, but rather delimited by marker patterns at which a checksum hits a particular value, so they move as data is inserted or removed.)

So, I recompressed all inside tarballs with gzip --rsyncable -9. I used -9 because the couple of HA backups I checked manually reported max compression when inspected with file. I later found out that in late 2023, HA stopped using max compression. Me doing -9 while HA no longer does it will skew these numbers a bit, but not in any interesting way I think.

My source looks like this:

ha-snapshots-unpacked-twice-repack-gzip-rsyncable$ find . | head
.
./cff234c8.tar
./cff234c8.tar/15ef4d2f_esphome.tar.gz
./cff234c8.tar/a0d7b954_nodered.tar.gz

It takes up 289GB of disk space, almost identical to our starting size.

Restic stores it in 58 GB:

Stats in raw-data mode:
     Snapshots processed:  1
        Total Blob Count:  41780
 Total Uncompressed Size:  59.175 GiB
              Total Size:  57.681 GiB
    Compression Progress:  100.00%
       Compression Ratio:  1.03x
Compression Space Saving:  2.52%

In other words: if we could patch Home Assistant to make rsyncable gzips, my backup storage would go from 290GB to 58GB. That’s not my earlier 40GB mark, but it is a very nice improvement, and it would come at only little cost to users who do not store their backups in restic or something else that dedups and compresses.

Curious about the actual effects of --rsyncable, I did a comparison on a single InfluxDB backup:

source result
from HA 772M ha-snapshots-unpacked-once/3bf092f5.tar/a0d7b954_influxdb.tar.gz (file says max compression)
gzip -9 772M a0d7b954_influxdb.tar.gz
gzip -9 --rsyncable 779M a0d7b954_influxdb.tar.gz
gzip --rsyncable 787M ha-snapshots-unpacked-twice-repack-gzip-rsyncable/3bf092f5.tar/a0d7b954_influxdb.tar.gz
after gunzip 1.8G ha-snapshots-unpacked-twice/3bf092f5.tar/a0d7b954_influxdb.tar

Observations:

  • gzip -9 matches the original size (this is a backup from 2023)
  • -9 --rsyncable is about 1% bigger than just -9

tar it up again

At this point, somebody asked me about block boundaries, and I said that my understanding was that restic did not care about them that much for dedup. But just to be sure, I did the -actual- experiment of what things would look like if HA made rsyncable .gz.

Source looks like this now:

ha-snapshots-unpacked-twice-repack-gzip-rsyncable-tar$ ls | head
00cceeff.tar
0123b63e.tar
03a06fae.tar

Presumably, HA should be able to restore from these files (except for the fact that I accidentally also gzipped the 2KB backup.json file HA sticks in each backup with some metadata).

This, again, takes up 289 GB.

Restic stores it in 56 GB:

Stats in raw-data mode:
     Snapshots processed:  1
        Total Blob Count:  37018
 Total Uncompressed Size:  57.161 GiB
              Total Size:  55.742 GiB
    Compression Progress:  100.00%
       Compression Ratio:  1.03x
Compression Space Saving:  2.48%

I cannot explain how this is 2 GB smaller than the previous mode.

Table of results

format source size restic size
354 original HA backups 291 GB 291 GB
untar once 291 GB 289 GB
untar+gunzip 713 GB 40 GB
untar+gunzip+gzip -9 –rsyncable 289GB 58 GB
untar+gunzip+gzip -9 –rsyncable+tar 289 GB 56 GB

Summary

Summarised, my results say:

  • having HA not compress would cost a bit of disk space temporarily, while allowing restic to save a boatload of space for a long time
  • having HA compress in --rsyncable would cost almost no extra space temporarily, and allow restic to still save a lot of space

Where I say “temporarily”, for some users this might be their forever, because they store the backups into non-dedup storage – or maybe even dedup storage that has different magic dust than restic does.

A secondary conclusion, of course, is “a bit of postprocessing of the HA tarball before it goes into restic provides these benefits too”.

As suggested above, Home Assistant uses securetar to generate backups, presumably because backups can optionally be encrypted. securetar itself relies on Python’s tarfile, which uses the Python gzip library for compression. This library does not expose anything like the --rsyncable flag, sadly. In gzip --rsyncable, the implementation of that periodic reset is, and this came as a surprise to me, implemented in the gzip binary, not in the zlib library.

Next steps

Given that we cannot easily make HA’s archives rsyncable, what remains is optionally making them uncompressed. I added a comment to the HA feature request “Allow backup compression parameters”. Please hit the Vote button!

A bigger job would be to add rsyncable support to (Python’s) zlib. I suspect this would make a lot of people happy - surely HA’s backups are not the only archives people routinely add to restic repositories.

As an alternative to patching HA, whatever method people use to get those backups into restic could do some processing on the backup to make it more suitable for dedup.

Also, somebody should write a tool to “reformat” existing restic repos full of HA tarballs. In fact, once that tool exists, just running it periodically, with no changes to HA, would also yield all the space saving benefits.

Extra note: rvdm pointed out to me that zstd also has a --rsyncable flag, and that one -is- exposed as part of the zstd library. This means that a Python wrapper for that library could offer the flag, and HA could switch to rsyncable zstd, which could be a default that gives us the best of both worlds (decent compression for everybody, and decent dedup for restic users).

updatedupdated2024-07-082024-07-08