tl;dr: if gzip
and restic
interact anywhere for you, you should consider passing --rsyncable
to gzip
, or not using gzip at all.
tl;dr2: Click here to read the summary right now.
Like many people, I have a Home Assistant installation. Mine does not control many things, but it collects a lot from various sensors and devices around the house. Besides Home Assistant’s own database with short and long term storage, I use InfluxDB as HA addon. This means that HA backups are also InfluxDB backups.
Over the course of a few years I have, apparently, collected 354 Home Assistant backups in a restic repository. The oldest one is around 300 MB; the newest, 1.4GB.
Background
Home Assistant backups look like this:
$ ssh root@XX
| | | | /\ (_) | | | |
| |__| | ___ _ __ ___ ___ / \ ___ ___ _ ___| |_ __ _ _ __ | |_
| __ |/ _ \| '_ \ _ \ / _ \ / /\ \ / __/ __| / __| __/ _\ | '_ \| __|
| | | | (_) | | | | | | __/ / ____ \\__ \__ \ \__ \ || (_| | | | | |_
|_| |_|\___/|_| |_| |_|\___| /_/ \_\___/___/_|___/\__\__,_|_| |_|\__|
Welcome to the Home Assistant command line.
[core-ssh ~]$ cd backup
[core-ssh backup]$ ls -al
total 10583052
drwxr-xr-x 3 root root 4096 Jul 6 00:00 .
drwxr-xr-x 1 root root 4096 Jul 4 13:22 ..
-rw-r--r-- 1 root root 1559705600 Jul 3 00:00 277e0264.tar
-rw-r--r-- 1 root root 1559388160 Jul 2 00:00 3456e3c7.tar
-rw-r--r-- 1 root root 1558886400 Jul 1 00:00 4303e28c.tar
-rw-r--r-- 1 root root 1560238080 Jul 4 00:00 77a4287d.tar
-rw-r--r-- 1 root root 1557125120 Jun 30 00:00 7b821e1e.tar
-rw-r--r-- 1 root root 1522309120 Jul 6 00:00 d6400715.tar
-rw-r--r-- 1 root root 1519339520 Jul 5 00:00 f0eba8bf.tar
[core-ssh backup]$ tar tvf 277e0264.tar
-rw-r--r-- 0/0 10510 2024-07-03 00:00:02 core_ssh.tar.gz
-rw-r--r-- 0/0 1800 2024-07-03 00:00:02 a0d7b954_aircast.tar.gz
-rw-r--r-- 0/0 2381 2024-07-03 00:00:02 15ef4d2f_esphome.tar.gz
-rw-r--r-- 0/0 1309290473 2024-07-03 00:00:02 a0d7b954_influxdb.tar.gz
-rw-r--r-- 0/0 36485918 2024-07-03 00:00:41 a0d7b954_tautulli.tar.gz
-rw-r--r-- 0/0 52714 2024-07-03 00:00:43 a0d7b954_grafana.tar.gz
-rw-r--r-- 0/0 2008 2024-07-03 00:00:43 a0d7b954_wireguard.tar.gz
-rw-r--r-- 0/0 1905 2024-07-03 00:00:43 a0d7b954_nodered.tar.gz
-rw-r--r-- 0/0 1560 2024-07-03 00:00:43 a0d7b954_chrony.tar.gz
-rw-r--r-- 0/0 2031 2024-07-03 00:00:43 a0d7b954_ssh.tar.gz
-rw-r--r-- 0/0 2422 2024-07-03 00:00:43 core_mosquitto.tar.gz
-rw-r--r-- 0/0 213812233 2024-07-03 00:00:43 homeassistant.tar.gz
-rw-r--r-- 0/0 205 2024-07-03 00:00:54 share.tar.gz
-rw-r--r-- 0/0 212 2024-07-03 00:00:54 addons_local.tar.gz
-rw-r--r-- 0/0 1961 2024-07-03 00:00:54 ssl.tar.gz
-rw-r--r-- 0/0 288 2024-07-03 00:00:54 media.tar.gz
-rw-r--r-- 0/0 1440 2024-07-03 00:00:53 ./backup.json
Each backup is a tarball, containing a bunch of tar.gz
files, including one for each addon, such as influxdb
.
The oldest influxdb backup is 513MB; the newest, 1.2GB.
Clearly Influx is the bulk of my backup data.
But Influx is my archive.
It barely changes, it just grows a bit.
I would love for it to dedup a bit in restic storage!
Those 354 backups, all sitting together in a directory as 354 .tar
files, take up 291GB of disk space.
A restic repository holding those 354 backups, with compression and deduplication enabled .. take up 291GB of disk space. There is effectively no compression (which is expected, because there’s .gz
inside already), and no dedup (which is the sad surprise this article is about):
$ restic -r ha-snapshots-all-together-restic/ stats --mode=raw-data
repository ed03b8ed opened (version 2, compression level auto)
Stats in raw-data mode:
Snapshots processed: 1
Total Blob Count: 199240
Total Uncompressed Size: 289.952 GiB
Total Size: 288.845 GiB
Compression Progress: 100.00%
Compression Ratio: 1.00x
Compression Space Saving: 0.38%
(The alert reader will notice that I only have one snapshot. I am assuming that restic dedup works as well within one snapshot as between snapshots, so I simplified the experiment by just backing up the one dir full of tars directly. My actual backup repository - the source of all these tarballs - has these 354 tarballs spread out over a few hundred snapshots, but with similar results: effectively no dedup).
It seems obvious that the .gz
in the tarball stack is the reason restic cannot dedup.
(Spoiler alert: it is).
Let’s find out how we can improve that.
I have extensive notes on the steps I took to get all these numbers, so please feel free to ask questions.
Unpack the tarballs once
I unpacked the tarballs once.
In the process, I found some of the tarballs were corrupted and/or truncated.
This has introduced around 1-2GB of noise into these measurements.
I believe this to be irrelevant to the end result.
(To be clear, I think the tarball truncation is my own fault, and not a problem in HA.
I keep forgetting that rsync -P
includes --partial
and most likely moved some backups from HA to restic via an undersized intermediate filesystem.)
My source now looks like this:
ha-snapshots-unpacked-once$ find . | head
.
./cff234c8.tar
./cff234c8.tar/15ef4d2f_esphome.tar.gz
./cff234c8.tar/a0d7b954_nodered.tar.gz
etc.
This takes up 291GB of disk space, and restic manages to squeeze it into 289GB.
Clearly the outside tar layer is not the problem.
Decompress
I gunzip
ed all the .gz
files.
Looks like this (funny how the order of files has changed):
ha-snapshots-unpacked-twice$ find . | head
.
./cff234c8.tar
./cff234c8.tar/core_ssh.tar
./cff234c8.tar/share.tar
etc.
This takes up a whopping 713GB of disk space. Restic packs it down to 40GB.
Forty. Gigabyte.
If I am reading this correctly:
Snapshots processed: 1
Total Blob Count: 211179
Total Uncompressed Size: 140.975 GiB
Total Size: 39.799 GiB
Compression Progress: 100.00%
Compression Ratio: 3.54x
Compression Space Saving: 71.77%
deduplication stored 713GB in 141GB. Compression then gave us another 100GB.
It turns out (unsurprisingly, I guess) that this is the best number we are going to get.
gzip’s rsyncable flag
At some point, I ran into a blog post that mentioned that restic was smart about dedup, and did not need things to be identical at block boundaries.
This reminded me of gzip --rsyncable
.
Martin Pool describes it well:
There is a patch called
--rsyncable
for gzip that fixes this behaviour: gzip files are basically broken up into blocks so that changes (including insertion or deletion) in the input file affect only the corresponding blocks in the output file. (The blocks are not of fixed size, but rather delimited by marker patterns at which a checksum hits a particular value, so they move as data is inserted or removed.)
So, I recompressed all inside tarballs with gzip --rsyncable -9
.
I used -9
because the couple of HA backups I checked manually reported max compression
when inspected with file
.
I later found out that in late 2023, HA stopped using max compression.
Me doing -9
while HA no longer does it will skew these numbers a bit, but not in any interesting way I think.
My source looks like this:
ha-snapshots-unpacked-twice-repack-gzip-rsyncable$ find . | head
.
./cff234c8.tar
./cff234c8.tar/15ef4d2f_esphome.tar.gz
./cff234c8.tar/a0d7b954_nodered.tar.gz
It takes up 289GB of disk space, almost identical to our starting size.
Restic stores it in 58 GB:
Stats in raw-data mode:
Snapshots processed: 1
Total Blob Count: 41780
Total Uncompressed Size: 59.175 GiB
Total Size: 57.681 GiB
Compression Progress: 100.00%
Compression Ratio: 1.03x
Compression Space Saving: 2.52%
In other words: if we could patch Home Assistant to make rsyncable gzips, my backup storage would go from 290GB to 58GB. That’s not my earlier 40GB mark, but it is a very nice improvement, and it would come at only little cost to users who do not store their backups in restic or something else that dedups and compresses.
Curious about the actual effects of --rsyncable
, I did a comparison on a single InfluxDB backup:
source | result |
---|---|
from HA | 772M ha-snapshots-unpacked-once/3bf092f5.tar/a0d7b954_influxdb.tar.gz (file says max compression ) |
gzip -9 |
772M a0d7b954_influxdb.tar.gz |
gzip -9 --rsyncable |
779M a0d7b954_influxdb.tar.gz |
gzip --rsyncable |
787M ha-snapshots-unpacked-twice-repack-gzip-rsyncable/3bf092f5.tar/a0d7b954_influxdb.tar.gz |
after gunzip | 1.8G ha-snapshots-unpacked-twice/3bf092f5.tar/a0d7b954_influxdb.tar |
Observations:
gzip -9
matches the original size (this is a backup from 2023)-9 --rsyncable
is about 1% bigger than just-9
tar it up again
At this point, somebody asked me about block boundaries, and I said that my understanding was that restic did not care about them that much for dedup.
But just to be sure, I did the -actual- experiment of what things would look like if HA made rsyncable .gz
.
Source looks like this now:
ha-snapshots-unpacked-twice-repack-gzip-rsyncable-tar$ ls | head
00cceeff.tar
0123b63e.tar
03a06fae.tar
Presumably, HA should be able to restore from these files (except for the fact that I accidentally also gzipped the 2KB backup.json
file HA sticks in each backup with some metadata).
This, again, takes up 289 GB.
Restic stores it in 56 GB:
Stats in raw-data mode:
Snapshots processed: 1
Total Blob Count: 37018
Total Uncompressed Size: 57.161 GiB
Total Size: 55.742 GiB
Compression Progress: 100.00%
Compression Ratio: 1.03x
Compression Space Saving: 2.48%
I cannot explain how this is 2 GB smaller than the previous mode.
Table of results
format | source size | restic size |
---|---|---|
354 original HA backups | 291 GB | 291 GB |
untar once | 291 GB | 289 GB |
untar+gunzip | 713 GB | 40 GB |
untar+gunzip+gzip -9 –rsyncable | 289GB | 58 GB |
untar+gunzip+gzip -9 –rsyncable+tar | 289 GB | 56 GB |
Summary
Summarised, my results say:
- having HA not compress would cost a bit of disk space temporarily, while allowing restic to save a boatload of space for a long time
- having HA compress in
--rsyncable
would cost almost no extra space temporarily, and allow restic to still save a lot of space
Where I say “temporarily”, for some users this might be their forever, because they store the backups into non-dedup storage – or maybe even dedup storage that has different magic dust than restic does.
A secondary conclusion, of course, is “a bit of postprocessing of the HA tarball before it goes into restic provides these benefits too”.
As suggested above, Home Assistant uses securetar to generate backups, presumably because backups can optionally be encrypted.
securetar itself relies on Python’s tarfile
, which uses the Python gzip
library for compression.
This library does not expose anything like the --rsyncable
flag, sadly.
In gzip --rsyncable
, the implementation of that periodic reset is, and this came as a surprise to me, implemented in the gzip binary, not in the zlib library.
Next steps
Given that we cannot easily make HA’s archives rsyncable, what remains is optionally making them uncompressed. I added a comment to the HA feature request “Allow backup compression parameters”. Please hit the Vote button!
A bigger job would be to add rsyncable
support to (Python’s) zlib.
I suspect this would make a lot of people happy - surely HA’s backups are not the only archives people routinely add to restic repositories.
As an alternative to patching HA, whatever method people use to get those backups into restic could do some processing on the backup to make it more suitable for dedup.
Also, somebody should write a tool to “reformat” existing restic repos full of HA tarballs. In fact, once that tool exists, just running it periodically, with no changes to HA, would also yield all the space saving benefits.
Extra note: rvdm pointed out to me that zstd
also has a --rsyncable
flag, and that one -is- exposed as part of the zstd library.
This means that a Python wrapper for that library could offer the flag, and HA could switch to rsyncable zstd, which could be a default that gives us the best of both worlds (decent compression for everybody, and decent dedup for restic users).