Skip to content

ZFS Cheatsheet#

Resources#

Highlevel Guides#

Detail Guides#

ZFS on Unraid#

Overview#

Terminology Reference#

  • Vdev: ZFS virtual device, a group of one or more disks usually with some redundancy like mirroring or RAIDZ.
  • Mirror vdev: Every disk in the vdev gets an identical copy of all data. Usually consists of 2 disks but can have more.
  • RAIDZ1: Like RAID5, some number of data disks plus one parity disk (N+1).
  • RAIDZ2: Like RAID6, some number of data disks plus two parity disks (N+2).
  • RAIDZ3: Some number of data disks plus three parity disks (N+3).
  • zpool: The logical ZFS volume or array consisting of one or more vdevs.
  • Resilver: ZFS term for rebuilding a pool after a drive fails and is replaced.
  • Scrub: Automatic scan of a pool to verify checksums and correct data corruption.
  • Dataset: The logical container where ZFS stores data. There are four types of datasets: file systems, volumes, snapshots, and bookmarks. "Dataset" can also refer to a file system dataset.
  • Zvol: Shorthand for a volume dataset. A zvol acts as a raw block device. ZFS carves out a chunk of disk space to be used by block sharing protocols like iSCSI.
  • ARC: Adaptive Replacement Cache, the algorithm used by ZFS to cache data. Also refers to the cache itself which exists in a system's main memory. The ARC is shared by all pools on a system.
  • L2ARC: A second tier of cache under the ARC. Despite its name, L2ARC uses a simple ring buffer algorithm and is typically deployed on one or more fast SSDs. L2ARCs are assigned per pool.
  • ZIL: The ZFS intent log. Stable storage that acts as a temporary landing zone for incoming sync writes. Every pool has a ZIL regardless of whether the pool has a SLOG.
  • SLOG: A separate device for the ZFS intent log (s**parate **log device, hence SLOG device). Added to a pool as a fast SSD if it's handling latency-sensitive sync writes. Like the L2ARC, it is assigned per pool.
  • Snapshot: A read-only historical reference copy of a dataset. Only consumes space based on changed data since the snapshot was taken.
  • Clone: A mounted, read/write copy of a snapshot. Can be used to recover files from a snapshot or to provide a new, separate working set of data.

Details#

devices -> vdevs -> zpool -> datasets

  • zpool: top-level ZFS structure

  • roughly analogous to JBOD with complex distribution mechanism
    >
    > [!danger] redundancy is at the vdev not zpool level
    > losing a vdev => losing the entire zpool. There is absolutely no redundancy at the zpool level

  • common misconception that ZFS "stripes" writes across the pool

    • writes are mostly distributed across vdevs w.r.t their available free space
    • ensures all vdevs will theoretically become full at the same time
  • zpools cannot share vdevs with one another
  • vdev: virtual device

  • consists of one or more real devices

  • mostly used for plain storage, but special vdevs exist (e.g. CACHE, LOG, SPECIAL)
  • single-device: cannot survive any failure; if used as storage or SPECIAL, failure will take the entire zpool down with it
  • mirror: each block is stored on every device in the vdev
  • RAIDz1-3: diagonal parity RAID where n is # of max disk failures
  • device: just a random-access block device i.e. SSD/HDD

  • simple raw file useful alternative block devices for testing/practice

  • dataset: roughly analogous to a standard, mounted filesystem

  • has its own set of underlying properties e.g. quota

  • dataset properties
    • recordsize: upper limit on how big chunks ZFS allocates to disk
    • 128k default
    • datasets with purely small stuff benefit from lower recordsize while large media datasets work just fine with 1M or higher
    • You can split a 1MB file into 256 pieces or use a large 1M box
    • HDD loves large blocks but reading/writing the whole MB just to change a bit somewhere in the middle is far from efficient
    • compression: sets compression algorithm
    • LZ4 is default; really fast (GB/s per core) compared to achieved compression
    • ZSTD is the new and fancy kid on the block and very efficient
    • GZIP is available too, but slow
    • recomend to use compression
    • atime: you want to turn this off immediately unless you absolutely need atime
    • casesensitivity: useful when dealing with Windows clients and everything where case sensitivity may become a problem
    • sync: consult the SLOG paragraph above on why you may want to change sync
    • copies: you can store multiple copies of data in a dataset
    • like a RAID1 on a single disk but only within this dataset
    • useful for very important data or when you don't have redundancy in the first place but still want self-healing of corrupt data
    • doesn't protect against drive failure
    • mountpoint: where you want the dataset to be mounted in the system
    • primarycache/secondarycache: you can exclude datasets from cache
    • or exclude everything by default and only allow some datasets to use cache
    • primary being ARC and secondary being L2ARC
  • zvol: roughly analogous to a dataset without filesystem

  • blocks: all data/metadata is stored in blocks

  • recordsize property: max block size in dataset

  • files composed of one or more blocks; block references only one file's data
  • ashift property: the binary exponent which represents sector size e.g. ashift=9 => sector size = 2^9 = 512 bytes
    >
    > [!danger] many disks lie about what their sector size causing an astronomical read/write amplification penalty
    > For Samsung EVO SSD which should have ashift=13 but reports ashift=9

  • sectors: smallest physical unit that can be written to or read from its underlying device

  • most disks use 4KiB sectors with some SSDs using 8KiB sectors

Tuning#

[!note]+ Moving files/folders across datasets requires copy+delete even if datasets are nested/in same zpool
Analogous to moving data across drives in windows (e.g. fooFolder from c:\ to d:\)
Bad: /tv/, /movies/, /downloads/ as separate datasets
Good: /media/ dataset with folders /media/tv/, /media/movies/, /media/downloads/

[!quote]
If you have a lot of files, like hundreds of thousands or millions photos on hard drives, there's a setting in ZFS you can change to improve performance when indexing or browsing the files.

Going through my 682,000 files went 64 seconds to only 5 seconds. It's amazing. ZFS usually keeps metadata cached in ARC but it's not very smart when ARC gets full.
You can set arc_meta_min to large enough to hold all the metadata, and then fill the arc metadata cache, so this will improve performance for rsync, find, or anything that scans through all the files and folders.

I can even scan and navigate the entire directory tree with the disks spun down, it's like the Dynamix cache dirs plugin but way better.

On unraid, you can do this

  1. Run a command to scan all files, eg. ls -lahR /mnt/zpool/ > /dev/null and see how long it takes (minutes)
  2. cat /proc/spl/kstat/zfs/arcstats and check how much arc_meta_used (if it's close to arc_meta_max then maybe increase your arc size and try again)
  3. Scan all files again and see how much faster it is (few seconds)
  4. Copy large file (several GB) from the ZFS share to your PC
  5. Scan all files again, if the times are all about the same then stop here because this won't help, but if it's slow then continue
  6. Make zfs_arc_meta_min a bigger number than your arc_meta_used in step 2 by doing echo 3088630576 >> /sys/module/zfs/parameters/zfs_arc_meta_min with your number
  7. Try steps 1 - 5 again, scanning should be always fast now
  8. To make the setting permanent, create/edit /boot/config/modprobe.d/zfs.conf and put options zfs zfs_arc_meta_min=3088630576 there with your number
  • recordsize: general rules of thumb

  • 1M: Sequential workloads

  • 1M: General-purpose file sharing/storage
  • 1M: BitTorrent download folders
  • 64K: KVM virtual machines using Qcow2 file-based storage
  • 16K: MySQL/InnoDB
  • 8K: PostgreSQL