Zpools are the underlying device layers for zfs filesystems. Mirrors, RAIDs and Concatenated Storage are defined here.
For pooling devices, zpools can be:
- a mirror
- a RAIDz with single or double parity
- a concatenated/striped storage
This work sheet has been done with Solaris 10 running on a virtual Parallels machine. The disks are not real, they are virtualized by Parallels, giving 8 GB to each disk. Not much, but enough to play with.
First we will try to look up the disks accessible by our system:
# format
Searching for disks...done
AVAILABLE DISK SELECTIONS:
0. c0d0
/pci@0,0/pci-ide@1f,1/ide@0/cmdk@0,0
1. c1d0
/pci@0,0/pci-ide@1f,1/ide@1/cmdk@0,0
Specify disk (enter its number): ^C
Type CTRL-C to quit "format".
If your disks do not show up, use devfsadm:
# devfsadm
# format
Searching for disks...done
AVAILABLE DISK SELECTIONS:
0. c0d0
/pci@0,0/pci-ide@1f,1/ide@0/cmdk@0,0
1. c0d1
/pci@0,0/pci-ide@1f,1/ide@0/cmdk@1,0
2. c1d0
/pci@0,0/pci-ide@1f,1/ide@1/cmdk@0,0
3. c1d1
/pci@0,0/pci-ide@1f,1/ide@1/cmdk@1,0
Specify disk (enter its number): ^C
You'll notice that the virtual disks are mapped as IDE/ATA drives, so the disk device names don't have a target specification "t".
Let's create our first pool by simply putting together all three disks (c0d0 is our root partition and boot disk which is not usable for our example):
# zpool create zfstest c0d1 c1d0 c1d1
That's it. You have just created a zpool named "zfstest" containing all three disks. Your available space will be just the sum of all three disks:
# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
zfstest 23.8G 91K 23.8G 0% ONLINE -
Use "zpool status" to get detailed status information of the components of your zpool:
# zpool status
pool: zfstest
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
zfstest ONLINE 0 0 0
c0d1 ONLINE 0 0 0
c1d0 ONLINE 0 0 0
c1d1 ONLINE 0 0 0
errors: No known data errors
To destroy a pool, use "zpool destroy":
# zpool destroy zfstest
and your pool is gone.
Let's try a mirror now:
# zpool create mirror c1d0 c1d1
You just created a mirror between disk c1d0 and disk c1d1. Available storage is the same as if you used only one of these disks. If disk sizes differ, the smaller size will be your storage size. Data is replicated between these disks.
"zpool status" now reads:
# zpool status
pool: zfstest
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
zfstest ONLINE 0 0 0
mirror ONLINE 0 0 0
c1d0 ONLINE 0 0 0
c1d1 ONLINE 0 0 0
errors: No known data errors
So now we have a simple mirror. But how to put data on it?
If you create a zpool there is automatically a zfs filesystem created in it. The mountpoint defaults to the poolname. So your pool "zfstest" is mounted as a zfs filesystem at /zfstest:
# df -k
Filesystem kbytes used avail capacity Mounted on
/dev/dsk/c0d0s0 14951508 5725085 9076908 39% /
/devices 0 0 0 0% /devices
ctfs 0 0 0 0% /system/contract
proc 0 0 0 0% /proc
mnttab 0 0 0 0% /etc/mnttab
swap 2104456 836 2103620 1% /etc/svc/volatile
objfs 0 0 0 0% /system/object
/usr/lib/libc/libc_hwcap1.so.1
14951508 5725085 9076908 39% /lib/libc.so.1
fd 0 0 0 0% /dev/fd
swap 2103624 4 2103620 1% /tmp
swap 2103644 24 2103620 1% /var/run
zfstest 8193024 24 8192938 1% /zfstest
We will create a big file on it:
# dd if=/dev/zero bs=128k count=40000 of=/zfstest/bigfile
40000+0 records in
40000+0 records out
It is really there now:
# ls -la /zfstest
total 10241344
drwxr-xr-x 2 root sys 3 Apr 21 11:15 .
drwxr-xr-x 39 root root 1024 Apr 21 11:13 ..
-rw-r--r-- 1 root root 5242880000 Apr 21 11:30 bigfile
Now the differences to classical volume managers do begin. The underlying zpool "zfstest" KNOWS actually that approx. 5 Gigabytes are taken by zfs:
# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
zfstest 7.94G 4.88G 3.05G 61% ONLINE -
This has enormous advantages: When replacing a mirrored disk, zfs will only copy allocated blocks to the new disk and not all blocks of the pool. The same is true with RAID devices, only allocated data blocks are reconstructed.
Now let's just stop the mirror. You do that just by detaching one drive from the mirror:
# zpool detach zfstest c1d0
This command pulled away disk c1d0 from the pool. Your mirror is not a mirror any more:
# zpool status
pool: zfstest
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
zfstest ONLINE 0 0 0
c1d1 ONLINE 0 0 0
errors: No known data errors
However you did not lose any bit of data! Your zpool ist just available as it was before (as long disk c1d1 does not fail).
You may attach another disk to your pool to create a new mirror:
# zpool attach zfstest c1d1 c0d1
Now your mirror consists of disks c1d1 and c0d1. Solaris will immediately begin to duplicate any block that's used by zfs from drive c1d1 to drive c0d1:
# zpool status
pool: zfstest
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scrub: resilver in progress, 55.53% done, 0h7m to go
config:
NAME STATE READ WRITE CKSUM
zfstest ONLINE 0 0 0
mirror ONLINE 0 0 0
c1d1 ONLINE 0 0 0
c0d1 ONLINE 0 0 0
errors: No known data errors
The process of replicating data on new or outdated disks is named "resilvering".
A mirror is not limited to two disks. If you have big concerns that your valuable data is prone to losses, just attach another disk to your mirror:
# zpool attach zfstest c0d1 c1d0
Your mirror now has three elements. Note, that your storage size does not grow by attaching new mirror components. But now two drives may fail completely and you still have all of your data:
# zpool status
pool: zfstest
state: ONLINE
scrub: resilver completed with 0 errors on Mon Apr 21 13:56:16 2008
config:
NAME STATE READ WRITE CKSUM
zfstest ONLINE 0 0 0
mirror ONLINE 0 0 0
c1d1 ONLINE 0 0 0
c0d1 ONLINE 0 0 0
c1d0 ONLINE 0 0 0
errors: No known data errors
Let's detach two disks now:
# zpool detach zfstest c1d1
# zpool detach zfstest c1d0
Your mirror has gone once again. To set up a concatenated or striped storage (write operations on zfs occur on ALL pool members, so it's more like a striped disk set), you may add these disks to your pool (never mistake "add" for "attach" - the former ADDs storage, the latter attaches disks to mirrors):
# zpool add zfstest c1d0 c1d1
Your pool has become the same as the one we created at the beginning of our exercise:
# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
zfstest 23.8G 4.88G 18.9G 20% ONLINE -
# zpool status
pool: zfstest
state: ONLINE
scrub: resilver completed with 0 errors on Mon Apr 21 13:56:16 2008
config:
NAME STATE READ WRITE CKSUM
zfstest ONLINE 0 0 0
c0d1 ONLINE 0 0 0
c1d0 ONLINE 0 0 0
c1d1 ONLINE 0 0 0
errors: No known data errors
The only difference is our file "bigfile", which is still available as we did not destroy the pool. You see it from the output of "zpool list" above: 4,8G are still used.
Now we are stuck. It is not possible to remove disks added to our zpool. As writes occur on all members, newly written data is on all disks. No chance to throw away a disk. Mirror component disks can be detached at any time - as long it is not the last disk of a mirror.
Let's destroy the pool and set up a RAID storage. ZFS offers two RAID types: raidz1 and raidz2. raidz1 means single parity, raidz2 double parity.
# zpool destroy zfstest
# zpool create zfstest raidz1 c0d1 c1d0 c1d1
We have now created a raid group:
# zpool status
pool: zfstest
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
zfstest ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c0d1 ONLINE 0 0 0
c1d0 ONLINE 0 0 0
c1d1 ONLINE 0 0 0
errors: No known data errors
Be aware that "zpool list" is showing the global capacity of your raid set and not the usable capacity:
# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
zfstest 23.9G 157K 23.9G 0% ONLINE -
To see how many space we are able to allocate, use a zfs command (zfs commands are explained in the zfs tutorial text):
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
zfstest 101K 15.7G 32.6K /zfstest
One disk may fail in this scenario.
To put up the same pool with double parity:
# zpool destroy zfstest
# zpool create zfstest raidz2 c0d1 c1d0 c1d1
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
zfstest 86.7K 7.80G 24.4K /zfstest
Only 7.8 GB left - compared to 15.7 GB with a single parity RAID device. Two drives may fail now.
We have achieved the same data security as with a three way mirror - hence the same usable storage.
As a practical example, here is the output of "zpool status" and "zpool list" of a mailserver. The zpool "mail" consists of two mirror pairs added to a pool.
The creation command has been:
# zpool create mail \
mirror c6t600D0230006C1C4C0C50BE5BC9D49100d0 c6t600D0230006B66680C50AB7821F0E900d0 \
mirror c6t600D0230006B66680C50AB0187D75000d0 c6t600D0230006C1C4C0C50BE27386C4900d0
As you see, it is perfectly legal and possible to add the storage of two mirrors in one pool.
# zpool status
pool: mail
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
mail ONLINE 0 0 0
mirror ONLINE 0 0 0
c6t600D0230006C1C4C0C50BE5BC9D49100d0 ONLINE 0 0 0
c6t600D0230006B66680C50AB7821F0E900d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c6t600D0230006B66680C50AB0187D75000d0 ONLINE 0 0 0
c6t600D0230006C1C4C0C50BE27386C4900d0 ONLINE 0 0 0
# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
mail 6.81T 3.08T 3.73T 45% ONLINE -
As you see, you can also use Sun MPxIO devices - they have LONG device names. You may also use FDISK partitions of x86 computers (...p0,...p1,...) and Solaris slices (...s0, ....s1, ...) to set up zpools. Both are however not recommended but fine to play with zpool commands.
The MPxIO names are usable because they show up just like normal block disk devices in /dev/dsk:
# ls -la /dev/dsk/c6t600D0230006C1C4C0C50BE5BC9D49100d0
lrwxrwxrwx 1 root root 65 Dec 11 06:22 /dev/dsk/c6t600D0230006C1C4C0C50BE5BC9D49100d0 -> ../../devices/scsi_vhci/disk@g600d0230006c1c4c0c50be5bc9d49100:wd