Joyqul's note for computer science: [SA] ZFS

ZFS 概觀

ZFS 是 Sun Microsystems 為 Solaris 作業系統 (Unix 的一支) 開發的檔案系統。
ZFS 具備 logical volume management 的能力 (稱為 Zpool)、也提供快照(snapshots)、copy-on-write 副本與自動修復的功能。

不同於傳統的檔案系統是建立在單一的磁碟分割之上，ZFS 是建立在虛擬的儲存池(pool)之上，稱為 Zpool。Zpool 可以是由多個 volume 所組成，之後檔案系統再透過 zpool 取得使用空間。ZFS的另一項特點就是 128 位元的設計，理論上可以 2 的 48 次方個檔案，每個檔案最大可以有 16EB，最大單一 volume 可達 16EB。

在資料安全與穩定性方面，ZFS 採用 copy- on-write 與 snapshots 計術來進一步維護檔案系統的正確性。所謂 copy-on-write 簡單的說就是每當檔案有變更時，不是直接覆蓋舊有的資料，而是將使用中的區塊複製出來，而這些變更是在這複製的區塊上。Copy-on-Write 最大的好處就是當資料變更時，舊有的資料依然能夠維持，方便作為復原之用。Snapshot 就是保存著這些就有資料的映像副本的技術。ZFS 有著另一項先進的計術，就是透過刪除重複的資料 (Deduplication) 來節省儲存的空間。

ZFS的架構有三層。一個或多個ZFS的檔案系統會存在一個ZFS pool，其中會包含一或多個裝置(通常是磁碟)。檔案系統在pool中會分享其資源，而且也不會受固定大小的限制。

裝置可以是還在跑的過程中，被加到一個pool之中，例如像增加 pool大小之時，可以在pool 裡，不需讓檔案系統離線的情況下，產生新的檔案系統。
ZFS支援檔案系統的快照，及複製存在的檔案系統。

ZFS管理所有的儲存體：卷冊管理軟體 (如：SVM 或 Veritas) 不是必需的。

ZFS 只用兩個工具指令管理：
．zpool 管理ZFS pools 及其中的裝置。
．zfs 管理 ZFS 檔案系統。

Storage pools

不同於傳統檔案系統需要駐留於單獨裝置或者需要一個卷管理系統去使用一個以上的裝置，ZFS建立在虛擬的，被稱為「zpools」的儲存池之上（儲存池最早在AdvFS實現^[5]，並且加到後來的Btrfs）。每個儲存池由若干虛擬裝置（virtual devices， vdevs）組成。這些虛擬裝置可以是原始磁碟，也可能是一個RAID1映像裝置，或是非標準RAID等級的多磁碟組。於是zpool上的檔案系統可以使用這些虛擬裝置的總儲存容量。
可以使用磁碟限額以及設定磁碟預留空間來限制儲存池中單個檔案系統所佔用的空間。

Pools
所有 ZFS 檔案系統在一個pool裡，所以第一步是產生一個 pool。
zpool 的指令是來管理 ZFS pools。

在產生新的pools前，先確認是否已有pool，以避免在過程中搞混哪個是哪個。
確認 pool 是否存在，可用 zpool list

Hot spare

A hot spare or hot standby is used as a failover mechanism to provide reliability in system configurations. The hot spare is active and connected as part of a working system. When a key component fails, the hot spare is switched into operation. More generally, a hot standby can be used to refer to any device or system that is held in readiness to overcome an otherwise significant start-up delay.

（以下照著 FreeBSD Handbook 來實作一次）

ZFS Tuning

Some of the features provided by ZFS are RAM-intensive, so some tuning may be required to provide maximum efficiency on systems with limited RAM.

Memory

At least one gigabyte.
The amount of recommended RAM depends upon the size of the pool and the ZFS features which are used. A general rule of thumb is 1GB of RAM for every 1TB of storage. If the deduplication feature is used, a general rule of thumb is 5GB of RAM per TB of storage to be deduplicated. While some users successfully use ZFS with less RAM, it is possible that when the system is under heavy load, it may panic due to memory exhaustion. Further tuning may be required for systems with less than the recommended RAM requirements.

Deduplication
重複數據刪除（英文：Data deduplication）是一種節約數據存儲空間的技術。在計算機中存儲了很多重複數據，這些數據佔用了大量硬碟空間，利用重複數據刪除技術，可以只存儲一份數據。另外一項節約存儲空間的技術是數據壓縮，數據壓縮技術在比較小的範圍內以比較小的粒度查找重複數據，粒度一般為幾個比特到幾個位元組。而重複數據刪除是在比較大的範圍內查找大塊的重複數據，一般重複數據塊尺寸在1KB以上。^[1] 重複數據刪除技術被廣泛應用於網路硬碟,電子郵件，磁碟備份介質設備等。

(1) 節約硬碟空間： 由於不必存儲重複數據，因此大大節約的磁碟空間。
(2) 提升寫入性能： 數據寫入存儲設備的主要性能瓶頸在於硬碟，由於硬碟是機械設備，一般單塊硬碟只能提供 100MB/s 左右的連續寫性能。在線重複數據刪除在數據存入硬碟之前就把重複的數據刪除掉了，因此存入硬碟的數據量變小了，數據的寫入性能也就提高了
(3) 節約網路頻寬： 對於使用了源端重刪技術的應用來說，數據上傳到存儲設備之前，已經去掉了重複的數據塊，因此重複的數據塊不需要經過網路傳輸到存儲介質，從而節約了網路頻寬。例如：Dropbox就採用了源端重刪技術，因此佔用網路頻寬很小，還有開源的數據同步工具rsync也採用了源端重刪技術節約網路頻寬。

Kernel Configuration

Due to the RAM limitations of the i386™ platform, users using ZFS on the i386™ architecture should add the following option to a custom kernel configuration file, rebuild the kernel, and reboot:

options KVA_PAGES=512

不過因為我這次的機器是 64 bit 所以不用加這行

Loader Tunables

The kmem address space can be increased on all FreeBSD architectures. On a test system with one gigabyte of physical memory, success was achieved with the following options added to /boot/loader.conf, and the system restarted:

vm.kmem_size="330M" vm.kmem_size_max="330M" vfs.zfs.arc_max="40M" vfs.zfs.vdev.cache.size="5M"

For a more detailed list of recommendations for ZFS-related tuning, see http://wiki.freebsd.org/ZFSTuningGuide.

Using ZFS

There is a start up mechanism that allows FreeBSD to mount ZFS pools during system initialization. To set it, issue the following commands:

# echo 'zfs_enable="YES"' >> /etc/rc.conf

# service zfs start (or # /etc/rc.d/zfs start)

The examples in this section assume three SCSI disks with the device names da0, da1, and da2. Users of IDE hardware should instead use ad device names.

Single Disk Pool

To create a simple, non-redundant ZFS pool using a single disk device, use zpool:

# zpool create example /dev/da1

(新增了一個 5GB 的虛擬硬碟 da1，然後用他來建 zpool)
To view the new pool, review the output of df:

# df -h

This output shows that the teat pool has been created and mounted. It is now accessible as a file system. Files may be created on it and users can browse it, as seen in the following example:

# cd /teat
# ls
# touch testfile
# ls -al

However, this pool is not taking advantage of any ZFS features.
To create a dataset on this pool with compression enabled:

# zfs create teat/compressed
# zfs set compression=gzip teat/compressed

The teat/compressed dataset is now a ZFS compressed file system.
Try copying some large files to /teat/compressed.

Compression can be disabled with:

# zfs set compression=off teat/compressed

To unmount a file system, issue the following command and then verify by using df:

# zfs umount teat/compressed
# df

umount 的時候如果 teat 裏面有檔案就會出現：
cannot unmount '/teat/compressed': Device busy
這時候只要把檔案都移除掉就可以順利 umount 了

To re-mount the file system to make it accessible again, and verify with df:

# zfs mount example/compressed
# df

The pool and file system may also be observed by viewing the output from mount:

# mount

ZFS datasets, after creation, may be used like any file systems.

However, many other features are available which can be set on a per-dataset basis. In the following example, a new file system, data is created. Important files will be stored here, the file system is set to keep two copies of each data block:

# zfs create teat/data
# zfs set copies=2 teat/data

It is now possible to see the data and space utilization by issuing df:

# df -h

Notice that each file system on the pool has the same amount of available space. This is the reason for using df in these examples, to show that the file systems use only the amount of space they need and all draw from the same pool. The ZFS file system does away with concepts such as volumes and partitions, and allows for several file systems to occupy the same pool.

To destroy the file systems and then destroy the pool as they are no longer needed:

# zfs destroy teat/compressed
# zfs destroy teat/data
# zpool destroy teat

ZFS RAID-Z

RAID-Z is not actually a kind of RAID, but a higher-level software technology that implements an integrated redundancy scheme in the ZFS file system similar to RAID 5. RAID-Z is a data-protection technology featured by ZFS in order to reduce the block overhead in mirroring.
RAID-Z avoids the RAID 5 "write hole" using copy-on-write; rather than overwriting data, it writes to a new location and then atomically overwrites the pointer to the old data. It avoids the need for read-modify-write operations for small writes by only ever performing full-stripe writes. Small blocks are mirrored instead of parity protected, which is possible because the file system is aware of the underlying storage structure and can allocate extra space if necessary. RAID-Z2 doubles the parity structure to achieve results similar to RAID 6: the ability to sustain up to two drive failures without losing data. In July 2009, triple-parity RAID-Z3 was added to provide increased redundancy due to the extended resilver times of multi-terabyte disks.
There is no way to prevent a disk from failing. One method of avoiding data loss due to a failed hard disk is to implement RAID. ZFS supports this feature in its pool design.

To create a RAID-Z pool, issue the following command and specify the disks to add to the pool:

# zpool create storage raidz da2 da3 da4

我加了三個虛擬的 1GB 硬碟（da2 da3 da4）

This command creates the storage zpool. This may be verified using mount(8) and df(1). This command makes a new file system in the pool called home:

# zfs create storage/home

It is now possible to enable compression and keep extra copies of directories and files using the following commands:

# zfs set copies=2 storage/home
# zfs set compression=gzip storage/home

To make this the new home directory for users, copy the user data to this directory, and create the appropriate symbolic links:

# cp -rp /home/* /storage/home
# rm -rf /home /usr/home
# ln -s /storage/home /home
# ln -s /storage/home /usr/home

Users should now have their data stored on the freshly created /storage/home.

Test by adding a new user and logging in as that user.
Try creating a snapshot which may be rolled back later:

# zfs snapshot storage/home@13-10-26

Note that the snapshot option will only capture a real file system, not a home directory or a file. The @ character is a delimiter used between the file system name or the volume name. When a user's home directory gets trashed, restore it with:

# zfs rollback storage/home@13-10-26

To get a list of all available snapshots, run ls in the file system's .zfs/snapshot directory. For example, to see the previously taken snapshot:

# ls /storage/home/.zfs/snapshot

It is possible to write a script to perform regular snapshots on user data. However, over time, snapshots may consume a great deal of disk space. The previous snapshot may be removed using the following command:

# zfs destroy storage/home@13-10-26

After testing, /storage/home can be made the real /home using this command:

# zfs set mountpoint=/home storage/home

不知道為什麼一直噴 device busy....
Run df and mount to confirm that the system now treats the file system as the real /home:

# mount

This completes the RAID-Z configuration. To get status updates about the file systems created during the nightly periodic(8) runs, issue the following command

# echo 'daily_status_zfs_enable="YES"' >> /etc/periodic.conf

Recovering RAID-Z

Every software RAID has a method of monitoring its state. The status of RAID-Z devices may be viewed with the following command:

# zpool status -x

If all pools are healthy and everything is normal, the following message will be returned:

This indicates that the device was previously taken offline by the administrator using the following command:

# zpool offline storage da1

It is now possible to replace da1 after the system has been powered down. When the system is back online, the following command may issued to replace the disk:

# zpool replace storage da1

From here, the status may be checked again, this time without the -x flag to get state information:

# zpool status storage

As shown from this example, everything appears to be normal.

Reference:
Wikipedia
http://www.freebsd.org/doc/handbook/filesystems-zfs.html
http://ithelp.ithome.com.tw/question/10056961
http://ithelp.ithome.com.tw/question/10057289
https://wiki.freebsd.org/ZFSQuickStartGuide

Joyqul's note for computer science

2013年10月26日星期六

[SA] ZFS

ZFS 概觀

Storage pools

Hot spare

ZFS Tuning

Memory

Kernel Configuration

Loader Tunables

Using ZFS

Single Disk Pool

ZFS RAID-Z

Recovering RAID-Z

沒有留言:

張貼留言

2013年10月26日 星期六

[SA] ZFS

ZFS 概觀

Storage pools

Hot spare

ZFS Tuning

Memory

Kernel Configuration

Loader Tunables

Using ZFS

Single Disk Pool

ZFS RAID-Z

Recovering RAID-Z

沒有留言:

張貼留言

2013年10月26日星期六