小文件合并方案分享

用户1260683

发布于 2020-07-14 08:36:41

2.8K0

文章被收录于专栏：Ceph对象存储方案Ceph对象存储方案

小文件合并方案分享

现有问题

资源利用率&成本:受限于磁盘性能和硬件成本，需要在控制好硬件成本的情况下，解决海量小文件的存储，提高资源利用率。单个集群如果存储了大量小文件(240块SATA，总共6亿文件，文件大小约100KB)，磁盘容量平均利用率只有22%。
读写性能:随着集群文件数量的增长，整体的读写性能会急剧下降。导致这类性能下降的原因主要有2个，一方面是filestore底层采用xfs文件系统，xfs不适合做这种大量小文件的存储，另外是我们采用了SMR的SATA磁盘，这类磁盘也不适合用在Ceph里，具体可以参考下面的文档。

https://blog.widodh.nl/2017/02/do-not-use-smr-disks-with-ceph/
https://copyfuture.com/blogs-details/201911061902186294pksqoqhzwcm79x Ceph 十年演进的经验教训 —— 磁盘文件系统并不适合作为分布式存储后端

Haystack

Facebook's Haystack design paper. https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf

SeaweedFS

SeaweedFS is optimized for small files. Small files are stored as one continuous block of content, with at most 8 unused bytes between files. Small file access is O(1) disk read.

https://github.com/chrislusf/seaweedfs#compared-to-glusterfs-ceph

ambry

https://github.com/linkedin/ambry/wiki/Store

The data node maintains a file per replicated store. We call this file the on-disk log. The on-disk log is a pre-allocated file in a standard linux file system (ext4/xfs). In Ambry, we pre-allocate a file for each on-disk log. The basic idea for the replicated store is the following : on put, append blobs to the end of the pre-allocated file so as to encourage a sequential write workload. Any gets that are serviced by the replicated store may incur a random disk IO, but we expect good locality in the page cache. Deletes, like puts, are appended as a record at the end of the file.

To be able to service random reads of either user metadata or blobs, the replicated store must maintain an index that maps blob IDs to specific offsets in the on-disk log. We store other attributes as well in this index such as delete flags and ttl values for each blob. The index is designed as a set of sorted files. The most recent index segment is in memory. The older segments are memory mapped and an entry is located by doing a binary search on them. The search moves from the most recent to the oldest. This makes it easy to identify the deleted entry before the put entry.