Pulsar's Tiered Storage feature allows older backlog data to be moved from BookKeeper to long term and cheaper storage, while still allowing clients to access the backlog as if nothing has changed.
With jclouds, it is easy to add support for more cloud storage providers in the future.
Tiered storage uses Apache Hadoop to support filesystems for long term storage.
With Hadoop, it is easy to add support for more filesystems in the future.
For more information about how to use the filesystem offloader with Pulsar, see here.
When to use tiered storage?
Tiered storage should be used when you have a topic for which you want to keep a very long backlog for a long time.
For example, if you have a topic containing user actions which you use to train your recommendation systems, you may want to keep that data for a long time, so that if you change your recommendation algorithm, you can rerun it against your full user history.
How does tiered storage work?
A topic in Pulsar is backed by a log, known as a managed ledger. This log is composed of an ordered list of segments. Pulsar only writes to the final segment of the log. All previous segments are sealed. The data within the segment is immutable. This is known as a segment oriented architecture.
The tiered storage offloading mechanism takes advantage of segment oriented architecture. When offloading is requested, the segments of the log are copied one-by-one to tiered storage. All segments of the log (apart from the current segment) written to tiered storage can be offloaded.
BookKeeper에 기록 된 데이터는 기본적으로 3 개의 물리적 시스템에 복제됩니다. However, once a segment is sealed in BookKeeper, it becomes immutable and can be copied to long term storage. Long term 스토리지는 Reed-Solomon error correction과 같은 메커니즘을 사용하여 더 적은 물리적 데이터 사본으로 인하여 저장소 비용을 절감 할 수 있습니다.
Before offloading ledgers to long term storage, you need to configure buckets, credentials, and other properties for the cloud storage service. Additionally, Pulsar uses multi-part objects to upload the segment data and brokers may crash while uploading the data. It is recommended that you add a life cycle rule for your bucket to expire incomplete multi-part upload after a day or two days to avoid getting charged for incomplete uploads. Moreover, you can trigger the offloading operation manually (via REST API or CLI) or automatically (via CLI).
After offloading ledgers to long term storage, you can still query data in the offloaded ledgers with Pulsar SQL.
For more information about tiered storage for Pulsar topics, see here.