Tiered Storage
Pulsar's Tiered Storage feature allows older backlog data to be offloaded to long term storage, thereby freeing up space in BookKeeper and reducing storage costs. This cookbook walks you through using tiered storage in your Pulsar cluster.
-
Tiered storage uses Apache jclouds to support Amazon S3 and Google Cloud Storage(GCS for short) for long term storage. With Jclouds, it is easy to add support for more cloud storage providers in the future.
-
Tiered storage uses Apache Hadoop to support filesystem for long term storage. With Hadoop, it is easy to add support for more filesystem in the future.
When should I use Tiered Storage?​
Tiered storage should be used when you have a topic for which you want to keep a very long backlog for a long time. For example, if you have a topic containing user actions which you use to train your recommendation systems, you may want to keep that data for a long time, so that if you change your recommendation algorithm you can rerun it against your full user history.
The offloading mechanism​
A topic in Pulsar is backed by a log, known as a managed ledger. This log is composed of an ordered list of segments. Pulsar only every writes to the final segment of the log. All previous segments are sealed. The data within the segment is immutable. This is known as a segment oriented architecture.
The Tiered Storage offloading mechanism takes advantage of this segment oriented architecture. When offloading is requested, the segments of the log are copied, one-by-one, to tiered storage. All segments of the log, apart from the segment currently being written to can be offloaded.
On the broker, the administrator must configure the bucket and credentials for the cloud storage service. The configured bucket must exist before attempting to offload. If it does not exist, the offload operation will fail.
Pulsar uses multi-part objects to upload the segment data. It is possible that a broker could crash while uploading the data. We recommend you add a life cycle rule your bucket to expire incomplete multi-part upload after a day or two to avoid getting charged for incomplete uploads.
When ledgers are offloaded to long term storage, you can still query data in the offloaded ledgers with Pulsar SQL.
Configuring the offload driver​
Offloading is configured in broker.conf
.
At a minimum, the administrator must configure the driver, the bucket and the authenticating credentials. There is also some other knobs to configure, like the bucket region, the max block size in backed storage, etc.
Currently we support driver of types:
aws-s3
: Simple Cloud Storage Servicegoogle-cloud-storage
: Google Cloud Storagefilesystem
: Filesystem Storage
Driver names are case-insensitive for driver's name. There is a third driver type,
s3
, which is identical toaws-s3
, though it requires that you specify an endpoint url usings3ManagedLedgerOffloadServiceEndpoint
. This is useful if using a S3 compatible data store, other than AWS.
managedLedgerOffloadDriver=aws-s3
"aws-s3" Driver configuration​
Bucket and Region​
Buckets are the basic containers that hold your data. Everything that you store in Cloud Storage must be contained in a bucket. You can use buckets to organize your data and control access to your data, but unlike directories and folders, you cannot nest buckets.
s3ManagedLedgerOffloadBucket=pulsar-topic-offload
Bucket Region is the region where bucket located. Bucket Region is not a required but a recommended configuration. If it is not configured, It will use the default region.
With AWS S3, the default region is US East (N. Virginia)
. Page AWS Regions and Endpoints contains more information.
s3ManagedLedgerOffloadRegion=eu-west-3
Authentication with AWS​
To be able to access AWS S3, you need to authenticate with AWS S3. Pulsar does not provide any direct means of configuring authentication for AWS S3, but relies on the mechanisms supported by the DefaultAWSCredentialsProviderChain.
Once you have created a set of credentials in the AWS IAM console, they can be configured in a number of ways.
- Using ec2 instance metadata credentials
If you are on AWS instance with an instance profile that provides credentials, Pulsar will use these credentials if no other mechanism is provided
- Set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in
conf/pulsar_env.sh
.
export AWS_ACCESS_KEY_ID=ABC123456789
export AWS_SECRET_ACCESS_KEY=ded7db27a4558e2ea8bbf0bf37ae0e8521618f366c
"export" is important so that the variables are made available in the environment of spawned processes.
- Add the Java system properties aws.accessKeyId and aws.secretKey to PULSAR_EXTRA_OPTS in
conf/pulsar_env.sh
.
PULSAR_EXTRA_OPTS="${PULSAR_EXTRA_OPTS} ${PULSAR_MEM} ${PULSAR_GC} -Daws.accessKeyId=ABC123456789 -Daws.secretKey=ded7db27a4558e2ea8bbf0bf37ae0e8521618f366c -Dio.netty.leakDetectionLevel=disabled -Dio.netty.recycler.maxCapacityPerThread=4096"
- Set the access credentials in
~/.aws/credentials
.
[default]
aws_access_key_id=ABC123456789
aws_secret_access_key=ded7db27a4558e2ea8bbf0bf37ae0e8521618f366c
- Assuming an IAM role
If you want to assume an IAM role, this can be done via specifying the following:
s3ManagedLedgerOffloadRole=<aws role arn>
s3ManagedLedgerOffloadRoleSessionName=pulsar-s3-offload
This will use the DefaultAWSCredentialsProviderChain
for assuming this role.
The broker must be rebooted for credentials specified in pulsar_env to take effect.