Tiered Storage
Pulsar's segment-oriented architecture allows for topic backlogs to grow very large, effectively without limit. However, this can become expensive over time.
One way to alleviate this cost is to use Tiered Storage. With tiered storage, older messages in the backlog can be moved from BookKeeper to a cheaper storage mechanism, while still allowing clients to access the backlog as if nothing had changed.
How Tiered Storage Works
Tiered storage leverages Pulsar's segment-based architecture where data is stored in immutable segments (ledgers) in BookKeeper. When segments are sealed and become read-only, they can be safely offloaded to external storage systems.
Offloading Process
- Segment Sealing: When a BookKeeper ledger is closed (due to size limits, time limits, or manual triggers), it becomes immutable.
- Eligibility Check: The broker determines which segments are eligible for offloading based on configured policies.
- Data Transfer: Eligible segments are copied to the configured long-term storage backend.
- Metadata Update: BookKeeper metadata is updated to reference the external storage location.
- Local Deletion: After a configurable delay (default: 4 hours), the original data is deleted from BookKeeper.
Transparent Access
Consumers can access offloaded data transparently. When a consumer requests data that has been offloaded:
- The broker checks BookKeeper metadata to determine the storage location
- If data is in external storage, the broker retrieves it seamlessly
- The consumer receives the data as if it were still in BookKeeper
Supported Storage Backends
Pulsar supports multiple storage backends for tiered storage:
Cloud Storage Providers
- Amazon S3: Industry-standard object storage with multiple storage classes
- Google Cloud Storage (GCS): Google's object storage with lifecycle management
- Microsoft Azure Blob Storage: Azure's object storage solution
- Alibaba Cloud OSS: Alibaba's object storage service
On-Premises Solutions
- Filesystem: Local or network-attached storage for on-premises deployments
- S3-Compatible Storage: MinIO, Ceph, and other S3-compatible solutions
Storage Classes and Cost Optimization
Different storage backends offer various storage classes:
- Hot storage: Immediate access, higher cost (S3 Standard, GCS Standard)
- Cool storage: Infrequent access, lower cost (S3 IA, GCS Nearline)
- Cold storage: Archive storage, lowest cost (S3 Glacier, GCS Coldline)
Configuration and Policies
Offloading Triggers
Tiered storage can be triggered by:
- Size-based policies: Offload when topic backlog exceeds a certain size
- Time-based policies: Offload data older than a specified age
- Manual triggers: Administrative commands via REST API or CLI
- Namespace-level policies: Automatic offloading based on namespace configuration
Key Configuration Parameters
- Offload threshold: Minimum backlog size before offloading begins
- Offload deletion lag: Delay before deleting data from BookKeeper after offloading
- Max block size: Maximum size of data blocks uploaded to external storage
- Read buffer size: Buffer size for reading offloaded data
- Offload driver: Backend storage driver configuration
Performance Considerations
Read Performance
- First access: Reading offloaded data is slower than BookKeeper due to network latency
- Caching: Some implementations cache frequently accessed offloaded data
- Prefetching: Brokers may prefetch data based on access patterns
Write Performance
- Asynchronous offloading: Offloading occurs in the background without affecting write performance
- Parallel transfers: Multiple segments can be offloaded concurrently
- Bandwidth management: Configurable limits to prevent offloading from overwhelming network resources
Cost Benefits
Data written to BookKeeper is replicated to 3 physical machines by default. However, once a segment is sealed in BookKeeper it becomes immutable and can be copied to long term storage. Long term storage can achieve cost savings by using mechanisms such as Reed-Solomon error correction to require fewer physical copies of data.
Cost savings come from:
- Reduced replication: External storage typically requires fewer replicas
- Storage class optimization: Use cheaper storage classes for older data
- Operational efficiency: Reduced BookKeeper cluster storage requirements
Use Cases and Benefits
Long-term Data Retention
- Compliance requirements: Meet regulatory requirements for data retention
- Historical analysis: Maintain access to historical data for analytics
- Audit trails: Preserve message history for compliance and auditing
Cost Optimization
- Storage cost reduction: Move cold data to cheaper storage tiers
- Operational efficiency: Reduce BookKeeper storage requirements
- Elastic scaling: Handle varying data volumes without over-provisioning
Management and Operations
Monitoring
- Offloading metrics: Track offloading progress and performance
- Storage usage: Monitor storage consumption across tiers
- Access patterns: Analyze data access patterns for optimization
Automation
- Policy-based management: Automated offloading based on predefined policies
- Lifecycle management: Integration with cloud provider lifecycle policies
- Alerting: Notifications for offloading failures or threshold breaches
For detailed setup instructions and configuration examples, see the Tiered storage cookbook.