Apache Pulsar 2.1.0-incubating

August 6, 2018 · 5 min read

We are glad to present the new 2.1.0-incubating release of Pulsar. This release is the culmination of 2 months of work that have brought multiple new features and improvements to Pulsar.

In Pulsar 2.1 you'll see:

Pulsar IO connector framework and a list of builtin connectors
PIP-17: Tiered Storage
Pulsar Stateful Functions
Go Client
Avro and Protobuf Schemas

For details information please check the detailed release notes and 2.1.0 documentation.

We'll provide a brief summary of these features in the section below.

Pulsar IO

Since Pulsar 2.0, we introduced a serverless inspired lightweight computing framework Pulsar Functions, providing the easiest possible way to implement application-specific in-stream processing logic of any complexity. A lot of developers love Pulsar Functions because they require minimal boilerplate and are easy to reason about.

In Pulsar 2.1, we continued following this "simplicity first" principle on developing Pulsar. We developed this IO (input/output) connector framework on top of Pulsar Functions, to simplify getting data in and out of Apache Pulsar. You don't need to write any single line of code. All you need is prepare a configuration file of the system your want to connect to, and use Pulsar admin CLI to submit a connector to Pulsar. Pulsar will take care of all the other stuffs, such as fault-tolerance, rebalancing and etc.

There are 6 built-in connectors released in 2.1 release. They are:

You can follow the tutorial to try out Pulsar IO on connecting Pulsar with Apache Cassandra.

More connectors will be coming in future releases. If you are interested in contributing a connector to Pulsar, checkout the guide on Developing Connectors. It is as simple as writing a Pulsar function.

Tiered Storage

One of the advantages of Apache Pulsar is its segment storage using Apache BookKeeper. You can store a topic backlog as large as you want. When the cluster starts to run out of space, you just add another storage node, and the system will automatically pickup the new storage nodes and start using them without rebalancing partitions. However, this can start to get expensive after a while.

Pulsar mitigates this cost/size trade-off by providing Tiered Storage. Tiered Storage turns your Pulsar topics into real infinite streams, by offloading older segments into a long term storage, such as AWS S3, GCS and HDFS, which is designed for storing cold data. To the end user, there is no perceivable difference between consuming streams whose data is stored in BookKeeper or in long term storage. All the underlying offloading mechanisms and metadata management are transparent to applications.

Currently S3 is supported in 2.1. More offloaders (such as Google GCS, Azure Blobstore, and HDFS) are coming in future releases.

If you are interested in this feature, you can checkout more details here.

Stateful Function

The greatest challenge that stream processing engines face is managing state. So does Pulsar Functions. As the goal for Pulsar Functions is to simplify developing stream native processing logic, we also want to provide an easier way for Pulsar Functions to manage their state. We introduced a set of State API for Pulsar Functions to store their state. It integrates with the table service in Apache BookKeeper for storing the state.

It is released as a developer preview feature in Pulsar Functions Java SDK. We would like to collect feedback to improve it in future releases.

Schemas

Pulsar 2.0 introduces native support for schemas in Pulsar. It means you can declare how message data looks and have Pulsar enforce that producers can only publish valid data on the topics. In 2.0, Pulsar only supports String, bytes and JSON schemas. We introduced the support for Avro and Protobuf in this release.