一旦訂閱被創建，所有的信息都將被 Pulsar 保留，即使consumer已斷開連接。 只有在 consumer 確認信息被成功處理後，先前保留下來的消息才會被丟棄。
信息是 Pulsar 的基礎單元。 信息就是 producer 發給 topic 的內容，以及 consumer 從 topic 消費的內容（信息處理完成後發送確認）。 信息類似於郵政系統中的信件。
|Value / data payload||The data carried by the message. All Pulsar messages carry raw bytes, although message data can also conform to data schemas|
|Key||信息可以隨附一個信息鍵 (message key)。這對諸如 topic 壓實 之類的事情有特別作用。|
|Producer 名稱||生產信息的 producer 名稱 ( producer 被自動賦予預設名稱，但你也可以自己指定)。|
|序列 ID||Topic 中，每個 Pulsar 信息屬於一個有序的序列。信息的序列 ID 是它在序列中的順序。|
|發佈時間||信息發佈的時間戳( producer 自動附上)。|
Pulsar 信息內容的更深入分解，請參考 Pulsar 的 binary protocol 文檔。
一個 producer 是關聯到topic的程序，它發布信息到 Pulsar 的 broker 上。
Producer 可以以同步（sync）或者異步（async）的方式發佈信息到 broker。
|Sync send||發送信息後，producer 等待 broker 的確認。如果沒有收到確認，producer 會認為發送失敗。|
|Async send||Producer 將會把信息放入它內部的 blocking 佇列，然後馬上返回。 然後，client 將在背景將信息發送給 broker。 If the queue is full (max size configurable, the producer could be blocked or fail immediately when calling the API, depending on arguments passed to the producer.|
Messages published by producers can be compressed during transportation in order to save bandwidth. Pulsar currently supports two types of compression:
If batching is enabled, the producer will accumulate and send a batch of messages in a single request. Batching size is defined by the maximum number of messages and maximum publish latency.
A consumer is a process that attaches to a topic via a subscription and then receives messages.
Messages can be received from brokers either synchronously (sync) or asynchronously (async).
|Async receive||異步接收會立即返回 future 物件 - 例如 java 中的 |
When a consumer has successfully processed a message, it needs to send an acknowledgement to the broker so that the broker can discard the message (otherwise it stores the message).
Messages can be acknowledged either one by one or cumulatively. With cumulative acknowledgement, the consumer only needs to acknowledge the last message it received. All messages in the stream up to (and including) the provided message will not be re-delivered to that consumer.
Cumulative acknowledgement cannot be used with shared subscription mode, because shared mode involves multiple consumers having access to the same subscription.
Client libraries can provide their own listener implementations for consumers. The Java client, for example, provides a MesssageListener
interface. In this interface, the
received method is called whenever a new message is received.
|定義了 topic 類型。 Pulsar 支持兩種不同類型的 topic：持久 和 非持久 （預設是持久類型，如果你沒有指明類型，topic將會是持久類型）。 持久型 topic 的所有信息都會 持久化 地存儲到硬碟上（這意味著多個硬碟，除非是單機模式的 broker ），反之，非持久topic 的數據則不會存儲到硬碟上。|
|Pulsar 集群中 topic 的租戶。 tenant 是 Pulsar 多租戶設計的重要元素，可以被跨集群的傳播。|
|Topic 的管理單元，作為 topic 分組的管理機制。 大多數的 topic 設定都是在 namespace 層級生效。 每個 tenant 可以有多個 namespace。|
|名稱的最後組成部分。Topic 的命名很自由，對於 Pulsar 來說，它沒有什麼特殊的含義。|
No need to explicitly create new topics
You don't need to explicitly create topics in Pulsar. If a client attempts to write or receive messages to/from a topic that does not yet exist, Pulsar will automatically create that topic under the namespace provided in the topic name.
A namespace is a logical nomenclature within a tenant. A tenant can create multiple namespaces via the admin API. 例如，一個對接多個應用的租戶，可以為每個應用創建不同的namespace。 A namespace allows the application to create and manage a hierarchy of topics. The topic
my-tenant/app1 is a namespace for the application
my-tenant. You can create any number of topics under the namespace.
A subscription is a named configuration rule that determines how messages are delivered to consumers. There are three available subscription modes in Pulsar: exclusive, shared, and failover. These modes are illustrated in the figure below.
In exclusive mode, only a single consumer is allowed to attach to the subscription. If more than one consumer attempts to subscribe to a topic using the same subscription, the consumer receives an error.
In the diagram above, only Consumer A-0 is allowed to consume messages.
Exclusive mode is the default subscription mode.
In shared or round robin mode, multiple consumers can attach to the same subscription. Messages are delivered in a round robin distribution across consumers, and any given message is delivered to only one consumer. When a consumer disconnects, all the messages that were sent to it and not acknowledged will be rescheduled for sending to the remaining consumers.
In the diagram above, Consumer-B-1 and Consumer-B-2 are able to subscribe to the topic, but Consumer-C-1 and others could as well.
Limitations of shared mode
There are two important things to be aware of when using shared mode: * Message ordering is not guaranteed. * You cannot use cumulative acknowledgment with shared mode.
In failover mode, multiple consumers can attach to the same subscription. The consumers will be lexically sorted by the consumer's name and the first consumer will initially be the only one receiving messages. This consumer is called the master consumer.
When the master consumer disconnects, all (non-acked and subsequent) messages will be delivered to the next consumer in line.
In the diagram above, Consumer-C-1 is the master consumer while Consumer-C-2 would be the next in line to receive messages if Consumer-C-1 disconnected.
When a consumer subscribes to a Pulsar topic, by default it subscribes to one specific topic, such as
persistent://public/default/my-topic. As of Pulsar version 1.23.0-incubating, however, Pulsar consumers can simultaneously subscribe to multiple topics. You can define a list of topics in two ways:
- 通過明確指定的 topic 列表
When subscribing to multiple topics by regex, all topics must be in the same namespace
When subscribing to multiple topics, the Pulsar client will automatically make a call to the Pulsar API to discover the topics that match the regex pattern/list and then subscribe to all of them. If any of the topics don't currently exist, the consumer will auto-subscribe to them once the topics are created.
No ordering guarantees across multiple topics
When a producer sends messages to a single topic, all messages are guaranteed to be read from that topic in the same order. However, these guarantees do not hold across multiple topics. So when a producer sends message to multiple topics, the order in which messages are read from those topics is not guaranteed to be the same.
Here are some multi-topic subscription examples for Java:
import java.util.regex.Pattern; import org.apache.pulsar.client.api.Consumer; import org.apache.pulsar.client.api.PulsarClient; PulsarClient pulsarClient = // Instantiate Pulsar client object // Subscribe to all topics in a namespace Pattern allTopicsInNamespace = Pattern.compile("persistent://public/default/.*"); Consumer allTopicsConsumer = pulsarClient.subscribe(allTopicsInNamespace, "subscription-1"); // Subscribe to a subsets of topics in a namespace, based on regex Pattern someTopicsInNamespace = Pattern.compile("persistent://public/default/foo.*"); Consumer someTopicsConsumer = pulsarClient.subscribe(someTopicsInNamespace, "subscription-1");
For code examples, see:
Normal topics can be served only by a single broker, which limits the topic's maximum throughput. Partitioned topics are a special type of topic that be handled by multiple brokers, which allows for much higher throughput.
Behind the scenes, a partitioned topic is actually implemented as N internal topics, where N is the number of partitions. When publishing messages to a partitioned topic, each message is routed to one of several brokers. The distribution of partitions across brokers is handled automatically by Pulsar.
The diagram below illustrates this:
Here, the topic Topic1 has five partitions (P0 through P4) split across three brokers. Because there are more partitions than brokers, two brokers handle two partitions a piece, while the third handles only one (again, Pulsar handles this distribution of partitions automatically).
Messages for this topic are broadcast to two consumers. The routing mode determines both which broker handles each partition, while the subscription mode determines which messages go to which consumers.
Decisions about routing and subscription modes can be made separately in most cases. In general, throughput concerns should guide partitioning/routing decisions while subscription decisions should be guided by application semantics.
There is no difference between partitioned topics and normal topics in terms of how subscription modes work, as partitioning only determines what happens between when a message is published by a producer and processed and acknowledged by a consumer.
Partitioned topics need to be explicitly created via the admin API. The number of partitions can be specified when creating the topic.
When publishing to partitioned topics, you must specify a routing mode. The routing mode determines which partition---that is, which internal topic---each message should be published to.
There are three routing modes available by default:
|Key hash||If a key property has been specified on the message, the partitioned producer will hash the key and assign it to a particular partition.||Per-key-bucket ordering|
|Single default partition||If no key is provided, each producer's message will be routed to a dedicated partition, initially random selected||Per-producer ordering|
|Round robin distribution||If no key is provided, all messages will be routed to different partitions in round-robin fashion to achieve maximum throughput.||None|
By default, Pulsar persistently stores all unacknowledged messages on multiple BookKeeper bookies (storage nodes). Data for messages on persistent topics can thus survive broker restarts and subscriber failover.
Pulsar also, however, supports non-persistent topics, which are topics on which messages are never persisted to disk and live only in memory. When using non-persistent delivery, killing a Pulsar broker or disconnecting a subscriber to a topic means that all in-transit messages are lost on that (non-persistent) topic, meaning that clients may see message loss.
Non-persistent topics have names of this form (note the
non-persistent in the name):
For more info on using non-persistent topics, see the Non-persistent messaging cookbook.
In non-persistent topics, brokers immediately deliver messages to all connected subscribers without persisting them in BookKeeper. If a subscriber is disconnected, the broker will not be able to deliver those in-transit messages, and subscribers will never be able to receive those messages again. Eliminating the persistent storage step makes messaging on non-persistent topics slightly faster than on persistent topics in some cases, but with the caveat that some of the core benefits of Pulsar are lost.
With non-persistent topics, message data lives only in memory. If a message broker fails or message data can otherwise not be retrieved from memory, your message data may be lost. Use non-persistent topics only if you're certain that your use case requires it and can sustain it.
Non-persistent messaging is usually faster than persistent messaging because brokers don't persist messages and immediately send acks back to the producer as soon as that message is deliver to all connected subscribers. Producers thus see comparatively low publish latency with non-persistent topic.
Producers and consumers can connect to non-persistent topics in the same way as persistent topics, with the crucial difference that the topic name must start with
non-persistent. All three subscription modes---exclusive, shared, and failover---are supported for non-persistent topics.
Here's an example Java consumer for a non-persistent topic:
PulsarClient client = PulsarClient.create("pulsar://localhost:6650"); String npTopic = "non-persistent://public/default/my-topic"; String subscriptionName = "my-subscription-name"; Consumer consumer = client.subscribe(npTopic, subscriptionName);
Here's an example Java producer for the same non-persistent topic:
Producer producer = client.createProducer(npTopic);
Message retention and expiry
By default, Pulsar message brokers:
- 立即刪除所有已經被 consumer 確認過的信息
- 以信息 backlog 的形式，持久保存所有未被確認的消息
Pulsar has two features, however, that enable you to override this default behavior:
- 信息 retention 讓你可以保存 consumer 確認過的信息
- 信息 expiry 可以給未被確認的信息設置存活時長（TTL）
The diagram below illustrates both concepts:
With message retention, shown at the top, a retention policy applied to all topics in a namespace dicates that some messages are durably stored in Pulsar even though they've already been acknowledged. Acknowledged messages that are not covered by the retention policy are deleted. Without a retention policy, all of the acknowledged messages would be deleted.
With message expiry, shown at the bottom, some messages are deleted, even though they haven't been acknowledged, because they've expired according to the TTL applied to the namespace (for example because a TTL of 5 minutes has been applied and the messages haven't been acknowledged but are 10 minutes old).
Message duplication occurs when a message is persisted by Pulsar more than once. Message deduplication is an optional Pulsar feature that prevents unnecessary message duplication by processing each message only once, even if the message is received more than once.
The following diagram illustrates what happens when message deduplication is disabled vs. enabled:
Message deduplication is disabled in the scenario shown at the top. Here, a producer publishes message 1 on a topic; the message reaches a Pulsar broker and is persisted to BookKeeper. The producer then sends message 1 again (in this case due to some retry logic), and the message is received by the broker and stored in BookKeeper again, which means that duplication has occurred.
In the second scenario at the bottom, the producer publishes message 1, which is received by the broker and persisted, as in the first scenario. When the producer attempts to publish the message again, however, the broker knows that it has already seen message 1 and thus does not persist the message.
Message deduplication is handled at the namespace level. For more instructions, see the message deduplication cookbook.
The other available approach to message deduplication is to ensure that each message is only produced once. This approach is typically called producer idempotency. The drawback of this approach is that it defers the work of message deduplication to the application. In Pulsar, this is handled at the broker level, which means that you don't need to modify your Pulsar client code. Instead, you only need to make administrative changes (see the Managing message deduplication cookbook for a guide).
Deduplication and effectively-once semantics
Message deduplication makes Pulsar an ideal messaging system to be used in conjunction with stream processing engines (SPEs) and other systems seeking to provide effectively-once processing semantics. Messaging systems that don't offer automatic message deduplication require the SPE or other system to guarantee deduplication, which means that strict message ordering comes at the cost of burdening the application with the responsibility of deduplication. With Pulsar, strict ordering guarantees come at no application-level cost.