Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I built a high scale MQTT ingestion system by utilising the MQTT protocol handler for Apache Pulsar (https://github.com/streamnative/mop). I ran a forked version and contributed back some of non-proprietary bits.

A lot more work than Mosquitto but obviously HA/distributed and some tradeoffs w.r.t features. Worth it if you want to run Pulsar anyway for other reasons.



I was going to go for Redpanda, what would be the pro/cons of Pulsar you think?


With Redpanda you would need to build something external. With Pulsar the protocol handlers run within the Pulsar proxy execution mode and all of your authn/authz can be done by Pulsar etc.

Redpanda might be more resource efficient however and less operational overhead than a Pulsar system.

Pulsar has some very distinct advantages over Redpanda when it comes to actually consuming messages though. Specifically it enables both queue-like and streaming consumption patterns (it is still a distributed log underneath but does selective acknowledgement at the subscription level).


I'm not so sure what do you mean by "queue-like and streaming consumption patterns" ?

a stream is a form of queue for me no?


Definitely not. Stream is an ordered log, a queue is a heap.

A stream has cumulative acknowledgement, i.e I have read up to X offset on partition Y, if I restart unexpectedly please redeliver all messages since X. This means that if any message on Y is failing you can't update the committed offset X without a) dropping it into the ether or b) writing it to retry topic. b) sounds like a solution but it's really just kicking the can down the road because you face the same choice there until it ends up in a dead-letter topic that you send stuff that can't be automatically dealt with. In the literature this is called head of line blocking.

Queues are completely different. Usually instead of having a bunch of partitions with exclusive consumers you want a work-stealing approach that has consumers rip whatever work items they can get and stay as well fed as possible and be able to deal with failing items by Nack'ing them and sending them back to the queue. In order to facilitate this though the queue needs to implement the ability to selectively Ack(nowledge) messages and keep track of which messages haven't been successfully consumed.

This is easy with a traditional queuing system because they usually don't offer any ordering guarantees (or if they do they are per key or something and pretty loose) and they store the set of messages "yet to be delivered" rather than "all messages in order" like a streaming system does. This makes it trivial to acknowledge a message has been processed (delete it) or nack it (remove the processing lock, start a redelivery timer for it). Naturally though this means the ability to re-consume already acknowledged messages pretty much doesn't exist in most queue systems as they are long-gone once they have been successfully processed.

Mixing the two is the magic of Pulsar. It has the underlying stream storage approach, with it coming ordering properties and a whole bunch of stuff that is good for scaling and reliability but layers on a queue based consumption API by storing subscription state durably on the cluster i.e it tracks which individual messages have been Ack'd rather than offsets like Kafka consumer groups or similar APIs.

Building this yourself on Kafka/Redpanda is possible but it's extremely difficult to do correctly and you need to be very careful about how you store the subscription state (usually on a set of compacted topics on the cluster). I say this because I took this path in the past and I don't recommend it for anyone that isn't sufficiently brave. :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: