(This is an old topic I wrote on Dev that belongs here)
I wrote this a year and half ago, keep that in mind when reading.
Background
The "message bus" is component that allows us to easily publish information to our clients and between the rails processes in the farm. At its core it provides a few very simple APIs.
# publish a message to all subscribers on the particular channel
MessageBus.publish '/channel_name', data
# then on the server side you can
MessageBus.subscribe '/channel_name' do |msg|
do_something msg.data
end
Or on the js client we can:
Discourse.MessageBus.subscribe('/channel_name', function(data){ /* do stuff with data*/ });
This simple apis hide a bunch of intricacies, we can publish messages to a subset of users:
MessageBus.publish '/private_channel', 'secret', user_ids: {1,2,3}
The message bus "understands" what site message belong to, transparently. Our rails apps have the ability to serve multiple web sites (eg: ember and dev are served in the same process - different dbs).
What changed?
I always really liked the API, its simple enough, however the bus itself was inherently unreliable. When the server sent messages to the client there was no Acking mechanism. If a server restarted there was no way for it to "catch up".
To resolve this I create an abstraction I call ReliableMessageBus, at its core it allows you to catch up on any messages in a channel. This involves some fairly tricky redis code, it means that when stuff is published on a redis channel it is also stored in a list:
def publish(channel, data)
redis = pub_redis
offset_key = offset_key(channel)
backlog_key = backlog_key(channel)
redis.watch(offset_key, backlog_key, global_id_key, global_backlog_key, global_offset_key) do
offset = redis.get(offset_key).to_i
backlog = redis.llen(backlog_key).to_i
global_offset = redis.get(global_offset_key).to_i
global_backlog = redis.llen(global_backlog_key).to_i
global_id = redis.get(global_id_key).to_i
global_id += 1
too_big = backlog + 1 > @max_backlog_size
global_too_big = global_backlog + 1 > @max_global_backlog_size
message_id = backlog + offset + 1
redis.multi do
if too_big
redis.ltrim backlog_key, (backlog+1) - @max_backlog_size, -1
offset += (backlog+1) - @max_backlog_size
redis.set(offset_key, offset)
end
if global_too_big
redis.ltrim global_backlog_key, (global_backlog+1) - @max_global_backlog_size, -1
global_offset += (global_backlog+1) - @max_global_backlog_size
redis.set(global_offset_key, global_offset)
end
msg = MessageBus::Message.new global_id, message_id, channel, data
payload = msg.encode
redis.set global_id_key, global_id
redis.rpush backlog_key, payload
redis.rpush global_backlog_key, message_id.to_s << "|" << channel
redis.publish redis_channel_name, payload
end
return message_id
end
end
The reliable message bus allows anyone to catch up on missed messages (it also caps the size of the backlog for sanity)
With these bits in place it was fairly straight forward to implement both polling and long-polling, two things I had not implemented in the past. The key was that I had a clean way of catching up.
Why I hate web sockets and disabled them?
My initial implementation was unreliable, but web sockets made it mostly sort of work. With web sockets you have this false sense that its just simple enough just to hookup a few callbacks on your socket and all is good. You don't worry about backlogs, the socket is always up and everything else is an edge case.
However, web sockets are just jam packed full of edge cases:
- There are a ton web socket implementations that you need to worry about: multiple framing protocols, multiple handshake protocols, and tons of weird bugs like needing to hack stuff so haproxy forgives some insane flavors of the protocol
- Some networks (and mobile networks) decide to disable web sockets altogether, like Telstra in Australia
- Proxies disable them
- Getting SSL web sockets to work is a big pain, but if you want web sockets to work you must be on SSL.
- Web sockets don't magically catch up when you open your laptop lid after being closed for 1 hour.
So, if you decide to support web sockets you carry a big amount of code and configuration around, you also are forced to implement polling anyway cause you can not guarantee clients support it, cause networks may and do disable them.
My call is that all this hoop jumping and complex class of bugs that would follow it is just not worth it. Given that nginx can support 500k reqs a second our bottleneck is not the network. Our bottleneck for the message bus is Ruby and Redis, we just need to make sure those bits are super fast.
I really hate all the buzz around web sockets, so much of it seems to be about how cool web sockets are, a lot less is based on evidence that sockets actually improve performance or even network performance in a way that significantly matters. gmail is doing just fine without web sockets.
This makes it easier to deploy Discourse
Now that I rid us of the strong web socket dependency and made "long polling" optional (in site settings) many can deploy discourse on app servers like passenger, if they wish. It will not perform as well, updates will not be as instant, but it will work.
I wrote that about a year ago, but it is still pretty much true today.