Hey guys, just got hummingbird.me onto the latest Discourse (w/ Postgres and Redis hosted on infrastructure shared with the rest of our site and just a few client-heavy plugins) but we're having some serious troubles with Unicorns timing out during certain queries. All three servers (web, postgres, redis) are running around 10% CPU and 50-60% RAM, there's plenty of query slots remaining on the postgres server, and the cache hit rates aren't unreasonable (most tables around 99%, with some with more random access like posts hovering around 50%)
Checking the Postgres logs pointed to some pretty hefty queries such as this lovely:
SELECT
a.id,
t.title, a.action_type, a.created_at, t.id topic_id,
t.closed AS topic_closed, t.archived AS topic_archived,
a.user_id AS target_user_id, au.name AS target_name, au.username AS target_username,
coalesce(p.post_number, 1) post_number, p.id as post_id,
p.reply_to_post_number,
pu.username, pu.name, pu.id user_id,
pu.uploaded_avatar_id,
u.username acting_username, u.name acting_name, u.id acting_user_id,
u.uploaded_avatar_id acting_uploaded_avatar_id,
coalesce(p.cooked, p2.cooked) cooked,
CASE WHEN coalesce(p.deleted_at, p2.deleted_at, t.deleted_at) IS NULL THEN false ELSE true END deleted,
p.hidden,
p.post_type,
p.action_code,
p.edit_reason,
t.category_id
FROM user_actions as a
JOIN topics t on t.id = a.target_topic_id
LEFT JOIN posts p on p.id = a.target_post_id
JOIN posts p2 on p2.topic_id = a.target_topic_id and p2.post_number = 1
JOIN users u on u.id = a.acting_user_id
JOIN users pu on pu.id = COALESCE(p.user_id, t.user_id)
JOIN users au on au.id = a.user_id
LEFT JOIN categories c on c.id = t.category_id
WHERE (t.deleted_at is null) AND (p.deleted_at is null and p2.deleted_at is null) AND (NOT COALESCE(p.hidden, false) OR p.user_id = -2) AND (COALESCE(p.post_type, p2.post_type) IN (1,2,3)) AND (t.visible) AND (a.action_type not in (3)) AND (a.action_type <> 14) AND (t.archetype != 'private_message') AND ((c.read_restricted IS NULL OR NOT c.read_restricted)) AND (a.user_id = 48321) AND (a.action_type in (4,5))
ORDER BY a.created_at desc
OFFSET 0
LIMIT 30
Which shows a pretty scary EXPLAIN:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=3523.95..3523.96 rows=1 width=622) (actual time=1735.823..1735.836 rows=30 loops=1)
-> Sort (cost=3523.95..3523.96 rows=1 width=622) (actual time=1735.820..1735.829 rows=30 loops=1)
Sort Key: a.created_at
Sort Method: top-N heapsort Memory: 46kB
-> Nested Loop (cost=2.95..3523.94 rows=1 width=622) (actual time=0.231..1735.316 rows=289 loops=1)
-> Nested Loop (cost=2.53..3515.50 rows=1 width=602) (actual time=0.228..1734.421 rows=289 loops=1)
-> Nested Loop (cost=2.12..3513.57 rows=1 width=582) (actual time=0.224..1733.551 rows=289 loops=1)
-> Nested Loop (cost=1.70..3506.19 rows=1 width=558) (actual time=0.201..1732.336 rows=289 loops=1)
Join Filter: (COALESCE(p.post_type, p2.post_type) = ANY ('{1,2,3}'::integer[]))
-> Nested Loop Left Join (cost=1.27..2873.24 rows=62 width=343) (actual time=0.165..1088.340 rows=289 loops=1)
Filter: ((p.deleted_at IS NULL) AND ((NOT COALESCE(p.hidden, false)) OR (p.user_id = (-2))))
-> Nested Loop Left Join (cost=0.84..1804.84 rows=128 width=91) (actual time=0.157..444.363 rows=289 loops=1)
Join Filter: (c.id = t.category_id)
Rows Removed by Join Filter: 6936
Filter: ((c.read_restricted IS NULL) OR (NOT c.read_restricted))
-> Nested Loop (cost=0.84..1748.15 rows=145 width=91) (actual time=0.102..441.141 rows=289 loops=1)
-> Index Scan using idx_unique_rows on user_actions a (cost=0.56..522.71 rows=217 width=32) (actual time=0.060..438.432 rows=297 loops=1)
Index Cond: ((action_type = ANY ('{4,5}'::integer[])) AND (user_id = 32418))
-> Index Scan using forum_threads_pkey on topics t (cost=0.29..5.64 rows=1 width=59) (actual time=0.007..0.008 rows=1 loops=297)
Index Cond: (id = a.target_topic_id)
Filter: ((deleted_at IS NULL) AND visible AND ((archetype)::text <> 'private_message'::text))
Rows Removed by Filter: 0
-> Materialize (cost=0.00..2.38 rows=25 width=5) (actual time=0.000..0.005 rows=25 loops=289)
-> Seq Scan on categories c (cost=0.00..2.25 rows=25 width=5) (actual time=0.003..0.031 rows=25 loops=1)
-> Index Scan using posts_pkey on posts p (cost=0.43..8.33 rows=1 width=256) (actual time=2.226..2.227 rows=1 loops=289)
Index Cond: (id = a.target_post_id)
-> Index Scan using index_posts_on_topic_id_and_post_number on posts p2 (cost=0.43..10.17 rows=3 width=227) (actual time=2.226..2.227 rows=1 loops=289)
Index Cond: ((topic_id = t.id) AND (post_number = 1))
Filter: (deleted_at IS NULL)
-> Index Scan using users_pkey on users u (cost=0.42..7.37 rows=1 width=28) (actual time=0.002..0.003 rows=1 loops=289)
Index Cond: (id = a.acting_user_id)
-> Index Scan using users_pkey on users pu (cost=0.42..1.92 rows=1 width=28) (actual time=0.001..0.002 rows=1 loops=289)
Index Cond: (id = COALESCE(p.user_id, t.user_id))
-> Index Scan using users_pkey on users au (cost=0.42..8.44 rows=1 width=24) (actual time=0.001..0.002 rows=1 loops=289)
Index Cond: (id = 48321)
Planning time: 9.415 ms
Execution time: 1736.177 ms
(37 rows)
The first time I ran this query through explain, it gave a 20-second execution time I'm not 100% sure I'm reading this correctly (mostly because it's so absurd it made me question my senses) but is that 7 loops nested?
Any tips? I'm thinking of switching our deploy to Puma because a slow response is better than no-response-and-the-server-dies-resulting-in-other-requests-failing, but I'd rather stay vanilla.