Hello folks, indeed DreamObjects started to respond more slowly than usually (hence the 503 errors) only recently. The investigation we're running to find the cause of the issue points at two concurrent causes: one is a radical increase in cluster's utilization, where the API endpoint receive lots more requests per second. At the same time when more requests are coming, the cluster also started to be expanded which generates lots of disk and network activity (because of the way Ceph works). These two concurrent events have put more stress on the request queues between the haproxy and the RGWs, causing the errors.
The expansion of the cluster is almost complete, from the monitors we see things have already improved today and will keep on getting better.
Steve-o: a bucket with millions of objects is more likely to create issues. There are some best practices to follow, if you're not doing them already. Feel free to reach out privately to me (or share here) and describe how you're storing the objects. I'm thinking of writing those best practices and using a real life example could be more useful than a general purpose article.