On Thu, May 02, 2019 at 07:45:16PM +0200, Andre Noll wrote:
On Thu, May 02, 18:52, Greg Kroah-Hartman wrote
On Thu, May 02, 2019 at 05:27:36PM +0200, Andre Noll wrote:
On Thu, May 02, 16:10, Greg Kroah-Hartman wrote
Ok, then how about we hold off on this patch for 4.9.y then. "no one" should be using 4.9.y in a "server system" anymore, unless you happen to have an enterprise kernel based on it. So we should be fine as the users of the older kernels don't run xfs.
Well, we do run xfs on top of bcache on vanilla 4.9 kernels on a few dozen production servers here. Mainly because we ran into all sorts of issues with newer kernels (not necessary related to xfs). 4.9, OTOH, appears to be rock solid for our workload.
Great, but what is wrong with 4.14.y or better yet, 4.19.y? Do those also work for your workload? If not, we should fix that, and soon :)
Some months ago we tried 4.14 and it was a real disaster: random crashes with nothing in the logs on the file servers and unkillable hung processes on the compute machines. The thing is, I can't afford an extended downtime of these production systems, or test patches, or enable debugging options which slow down the systems too much. Also, 10 of the compute nodes load the nvidia module, so all bets are off anyway. But we've seen the hung processes also on the non-gpu nodes where the nvidia module is not loaded.
As for 4.19, xfs on bcache was broken until a couple of weeks ago. Meanwhile the fix (e578f90d8a9c) went in, so I benchmarked 4.19.x on one system briefly. To my surprise the results were *worse* than with 4.9. This seems to be another cache bypass issue, but I need to have a closer look, and more reliable numbers.
Is this something you can reproduce outside of those 10 magical machines?
-- Thanks, Sasha