On Mon, 3 Jul 2023 10:13:22 +0000 Ajay Kaher akaher@vmware.com wrote:
+/**
- eventfs_down_write - acquire write lock function
- @eventfs_rwsem: a pointer to rw_semaphore
- helper function to perform write lock on eventfs_rwsem
- */
+static void eventfs_down_write(struct rw_semaphore *eventfs_rwsem) +{
while (!down_write_trylock(eventfs_rwsem))
msleep(10);
What's this loop for? Something like that needs a very good explanation in a comment. Loops like these are usually a sign of a workaround for a bug in the design, or worse, simply hides an existing bug.
Yes correct, this logic is to solve deadlock:
Thread 1 Thread 2 down_read_nested() - read lock acquired down_write() - waiting for write lock to acquire down_read_nested() - deadlock
Deadlock is because rwlock wouldn’t allow read lock to be acquired if write lock is waiting. down_write_trylock() wouldn’t add the write lock in waiting queue, hence helps to prevent deadlock scenario.
I was stuck with this Deadlock, tried few methods and finally borrowed from cifs, as it’s upstreamed, tested and working in cifs, please refer: https://elixir.bootlin.com/linux/v6.3.1/source/fs/cifs/file.c#L438
I just looked at that code and the commit, and I honestly believe that is a horrible hack, and very fragile. It's in the smb code, so it was unlikely reviewed by anyone outside that subsystem. I really do not want to prolificate that solution around the kernel. We need to come up with something else.
I also think it's buggy (yes the cifs code is buggy!) because in the comment above the down_read_nested() it says:
/* * nested locking. NOTE: rwsems are not allowed to recurse * (which occurs if the same task tries to acquire the same * lock instance multiple times), but multiple locks of the * same lock class might be taken, if the order of the locks * is always the same. This ordering rule can be expressed * to lockdep via the _nested() APIs, but enumerating the * subclasses that are used. (If the nesting relationship is * static then another method for expressing nested locking is * the explicit definition of lock class keys and the use of * lockdep_set_class() at lock initialization time. * See Documentation/locking/lockdep-design.rst for more details.) */
So this is NOT a solution (and the cifs code should be fixed too!)
Can you show me the exact backtrace where the reader lock gets taken again? We will have to come up with a way to not take the same lock twice.
We can also look to see if we can implement this with RCU. What exactly is this rwsem protecting?
Looking further for your input. I will add explanation in v4.
+}
[..]
- This function creates the top of the trace event directory.
- */
+struct dentry *eventfs_create_events_dir(const char *name,
struct dentry *parent,
struct rw_semaphore *eventfs_rwsem)
OK, I'm going to have to really look at this. Passing in a lock to the API is just broken. We need to find a way to solve this another way.
eventfs_rwsem is a member of struct trace_array, I guess we should pass pointer to trace_array.
No, it should not be part of the trace_array. If we can't do this with RCU, then we need to add a descriptor that contains the dentry that is returned above, and have the lock held there. The caller of the eventfs_create_events_dir() should not care about locking. That's an implementation detail that should *not* be part of the API.
That is, if you need a lock:
struct eventfs_dentry { struct dentry *dentry; struct rwsem *rwsem; };
And then get to that lock by using the container_of() macro. All created eventfs dentry's could have this structure, where the rwsem points to the top one. Again, that's only if we can't do this with RCU.
-- Steve
I'm about to board a plane to JFK shortly, I'm hoping to play with this while flying back.
I have replied for major concerns. All other minor I will take care in v4.
Thanks a lot for giving time to eventfs patches.
- Ajay