Dominique Martinet wrote on Wed, Jun 28, 2023 at 08:42:41PM +0900:
If flags already has either MFD_EXEC or MFD_NOEXEC_SEAL, you don't check the sysctl at all. [...repro snipped..]
What am I missing?
(Perhaps the intent is just to force people to use the flag so it is easier to check for memfd_create in seccomp or other LSM? But I don't see why such a check couldn't consider the absence of a flag as well, so I don't see the point.)
BTW I find the current behaviour rather hard to use: setting this to 2 should still set NOEXEC by default in my opinion, just refuse anything that explicitly requested EXEC.
And I just noticed it's not possible to lower the value despite having CAP_SYS_ADMIN: what the heck?! I have never seen such a sysctl and it just forced me to reboot because I willy-nilly tested in the init pid namespace, and quite a few applications that don't require exec broke exactly as I described below.
If the user has CAP_SYS_ADMIN there are more container escape methods than I can count, this is basically free pass to root on main namespace anyway, you're not protecting anything. Please let people set the sysctl to what they want.
Sure there's a warn_once that memfd_create was used without seal, but right now on my system it's "used up" 5 seconds after boot by systemd: [ 5.854378] memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=1 'systemd'
And anyway, older kernels will barf up EINVAL when calling memfd_create with MFD_NOEXEC_SEAL, so even if userspace will want to adapt they'll need to try calling memfd_create with the flag once and retry on EINVAL, which let's face it is going to take a while to happen. (Also, the flag has been added to glibc, but not in any release yet)
Making calls default to noexec AND refuse exec does what you want (forbid use of exec in an app that wasn't in a namespace that allows exec) while allowing apps that require it to work; that sounds better than making all applications that haven't taken the pain of adding the new flag to me. Well, I guess an app that did require exec without setting the flag will fail in a weird place instead of failing at memfd_create and having a chance to fallback, so it's not like it doesn't make any sense; I don't have such strong feelings about this if the sysctl works, but for my use case I'm more likely to want to take a chance at memfd_create not needing exec than having the flag set. Perhaps a third value if I cared enough...