On Tue, Jul 6, 2021 at 6:29 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Tue, Jul 06, 2021 at 05:49:01PM +0200, Daniel Vetter wrote:
The other thing to keep in mind is that one of these drivers supports 25 years of product generations, and the other one doesn't.
Sure, but that is the point, isn't it? To have an actually useful thing you need all of this mess
My argument is that an in-tree open kernel driver is a big help to reverse engineering an open userspace. Having the vendors collaboration to build that monstrous thing can only help the end goal of an end to end open stack.
Not sure where this got lost, but we're totally fine with vendors using the upstream driver together with their closed stack. And most of the drivers we do have in upstream are actually, at least in parts, supported by the vendor. E.g. if you'd have looked the drm/arm driver you picked is actually 100% written by ARM engineers. So kinda unfitting example.
So the argument with Habana really boils down to how much do they need to show in the open source space to get a kernel driver? You want to see the ISA or compiler at least?
Yup. We dont care about any of the fancy pieces you build on top, nor does the compiler need to be the optimizing one. Just something that's good enough to drive the hw in some demons to see how it works and all that. Generally that's also not that hard to reverse engineer, if someone is bored enough, the real fancy stuff tends to be in how you optimize the generated code. And make it fit into the higher levels properly.
That at least doesn't seem "extreme" to me.
For instance a vendor with an in-tree driver has a strong incentive to sort out their FW licensing issues so it can be redistributed.
Nvidia has been claiming to try and sort out the FW problem for years. They even managed to release a few things, but I think the last one is 2-3 years late now. Partially the reason is that there don't have a stable api between the firmware and driver, it's all internal from the same source tree, and they don't really want to change that.
Right, companies have no incentive to work in a sane way if they have their own parallel world. I think drawing them part by part into the standard open workflows and expectations is actually helpful to everyone.
Well we do try to get them on board part-by-part generally starting with the kernel and ending with a proper compiler instead of the usual llvm hack job, but for whatever reasons they really like their in-house stuff, see below for what I mean.
I don't think the facts on the ground support your claim here, aside from the practical problem that nvidia is unwilling to even create an open driver to begin with. So there isn't anything to merge.
The internet tells me there is nvgpu, it doesn't seem to have helped.
Not sure which one you mean, but every once in a while they open up a few headers, or a few programming specs, or a small driver somewhere for a very specific thing, and then it dies again or gets obfuscated for the next platform, or just never updated. I've never seen anything that comes remotely to something complete, aside from tegra socs, which are fully supported in upstream afaik.
I understand nvgpu is the tegra driver that people actualy use. nouveau may have good tegra support but is it used in any actual commercial product?
I think it was almost the case. Afaik they still have their internal userspace stack working on top of nvidia, at least last year someone fixed up a bunch of issues in the tegra+nouveau combo to enable format modifiers properly across the board. But also nvidia is never going to sell you that as the officially supported thing, unless your ask comes back with enormous amounts of sold hardware.
And it's not just nvidia, it's pretty much everyone. Like a soc company I don't want to know started collaborating with upstream and the reverse-engineered mesa team on a kernel driver, seems to work pretty well for current hardware. But for the next generation they decided it's going to be again only their in-house tree that completele ignores drivers/gpu/drm, and also tosses all the foundational work they helped build on the userspace side. And this is consistent across all companies, over the last 20 years I know of (often non-public) stories across every single company where they decided that all the time invested into community/upstream collaboration isn't useful anymore, we go all vendor solo for the next one.
Most of those you luckily don't hear about anymore, all it results in the upstream driver being 1-2 years late or so. But even the good ones where we collaborate well can't seem to help themselves and want to throw it all away every few years. -Daniel