On Fri, Aug 6, 2010 at 9:53 AM, Alexander Sack asac@linaro.org wrote:
Hi,
On Fri, Aug 6, 2010 at 3:28 AM, Christian Robottom Reis kiko@linaro.org wrote:
Hi there!
I unpacked our minimal release image and ran an xdiskusage on it, mostly to see what we're shipping -- and I was surprised to see that a fourth of the image is actually apt package caches and lists. Can we put into the image generation script something to strip them out before generating the image?
if there are really .deb's shipped in the tarball then this is definitly waste and a bug.
However, if its just the lists and pkg cache then I am not so convinced unless we say we remove apt (and dpkg) from our images (e.g. dont allow easy install/upgrade etc.).
Those files would come back when running apt-get update etc., so the only thing we would win is smaller initial download bandwidth, while I think we are really after general/lasting disk foodprint savings.
We could remove these files, but I agree it may be a false optimisation: the size of the release filesystem is no longer representative of the steady-state size of the filesystem when it's in use in this case.
Out of interest, does anyone know why dpkg/apt never migrated from the "massive sequential text file" approach to something more database-oriented? I've often thought that the current system's scalability has been under pressure for a long time, and that there is potential for substantial improvements in footprint and performance - though the Debian and Ubuntu communities would need to give their support for such an approach, unless we wanted to switch to a different packaging system.
One thing we could do is remove universe from our default apt line. this probably would reduce the size of that directory by > 50% ...
Long term we could have our own archive with less packages ... this could reduce size of those indexes etc. even further.
The untarring also suggests a number of places where we could further trim the image, some of which are probably pretty hard to do:
* stripping /usr/share/doc out (but everybody knew that)
ack. we plan to do that using pitti's dpkg improvements; last time they didn't land in the archive yet, but I will check the status soon again.
It's interesting to note that due to the fact that /usr/share/doc contains mostly nearly-empty directories and tiny files, the filesystem overhead may be a significant part of the overall consumption here - I estimate about 20-30% of the overall space, assuming a typical filesystem with 4KB blocksize.
If we have to keep /usr/share/doc/ (for copyright notices and so on), maybe it would be feasible to replace each /usr/share/doc/<package>/ with a tarball? This would eliminate most of the overhead as well as making the actual data smaller. Since /usr/share/doc/ is not accessed often, and not accessed by many automated tools, this might not cause much disruption.
[...]
* stripping out modules for devices that won't ever be on this ARM device
yeah, this feels to make sense. However, I am not sure how to draw the line. Maybe this is something the kernel WG can take a look at and come up with a reduced list of modules?
Classifying drivers by bus, and throwing out anything that can't be physically connected, such as PCI/AGP/ISA might be an approach here. Also, peripherals which can only be connected to on-SoC buses, but are not present in a given platform's silicon could be excluded. We would still have to keep a lot though... anything which can be connected via USB, for example.
A more ambitious solution might be to allow for dynamic installation of missing modules, but that's probably a separate project since it would impact on the way the kernel is packaged.
Currently we have no choice but to install absolutely everything "just in case" (much like the way /dev used to contains 1000s if device nodes that were never used).
Cheers ---Dave