Hi there!
I unpacked our minimal release image and ran an xdiskusage on it, mostly to see what we're shipping -- and I was surprised to see that a fourth of the image is actually apt package caches and lists. Can we put into the image generation script something to strip them out before generating the image?
The untarring also suggests a number of places where we could further trim the image, some of which are probably pretty hard to do:
* stripping /usr/share/doc out (but everybody knew that) * dropping charmaps, zones and locale info that will never really be used * stripping out modules for devices that won't ever be on this ARM device * stripping out firmware for peripherals that won't ever be on this ARM device
I've included the xdiskusage output so you can see what I mean. And good job on the release!
Hi,
On Fri, Aug 6, 2010 at 3:28 AM, Christian Robottom Reis kiko@linaro.orgwrote:
Hi there!
I unpacked our minimal release image and ran an xdiskusage on it, mostly to see what we're shipping -- and I was surprised to see that a fourth of the image is actually apt package caches and lists. Can we put into the image generation script something to strip them out before generating the image?
if there are really .deb's shipped in the tarball then this is definitly waste and a bug.
However, if its just the lists and pkg cache then I am not so convinced unless we say we remove apt (and dpkg) from our images (e.g. dont allow easy install/upgrade etc.).
Those files would come back when running apt-get update etc., so the only thing we would win is smaller initial download bandwidth, while I think we are really after general/lasting disk foodprint savings.
One thing we could do is remove universe from our default apt line. this probably would reduce the size of that directory by > 50% ...
Long term we could have our own archive with less packages ... this could reduce size of those indexes etc. even further.
The untarring also suggests a number of places where we could further trim the image, some of which are probably pretty hard to do:
- stripping /usr/share/doc out (but everybody knew that)
ack. we plan to do that using pitti's dpkg improvements; last time they didn't land in the archive yet, but I will check the status soon again.
- dropping charmaps, zones and locale info that will never really be used
I haven't look into those. if we ship useless stuff we should definitly make it go away. Anyone volunteers to look at this aspect a bit closer?
- stripping out modules for devices that won't ever be on this ARM device
yeah, this feels to make sense. However, I am not sure how to draw the line. Maybe this is something the kernel WG can take a look at and come up with a reduced list of modules?
* stripping out firmware for peripherals that won't ever be on this
ARM device
right. this could be part of the modules effort from above imo.
--
- Alexander
On Fri, Aug 6, 2010 at 9:53 AM, Alexander Sack asac@linaro.org wrote:
Hi,
On Fri, Aug 6, 2010 at 3:28 AM, Christian Robottom Reis kiko@linaro.org wrote:
Hi there!
I unpacked our minimal release image and ran an xdiskusage on it, mostly to see what we're shipping -- and I was surprised to see that a fourth of the image is actually apt package caches and lists. Can we put into the image generation script something to strip them out before generating the image?
if there are really .deb's shipped in the tarball then this is definitly waste and a bug.
However, if its just the lists and pkg cache then I am not so convinced unless we say we remove apt (and dpkg) from our images (e.g. dont allow easy install/upgrade etc.).
Those files would come back when running apt-get update etc., so the only thing we would win is smaller initial download bandwidth, while I think we are really after general/lasting disk foodprint savings.
We could remove these files, but I agree it may be a false optimisation: the size of the release filesystem is no longer representative of the steady-state size of the filesystem when it's in use in this case.
Out of interest, does anyone know why dpkg/apt never migrated from the "massive sequential text file" approach to something more database-oriented? I've often thought that the current system's scalability has been under pressure for a long time, and that there is potential for substantial improvements in footprint and performance - though the Debian and Ubuntu communities would need to give their support for such an approach, unless we wanted to switch to a different packaging system.
One thing we could do is remove universe from our default apt line. this probably would reduce the size of that directory by > 50% ...
Long term we could have our own archive with less packages ... this could reduce size of those indexes etc. even further.
The untarring also suggests a number of places where we could further trim the image, some of which are probably pretty hard to do:
* stripping /usr/share/doc out (but everybody knew that)
ack. we plan to do that using pitti's dpkg improvements; last time they didn't land in the archive yet, but I will check the status soon again.
It's interesting to note that due to the fact that /usr/share/doc contains mostly nearly-empty directories and tiny files, the filesystem overhead may be a significant part of the overall consumption here - I estimate about 20-30% of the overall space, assuming a typical filesystem with 4KB blocksize.
If we have to keep /usr/share/doc/ (for copyright notices and so on), maybe it would be feasible to replace each /usr/share/doc/<package>/ with a tarball? This would eliminate most of the overhead as well as making the actual data smaller. Since /usr/share/doc/ is not accessed often, and not accessed by many automated tools, this might not cause much disruption.
[...]
* stripping out modules for devices that won't ever be on this ARM device
yeah, this feels to make sense. However, I am not sure how to draw the line. Maybe this is something the kernel WG can take a look at and come up with a reduced list of modules?
Classifying drivers by bus, and throwing out anything that can't be physically connected, such as PCI/AGP/ISA might be an approach here. Also, peripherals which can only be connected to on-SoC buses, but are not present in a given platform's silicon could be excluded. We would still have to keep a lot though... anything which can be connected via USB, for example.
A more ambitious solution might be to allow for dynamic installation of missing modules, but that's probably a separate project since it would impact on the way the kernel is packaged.
Currently we have no choice but to install absolutely everything "just in case" (much like the way /dev used to contains 1000s if device nodes that were never used).
Cheers ---Dave
On Fri, Aug 6, 2010 at 11:57 AM, Dave Martin dave.martin@linaro.org wrote:
On Fri, Aug 6, 2010 at 9:53 AM, Alexander Sack asac@linaro.org wrote:
Hi,
On Fri, Aug 6, 2010 at 3:28 AM, Christian Robottom Reis <kiko@linaro.org
wrote:
Hi there!
I unpacked our minimal release image and ran an xdiskusage on it, mostly to see what we're shipping -- and I was surprised to see that a fourth of the image is actually apt package caches and lists. Can we put into the image generation script something to strip them out before generating the image?
if there are really .deb's shipped in the tarball then this is definitly waste and a bug.
However, if its just the lists and pkg cache then I am not so convinced unless we say we remove apt (and dpkg) from our images (e.g. dont allow easy
install/upgrade
etc.).
Those files would come back when running apt-get update etc., so the only thing we would win is smaller initial download bandwidth, while I think
we
are really after general/lasting disk foodprint savings.
We could remove these files, but I agree it may be a false optimisation: the size of the release filesystem is no longer representative of the steady-state size of the filesystem when it's in use in this case.
ack. this was my line of thinking .... thanks for confirming.
Out of interest, does anyone know why dpkg/apt never migrated from the "massive sequential text file" approach to something more database-oriented? I've often thought that the current system's scalability has been under pressure for a long time, and that there is potential for substantial improvements in footprint and performance - though the Debian and Ubuntu communities would need to give their support for such an approach, unless we wanted to switch to a different packaging system.
CCed mvo who can probably give some background here. From what I understand it's a long standing wishlist bug with potential to break the world :-P
On Fri, Aug 6, 2010 at 11:57 AM, Dave Martin dave.martin@linaro.org wrote:
On Fri, Aug 6, 2010 at 9:53 AM, Alexander Sack asac@linaro.org wrote:>>
- stripping /usr/share/doc out (but everybody knew that)
ack. we plan to do that using pitti's dpkg improvements; last time they didn't land in the archive yet, but I will check the status soon again.
It's interesting to note that due to the fact that /usr/share/doc contains mostly nearly-empty directories and tiny files, the filesystem overhead may be a significant part of the overall consumption here - I estimate about 20-30% of the overall space, assuming a typical filesystem with 4KB blocksize.
If we have to keep /usr/share/doc/ (for copyright notices and so on), maybe it would be feasible to replace each /usr/share/doc/<package>/ with a tarball? This would eliminate most of the overhead as well as making the actual data smaller. Since /usr/share/doc/ is not accessed often, and not accessed by many automated tools, this might not cause much disruption.
CCed Martin who probably has thought about this copyright/space dilemma while implementing the dpkg goody i mentioned above.
[...]
- stripping out modules for devices that won't ever be on this ARM device
yeah, this feels to make sense. However, I am not sure how to draw the
line.
Maybe this is something the kernel WG can take a look at and come up with
a
reduced list of modules?
Classifying drivers by bus, and throwing out anything that can't be physically connected, such as PCI/AGP/ISA might be an approach here. Also, peripherals which can only be connected to on-SoC buses, but are not present in a given platform's silicon could be excluded. We would still have to keep a lot though... anything which can be connected via USB, for example.
Right. I had something like that in mind too and I think it's the safe way to go for our "standard" linaro kernels. I wonder if we need a process/policy that says what modules are needed to become an official "linaro" BSP kernel :).
Also, once we have support for hardware support packs vendors or people that want to create kernels etc. with even less modules could still do that on their own.
A more ambitious solution might be to allow for dynamic installation of missing modules, but that's probably a separate project since it would impact on the way the kernel is packaged.
Personal feeling on this is that we shouldn't go that far in linaro. But I would like to hear other opinions on this.
On Fri, 2010-08-06 at 12:05 +0200, Alexander Sack wrote:
Out of interest, does anyone know why dpkg/apt never migrated from the "massive sequential text file" approach to something more database-oriented? I've often thought that the current system's scalability has been under pressure for a long time, and that there is potential for substantial improvements in footprint and performance - though the Debian and Ubuntu communities would need to give their support for such an approach, unless we wanted to switch to a different packaging system.
CCed mvo who can probably give some background here. From what I understand it's a long standing wishlist bug with potential to break the world :-P
Michael, if anyone would like to attempt to fix this I'd like to help with the actual dirty work.
ZK
Hello all,
Alexander Sack [2010-08-06 12:15 +0200]:
If we have to keep /usr/share/doc/ (for copyright notices and so on), maybe it would be feasible to replace each /usr/share/doc/<package>/ with a tarball? This would eliminate most of the overhead as well as making the actual data smaller. Since /usr/share/doc/ is not accessed often, and not accessed by many automated tools, this might not cause much disruption.
CCed Martin who probably has thought about this copyright/space dilemma while implementing the dpkg goody i mentioned above.
Replacing the /usr/share/doc/ contents with a tarball, and updating this with every package update is certainly not something dpkg itself should do behind your back IMHO (or even could). It would also mean quite a lot of overhead whenever you touch any package in the system.
For the project I implemented this it was good enough (i. e. saved us some 40 MB), and we don't care about the couple of bytes of extra directory overhead.
However, dpkg does support hooks for things like that. If you want to do this, you might put a --post-invoke hook somewhere in /etc/dpkg/dpkg.cfg.d/ ?
Martin
On Fri, Aug 6, 2010 at 1:16 PM, Martin Pitt martin.pitt@ubuntu.com wrote:
Hello all,
Alexander Sack [2010-08-06 12:15 +0200]:
If we have to keep /usr/share/doc/ (for copyright notices and so on), maybe it would be feasible to replace each /usr/share/doc/<package>/ with a tarball? This would eliminate most of the overhead as well as making the actual data smaller. Since /usr/share/doc/ is not accessed often, and not accessed by many automated tools, this might not cause much disruption.
CCed Martin who probably has thought about this copyright/space dilemma while implementing the dpkg goody i mentioned above.
Replacing the /usr/share/doc/ contents with a tarball, and updating this with every package update is certainly not something dpkg itself should do behind your back IMHO (or even could). It would also mean quite a lot of overhead whenever you touch any package in the system.
Just to clarify my meaning--- I expected to have a tarball per package, not one massive tarball for the whole system... the cost of maintaining the latter would certainly get very unpleasant for people.
[...]
However, dpkg does support hooks for things like that. If you want to do this, you might put a --post-invoke hook somewhere in /etc/dpkg/dpkg.cfg.d/ ?
Interesting; I guess quite a lot of flexibility can be built on that... one question related to this: how does dpkg cope with hook scripts which effectively change a package's file list?
I can foresee problems if a post-unpack hook creates a file dpkg doesn't know about --- the tarball (presumably not in the package's file list) could silently be overwritten when installing another package.
Cheers ---Dave
On Fri, Aug 6, 2010 at 11:16 AM, Zygmunt Bazyli Krynicki zygmunt.krynicki@linaro.org wrote:
On Fri, 2010-08-06 at 12:05 +0200, Alexander Sack wrote:
Out of interest, does anyone know why dpkg/apt never migrated from the "massive sequential text file" approach to something more database-oriented? I've often thought that the current system's scalability has been under pressure for a long time, and that there is potential for substantial improvements in footprint and performance - though the Debian and Ubuntu communities would need to give their support for such an approach, unless we wanted to switch to a different packaging system.
CCed mvo who can probably give some background here. From what I understand it's a long standing wishlist bug with potential to break the world :-P
Michael, if anyone would like to attempt to fix this I'd like to help with the actual dirty work.
ZK
Maybe one to discuss with Debian folks and/or raise at next UDS?
Hello,
Dave Martin [2010-08-06 13:35 +0100]:
Just to clarify my meaning--- I expected to have a tarball per package, not one massive tarball for the whole system... the cost of maintaining the latter would certainly get very unpleasant for people.
Hm, but then you wouldn't save a lot -- you'd have a small tarball instead of a small copyright file, and the only thing you'd save would be the directory entry.
I can foresee problems if a post-unpack hook creates a file dpkg doesn't know about --- the tarball (presumably not in the package's file list) could silently be overwritten when installing another package.
Creating files dpkg doesn't know about is not a problem per se -- but of course dpkg wouldn't know how to remove it on removal or upgrade, so you have to cope with all of those in the hook.
Martin
On Fri, Aug 06, 2010 at 10:57:21AM +0100, Dave Martin wrote:
We could remove these files, but I agree it may be a false optimisation: the size of the release filesystem is no longer representative of the steady-state size of the filesystem when it's in use in this case.
Well, I think that assumes that what you are doing with the image is installing more software through apt, which I agree is an important use case. However, there may be other scenarios where it's not as important.
I guess I do see a motivation for making the initial image download smaller, even if it does imply most users will need to apt-get update anyway, because:
- It improves the first impression users get downloading the package
- It is likely the user will have to apt-get update afterwards anyway, since the archive changes continuously
- The image size is a proxy for "installed system size", and I don't want us to be seen as a fat distribution. If the headless linaro image were 50MB I am sure we'd have many users impressed already at how convenient and flexible a platform we would be providing.
Thanks for the commentary,
On Fri, Aug 06, 2010 at 02:16:00PM +0200, Martin Pitt wrote:
Hello all,
Alexander Sack [2010-08-06 12:15 +0200]:
If we have to keep /usr/share/doc/ (for copyright notices and so on), maybe it would be feasible to replace each /usr/share/doc/<package>/ with a tarball? This would eliminate most of the overhead as well as making the actual data smaller. Since /usr/share/doc/ is not accessed often, and not accessed by many automated tools, this might not cause much disruption.
CCed Martin who probably has thought about this copyright/space dilemma while implementing the dpkg goody i mentioned above.
Replacing the /usr/share/doc/ contents with a tarball, and updating this with every package update is certainly not something dpkg itself should do behind your back IMHO (or even could). It would also mean quite a lot of overhead whenever you touch any package in the system.
By touch I think you mean install, upgrade or remove, and of these I guess upgrade is the more common case; do you think it is?
Would the overhead be significant even if the tarball wasn't compressed? I don't understand enough about tar's concatenate and delete performance to risk a guess.
On Fri, Aug 06, 2010 at 12:05:25PM +0200, Alexander Sack wrote:
On Fri, Aug 6, 2010 at 11:57 AM, Dave Martin dave.martin@linaro.org wrote:
On Fri, Aug 6, 2010 at 9:53 AM, Alexander Sack asac@linaro.org wrote:
On Fri, Aug 6, 2010 at 3:28 AM, Christian Robottom Reis <kiko@linaro.org
Hi there!
I unpacked our minimal release image and ran an xdiskusage on it, mostly to see what we're shipping -- and I was surprised to see that a fourth of the image is actually apt package caches and lists. Can we put into the image generation script something to strip them out before generating the image?
if there are really .deb's shipped in the tarball then this is definitly waste and a bug.
However, if its just the lists and pkg cache then I am not so convinced unless we say we remove apt (and dpkg) from our images (e.g. dont allow easy
install/upgrade
etc.).
Those files would come back when running apt-get update etc., so the only thing we would win is smaller initial download bandwidth, while I think
we
are really after general/lasting disk foodprint savings.
We could remove these files, but I agree it may be a false optimisation: the size of the release filesystem is no longer representative of the steady-state size of the filesystem when it's in use in this case.
You have the following options to make the on-disk file size smaller:
* keep them compressed on disk (needs apt in maverick), you need to set """ Acquire::GzipIndexes "true"; """ in /etc/apt/apt.conf.d/10keepcompressed
* Create the apt caches in memory only, set """ Dir::Cache::srcpkgcache ""; Dir::Cache::pkgcache ""; """ in /etc/apt/apt.conf.d
Given your hardware targets I think its best to test how well both solutions perform, enabling the first one should be pretty safe.
Out of interest, does anyone know why dpkg/apt never migrated from the "massive sequential text file" approach to something more database-oriented? I've often thought that the current system's scalability has been under pressure for a long time, and that there is potential for substantial improvements in footprint and performance - though the Debian and Ubuntu communities would need to give their support for such an approach, unless we wanted to switch to a different packaging system.
CCed mvo who can probably give some background here. From what I understand it's a long standing wishlist bug with potential to break the world :-P
There is currently no work on this inside apt. The text file is the "canonical" source, the *pkgcache.bin is a fast mmap representation of this information that is re-build if needed.
There are various non-apt approaches out there (as you probably aware of) that use different storage. I think this needs careful evaluation and research, the current mmap format is actually not bad and in the latest maverick apt we can grow the size dynamically. The upside of the text file approach is simplicity and robustness, on failures sysadmin can inspect/fix with vi.
Cheers, Michael
P.S. Talking about areas where our packaging system needs improvements, I think the biggest problem for a embedded use-case are maintainer scripts (postinst, preinst, postrm, prerm). They make rollback from failed upgrades impossible (as they can alter the system state in ways outside of dpkgs knowledge/control). Having a more declarative approach (triggers are a great step forward) would certainly be a win.
On Fri, Aug 6, 2010 at 3:15 PM, Michael Vogt mvo@ubuntu.com wrote:
You have the following options to make the on-disk file size smaller:
* keep them compressed on disk (needs apt in maverick), you need to set """ Acquire::GzipIndexes "true"; """ in /etc/apt/apt.conf.d/10keepcompressed
* Create the apt caches in memory only, set """ Dir::Cache::srcpkgcache ""; Dir::Cache::pkgcache ""; """ in /etc/apt/apt.conf.d
Given your hardware targets I think its best to test how well both solutions perform, enabling the first one should be pretty safe.
Hmmm, interesting, we'll have to play with those :)
Out of interest, does anyone know why dpkg/apt never migrated from the "massive sequential text file" approach to something more database-oriented? I've often thought that the current system's scalability has been under pressure for a long time, and that there is potential for substantial improvements in footprint and performance - though the Debian and Ubuntu communities would need to give their support for such an approach, unless we wanted to switch to a different packaging system.
CCed mvo who can probably give some background here. From what I understand it's a long standing wishlist bug with potential to break the world :-P
There is currently no work on this inside apt. The text file is the "canonical" source, the *pkgcache.bin is a fast mmap representation of this information that is re-build if needed.
There are various non-apt approaches out there (as you probably aware of) that use different storage. I think this needs careful evaluation and research, the current mmap format is actually not bad and in the latest maverick apt we can grow the size dynamically. The upside of the text file approach is simplicity and robustness, on failures sysadmin can inspect/fix with vi.
I hadn't got the impression that there was anything bad about apt's package cache, but it might be nice to consolidate it with the package lists in some way, since the information contained is essentially the same (I think).
For a while, Debian also distributed package list diffs, which can save a lot of download bandwidth in principle, though it gives the receiving host more work to do. I don't know if it still works that way though; I haven't seen any diffs during list updates for a while.
I guess if someone wanted to work on this, it might make sense to create a new (initially experimental) database backend for apt while continuing to support the existing one. The new backend could allow for an alternative representation of the data and some alternative mechanisms, but contents of the data could remain unchanged. You're right that the text file approach is just fine (and convenient) for desktop PC and server environments in particular, as well as being robust and well-tested, so we certainly wouldn't want to get rid of it or risk breaking it.
[...]
P.S. Talking about areas where our packaging system needs improvements, I think the biggest problem for a embedded use-case are maintainer scripts (postinst, preinst, postrm, prerm). They make rollback from failed upgrades impossible (as they can alter the system state in ways outside of dpkgs knowledge/control). Having a more declarative approach (triggers are a great step forward) would certainly be a win.
Part of the problem is that the maintainer scripts can currently run random helper programs and scripts. This is very flexible, but by definition, dpkg can't understand what every possible program will do. The package maintainer could be given the task of describing to dpkg what will happen, but I'm wondering whether dpkg can realistically enforce it, or whether there's a risk of the description having to become as complex as the helper.
So I guess there's a question of whether we can/should reduce that flexibility, and by how much.
If the flexibility has to be maintained, I wonder whether some sort of filesystem checkpointing approach is possible, so that the changes made by the maintainer scripts go into a staging area, to be "committed" together in a safe way when the packager has finished. Maybe something similar to fakeroot could be used, or a FUSE-based versioned filesystem or similar. Not sure how to commit the changes safely though; you kinda end up having to lock the whole filesystem at least for that phase of the process.
In principle, we have the problem even now that if dpkg runs while something else is modifying files in /var, /etc, it's possible for the configuration to get corrupted. Fortunately, that's a rare occurrence, and generally requires the admin to do something silly.
Cheers ---Dave
On Fri, Aug 6, 2010 at 2:30 PM, Christian Robottom Reis kiko@linaro.org wrote:
By touch I think you mean install, upgrade or remove, and of these I guess upgrade is the more common case; do you think it is?
I think you're right. This hits the end user more than first-time installation - install and remove normally imply an explicit request from the user for something to happen, in which case the user expects to have to wait a bit; upgrades are housekeeping that may happen (and use power and potentially annoy the user) at any time (ish).
Would the overhead be significant even if the tarball wasn't compressed? I don't understand enough about tar's concatenate and delete performance to risk a guess.
tar's default internal blocksize is 512 bytes, so there would still be overhead but it would be less. I don't think tar really supports random access though, since tar files are sequential and monolithic; having many tarballs instead of just one may be better.
---Dave
On Fri, Aug 06, 2010 at 05:38:56PM +0100, Dave Martin wrote:
Would the overhead be significant even if the tarball wasn't compressed? I don't understand enough about tar's concatenate and delete performance to risk a guess.
tar's default internal blocksize is 512 bytes, so there would still be overhead but it would be less. I don't think tar really supports random access though, since tar files are sequential and monolithic; having many tarballs instead of just one may be better.
I don't know for sure if it does, but man tar does show --concatenate and --delete options so I think there's something there. Whether or not that's fast is another matter <wink>
Christian Robottom Reis [2010-08-06 10:30 -0300]:
By touch I think you mean install, upgrade or remove, and of these I guess upgrade is the more common case; do you think it is?
install and upgrade are the common ones, right.
Would the overhead be significant even if the tarball wasn't compressed? I don't understand enough about tar's concatenate and delete performance to risk a guess.
As far as I know you can append files; I'm not sure about "inline" deletion, but even if it would exist, then over time (i. e. upgrades) you would dramatically fragment the file, which reduces or even reverts the initial space saving.
Martin
On Sun, Aug 8, 2010 at 6:50 PM, Martin Pitt martin.pitt@ubuntu.com wrote:
[...]
As far as I know you can append files; I'm not sure about "inline" deletion, but even if it would exist, then over time (i. e. upgrades) you would dramatically fragment the file, which reduces or even reverts the initial space saving.
Fortunately fragmentation is not a problem: tar --delete squashes the deleted entry out of the file by rewriting the entire file contents from the point where the deletion occurred ;) Of course, that could be a bit slow, especially if you delete something near the start of the archive...
[...]
On Fri, Aug 6, 2010 at 7:50 PM, Loïc Minier loic.minier@linaro.org wrote: [...]
Yup, Martin Pitt worked on some APT patches to allow keeping these compressed in the local disk.
Out of interest, since these indexes are designed to be used via mmap, do we need to decompress the files when running apt? If so, that suggests that the transient disk footprint actually increases, since we must store both the compressed and decompressed indexes (in addition to any package files apt then downloads). Alternatively apt could stream the data in and store it in memory, though that would raise the ram/swap footprint significantly compared with mmap.
Cheers ---Dave
Hello Dave,
Dave Martin [2010-08-09 9:48 +0100]:
Fortunately fragmentation is not a problem: tar --delete squashes the deleted entry out of the file by rewriting the entire file contents from the point where the deletion occurred ;) Of course, that could be a bit slow, especially if you delete something near the start of the archive...
Ah, right. I take it that rules out having one big tarball for /usr/share/doc/ then?
On Fri, Aug 6, 2010 at 7:50 PM, Loïc Minier loic.minier@linaro.org wrote: [...]
Yup, Martin Pitt worked on some APT patches to allow keeping these compressed in the local disk.
Out of interest, since these indexes are designed to be used via mmap,
They aren't. apt only read()s /var/lib/apt/lists/*. It only mmaps some dpkg files (/var/lib/dpkg/status, I think) and it's internal cache (/var/cache/apt/pkgcache.bin).
do we need to decompress the files when running apt?
Only dynamically in memory when building pkgcache.bin. But that doesn't change the RAM footprint, since without compression it would also read the entire file into RAM while doing that.
Martin
On Mon, Aug 9, 2010 at 10:36 AM, Martin Pitt martin.pitt@ubuntu.com wrote:
Hello Dave,
[...]
Ah, right. I take it that rules out having one big tarball for /usr/share/doc/ then?
Hence my thought of having a tarball per package. I'm still a bit worried about a post-unpack hook changing the package files to be different from dpkg's file list for the package -- that could cause safety problems. Is there a way to tell dpkg that the hook changed the files? (May just be my ignorance)
[...]
Only dynamically in memory when building pkgcache.bin. But that doesn't change the RAM footprint, since without compression it would also read the entire file into RAM while doing that.
Ah, I misunderstood. I was thinking about compressing {,src}pkgcache.bin instead of the package lists... Ignore my comments on that ;)
Cheers ---Dave
Hello Dave,
Dave Martin [2010-08-09 10:47 +0100]:
Hence my thought of having a tarball per package. I'm still a bit worried about a post-unpack hook changing the package files to be different from dpkg's file list for the package -- that could cause safety problems. Is there a way to tell dpkg that the hook changed the files? (May just be my ignorance)
If the hook updates /var/lib/dpkg/info/<package>.{list,md5sum} accordingly, it should be alright.
Ah, I misunderstood. I was thinking about compressing {,src}pkgcache.bin instead of the package lists... Ignore my comments on that ;)
srcpkgcache.bin can be dropped entirely. pkgcache.bin can be dropped for most operations [1], but synaptics doesn't work without one [2]. So pkgcache.bin needs to stay as it is for the time being.
Thanks,
Martin
[1] https://wiki.ubuntu.com/ReducingDiskFootprint#Disable%20apt%20caches [2] https://bugs.launchpad.net/ubuntu/+source/synaptic/+bug/596898
+++ Christian Robottom Reis [2010-08-05 22:28 -0300]:
Hi there!
I unpacked our minimal release image and ran an xdiskusage on it,
mostly to see what we're shipping -- and I was surprised to see that a fourth of the image is actually apt package caches and lists.
This is typical for a debian-based minimal system.
Emdebian has spend some time developing tools for minimising installed size of Debian-compatible images. So I can make a few relevant comments.
Can we put into the image generation script something to strip them out before generating the image?
We could. The tradeoff is having to download them again on first use of apt on the target system vs a smaller installed system until that is done. In cases where that is 'never' then it's a big win.
Making sure that only repos that are actually needed on the target are listed can help. Does it need src repos? Does it need universe/multiverse? leaving those out makes a huge difference.
I assume there are no .debs in the apt cache? debotstrap-based installers leave all the .debs in because they are needed for second-stage configuration, but I assume we've done the second-staging by some means or other. (multistrap-based image creation does not need the .debs for 'second-stage', so this issue does not arise).
The untarring also suggests a number of places where we could further trim the image, some of which are probably pretty hard to do:
*
- stripping /usr/share/doc out (but everybody knew that)
- dropping charmaps, zones and locale info that will never really be used
- stripping out modules for devices that won't ever be on this ARM device
- stripping out firmware for peripherals that won't ever be on this ARM device
This is pretty close to what emdebian grip does - i.e. the set of easy wins which approximately halves your base image size without making any binary-incompatible changes or rebuilding anything. (although emdebian doesn't do anything about kernels - we've left that as out-of-scope)
We could use the em-grip tool (or a variant) to repackage our debs to make smaller images. However the result is not policy-compliant 'ubuntu', but a new repository of packages containing the exact same binaries, but less bloat, ontop of which you can install any normal ubuntu packages which have not had this treatment. That may or may not be how we want to proceed? It is a sane and effective way to manage this sort of thing (it is currently trivial to crossgrade Debian to emdebian-grip and save a load of space, or to use the installer to install grip instead of normal Debian). We could pull the same trick for Ubuntu with relatively little effort.
Grip does the following things to compatibly save space: * Reduce all Long descriptions to 4 lines in packages files (makes them approc half the size) * strip other fields that aren't actually needed (including 'recommends') * strip all docs, examples, manpages, just leaving copyright files * sets dpkg-vendor so that it can be used to different stuff in maintainer scripts (or on rebuilds). * restricts overall archive size to keep apt metadata size down * remove lintian files, help files * don't require everything 'essential' so a smaller minimal system can be specified. * split translations out into '.tdebs' - one per lang per package, in separate pool i.e not like the ubuntu or proposed Debian schemes
* tzdata is one thing we left alone in grip, although it would be really good to slim it down a bit. In crush it was shrunk to ~1/3rd the size (2.4MB) by removing the 'right' and 'posix' copies.
Of course another way to achieve much the same effect is to use dpkg filtering at install time to do the same sorts of stripping. This means you can leave the package files exactly as they were (and downloads don't get any smaller, only final images). That has been implemented as proof of concept a few years back, but there are some complicated issues about what happens on future upgrades/removals and exactly how dpkg should deal with operations on installed-but-filtered files.
If we want to make smaller images we should certainly look at re-using some of the emdebian technology and/or mechanisms as it already works well.
Wookey
On Wed, Aug 11, 2010 at 05:32:24PM +0100, Wookey wrote:
Making sure that only repos that are actually needed on the target are listed can help. Does it need src repos? Does it need universe/multiverse? leaving those out makes a huge difference.
atm, we dont have deb-src lines from what i know. However, we have universe which I suggested in one of the mails from above to be dropped to save > 50% of that space.
I assume there are no .debs in the apt cache? debotstrap-based installers leave all the .debs in because they are needed for second-stage configuration, but I assume we've done the second-staging by some means or other. (multistrap-based image creation does not need the .debs for 'second-stage', so this issue does not arise).
right, we dont ship .deb files in the rootfs. if so, it would be a bug for sure.
We could use the em-grip tool (or a variant) to repackage our debs to
make smaller images. However the result is not policy-compliant 'ubuntu', but a new repository of packages containing the exact same binaries, but less bloat, ontop of which you can install any normal ubuntu packages which have not had this treatment. That may or may not be how we want to proceed? It is a sane and effective way to manage this sort of thing (it is currently trivial to crossgrade Debian to emdebian-grip and save a load of space, or to use the installer to install grip instead of normal Debian). We could pull the same trick for Ubuntu with relatively little effort.
Prerequisite for this is to have derived archives first. So let's add em-grip add to our idea list to evaluate for next cycle (assuming we have derived archives by then). Maybe also talk to scott so they can keep doors open for such a feature when implementing that.
If we want to make smaller images we should certainly look at re-using some of the emdebian technology and/or mechanisms as it already works well.
agreed; seems like you just volunteered to push our on-disk-footprint spec this cycle? ;-P
- Alexander