Hello Loic,
(cc:ing list)
On Thu, 8 Sep 2011 13:26:57 +0200 Loïc Minier loic.minier@linaro.org wrote:
On Thu, Sep 08, 2011, Paul Sokolovsky wrote:
Repo is bad tool for mirroring. We came to that conclusion, as other folks before us. So, android-build repo mirror waits for it rewrite, left in peace for now that it "mostly works". But for upstream mirror for Gerrit, I implemented it via previously queued ideas of "proper" git mirror. It's not deployed in production mode yet - kernel.org downtime stroke right in the middle of it.
Gerrit upstream mirror is essentially loop over existing working repository tree in FS, anf git pull/push each with suitable ref params (I tried --mirror first, but that doesn't work well with Gerrit). The devel codxe is here: https://code.launchpad.net/~linaro-infrastructure/linaro-android-gerrit-supp...
It seems there is an important new problem with the use case of mirroring for Gerrit: detecting new projects and removed projects as to provision/unprovision in Gerrit (gerrit create-project). I guess we're not too strict about removing projects/cleaning up git repos removed from manifests right now.
Yes. Actually there're 2 problems: 1) detecting new upstream projects (there's script for that, but not expected to be run automatically so far); 2) properly migrating our own components to Gerrit, - so far we just leave old repos in place to keep old releases fetchable, but I expect this to lead to a mess. Btw, projects in Gerrit are not deletable in normal way ;-). (One could hack on DB level, yeah.)
Do I understand correctly that we have a mirror AND the Gerrit copy of the repos? (/srv/git.linaro.org/git/android + /mnt/gerrit-mirror respectively)
Yes, we have Gerrit-served master repositories, plus we have workcopy checkout to perform merges with upstream. Well, those merges are not merges, but fast-forwards of upstream branch subset, it would be cool if git could do fast-forward from a repository to repository w/o working copy, but of course that wouldn't work in general case of merges and conflicts.
Also, a general choice with a generic mirroring service is whether we try sharing the mirrored data effectively. Say we want to mirror Cyanogenmod manifests, or if someone wanted to mirror Android upstream
- Linaro manigests, could we do so in a way which avoids duplicating
the contents of repos. This can quickly jump to github/gitorious-level complexity though, but some of it needs to be considered now (like the fact that Linaro has plenty of manifests and they point at 99% of the same data, so a mirror of Linaro with separate storage per-manifest would be unacceptably costly).
I'm not sure I follow exactly. Are you talking about sharing git pack data across repositories, knowing they're close forks of each other? I never heard about that, nor I think its worth being pursued, because there's enough complexity already. It can be handled on specific project level somehow, for example Linaro codebase is a proper superset of AOSP, with fetching Linaro tree, one would have (easily separatable) access to both AOSP and Linaro code (technically, not politically).
Another problem I see with our current usage is that the branch names keep changing, for instance linaro_android_2.3.5, and because this is where we get manifests from, it might be tricky hardcoding this into the mirroring service.
Good mirroring service wouldn't be tight to specific branch, I guess.
I've solved c) by forking repo for myself, packaging it as a .deb and nuking the repo updating pieces, but a) and b) remain issues with the result. I have other concerns with repo:
- quite complex and hairy for what it does in practice
Yeah, just faced it that with latest "repo patched to fetch all tags" campaign - seeminingly small changes broke mirroring and suspectedly adversely affected performance.
I didn't pull your patch yet, but my intent was to pull exactly that feature; it's the top three commits at: http://android.git.linaro.org/gitweb?p=tools/repo.git%3Ba=shortlog%3Bh=refs/... right?
Yes. Please note that I personally consider it flawed experiment. It would have its limited usage, but to claim it's generally-suitable solution, it would need lot of testing. (With all these experiments, I got feeling that magic that repo does is very well thought, and keeps acceptable performance when dealing with truly enormous Android codebase at shaky equilibrium, so changing seemingly small thing can drop checkout speed twice for example).
Yes, that's for sure a way to go. Except that after the patched repo experiments I have suspicion that running repo against complete git mirror vs repo's partial mirror may increase sync time and I/O complexity (the latter hits in full in case of parallel fetches).
You're saying we should filter the contents of the git repos exposed in the resulting mirrored tree so that developers and buildds pointing repo at it get good performance, correct?
No, I'm saying that I have intuitive suspicions that it may affect performance, so we'd rather take necessity of performance and load tests for this work seriously. And if performance degradation against reference design (repo) will be proven, think what to do about it.
Do we know in which ways? avoiding useless branches I guess? If we know about all manifests which point at all branches/tags/shas, we can keep just that in the published repos and garbage collect the rest (might need more than git gc).
But generally, I came to following observation: 1) repo to fetch codebase for development; 2) repo for mirroring; 3) repo to fetch for quick build - are all rather different usecases. Usage 1 for sure stays that way forever, but for usages 2 & 3 we may write specialized tools providing better performance.
Very well put!
I agree on 1 and 2, I'm less sure about 3; would we really want to move to something else than repo for builds? It's good when buildd and developer builds look alike when possible.
Well, I don't know. We for sure would want to (try to) improve build checkout time, and all ways should be considered. At repo is after just a tool which checks out a list of git repositories at specified revisions, so if we'd find more optimal way to do that (though likely more limited in other aspects), why not?
Another advantage is that there is now a python-git which can be leveraged rather than wrapping the git commands in python as repo does.
That sounds too complicated, like another "quite complex and hairy for what it does in practice" ;-).
Ah, so you've played with python-git and it sucks?
No, I didn't mean that. I meant that before moving to using more intricate API level which would be allowed by using language binding, it would makes sent to try to implement it in terms of git pull/push, if that's not enough - git fetch, and only if still not enough - to employ even more complex solutions.
Fair enough, but I did find it's a bit of a pity that repo has so much git wrapping logic, not isolated at all, and probably fragile in the face of changes to the host's git and gitconfig.
I guess you know that Google works (worked) slowly on integrating some kind of submodules support into repo (for exact purpose not known by me, I assume ditching manifests, but maybe just to support components with it). But works goes very slowly or halted.
The whole repo workflow seems so borken to me that I doubt they have a chance fixing it in incremental updates.
Well, I can't at once tell if repositories with submodules may need additional attention with mirroring. My guess is yes. After all, submodule is just a symlink to another repo, they are fetched in *working copy*. And git mirror is a bare repo.
You're right; I hadn't thought enough about this. It will definitely need more than a couple of commands.
First, we'll run in the same issue that required rewriting manifests that the sources for the commits are separate repos. So we could rewrite ".gitmodules" files, or we could try using "insteadOf" to map original URLs to mirrored URLs. Would be nice if this is easy to consume. Second, indeed we need to do the actual fetching, but it doesn't seem to be *too* hard; perhaps we could even contribute this to git if it's generic enough.
Yes, I guess that makes sense.
On Fri, Sep 09, 2011, Paul Sokolovsky wrote:
Also, a general choice with a generic mirroring service is whether we try sharing the mirrored data effectively. Say we want to mirror Cyanogenmod manifests, or if someone wanted to mirror Android upstream
- Linaro manigests, could we do so in a way which avoids duplicating
the contents of repos. This can quickly jump to github/gitorious-level complexity though, but some of it needs to be considered now (like the fact that Linaro has plenty of manifests and they point at 99% of the same data, so a mirror of Linaro with separate storage per-manifest would be unacceptably costly).
I'm not sure I follow exactly. Are you talking about sharing git pack data across repositories, knowing they're close forks of each other? I never heard about that, nor I think its worth being pursued, because there's enough complexity already. It can be handled on specific project level somehow, for example Linaro codebase is a proper superset of AOSP, with fetching Linaro tree, one would have (easily separatable) access to both AOSP and Linaro code (technically, not politically).
Good example of what I meant: I don't think we want to go as far as sharing git pack data across repositories which are close, that's clearly on the too complex side of things. However we also don't want to keep it too simple, like one mirror data set for each manifest.xml that we mirror. Instead, it's likely going to be something like parsing all manifests that we want to mirror, building a map from them of which git repos and branches and tags and shas we want to mirror, then mirroring that in a flat hierarchy, then finding a way to consume the resulting super-mirror.
Another problem I see with our current usage is that the branch names keep changing, for instance linaro_android_2.3.5, and because this is where we get manifests from, it might be tricky hardcoding this into the mirroring service.
Good mirroring service wouldn't be tight to specific branch, I guess.
Yup; just wanted to point out that this is a bit specific to our choice of workflow, and not generic. If we try to create a *reusable* tool to mirror manifests, we want to be careful not too put too many smarts which are only relevant to Linaro. I didn't really think about that, maybe it's just a matter of giving an easy way to Linaro to provision new branch names to the mirroring service, or maybe it should be done with a manifest_branch_name_regexp="^linaro_android_.*" kind of approach.
Yes. Please note that I personally consider it flawed experiment.
Ok, understood, also thanks to the detailed list of issues you passed; I'll wait and see a bit more on that one.
would have its limited usage, but to claim it's generally-suitable solution, it would need lot of testing. (With all these experiments, I got feeling that magic that repo does is very well thought, and keeps acceptable performance when dealing with truly enormous Android codebase at shaky equilibrium, so changing seemingly small thing can drop checkout speed twice for example).
[...]
No, I'm saying that I have intuitive suspicions that it may affect performance, so we'd rather take necessity of performance and load tests for this work seriously. And if performance degradation against reference design (repo) will be proven, think what to do about it.
Noted
Well, I don't know. We for sure would want to (try to) improve build checkout time, and all ways should be considered. At repo is after just a tool which checks out a list of git repositories at specified revisions, so if we'd find more optimal way to do that (though likely more limited in other aspects), why not?
Yup; replacing repo with a compatible interface for our limited cases is fine, but it seems important to preserve the possibility of using repo in the workflow.
Ah, so you've played with python-git and it sucks?
No, I didn't mean that. I meant that before moving to using more intricate API level which would be allowed by using language binding, it would makes sent to try to implement it in terms of git pull/push, if that's not enough - git fetch, and only if still not enough - to employ even more complex solutions.
Oh just seemed a better idea to me to use python-git instead of wrapping git pull, git clone, git fetch etc.; this wasn't to do specifically advanced things. I was simply hoping this would draw a nice line in the sand in terms of responsibilities and give us a cleaner interface to do git-ish stuff.