Boost logo

Boost Users :

Subject: Re: [Boost-users] What's happened to Ryppl?
From: Dean Michael Berris (mikhailberis_at_[hidden])
Date: 2011-02-01 04:27:16


On Tue, Feb 1, 2011 at 5:20 AM, Steven Watanabe <watanabesj_at_[hidden]> wrote:
> AMDG
>
> On 1/30/2011 4:35 PM, Dean Michael Berris wrote:
>>
>> With subversion, each commit is a different revision number right?
>> Therefore that means there's only one state of the entire repository
>> including private branches, etc.
>>
>
> Yes, but I don't see how that's relevant.
> How can the repository be in more than one state?
> Now if only we had quantum repositories...
>

That's important because in a DVCS, there's no one repository.
Therefore that means each and every repository clone of the original
canonical repository out there has its own state.

However in Git, every commit has its own identity which knows who its
parents are. That means you can then apply many different commits from
many places, conglomerate them into your local repository, and reflect
that tree onto the canonical repo if you're the maintainer. This then
allows everyone else to merge in this tree into their local repo and
therefore you get the distributed and scalable aspect for
multi-developer projects. It's really an important distinction.

>> What then happens when you have two people trying to merge from two
>> different branches into one branch. Can you do this incrementally?
>
> What do you mean by that?  I can merge any subset
> of the changes, so I can split it up if I want to,
> or I can merge everything at once.
>

I mean let's say we have the tree in subversion:

trunk r1
  |---- your branch
  |---- my branch

If I merge in changes from trunk to my branch, and then you do the
same at a slightly later time, and we both try to merge back into
trunk. In subversion we would have to do that in a single commit each.
With Git, merging the trees of my branch and your branch is a single
command, and is largely automatic -- if we share commits from trunk in
our branches, Git knows what to do and a lot of the conflicts are
largely really just the conflicts on changes we both have that need
resolving.

We do the resolving locally so we both can keep trying to run into
each other's merge race, but then the canonical repo we're both
working on maintains a single tree.

>> How
>> would you track the single repository's state?
>
> Each commit is guaranteed to be atomic.
>

And so that means the state I have in my working copy is not a unique
state. That means, if I've made a ton of changes locally that I
haven't committed yet and 80% of those changes conflict with changes
in the central repo, I just say "OH FML"? Note that even if I do have
a branch, synchronizing changes back into the source branch will be a
PITA.

>> How do you avoid
>> clobbering each other's incremental merges?
>
> If the merges touch the same files, the
> second person's commit will fail.  This is
> a good thing because /someone/ has to resolve
> the conflict.  Updating and retrying the commit
> will work if the tool can handle the merge
> automatically.  (I personally always re-run
> the tests after updating, to make sure that
> I've tested what will be the new state of
> the branch even if there were no merge conflicts.).
>

We have the same workflow except in Git, I don't have a chance to mess
up the canonical repo if I'm not the maintainer. What happens with Git
is that if I am a co-maintainer of a library's repo with you, then if
I push and the merge that happens upstream is not a "fast-forward"
(i.e. no merge conflicts will happen and is likely just an index
update) then I have to pull the changes and merge the commits locally
-- this is not the case with Subversion as I have to have exactly the
same revision number in my working copy for me to be able to commit
anything. Note that git works as a series of patches laid out in a DAG
and each commit is unique, meaning each commit can be transplanted
from branch to branch and the identity is maintained even if you had
it in a ton of branches (or repositories).

>> Remember you're assuming
>> that you're the only one trying to do the merge on the same code in a
>> single repository. Consider the case where you have more than just you
>> merging from different branches into the same branch.
>>
>> In git, merging N remote-tracking branches into a single branch is
>> possible with a single command on a local repo -- if you really wanted
>> to do it that way.
>
> This would require N svn commands.  (Of course
> if I did it a lot I could script it.  It really
> isn't a big deal.).
>

Not only that, it would also require your working copy to be sync'ed
with repo's state for the branch for which you want to do the commit.
This synchronization is a killer in multi-developer projects touching
the same code base.

>> Of course you already stated that you don't want
>> automated tools so if you *really* wanted to inspect the merge one
>> commit at a time you can actually do it interactively as well.
>>
>
> I didn't say that I didn't want automated tools.
> I said that I didn't trust them.  With svn that
> means that, before I commit I always
> a) run all the relevant tests
> b) review the full diff
>

And that workflow is very much supported in git as well. You can
review full diffs in git as well. And you can commit locally and test
everything locally and when you're satisfied you can then ask the
maintainer to (or if you're the maintainer, just) push the changes up
to the publicly accessible repo. Every other developer then
synchronizes their own local repo by merging in changes from upstream
(the canonical repo) and then stabilizing their local repo's at their
own pace.

What's important here is the "at their own pace" part.

> This is regardless of whether I'm committing
> new changes or merging from somewhere else.
>

Agreed. And this is precisely the same workflow that is made a lot
easier by Git.

Not only is it easy, it's crazy fast as well.

>>
>> With subversion what happens is everyone absolutely has to be on the
>> same page all the time and that's a problem.
>>
>
> It isn't a problem unless you're editing the same
> piece of code in parallel.  If you find that you're
> stomping on each other's changes a lot
>
> a) The situation is best avoided to begin with.  The
>   version control tool can only help you so much, no
>   matter how cool it is.  No tool is ever going to
>   be able to resolve true merge conflicts for you.

Right, but in the cases where lots of people are touching the same
code base, a tool that supports this workflow better is suited best
for that situation. If we want to keep Boost as a "read, but don't
touch" open source project where only a handful of people get the
privilege of mucking the single repository up, then I guess there's no
point in having a discussion on using Git because subversion is
perfect for that.

If we're in agreement that we don't want more contributors pitching in
to the same codebase then I withdraw the suggestion to use Git and use
it in my projects instead.

> b) Working in branches will buy you about as much as
>   using a DVCS as far as putting off resolving
>   conflicts is concerned.
>
> Honestly, if you assume the worst case, and
> don't use the tool intelligently, you're bound
> to get in trouble.  I'm sure that I could invent
> cases where I get myself in trouble (mis-)using
> git that work fine with svn.
>

I think there's fundamental impedance mismatch when you look at the
act of developing an open source project with 100 people as compared
to having just 2 or three people touching the same code.

In the simplest case scenario of less than a handful of people are
touching the code, heck tarballs and exchanging patches should work
just fine -- you're going to resolve conflicts anyway. But if you have
a lot more people doing this then you have a choice: either use a tool
that supports thousands of concurrent developers, or use one that
supports maybe a few tens of developers.

Branches in git are a different beast from how a branch in subversion
looks like. Basically in subversion, you're copying a snapshot of the
code and making changes on top of that. When merging you take commits
that are made from one branch into another using the repository
version as the identifier for changes made in the code. That's alright
if you only had one repository and just a few developers touching the
code and doing the merging -- now scale that to a hundred people
touching the same code and having one branch each, then you start
seeing how just branches won't cut it. This is true not only for Boost
-- imagine a hundred people working on the containers or algorithms
collections at the same time -- especially if it wants to support a
lot more contributors than it already has.

In Git, branches themselves are basically sub-trees where each and
every sub-branch (a branch from a branch) can be transplanted from one
branch to any other branch. And then you have the distributed nature
of the beast where in your branches can be tracking remote branches --
meaning it will be synchronizing with the remote branch's tree. So
there's no "one state" of the whole project except the one that the
developers agree upon is the canonical repo. Then that means anybody
can be working on Boost libraries and porting them to a platform that
nobody else in the current Boost pool of developers has, and making it
stable until such time that they see it fit to contribute changes back
upstream -- this means they didn't need to get anybody's permission to
muck around with Boost to get commit access to the repository so that
they can work on things that matter to them *and keep a record of the
changes locally*. This goes the same for people to just want to
maintain local Boost repositories for their own organizations and
would want for example to fix all warnings and not have to submit
those changes until they're ready later on.

>> The maintainer can then do the adjustments on the history of the repo
>> -- things like consolidating commits, etc. -- which largely is really
>> what maintainers do,
>
> Is it?  I personally don't want to spend a lot of
> time dealing with version control--and I don't.
> The vast majority of my time is spent writing code
> or reviewing patches or running tests.  All of
> which are largely unaffected by the version control
> tool.
>

Of course in Boost, what happens is maintainers are largely the same
developers of the project as well. Which is odd for an open source
project the magnitude and importance of Boost.

If you don't want to spend a lot of time dealing with version control
then git is precisely the tool you want. If you spend a couple of
seconds (or maybe a minute) committing things or merging them to the
single Boost subversion repository, then you can spend a fraction of
that (an order of magnitude less) than the time you would using git.
Benchmarks abound comparing performance of git against subversion in
most of these routine operations showing how git is much more
efficient and better at staying out of your way than subversion is.

>> only with git it's just a lot easier.
>
> It isn't just easier with git, it's basically impossible
> with svn.  In svn, the history is strictly append only.
> (Of course, some including me see this as a good thing...)
>

In publicly-accessible Git repositories, it is encouraged that history
is preserved so that those that clone from it and build upon it see a
"truthful" version of the code. But precisely because you can muck
around with your local commits before submitting patches upstream,
this flexibility allows you to do things like that on your local
repository. Just one of those things that changes the workflow and
allows developers to improve things locally *incrementally* and
synchronize later only when it's necessary.

>>>>> b) Why would I want to try it several different ways?
>>>>>   I always know exactly what I want to merge before
>>>>>   I start.
>>>>
>>>> Which is also the point with git -- because you can choose which
>>>> changesets exactly you want to take from where into your local
>>>> repository. The fact that you *can* do this is a life saver for
>>>> multi-developer projects -- and because it's easy it's something you
>>>> largely don't have to avoid doing.
>>>>
>>>
>>> This doesn't answer the question I asked.
>>>
>>
>> Of course you're looking at the whole thing with centralized VCS in
>> mind. Consider the case that you have multiple remote branches you can
>> pull from. If you're the maintainer and you want to basically
>> consolidate the effort of multiple developers working on different
>> parts of the same system, then you can do this piece-meal.
>>
>> For example, you, Dave Abrahams, and I are working on some extensions
>> to MPL. Let's just say for the sake of example.
>>
>> I can have published changes up on my github fork of the MPL library,
>> and Dave would be the maintainer, and you would have your published
>> changes up on your github fork as well. Now let's say I'm not done yet
>> with what I'm working on but the changes are available already from my
>> fork. Let's say you tell Dave "hey Dave, I'm done, here's a pull
>> request". Dave can then basically do a number of things:
>>
>> 1.) Just merge in what you've done because you're already finished and
>> there's a pull request waiting. He does this on his local repo first
>> to run tests locally -- once he's done with that he can push the
>> changes to the canonical repo.
>>
>> 2.) Pull in my (not yet complete) changes first before he tries to
>> merge your stuff in to see if there's something that I've touched that
>> could potentially break what you've done. In this case Dave can notify
>> you to pull the changes I've already made and see if you can work it
>> out to get things fixed again. Or he can notify me and say "hey fix
>> this!".
>>
>> 3.) Ask me to pull your stuff and ask me to finish up what I'm doing
>> so that I can send a pull request that actually already incorporates
>> your changes when I'm done.
>>
>> ... ad infinitum.
>>
>
> 4.) Dave isn't paying attention, so nothing happens.  A couple
>    years later, after we've both moved on to other things, he
>    notices my changes and decides that they're good and merges
>    them.  ...More time passes...  He sees your changes and
>    they look reasonable, so he tries to merge them.  He gets
>    a merge conflict and then notifies you asking you to update
>    your feature.  You are no longer following Boost development,
>    so the changes get dropped on the floor.  ...A few more years
>    go by...  Another developer finds that he needs your stuff.
>    He resolves the conflicts with the current version and the
>    changes eventually go into the official version.
>
> This is something like how things seem to work in practice now,
> and I don't see how using a different tool is going to change it.
>

And this is so easy to fix with git because then if Dave the
maintainer isn't paying attention, either one of us can ping a release
manager or let everybody know that "hey, we're trying to consolidate
changes here but Dave isn't paying attention!" and thus someone can
pick either one of our repositories as the "canonical" repo for the
library. Of course that promotes either one of us to be the maintainer
-- it's a lot more fluid process that is explicitly supported and
encouraged by the git workflow. This is the insurance mechanism and
the "business continuity process" that is built-into the distributed
version control systems like git, mercurial, bazaar, etc.

>> With subversion, there's no way for something like this to happen with
>> little friction.
>
> Why not?  Replace "github fork" with "branch" and
> subversion supports everything that you've described.
>

If you made your subversion repository publicly accessible without
need for authenticating who the user is to be able to commit changes
then that would be true. Otherwise as it stands at the moment you need
permission to even touch the Boost repository. And this part turns a
lot of people away from wanting to contribute because the other way
around it is to submit a patch in Trac -- which is quite honestly
painful and time consuming as heck.

>> First we can't be working on the same code anyway
>> because every time we try to commit we could be stomping on each
>> other's changes and be spending our time just cursing subversion as we
>> wait for the network traffic and spend most of our time just trying to
>> merge changes when all we want to do is commit our changes so that we
>> can record progress. Second we're going to have to use branches and
>> have "rebasing" done manually anyway just so that we can all stay
>> synchronized all the time --
>
> What do you mean by "rebasing."  Subversion has no
> such concept.  If you want to stay synchronized
> constantly, you can.  If you want to ignore everyone
> else's changes, you can.  If you want to synchronize
> periodically, you can.  If you want to take specific
> changes, you can.  What's the problem?
>

The concept of rebasing is really simple:

1. I branch from trunk revision 1, and make changes until revision 30.
2. In between r30 and r1 some things change in trunk.
3. I want to make my branch upto date with the changes that have been
in trunk since r1 to r30 so I 're-base' by pulling the changes from
trunk into my branch up to r30.
4. I have to (or subversion has to) remember that I've already merged
in changes up to r30 so the next time I do the same operation, I don't
try to pull in the changes that are already there.
5. When I commit r31, then I have effectively rebased my branch to trunk r30.

OTOH with git, we can just be working on our local master tracking the
canonical master, and just keep making changes willy-nilly locally.
When we want to push to the repository only then would we want to
actually merge in changes. That's supported sure, and that's no better
than the subversion approach.

BUT ... with git you and I can work on separate local branches that
fork off from master. We can keep making changes in that branch and
then later on once we're ready to integrate back to master, we do that
locally (we might even squash commits from the local branch so that we
can submit a single big-ass patch to the upstream maintainer). That
doesn't seem enticing at first but then imagine 20 or 100 of us doing
that to the same source code and you'll quickly see why the subversion
approach isn't going to scale and can potentially hold our individual
progress up, not just the progress of the whole project.

>>>>> c) Even if I were merging by trial and error, I
>>>>>   still don't understand what makes a distributed
>>>>>   system so much better.  It doesn't seem like it
>>>>>   should matter.
>>>>>
>>>>
>>>> Because in a distributed system, you can have multiple sources to
>>>> choose from and many different ways of globbing things together.
>>>>
>>>
>>> So, what I'm hearing is the fact that you
>>> have more things to merge makes merging
>>> easier.  But that can't be what you mean,
>>> because it's obviously nonsense.  Come again?
>>>
>>
>> Yes, that's exactly what I mean.
>
> Apparently not, since your answer flips around
> what I said.
>

I meant, with *git* and because merging is almost as painless as
possible most of the time, having more things to merge from makes it
easier. The logic is really simple: if I can pick from more sources of
things to merge in, I can do that all at the same time and if things
fail, back out and exclude a source and see if things go fine. I can
then isolate which things I would merge without having to resolve
manual conflicts, push those to the canonical repo, and as a
maintainer just tell the other sources "hey, synchronize with the
state now and try again". That means it's easier for me as a
maintainer now that I don't have to manually figure everything out, I
can have other sources deal with it for me if they really want to have
their stuff included.

The carrot is that your changes get into Boost, the stick is your pull
request has to apply cleanly.

>> Because merging is easy with git and
>> is largely an automated process anyway,
>
> If you will recall, the question I started out with
> is: "What about a distributed version control system
> makes merging easier?"  That question remains unanswered.

I've said yes all the while here and that was largely mostly because
once you've tried it and have been in a project where the distributed
thing is actually done, you'd see that having everything be
synchronized is just a waste of time.

> The best I've gotten is "git's automated merge is smart,"
> but it seems to me that this is orthogonal to the fact
> that git is a DVCS.
>

Why? Because it's distributed is precisely why merging is so much easier.

>> merging changes from multiple
>> sources when integrating for example to do a "feature freeze" and
>> "stabilization" by the release engineering group is actually made
>> *fun* and easier than if you had to merge every time you had to commit
>> in an actively changing codebase.
>>
>
> I've never run into this issue.
> a) Boost code in general isn't changing that fast.

Which I suppose is due to:

1. Lack of active contributors.

2. The process to contributing requires all sorts of permissions and
front-loaded work on potential contributors which means even before
people want to strart contributing they're turned off by the rigidity
of the process and the toolset leading to 1 above.

3. See 1 above.

> b) My commits are generally "medium-sized."  i.e.
>   Each commit is a single unit that I consider
>   ready to publish to the world.  For smaller units,
>   I've found that my memory and my editor's undo
>   are good enough.  Now, please don't tell me that
>   I'm thinking like a centralized VCS user.  I know
>   I am, and I don't see a problem with it, when I'm
>   using a centralized VCS.

Now, with a DVCS you don't have to rely on your memory too much or
your editor's undo limit. This also, if I may say so myself, isn't a
scalable way of doing it.

There are local branches for that sort of thing and if you want to
submit a singular patch (a squashed merge into a single commit) then
that's *trivial* to do with git.

> c) There's nothing stopping you from using a branch to
>   avoid this problem.  If you're unwilling to use
>   the means that the tool provides to solve your
>   issue, then the problem is not with the tool.
>

But having to branch in a central repo compared to branching a local
repo is the difference between night in the jungle and day by the
beach, respectively.

>>>
>>> Have you ever heard of branches?  Subversion
>>> does support them, you know.
>>>
>>
>> And have you tried merging in changes from N different branches into
>> your private branch in Subversion to get the latest from other
>> developers working on the same code? Because I have done this with git
>> and it's *trivial*.
>>
>
> I've never wanted to do this, but unless there
> are conflicts, it should work just fine.  If
> there are conflicts, you're going to have to
> resolve them one way or another regardless of
> the version control tool.
>

Unfortunately, that's not as easy as you make it sound with subversion.

Let's take a simple example:

1. I branch off trunk r1
2. Developer B branches of trunk r99
3. Developer C branches off trunk r1000

Now I want to merge changes from Developer B's branch into my branch
so that I can try it out. That's fine because I'll be pulling the
changes from r1 to r99. Now let's take the reverse, Developer C wants
to pull from my branch, what happens? Hell breaks loose because he
doesn't have the history in his branch about the state that was trunk
r1..999.

This kind of thing is what I'm talking about git makes easy -- because
you know the whole history up front, even if two branches were
branched off different points in the tree, there's no problem making
that merge and trying to replay your changes on top of things. Of
course the likelihood that you'll see conflicts is dependent on the
parts of the code that is being touched, but the fact that *it's
possible* is just powerful.

Now scale the above to 10, 20, 50 developers and you'll see why the
centralized model breaks down.

>> Also are you really suggesting that Linux development would work with
>> thousands of developers using subversion to do branches? Do you expect
>> anybody to get anything done in that situation? And no that's not a
>> rhetorical question.
>>
>
> It might overload the server.   That's a legitimate
> concern.  But other than that, I don't see why not.
> (However, since I have nothing to do with Linux
> development, I may be totally wrong.)
>

It's not just that. Nobody would want to be merging anything with
subversion that way. And imagine if everyone asked Linus to do the
merge for them into his branch. That just isn't the scalable way to
go.

Oh and not to mention the administration nightmare of that managing
thousands of usernames and passwords, worrying about backups, the
insane checkouts and switches required, etc.

HTH

-- 
Dean Michael Berris
about.me/deanberris

Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net