Koji multirepo feature

STATUS: unknown

Introduction

Currently, Koji makes a single yum repo for each build tag that contains all the latest rpms of that tag (with inheritance). This feature will open that up, allowing Koji to split that content over multiple repos. This will reduce the repo regen load in Koji, at least if used sensibly.

Folks new to Koji are sometimes surprised that Koji doesn’t simply make one repo for each tag in the inheritance and have the buildroot pull from all of them. There are several reasons we do not do that.

  • This results in a simple union of all that content and, so does not capture the inheritance relationships of the tags
  • In particular, it prevents overrides from working properly. With the individual tag repos, the yum/dnf will simply choose the latest NVR for each package, regardless of where in the inheritance they come from.
  • Build environments are supposed to represent the build tag contents at a particular event id. That means we need each repo in the set to correspond to the same event id. It is difficult to accomplish this without essentially regenerating each repo each time, which destroys all efficiency gains.
  • Very complex inheritance trees are possible in Koji. For some systems a build tag might inherit from many dozens of other tags, making for a very complicated yum/dnf config. Note: this is also a challenge for multirepo.

With multirepo, we instead define a way to indicate that a build tag should include the repo for another tag instead of inheriting from it. This means you can have use tag inheritance as before, but you have the option to combine repos in this way.

Benefits

The key benefit is efficiency. When this feature is used carefully, it will reduce the amount of repo regenerations that occur, as well as the amount of time that they take.

Difficulties

This is a fairly significant change to the Koji data model, and it breaks long held assumptions about Koji behavior. It is no longer sufficient to simply point yum/dnf at the yum repo url for a tag.

We still want each build environment to come from content at a particular event id. To maintain that and avoid overzealous repo regeneration, Koji will need to know not only the event id that triggered a repo regen, but instead the event id span for which that repo is valid.

Approach

We want to preserve the notion that build environment content is determined by the state of the build tag at a given event id.

Furthermore, we want to avoid a client (or user) having to do too much work to figure out which repos to use if they want to replicate a build environment.

Also, we want to avoid unecessary repo regens.

Dealing with events

Instead of simply tracking the single event id that a repo was regenerated from, we will track the event range that it is valid for.

We will keep the create_event field (perhaps renamed?), but instead of just using the current event, we will use the last change event for the tag content. We will also add new fields.

  • create_time - tracks the actual creation time
  • expire_event - records the first event that changes the tag content (NULL for still valid)

With this data, we will know exactly which range of events a repo is valid for. If we need a repo for a different event in this range than the initial one that prompted repo generation, then we can simply reuse it.

In order for this to work, we will need to be able to determine precisely which events affected a tag. To do that, we will add a new hub function called tag_last_changed, which work in a complementary way to the existing tag_changed_since_event function.

Dealing with multiple repos

Currently, we have a one tag one repo system. This is very convenient. Users can simply use the latest repo for a tag in their configuration. With the new system, we may have multiple repos. This will require code changes in:

  • getMockConfig()
  • anon_handle_mock_config()
  • getRepo()
  • BuildRoot class in kojid

More significantly, the standard_buildroot table references only a single repo id. So we will need change how this information is stored somehow. Most likely, this will mean moving this data to a separate table so we can support multiple values. E.g.

CREATE TABLE buildroot_repos (
    buildroot_id INTEGER NOT NULL REFERENCES buildroot(id),
    repo_id INTEGER NOT NULL REFERENCES repo (id),
    PRIMARY KEY (buildroot_id, repo_id)
);

Likewise, the api will have to report these multiple values.