Design for a small-scale self-hosted Git service
In many ways this is a sort of wishlist for a self-hosting Git solution that is much more lightweight than the big players (GitLab etc). I love GitLab and will continue to use it to host my important repos + stuff I expect to collaborate on, but there are many features (PRs + issues + very fancy web view) that I just don’t need for, say, private scripts.1 Even though I am referring to this hypothetical Git service as if I plan to make it, no promises that I actually do.2
Additionally, I don’t want private scripts under a centralized hosting service out of my control. It’s not that I’m paranoid about having my private scripts on GitLab, but it just doesn’t feel right.3
Also, GitLab’s namespaces are quite dry (at least GitLab SAAS). One particular point of annoyance is that every user has a namespace. I was very bothered that users could not have subgroups, but that really disguised the fundamental problem I had with GitLab users: users should not have a namespace by default. It’s so clunky that signing up users, which are used for access control, also reserves a namespace! What if I’m just using GitLab to help maintain X software, or what if I sign up and never use the service, period? There are a lot of group/organization names (GitHub and GitLab respectively) that I wanted to use that are some person’s username. Said person invariably has made 0 commits in the history of ever, and the namespace is so much drier for that. Boo!
OK, so that calls for a private Git server. But most of the solutions do not have “subgroups” like GitLab, which is a total deal-breaker for me, and GitLab is way overkill for a private Git server that literally only exists to sync files between my desktop and laptop. But at the same time, I have no plan of SSHing into the server and pull/pushing to repo that way. Particularly, I want to set up API routes so that other people can see some repos I decide to make public, and so they can create/push to new repos as well.4
As far as I know there is no Git hosting service that does all of this for you. These are my plans for building one. Tentative name: Glee.
Filesystem and permissions
Overview
Let me tell you why I like GitLab’s subgroup functionality so much: It’s like a filesystem. That’s it, that’s all I want from a Git hosting service. As far as I know only GitLab supplies that: nesting with a depth of . And that’s great, but the number one complaint I have about GitLab (sub)groups is that permissions are inherited in an opaque manner. Projects inherit permissions from their group somehow — is it on creation? Is it persistent? I have no clue, even as I’m writing this, and I care so little about finding out that I’d rather write my own Git hosting service.
Permissions should be handled according to the following two rules:
- The most specific permission option should be used.
- If no permission is set, it “looks up” for the default option.
Here’s an example. Say that we have project C, and its path is A/B/C (so
A and B are directories, C is a repo). Let us say that for some
permission P, a directory/repo can either have
yes
,no
, orinherit
. So if A hasyes
and B/C are bothinherit
, then C inherits the status from B, which inherits the status from A, which has permission P. Therefore, C also has it. But if A hasyes
and B hasno
, then C inherits B which explicitly does not have permission P. Therefore, C does not have permission P.
That’s how permissions should work: keep going up until you reach a
directory with the permission explicitly set (i.e. not
inherit
). Of course, if it’s just inherit
all
the way up, there should be a default value. It doesn’t particularly
matter what it is, just as long as it’s sensible and made clear.
Users
This is the design for a a small-scale Git server, so every user should be trusted. Hosting services like GitHub and Gitlab have intricate user permissions which I have used exactly zero times. What I am about to describe does not scale for large enterprises, because it is not supposed to.
There are three levels of permissions: none (i.e. you can perform
this action without being logged in — think public repositories), user,
and admin. For any particular repository, you can set view
to none
, user
, or admin
, and you
can set push
to user
or admin
.
Obviously users inherit the permissions of all visitors, and admins
inherit the permissions of users.
The reason GitHub/Lab needs access control is because anyone can sign up for an account. Instead, I think it’s better to authenticate each user during signup. Here I think an cost ( is number of users) is better than an cost ( is number of projects). This is because I think project-side operations will be far greater than the number of users.
There are a number of ways you can deal with verifying user accounts. One way is just by allowing anyone to make an account (as in GitHub, GitLab, or indeed, any popular public-facing website), and only verifying accounts that come from trusted maintainers. This can get kind of annoying because you have to drudge through potential troll/spam/test accounts5 to verify the one or two new legitimate users.
The only viable solution that I see is requiring admin intervention
to create an account. On the mathadvance.org
mail server,
because we use the Mailcow suite, the admin has to directly make a user
account. I kind of hate this line of approach, because it puts the
impetus on the user to login to their account and change their password,
and if they don’t then sucks to be you. Forced password resets are sort
of a bandaid on this, but the consequences of a user not following
through and using their account should not be that a garbage account
gets created.
So here is my proposed solution. The signup form has these five fields:
Email
Real Name
Password
Confirm Password Signup Code
All of the fields are self-explanatory except Signup Code. The signup code is a one-use temporary code that an admin generates that expires in, say, 48 hours (which is perfectly reasonable for any actual contributor to sign up in). The idea is that if you want someone to make an account, you give them a signup code. That way, if they don’t follow through, your temporary code expires in 48 hours anyway and there is no harm no foul.
I’m thinking of storing signup codes in /tmp
, so they
get cleaned up, and put a timestamp along with the code in the file. So
something like this:
bqIApG2okZH2NrAJVWKVQkQpvSIwV86L 1646357721036
Where the first line is the token, and the second line is the Unix timestamp.
Directories
The project will follow XDG specifications, so there will be two
directories where stuff6 is stored. We have
$XDG_DATA_HOME/glee
for data generated by interfacing with
Glee and $XDG_CONFIG/glee
for manually edited config
files.
Here is how $XDG_CONFIG/glee
is going to look:
repos/
test-repo/
actual_file.txt
.git/
dir/
nested-repo/
actual_file.txt
.git/
users/
dennisc
repo-data.json redirects.json
Here is what repo-data.json
contains.7
{
"test-repo": {
"perms": {
"view": "user",
"push": "admin",
},
"history": ["dir/test-repo"]
},
"dir": {
"perms": {
"view": "any",
"push": "admin",
},
"history": [],
"subpaths": {
"nested-repo": {
"perms": {
"view": "user"
"push": "default"
}
}
}
}
}
If an object (like the value for key dir
) has field
subpaths
then it is a directory, and its subpaths are
contained in the object value of key subpaths
. Otherwise,
it is not a directory and is a repository. If you want, you can think of
the entire JSON object as listing the subpaths of
$XDG_DATA_HOME/glee/repos
.
If the repos
flag doesn’t exist, then the path itself
must be a repo, and otherwise it is a directory.
The history
array is the prior locations that a
particular path was in. If it proves to be too much an implementation
hassle/I decide it isn’t useful (it’s not how the webserver will
determine redirects), I will cut it out. In practice I think the hardest
thing to do will be to define a simple, intuitive spec around its
behavior when moving stuff. Should the history of
nested-repo
be added to if dir
is moved? I am
inclined to say yes.
The redirects.json
file is for redirecting from old
paths to new paths, provided that the old path is not used by something
else. Here’s an example that corresponds with the previous one:
{
"dir/test-repo": "test-repo"
}
Now if you move A to B to C, then you have two approaches: A
redirects to B which then redirects to C or just set the link from A to
go directly to C. Now, the former is more costly on all redirects, and
the latter happens only on renames. Since redirects will be far more
common than renames (we hope), then it is better to make renames more
expensive. GET
is more common than
PUT/POST
.
This is why a history
array might be useful: look
through the history, edit anything that appears in
redirects.json
. Then again, when moving B to C, you could
just look at all key-val pairs with value B, edit them to C. Since this
is a small-scale Git hosting service (why would perms be so
broad/users require admin auth otherwise?) I don’t envision such a
distinction mattering at all, since moving is a fairly rare operation.
So here’s where my head’s at: no history
array, when moving
A to B, scan all redirects with value A and edit the value to B.
Entries will be deleted if a new repo is created (here, at
dir/test-repo
), or moved to the old location.
Obviously, redirects will respect view permissions (so the old URL will just return “no repo” if it redirects to a private repo, i.e. one you don’t have permission to view).
The users
directory contains user info, probably
username + hashed password + permissions. I don’t think I know enough
about Git to say for sure what should be handled by Git/the OS and what
should be handled by the program.
As for $XDG_CONFIG_HOME/glee
, here is what’s going into
it:
perms.toml
(Yeah, that’s it for now; I may add more conf files if the need arises, but if we only need one conf file that’s perfect.)
And inside perms.toml:
# The list of all roles besides `any` and `admin`
# The order that they are defined in is how permissions are inherited
# For instance, if `roles = ["user", "mod"]`, then `mod` inherits
# the permissions of `user` since `mod` comes after `user`.
roles = ["user"]
# The default role given to a new account.
signup_role = "user"
# Permissions assigned to the "default" key
[defaults]
view = "admin"
push = "admin"
You will notice that any
is not in roles
.
This is because any
literally means anybody, signed in or
not. So it’ll be a reserved keyword, something that you can’t put in
roles
. Same for admin
, there are special
permissions only admins have (like granting signup tokens).
Even though I said before (and probably will say later) that having a bajillion levels of access control is stupid, I think I’ll be keeping roles extensible. It’s lightweight and totally opt-in (just don’t add more roles if you don’t want more). Just because Glee isn’t designed to scale doesn’t mean I won’t nab an easy opportunity to make it scale better.
You will notice that the default permissions are kind of conservative. That’s by design; you don’t want to accidentally expose private repos before you read up on how default permissions work.8
By the way, config files are in TOML, data files in JSON.
Webview
I want to make something simplistic like the Linux Kernel’s Git webview.9 Actually, maybe even moreso: I don’t
think I need about
, diff
, or
stats
, and I probably won’t even implement syntax
highlighting — this makes the link for raw content shorter, since there
only is raw content. Maybe I’ll allow formatted view through a
URL param like
https://glee.dennisc.net/glee.git/tree/.gitignore?fmt.
and have links to files/directories direct to the fmt
version, plaintext otherwise.10
Oh and also, branches via
https://glee.dennisc.net/glee.git/tree/.gitignore?b=dev
I have no intentions of totally eschewing JS, by the way. I am not
nearly as militant as some other people about “no JS!!” (Ad banners
annoy me as much as anyone else, but to extrapolate that with “all JS
bad” is a stretch. Though some contexts, particularly high-security
ones, are totally right about no JS.) If I can avoid JS though I will
make an effort to, particularly since I care about people using
command-line browsers. Currently I plan to have the API return a list of
directories and files inside a repo and format accordingly (this
includes links, with fmt
if appropriate); same for non-repo
directories, and format with JS appropriately.
But GitWeb, CGit, etc are not really the sort of solution I want, since as far as I know you can’t login to the web interfaces. Which sucks for collaboration, and also sucks for personal use because what’s the point of a webview if I can’t even see all my projects, most of which are supposed to be private?
Other VCS
I say this is a Git hosting service (and indeed that is what I will support, first and foremost) but in principle nothing I have said will not work with something like Pijul (which I have wanted to try for quite a while!) So a Pijul integration is something I might want to consider, depending on my experiences with it.
The Caveat
The thing about this sort of design is that it has so few details,
someone must have done it before me. Maybe I am wrong and
everyone else decided to use 7 levels of access control but only 1 level
of nesting (username/repo
or org/repo
). I hope
I am not, though, so if you know a self-hosted Git service that sounds
something like this, please let me know so I don’t have to build it
myself. Because I would really rather not.
The scripts are two or three lines long. Setting “secrets” for them would be incredibly stupid.↩︎
For one, I probably won’t start this project until I have finished the mapm gui, which I also want to write about.↩︎
It’s not just that putting private scripts on GitLab doesn’t seem “private” enough, but also that it would pollute my GitLab (which ideally should not have bad code, even on my side).↩︎
The drafts for MAST units, as well as MAST units I haven’t looked over yet, are also on a private GitLab group right now. This is why I want to set up a rudimentary web view and collaboration abilities, which would require managing users + permissions.↩︎
Trust me, even if you explicitly tell people NOT to make testing accounts, they will anyway. Goddamnit.↩︎
I use such a generic term because I have to.↩︎
My original design was putting a
.glee
file in each repository. This does not work for the obvious reason that it prevents you from making a file called.glee
. The hassle to validate this is honestly less than the hassle of just writing a parser forrepo-data.json
.↩︎Yes, all this stuff and more will go in the docs, I’m not that terrible at programming. But also I totally know no one will read the docs until they absolutely have to, and it’s better if “absolutely have to” means “no one can see my repositories by default, how do I change this” versus “oh crap my super secret keys got leaked”.↩︎
It looks like this is a patched version of cgit.↩︎
A directory plaintext might look something like
src/ Cargo.toml .gitignore
with trailing slashes to indicate directories.↩︎