A plan of attack for Glee
As of a few hours ago, my understanding of trees in Postgres has shot up a lot, thanks to a blog post by Leonard Marc. The approach we will be using is #2, i.e. storing the parent directory as a foreign key. (More details soon.) Suffice to say that I feel very stupid and enlightened at the same time, because now I believe I have a perfectly sound model of Glee. In fact, it is sound enough that I am confident that I can begin programming — and I am so averse to programming that my mailserver has not been able to send emails for the last eight months.
Storytime (skippable)
Previously I had a hare-brained idea to store the metadata of a repository (who has permissions to act in this repository?) as an actual file in the filepath. It suffices to say this is a terrible idea, but it would have worked just fine with my previous design.
Then I started thinking about the goals of Glee. At the start, Glee
was meant to be a host that didn’t need to scale, but now my goal is for
Glee to scale towards large single-organization hosts. (Because
otherwise it is utterly useless and you can just use cgit
.)
The underlying complexity, however, will still remain very low,
somewhere between cgit
and sourcehut
. It is
the goal that
- all of the design decisions can be explained in a single medium-sized document,
- understanding the design is sufficient to use the software proficiently.
But this post is not meant to be that document. I will hold off on that until I have Glee in a somewhat usable state.
Now, if we want large single organizations to use this for
enterprise-ish stuff, we really don’t want random people creating
accounts. (I also cannot be bothered to implement a “send registration
email” feature.) The logical flow is that if you are an admin of a
directory or repository (which we will henceforth call a path),
then you may give an email address (say dchen@dennisc.net
)
permissions to (say) write to the path. If
dchen@dennisc.net
is already associated with an account,
then that account will now have permissions to write to said path.
If not, an email will be sent containing an invite link, with which
dchen@dennisc.net
can use to register. In the database, we
will note that this invite link should also give the new account
permissions to the path (say) a/b/c.git
. So we will store
the permission write: a/b/c.git
in the associated database
entry for that invite link, and when the invite link is used, that
permission will be added for the new user.
Now what if a/b/c.git
moves to a/e.git
after the invite link is created but before it is used?
Oh.
The correct way to deal with this issue is to instead point to a unique id representing the path, one that doesn’t change. Which calls for storing information about paths in the database.
Directories and Repositories
To motivate this section, I will say what I have said before in many of my previous Glee posts. Glee is about storing your Git repositories as a filesystem.
We are going to use a standard tree structure in Postgres. We will
have a table Directories
and Repositories
. The
fields of each are going to be
- id: a uuid
- public: whether the directory or repository is publicly viewable, an optional boolean (no value means “inherit”)
- name: the name of the directory or repository, for example
c.git
- parent: a nullable foreign key pointing to the uuid of a directory
- permissions: a list of read/write/admin with user ids,
e.g.
[read, b3994226-6761-456d-879c-7b18facbbd81]
Of course, the parent of the root directory /
will have
no parent. It will be the sole path with no parent. It also
will be the sole path we cannot move.
For now I’m thinking the struct representing this unified model in
Rust should be DirRepo
in backend.rs
. Or maybe
just Path
, but that is not the most ideal name because it
conflicts with a filesystem path, and Path
implies a
complete path rather than just one step (i.e. current file plus
parent).
We will index the column id
in the table
Directories
. That way we can emulate ls
for
the directory b3994226-6761-456d-879c-7b18facbbd81
by
simply searching for anything with a parent id of
b3994226-6761-456d-879c-7b18facbbd81
and have this query be
efficient.
We will have to validate that name
does not contain any
/
characters upon any client POST request for obvious
reasons.
Handling redirects
Now handling redirects is trivial, which means we will do it. We will
have a table of Redirects
which store
- parent
- name
- link: which ID the redirect goes to.
Let me give you a concrete example to explain how resolving redirects
will work. Suppose we rename
a/b/c.git ->
a/d.git. As expected, we look at the entry for
c.git`,
- change its name to
d.git
, - change its parent to
a
(more accurately the id ofa
).
Furthemore, we create a Redirect with
- parent: whatever the id of
b
is (this is the parent ofc.git
before rename) - name:
c.git
- link: whatever the id of
c.git
was, i.e. the id of the newd.git
.
Note that the path a/b
still exists, we just moved
c.git
. Here is what happens when we try to navigate to
a/b/c.git
:
- There is a directory named
a
with parent/
. - There is a directory named
b
with parenta
. - There is no directory or file named
c.git
with parentb
. But there is a Redirect with the namec.git
with parentb
that points tod.git
. So now we look atd.git
and get a repository.
To be clear, when we say “with parent /
”, we really mean
that its parent is the id of the root directory, etc.
(Basically this is the idea I had with symlinks, but it solves the problem of changing identifiers because we use a static id.)
To demonstrate the robustness of this idea, suppose we now move
a/b
to a/e
. We still want
a/b/c.git
to go to the correct repository. What
happens?
- There is a directory named
a
with parent/
. - There is no directory named
b
with parenta
. However, there is a redirect with nameb
and parenta
that goes toe
(which has patha/e
). - There is no directory or file named
c.git
with parente
(remember that when we do parent checks, it is with the id; saying “with parente
” is merely a shorthand). However, there is a redirect with namec.git
with parente
that points tod.git
. So now we get the samed.git
, precisely as desired!
In fact, a/b/c.git
will always redirect to that
same repository until a new directory or repository is made at that same
path.
And these redirects persist until they are “overwritten” by a new path at the same location. When the overwrite occurs, we will delete the redirect. The leading principle here is very simple:
The combination of name and parent id must be unique among all directories, repositories, and redirects.
That means there are no unused redirects lying around, meaning that we never have to prune redirects. So our analogy of a filesystem with repositories and directories can be extended with redirects. Now we just have two types of files: redirects and repositories, and directories.
What do we do when we try to create a new directory/repository and there is a conflict?
- If the conflict is with a redirect, simply delete the redirect.
- If the conflict is with another directory/repository, forbid the operation.
How are we concretely storing repositories?
Now it would be stupid to actually perform a filesystem move every
time we do a “virtual” move in the database. The correct answer is very
obvious: we store the repository with uuid
b3994226-6761-456d-879c-7b18facbbd81
in the path
b3994226-6761-456d-879c-7b18facbbd81.git
.1
That way when we resolve a path to a repository, we merely need to look
at uuid.git
in the filesystem. Furthermore, because we
never move repositories in the actual filesystem and never change uuids,
any bug with filepath resolution is fixable. This means we will never
corrupt our data with moves, because we are never changing the
underlying data; bugs will only appear due to incorrect filepath
resolution.
User permissions, again
Specifically we will talk about
- how users ought to be invited
- how permissions will be managed on the frontend
because those are the only things which I have changed the design of.
A link to a special page to “manage permisisons” for a directory/repository will appear if you have admin access. We will not be modifying a raw TOML file because that is a bad idea. Here is our new approach:
When resolving whether a user is an admin, we should also determine whether they have directly been defined to be admin (i.e. in the current directory or repository) or whether they have inheritd admin from a parent directory. We will say an admin is an Inherit Admin if they have inherited and a Direct Admin if they have been directly defined as an admin.
An inherit admin will be stronger than a direct admin. So if we have determined a user is a direct admin, we also must check whether they inherit admin as well, since making someone a direct admin on top of being an inherit admin should not reduce their permissions.
A direct admin cannot delete admins in the current directory/repository. An inherit admin can delete direct admins in the current directory/repository.
Regardless of what permisisons you have (read/write/admin), the main page of the repo will tell you what permissions you have upfront. (Many other sites are awful at doing this.)
Admins, whether inherit or direct, can see both who has direct permissions on the “manage permissions” page. You may think that resolving who has inherited permissions might be complex, and you would be right, but we already do this work when determining whether to show the permissions page. Instead of just resolving permissions for one user, we will create a list of all users and the permissions they have. For example, we might say
- Alice is an admin of the repository
- Bob inherits admin from
/gym
- Charles inherits write from
/gym
However, it might be privileged information that Dean is an admin of
/
, and it would be bad if a low-level admin saw “Dean
inherits admin from /
”. Suppose the highest parent
directory we inherit admin from is /path
. Then we only want
to show users who inherit permissions from /path
or lower.
So when we traverse the tree to resolve permissions, we will be keeping
track of
- the highest parent directory you inherit admin from,
- and the lowest parent directory every other user inherits their strongest permission from,
where “high” means “less deep” and “low” means “deeper”.
If your highest inheriting directory is higher than a user’s lowest inheriting directory, then you will see that the user inherits their permissions from said lowest inheriting directory.
Also, there will be a special “manage permissions” page on
/
which allows admins of /
to delete any user
who is not an admin of /
.
Displaying the git log as a graph
Here’s a tip that will change your life: try using
git log --graph
. GitHub, GitLab, and SourceHut’s log views
are all linear, meaning they do not show the commit graph. BitBucket of
all places does. We show the commit graph as well because that is the
right thing to do, although we will shamelessly fail on unreasonably
large octopus merges.
This will require a good understanding of libgit2
’s
rev-walk
function and significant thought into the frontend
design of the Git log. Of all the things I want to implement in Glee,
this seems like it will be the hardest.
Plan of action
Having finally fleshed out the design, here is the plan of action. In this order, here is how I plan to implement Glee:
- Revamp the user model.
- Remove admin, because we are now handling permissions on the filesystem in a more sophisticated manner.
- Maybe start using Redis (well really ValKey now). Because while scalability was not a goal before, the whole point now of several new ideations is that scalability is actually important. We want big organizations to be able to use this, at least in theory, so Postgres authentication might become a bottleneck. (But I will have to do research onto whether this is actually worth doing, though my gut says using Redis is the right thing to do.)
- Create Directory/Repository tables.
- Use foreign key pointer approach.
- Make an index on the parent id and set up scaffolding for initializing indices.
- Revamp Invite Token model
- Invites should be associated with “what do you want to invite user to”, so path + permission (read/write/admin).
- Said invites will also modify the appropriate directory/repository entry.
- Implement permissions page for each directory/repository.
- Implement directory main page view.
- Implement repository main page view
- Implement log/trunk view, etc for repositories.
- Need to figure out how to emulate
git log --graph
, but with a web UI.
- Need to figure out how to emulate
- Figure out SSH interceptors to implement write access.
When all this is done, we will have a reasonably complete product. I feel that I finally have the requisite understanding of Postgres to implement the database-side stuff, though I will have to spend some more time understanding Git better. But at the very least, I can implement everything up to the repository main page view without understanding Git one bit more. So the goal will be to get to that point soon.
Really, we store this repository in the data directory of Glee.↩︎