Skip to content

feat(proxy): resolve push identity from token via SCM provider API#1604

Open
coopernetes wants to merge 2 commits into
mainfrom
feat/token-id-mapping
Open

feat(proxy): resolve push identity from token via SCM provider API#1604
coopernetes wants to merge 2 commits into
mainfrom
feat/token-id-mapping

Conversation

@coopernetes

@coopernetes coopernetes commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Description

parsePush uses the last commit's committer as the push user. This adds a new chain processor that extracts the token from HTTP Basic auth, calls the SCM provider's user API (GitHub GET /user for now), and maps the SCM login to a git-proxy user via the gitAccount field.

  • TokenIdentityProvider interface with hostname-based dispatch
  • GitHubTokenIdentityProvider calling api.github.com/user
  • resolveUserFromToken chain processor (non-blocking on failure)
  • findUserByGitAccount DB lookup (file + mongo)
  • GET/PUT /api/v1/user/:username/git-account endpoints
  • In-memory token→user cache (5 min TTL, SHA-512 keyed) to avoid hitting the SCM API on every push. Only positive resolutions are cached. Cache is evicted per-user when gitAccount is updated via the API.

⚠️ This PR does not store PATs or tokens. The token is never written to disk or any database. The in-memory cache stores only a one-way SHA-512 hash of provider:token as the lookup key, alongside the resolved git-proxy username. The hash is non-reversible — the original token cannot be recovered from it. The cache lives in process memory only and is cleared on restart.

This doesn't block a push if the gitAccount isn't mapped in order to allow introduction of the gitAccount via the UI. This acts as a "soft" check for now unless the maintainer team wishes to adopt this model and use it as a requirement for authorising the "pusher" identity link that is missing as per what is described in #1400

How it works

  1. resolveUserFromToken runs in the push chain after parsePush, before checkUserPushPermission
  2. Extracts the token from the HTTP Basic auth header (the password field)
  3. Checks in-memory cache (SHA-512 of provider:token as key) — returns immediately on hit
  4. Dispatches to a TokenIdentityProvider based on the upstream hostname (github.com → GitHubTokenIdentityProvider)
  5. Calls GET /user with the token to get the SCM login
  6. Looks up the git-proxy user by gitAccount field — if found, sets action.user and action.userEmail from the DB user and stores in cache
  7. If no gitAccount match, falls back to using the SCM login directly (non-blocking)

Cache note

The cache of token hashes is in-process memory intentionally — caches should not persist across restarts as it is an API driven optimization (respect user's own rate limits, don't look up data that doesn't change). A database-backed cache would be the natural next step if horizontal scaling becomes a concern, but for a single-process proxy this is sufficient and avoids a schema migration.

Limitations

  • Does not work for a generic git repository provider that doesn't provide a user API. Forcing this behaviour within Git Proxy will constrain its applicability to only these providers which have an API for identity lookups to match them to a valid Git Proxy user.
  • For specific providers (GitLab, Forgejo/Codeberg/Gitea), an additional scope is needed. Originally documented here: https://github.com/RBC/fogwall/blob/main/docs/CONFIGURATION.md#token-scope-requirements

Token scope requirements

The SCM login check calls GET /user (or equivalent) on the upstream SCM using the pusher's token. The token must carry at least the following scope:

Provider API endpoint Additional scope
GitHub GET https://api.github.com/user No additional scopes required for either classic or fine-grained PATs.
GitLab GET {uri}/api/v4/user read_user or api (not recommended, prefer read_user)
Codeberg GET https://codeberg.org/api/v1/user read:user
Gitea GET https://gitea.com/api/v1/user read:user
  • BitBucket is just... weird... It has two separate sets of permissions between git and Bitbucket APIs. A user email can be linked between both "realms" but you cannot use your email to push code to that platform. Supporting Bitbucket proper requires some credential rewriting which is error-prone and brittle. See BitbucketProvider and BitbucketIdentityFilter in RBC/fogwall for details on what is needed in the HTTP flow. It's shared here as prior art/learnings only.

Related Issue

related to #1400

General

Documentation

  • Required user docs for adding their gitAccount (GitHub username in this current iteration)
  • Update any architectural docs with the identity resolution

Configuration

no configuration changes introduced

Tests

  • Tests have been added/updated for new functionality
  • Unit tests pass (npm test)
  • Linting and formatting pass (npm run lint and npm run format:check)
  • Type checks pass (npm run check-types)
  • API route tests for GET/PUT /api/v1/user/:username/git-account (coverage exists but UI integration testing is deferred)

@coopernetes coopernetes requested a review from a team as a code owner June 19, 2026 19:18
@netlify

netlify Bot commented Jun 19, 2026

Copy link
Copy Markdown

Deploy Preview for endearing-brigadeiros-63f9d0 canceled.

Name Link
🔨 Latest commit 2a00c7c
🔍 Latest deploy log https://app.netlify.com/projects/endearing-brigadeiros-63f9d0/deploys/6a3c554efed26c00088b45d7

@linux-foundation-easycla

linux-foundation-easycla Bot commented Jun 19, 2026

Copy link
Copy Markdown

CLA Signed
The committers listed above are authorized under a signed CLA.

@github-actions

github-actions Bot commented Jun 19, 2026

Copy link
Copy Markdown

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

@coopernetes coopernetes force-pushed the feat/token-id-mapping branch from 9c3d053 to ef788cf Compare June 19, 2026 19:20
@codecov

codecov Bot commented Jun 19, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 97.16981% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.69%. Comparing base (ca1d5aa) to head (2a00c7c).

Files with missing lines Patch % Lines
...oxy/processors/push-action/resolveUserFromToken.ts 96.42% 3 Missing ⚠️
src/db/file/users.ts 81.81% 2 Missing ⚠️
src/db/index.ts 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1604      +/-   ##
==========================================
+ Coverage   85.38%   85.69%   +0.30%     
==========================================
  Files          83       85       +2     
  Lines        7878     8090     +212     
  Branches     1312     1360      +48     
==========================================
+ Hits         6727     6933     +206     
- Misses       1123     1129       +6     
  Partials       28       28              

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread src/proxy/processors/push-action/resolveUserFromToken.ts
Comment thread src/db/mongo/users.ts

export const findUserByGitAccount = async function (gitAccount: string): Promise<User | null> {
const collection = await connect(collectionName);
const doc = await collection.findOne({ gitAccount: { $eq: gitAccount.toLowerCase() } });

@coopernetes coopernetes Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any desire to support an list of accounts here? gitAccount is somewhat of a holdover from v1. It's also singular across the whole user context - there's no shape in the data model today that supports associative git account by upstream provider/hostname.

Ideally, we revisit this shape in support of this PR. Something like this:

# MongoDB doc
{
  # existing keys...
  "username": "git-proxy-user",
  "email": "user@corpo-example.com",
  "gitAccounts": {
    "github.com": ["foo", "bar"],
    "gitlab.com": [ "baz" ]
  }
}

@coopernetes coopernetes force-pushed the feat/token-id-mapping branch 2 times, most recently from 4b14544 to e443da3 Compare June 21, 2026 04:01
@kriswest

Copy link
Copy Markdown
Contributor

We should probably make a decision on whether we're going to start storing PATs/passwords for git accounts in git proxy, or github/gitlab apps etc.. Its been brought up multiple times as a necessary change to fulfil the ultimate goals of several Git Proxy contributors (e.g. raising PRs).

Perhaps a topic for the next meeting.

@coopernetes

Copy link
Copy Markdown
Contributor Author

@kriswest this PR does not propose storing the PATs. Only a irrevisible SHA-512 hash of the token to avoid excessive calls to the user lookup APIs. Unless I missed something?

@kriswest

Copy link
Copy Markdown
Contributor

@coopernetes sorry I wasn't suggesting it did! I think its a great approach to solving the pusher validation issue using the current data we have on users - it was just prompting me raise the fact that other desired features are going to need to store PATs or be authorised applications in order to do some of the other things contributors have made clear they want to try and achieve with git-proxy and that we should get on a make a formal decision as to whether we're going to take that on soon or not. as it effects the design of various features (such as this one and proxy format/doing the second push for you/raising the PR). I'm aware you are looking at similar features in fogwall and would love to have a chat about approaches for git proxy soon.

@coopernetes

Copy link
Copy Markdown
Contributor Author

Understood, my mistake. I misinterpreted.

Some good candidates for relevant issues worth discussing in a design session:

@jescalada jescalada left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - wondering if any other @finos/git-proxy-maintainers wants to check it out before merging?

Also: Is this a complete fix for #1400 or is there anything else we need to patch up for checkUserPushPermission to work as expected?

);
action.user = identity.login;
if (identity.email) {
action.userEmail = identity.email;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is GitProxy identity guaranteed to be the same as the SCM identity? If not, should we document that proper push identity can only be obtained if the SCM user's email is set to match?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only guarantee in this flow is that the identity.login string is reliable insofar as the PAT will always be linked to a real GitHub user (except in odd cases like using a separate GitHub OAuth App to generate an OAuth token then feeding that into a git client... not impossible but likely not a common setup for developers).

This will decouple any email-to-user linkage from this new token resolver. The existing commit metadata in the chain already captures committer/author identity and links it to a GitProxy user account. This new step will just run as a later step to link back to the gitAccount.

gitAccount was previously "overloaded" and set to an email address to link to internal GitProxy user account objects based on conventions established from the original maintainer (as far as I remember, I could be mistaken). As I understand it, that was an internal, organization specific convention. This PR reclaims that field to mean the pusher's SCM profile name / GitHub login making it a reliable identifier for who actually performed the push.

I'm gonna remove the if (identity.email) check. As discussed in #1400, the email address is only ever returned in that GET /user endpoint if a user explicitly goes against the private-by-default setting of hiding it. Almost no one on GitHub enables that setting because of the obvious privacy implication.

Comment thread src/proxy/processors/push-action/tokenIdentity.ts
Comment thread src/service/routes/users.ts

// Get git account (SCM identity) for a user
router.get('/:username/git-account', async (req: Request<{ username: string }>, res: Response) => {
const targetUsername = req.params.username.toLowerCase();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might be missing an auth check here:

Suggested change
const targetUsername = req.params.username.toLowerCase();
if (!req.user) {
res.status(401).json({ error: 'Authentication required' });
return;
}
const targetUsername = req.params.username.toLowerCase();

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new endpoint is consistent with the rest of the users Router endpoints. If auth is needed here across the suite of endpoints, let's track that in a separate issue+PR.

@coopernetes coopernetes force-pushed the feat/token-id-mapping branch from b373c9b to a086b6b Compare June 24, 2026 21:53
@coopernetes

coopernetes commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the review @jescalada , addressed those comments so just one final round of review. On your question regarding 1400, this is a partial fix. When a user has their gitAccount set, checkUserPushPermission will correctly check the actual pusher's permission. Without it, parsePush still falls back to the last committer. This is intentional and documented in the PR description; it's designed as a non-blocking soft check to allow gradual adoption before the decision on changing gitAccount to be an associative map of SCM-hostnames-to-identity (see #1604 (comment)) is sorted. Happy to add those changes here though so it's a complete solution to 1400.

…1400)

parsePush incorrectly uses the last commit's committer as the push user.
This adds a new chain processor that extracts the token from HTTP Basic
auth, calls the SCM provider's user API (GitHub GET /user for now), and
maps the SCM login to a git-proxy user via the gitAccount field.

- TokenIdentityProvider interface with hostname-based dispatch
- GitHubTokenIdentityProvider calling api.github.com/user
- resolveUserFromToken chain processor (non-blocking on failure)
- findUserByGitAccount DB lookup (file + mongo)
- GET/PUT /api/v1/user/:username/git-account endpoints
… identity resolver

GitHub's GET /user only returns email if the user has explicitly made it
public — effectively never. Remove the if (identity.email) branch and
the email field from ScmUserInfo to avoid the misleading implication
that an email fallback exists.

Add AbortSignal.timeout(5000) to the GitHub API fetch to prevent the
push chain from hanging if the API is slow or unreachable.
@coopernetes coopernetes force-pushed the feat/token-id-mapping branch from a086b6b to 2a00c7c Compare June 24, 2026 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants