Skip to content

Extend url2purl/purl2url coverage for Git-based source hosts#223

Open
ziadhany wants to merge 14 commits intopackage-url:mainfrom
ziadhany:url2purl
Open

Extend url2purl/purl2url coverage for Git-based source hosts#223
ziadhany wants to merge 14 commits intopackage-url:mainfrom
ziadhany:url2purl

Conversation

@ziadhany
Copy link
Copy Markdown
Contributor

@ziadhany ziadhany commented Feb 10, 2026

Currently, many fix-commit URLs are not converted to PURLs during collection. Adding support for these URLs would ensure more accurate url2purl and purl2url conversions, as well as generating commit and patch URLs from PURLs.

  • git.kernel.org cgit /repo/commit/?id=HASH
  • gcc.gnu.org Gitweb /?p=repo.git;a=commit;h=HASH
  • sourceforge.net Apache Allura /p/project/repo/ci/HASH/
  • googlesource.com Gitiles /repo/+/HASH
  • codeberg.org / gitea.com gitea / forgejo commit/{HASH}"
  • gitlab-sub gitlab.gnome.org
  • github
  • gitlab
  • bitbucket

Related issue:

Signed-off-by: ziad hany <ziadhany2016@gmail.com>
Signed-off-by: ziad hany <ziadhany2016@gmail.com>
Add more tests

Signed-off-by: ziad hany <ziadhany2016@gmail.com>
Signed-off-by: ziad hany <ziadhany2016@gmail.com>
Add more test

Signed-off-by: ziad hany <ziadhany2016@gmail.com>
Add a test

Signed-off-by: ziad hany <ziadhany2016@gmail.com>
Add a test

Signed-off-by: ziad hany <ziadhany2016@gmail.com>
Signed-off-by: ziad hany <ziadhany2016@gmail.com>
ziadhany added 2 commits March 5, 2026 03:41
Add support for gitbox.apache.org

Signed-off-by: ziad hany <ziadhany2016@gmail.com>
Signed-off-by: ziad hany <ziadhany2016@gmail.com>
@ziadhany
Copy link
Copy Markdown
Contributor Author

ziadhany commented Mar 5, 2026

@TG1999 @keshav-space This PR is ready for review!. Take a look whenever you have a chance.

Comment thread src/packageurl/contrib/purl2url.py Outdated
Comment thread src/packageurl/contrib/purl2url.py
Comment thread src/packageurl/contrib/url2purl.py Outdated
commit_match = re.search(gitea_commit_pattern, url)
if commit_match:
return PackageURL(
type="generic",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a Forgejo PURL! @johnmhoran

return None

return PackageURL(
type="generic",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is in emerging git PURL for that from @darakian


if match := re.search(kernel_shorthand, url):
res = match.groupdict()
namespace = "git.kernel.org/pub/scm/linux/kernel/git/stable/"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs thinking and there is a emerging PURL registry that will cater to the kernel needs.

match = re.search(gitiles_project_pattern, url)
if match:
return PackageURL(
type="generic",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely also a candidate for the new git PURL type

commit_match = re.search(allura_pattern, url)
if commit_match:
return PackageURL(
type="generic",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may have a sourceforge type? Or this is for a git type

Copy link
Copy Markdown
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great ... we need some PURL type refinements.

…t repositories.

Treat GitLab subdomains as GitLab when using the repository URL instead of a generic type.

Add support for salsa.debian.org, gitlab.eclipse.org, forge.fedoraproject.org domains

Signed-off-by: ziad hany <ziadhany2016@gmail.com>
Signed-off-by: ziad hany <ziadhany2016@gmail.com>
Comment thread tests/contrib/data/url2purl.json Outdated
Signed-off-by: ziad hany <ziadhany2016@gmail.com>
Signed-off-by: ziad hany <ziadhany2016@gmail.com>
@ziadhany
Copy link
Copy Markdown
Contributor Author

@pombredanne, for undefined purl types (like Forgejo, sourceforge, or kernel), we should use generic for this, right?
the purl type registier take time , can we process this PR and once we have a new purl type we can just update the code.

@mjherzog
Copy link
Copy Markdown
Member

We have an active PR for 'git' (package-url/purl-spec#823), but no PR for forejo or sourceforge.
The "kernel" case is pending a bigger decision for the new 'sid" PURL type.
We may want to wait for the 'git' PURL type to be registered, but we should not wait for the others.
BTW - my understanding is that 'git' is an option for SourceForge, but not the only VCS there - I suspect that there are still many projects using Subversion there.

Copy link
Copy Markdown
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! See my review comments. IMHO we can make the code simpler, more functions are OK.

return purl_data.qualifiers.get("commit_url", None)


def get_patch_url(purl):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it makes sense to put the patch and commit URLs as PURL qualifiers. Why would you want that? To me, parsing a URL to get a commit or patch out of it is not specifically tied to a PURL, this is a URL operation. Now we could also possibly get a commit or patch from a PURL, but this would not be qualifiers on the PURL either.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pombredanne Ok, I thought it was similar to vcs_url in the purl-spec docs. The main goal is to get the patch text faster, since cloning a repo and using git takes a lot of time. I'm not sure if this is the right place for it in the qualifiers, though.

return repo_url


SUB_GITLAB_DOMAINS = [r"^git\.codelinaro\.org", r"^salsa\.debian\.org", r"^gitlab\.(?!com\b)[^/]+"]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this re mean?: (?!com\b)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are likely a large number of gitlab instances, it feels not realistic to list them all here. Why would this need to be known if the repository_url is provided when parsing a PURL? IMHo we should not need that at all.

Copy link
Copy Markdown
Contributor Author

@ziadhany ziadhany Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pombredanne you are right this SUB_GITLAB_DOMAINS should be only part of url2purl

r"^gitlab\.(?!com\b)[^/]+" This is a hack based on my observation about how we can identify if this is a GitLab subdomain. I noticed that they always start with gitlab.*

Ex:

gitlab.freedesktop.org/xorg/xserver
gitlab.xxx.com

The (?!com\b) this part is to filter out the main GitLab domain, gitlab.com. ( negative lookahead )

return get_generic_template_url(namespace, name, version, "patch_url")


@commit_router.route("pkg:gitlab/.*", "pkg:bitbucket/.*", "pkg:github/.*")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not having three separate functions? the rules to build a URL from a PURL are not generic at all, so IMHO the code would be simpler and cleaner with 3 functions, one for each forge.

Comment thread src/packageurl/contrib/purl2url.py
Comment thread src/packageurl/contrib/purl2url.py

@commit_router.route("pkg:generic/.*")
def build_generic_commit_url(purl):
"""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give some examples as doctests?

if not (namespace and name):
return

return get_generic_template_url(namespace, name, None, "repo_url")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should use kwargs, but in all cases this calls for using an object with separate methods for each url type

return get_generic_template_url(namespace, name, version, "commit_url")


@patch_router.route("pkg:generic/.*")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The router is designed to route like a multidispatch, so you should use it for dispatching based on specific URL/PURL patterns

)


def build_route_regex(domain_patterns, path_suffix="/.*"):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is too complex IMHO, just use instead a proper list of route patterns



@purl_router.route(SUB_GITLAB_ROUTE_REGEX)
def build_gitlab_sub_purl(url):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you are missing also is a function that takes a URL and return details about a patch or commit, not even with PURL in the mix

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain more about what you mean by getting patch/commit details? Do you mean fetching the data and parsing the text? I thought this will be a job for VulnerableCode or ScanCode.io.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants