Extend url2purl/purl2url coverage for Git-based source hosts#223
Extend url2purl/purl2url coverage for Git-based source hosts#223ziadhany wants to merge 14 commits intopackage-url:mainfrom
Conversation
Signed-off-by: ziad hany <ziadhany2016@gmail.com>
Signed-off-by: ziad hany <ziadhany2016@gmail.com>
Add more tests Signed-off-by: ziad hany <ziadhany2016@gmail.com>
Signed-off-by: ziad hany <ziadhany2016@gmail.com>
Add more test Signed-off-by: ziad hany <ziadhany2016@gmail.com>
Add a test Signed-off-by: ziad hany <ziadhany2016@gmail.com>
Add a test Signed-off-by: ziad hany <ziadhany2016@gmail.com>
Signed-off-by: ziad hany <ziadhany2016@gmail.com>
Add support for gitbox.apache.org Signed-off-by: ziad hany <ziadhany2016@gmail.com>
Signed-off-by: ziad hany <ziadhany2016@gmail.com>
|
@TG1999 @keshav-space This PR is ready for review!. Take a look whenever you have a chance. |
| commit_match = re.search(gitea_commit_pattern, url) | ||
| if commit_match: | ||
| return PackageURL( | ||
| type="generic", |
| return None | ||
|
|
||
| return PackageURL( | ||
| type="generic", |
There was a problem hiding this comment.
There is in emerging git PURL for that from @darakian
|
|
||
| if match := re.search(kernel_shorthand, url): | ||
| res = match.groupdict() | ||
| namespace = "git.kernel.org/pub/scm/linux/kernel/git/stable/" |
There was a problem hiding this comment.
This needs thinking and there is a emerging PURL registry that will cater to the kernel needs.
| match = re.search(gitiles_project_pattern, url) | ||
| if match: | ||
| return PackageURL( | ||
| type="generic", |
There was a problem hiding this comment.
Likely also a candidate for the new git PURL type
| commit_match = re.search(allura_pattern, url) | ||
| if commit_match: | ||
| return PackageURL( | ||
| type="generic", |
There was a problem hiding this comment.
We may have a sourceforge type? Or this is for a git type
pombredanne
left a comment
There was a problem hiding this comment.
Looking great ... we need some PURL type refinements.
…t repositories. Treat GitLab subdomains as GitLab when using the repository URL instead of a generic type. Add support for salsa.debian.org, gitlab.eclipse.org, forge.fedoraproject.org domains Signed-off-by: ziad hany <ziadhany2016@gmail.com>
Signed-off-by: ziad hany <ziadhany2016@gmail.com>
Signed-off-by: ziad hany <ziadhany2016@gmail.com>
Signed-off-by: ziad hany <ziadhany2016@gmail.com>
|
@pombredanne, for undefined purl types (like Forgejo, sourceforge, or kernel), we should use |
|
We have an active PR for 'git' (package-url/purl-spec#823), but no PR for forejo or sourceforge. |
pombredanne
left a comment
There was a problem hiding this comment.
Thanks! See my review comments. IMHO we can make the code simpler, more functions are OK.
| return purl_data.qualifiers.get("commit_url", None) | ||
|
|
||
|
|
||
| def get_patch_url(purl): |
There was a problem hiding this comment.
I am not sure it makes sense to put the patch and commit URLs as PURL qualifiers. Why would you want that? To me, parsing a URL to get a commit or patch out of it is not specifically tied to a PURL, this is a URL operation. Now we could also possibly get a commit or patch from a PURL, but this would not be qualifiers on the PURL either.
There was a problem hiding this comment.
@pombredanne Ok, I thought it was similar to vcs_url in the purl-spec docs. The main goal is to get the patch text faster, since cloning a repo and using git takes a lot of time. I'm not sure if this is the right place for it in the qualifiers, though.
| return repo_url | ||
|
|
||
|
|
||
| SUB_GITLAB_DOMAINS = [r"^git\.codelinaro\.org", r"^salsa\.debian\.org", r"^gitlab\.(?!com\b)[^/]+"] |
There was a problem hiding this comment.
What does this re mean?: (?!com\b)
There was a problem hiding this comment.
There are likely a large number of gitlab instances, it feels not realistic to list them all here. Why would this need to be known if the repository_url is provided when parsing a PURL? IMHo we should not need that at all.
There was a problem hiding this comment.
@pombredanne you are right this SUB_GITLAB_DOMAINS should be only part of url2purl
r"^gitlab\.(?!com\b)[^/]+" This is a hack based on my observation about how we can identify if this is a GitLab subdomain. I noticed that they always start with gitlab.*
Ex:
gitlab.freedesktop.org/xorg/xserver
gitlab.xxx.com
The (?!com\b) this part is to filter out the main GitLab domain, gitlab.com. ( negative lookahead )
| return get_generic_template_url(namespace, name, version, "patch_url") | ||
|
|
||
|
|
||
| @commit_router.route("pkg:gitlab/.*", "pkg:bitbucket/.*", "pkg:github/.*") |
There was a problem hiding this comment.
Why not having three separate functions? the rules to build a URL from a PURL are not generic at all, so IMHO the code would be simpler and cleaner with 3 functions, one for each forge.
|
|
||
| @commit_router.route("pkg:generic/.*") | ||
| def build_generic_commit_url(purl): | ||
| """ |
There was a problem hiding this comment.
Can you give some examples as doctests?
| if not (namespace and name): | ||
| return | ||
|
|
||
| return get_generic_template_url(namespace, name, None, "repo_url") |
There was a problem hiding this comment.
you should use kwargs, but in all cases this calls for using an object with separate methods for each url type
| return get_generic_template_url(namespace, name, version, "commit_url") | ||
|
|
||
|
|
||
| @patch_router.route("pkg:generic/.*") |
There was a problem hiding this comment.
The router is designed to route like a multidispatch, so you should use it for dispatching based on specific URL/PURL patterns
| ) | ||
|
|
||
|
|
||
| def build_route_regex(domain_patterns, path_suffix="/.*"): |
There was a problem hiding this comment.
this is too complex IMHO, just use instead a proper list of route patterns
|
|
||
|
|
||
| @purl_router.route(SUB_GITLAB_ROUTE_REGEX) | ||
| def build_gitlab_sub_purl(url): |
There was a problem hiding this comment.
What you are missing also is a function that takes a URL and return details about a patch or commit, not even with PURL in the mix
There was a problem hiding this comment.
Can you explain more about what you mean by getting patch/commit details? Do you mean fetching the data and parsing the text? I thought this will be a job for VulnerableCode or ScanCode.io.
Currently, many fix-commit URLs are not converted to PURLs during collection. Adding support for these URLs would ensure more accurate url2purl and purl2url conversions, as well as generating commit and patch URLs from PURLs.
cgit/repo/commit/?id=HASHGitweb/?p=repo.git;a=commit;h=HASHApache Allura/p/project/repo/ci/HASH/Gitiles/repo/+/HASHgitea / forgejocommit/{HASH}"Related issue: