Skip to content

feat: add frequencies sketches#44

Merged
leerho merged 8 commits into
apache:mainfrom
PsiACE:freq
Dec 30, 2025
Merged

feat: add frequencies sketches#44
leerho merged 8 commits into
apache:mainfrom
PsiACE:freq

Conversation

@PsiACE

@PsiACE PsiACE commented Dec 26, 2025

Copy link
Copy Markdown
Member
  • support FrequentItemsSketch
  • add baisc test cases and serde tests

Signed-off-by: Chojan Shang <psiace@apache.org>
Signed-off-by: Chojan Shang <psiace@apache.org>
Signed-off-by: Chojan Shang <psiace@apache.org>
@PsiACE

PsiACE commented Dec 26, 2025

Copy link
Copy Markdown
Member Author

only 1 serde case failed on windows, ... on windows, long is 32‑bit......

@jmalkin

jmalkin commented Dec 26, 2025

Copy link
Copy Markdown

Probably don’t need to bother with a dedicated implementation for longs. That was mostly for initial testing, I think. And if rust supports templates/generics with primitives then there’s no reason for a separate implementation for longs. We didn’t do one for c++, just provided a default serde for a few basic types.

@PsiACE

PsiACE commented Dec 26, 2025

Copy link
Copy Markdown
Member Author

you're right.

just provided a default serde for a few basic types.

I'll mark this pr as a draft for now. Later, I'll remove longsketch, update the tests, and skip the serde errors on Windows.

@PsiACE PsiACE marked this pull request as draft December 26, 2025 18:20
Signed-off-by: Chojan Shang <psiace@apache.org>
@PsiACE PsiACE marked this pull request as ready for review December 26, 2025 18:37
@leerho

leerho commented Dec 26, 2025

Copy link
Copy Markdown
Member

I need to see some Rust reviewers before I approve & merge this.

@tisonkun

Copy link
Copy Markdown
Member

I'll take a look here before the new year :P

Comment thread datasketches/src/frequencies/mod.rs
Comment thread datasketches/src/frequencies/reverse_purge_item_hash_map.rs Outdated
Comment thread datasketches/src/frequencies/sketch.rs
Comment thread datasketches/src/frequencies/sketch.rs Outdated
Comment thread datasketches/src/frequencies/sketch.rs Outdated
Comment thread datasketches/src/frequencies/mod.rs
Comment thread datasketches/src/frequencies/serde.rs Outdated
Comment thread datasketches/src/frequencies/serde.rs Outdated
Comment thread datasketches/src/frequencies/serde.rs Outdated
Comment thread datasketches/src/frequencies/reverse_purge_item_hash_map.rs Outdated
Comment thread datasketches/src/frequencies/sketch.rs Outdated
Comment thread datasketches/src/frequencies/sketch.rs Outdated

@tisonkun tisonkun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM. Mostly style comments to keep our codebase consistent.

Signed-off-by: Chojan Shang <psiace@apache.org>
Signed-off-by: Chojan Shang <psiace@apache.org>
@leerho leerho requested a review from tisonkun December 30, 2025 01:47

@tisonkun tisonkun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leerho There are several further potential improvements. But those can be follow-ups.

I'd leave it to @PsiACE to decide whether to include it in this patch or in a follow-up PR.

Comment thread datasketches/src/frequencies/sketch.rs Outdated

@tisonkun tisonkun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PsiACE The main branch now includes the new general error type. So you need to merge the main branch to resolve logic conflicts.

@leerho

leerho commented Dec 30, 2025

Copy link
Copy Markdown
Member

Waiting for @PsiACE

@PsiACE

PsiACE commented Dec 30, 2025

Copy link
Copy Markdown
Member Author

I have now completed the updates. This PR should be safe to merge. cc @tisonkun @leerho

@tisonkun tisonkun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. You may leverage the new codec utils in a follow-up PR.

if let Err(err) = sketch {
assert!(matches!(err, SerdeError::InsufficientData(_)));
assert_eq!(err.kind(), ErrorKind::InvalidData);
assert!(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: googletest has a contains matcher that can provide a better error message without manual construct.

https://docs.rs/googletest/latest/googletest/matchers/fn.contains.html

Comment on lines +144 to +145
let value = self.hash_map.get(item);
value.max(0) as u64

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the cpp impl uses u64 for reverse_purge_hash_map's value. But let's review this point as a follow-up.

Read https://github.com/apache/datasketches-cpp/blob/7bb979d3ef8929e235bcd22d67579e1f695f6ecd/fi/include/reverse_purge_hash_map_impl.hpp#L346-L364 especially for how subtract_and_keep_positive_only replaces:

        self.adjust_all_values_by(-median);
        self.keep_only_positive_counts();

This is the only usage where i64 is needed in the current impl.

@leerho leerho merged commit 38fad36 into apache:main Dec 30, 2025
9 checks passed
@tisonkun

Copy link
Copy Markdown
Member

Cross-reference to #35.

I don't know whether this sketch works well in approx_topk scenario. I'll try.

@PsiACE PsiACE deleted the freq branch January 5, 2026 00:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants