Skip to content

Feat: Add label filter support in pgdiskann client#724

Merged
XuanYang-cn merged 7 commits intozilliztech:mainfrom
EmumbaOrg:feat/pgdiskann-label-filtering
Apr 21, 2026
Merged

Feat: Add label filter support in pgdiskann client#724
XuanYang-cn merged 7 commits intozilliztech:mainfrom
EmumbaOrg:feat/pgdiskann-label-filtering

Conversation

@EeshaaKhan
Copy link
Copy Markdown
Contributor

@EeshaaKhan EeshaaKhan commented Feb 13, 2026

This PR adds support for label-based filtering in pgdiskann client.

Changes introduced:

  • Added support for FilterOp.StrEqual to enable filtering by scalar string labels.
  • Introduced optional with_scalar_labels flag to create tables with an additional label column.
  • Updated table creation logic to conditionally include scalar label support.
  • Extended insert_embeddings to handle label data when enabled.
  • Implemented prepare_filter() to dynamically construct the WHERE clause based on filter type.
  • Unified search query generation through _generate_search_query() to support filtered and unfiltered search paths.
  • Existing numeric filtering (id >= value) and non-filtered search behavior remain unchanged.
  • Includes vector storage optimization (SET STORAGE PLAIN) to disable TOAST compression on the embedding column for improved query performance
Screenshot from 2026-02-13 11-53-39

@EeshaaKhan
Copy link
Copy Markdown
Contributor Author

/assign @XuanYang-cn

@EeshaaKhan
Copy link
Copy Markdown
Contributor Author

@XuanYang-cn Could you please review this PR?

@EeshaaKhan
Copy link
Copy Markdown
Contributor Author

@alwayslove2013 Hi, this PR has been pending review for a while. Could you please take a look?

@XuanYang-cn
Copy link
Copy Markdown
Collaborator

Hi @EeshaaKhan, thanks for the contribution, and apologies for the delayed response! I'll review this PR shortly.

Before I do, could you please rebase onto the latest main? We've since upgraded to Pydantic v2 and updated several checks, so this PR didn't trigger the GitHub Actions workflow and may fail against the current codebase.
Thanks again for your patience!

@EeshaaKhan EeshaaKhan force-pushed the feat/pgdiskann-label-filtering branch from 4a70c07 to 2e4459c Compare April 17, 2026 07:18
@EeshaaKhan
Copy link
Copy Markdown
Contributor Author

Hi @XuanYang-cn ,
Thank you for the feedback! I've rebased the PR onto the latest main branch. All checks are now passing with no conflicts.
Ready for review! 🚀

@sre-ci-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: EeshaaKhan, XuanYang-cn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Collaborator

@XuanYang-cn XuanYang-cn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! The overall approach is clean and consistent with how pgvector and cockroachdb handle label filtering.

Worth noting

  • ALTER TABLE ... SET STORAGE PLAIN runs unconditionally in _create_table, regardless of with_scalar_labels. Disabling TOAST compression affects storage and performance for every PgDiskANN benchmark — worth a brief comment in the code explaining why it's needed, and a mention in the PR description.
  • self._scalar_label_field = "label" (singular) while LabelFilter.label_field defaults to "labels" (plural). A short comment on the column name choice would prevent future confusion.

@EeshaaKhan
Copy link
Copy Markdown
Contributor Author

EeshaaKhan commented Apr 20, 2026

@XuanYang-cn Thanks for the feedback!

I've addressed both points:

  1. SET STORAGE PLAIN: This runs unconditionally because it only targets the vector column - disabling TOAST compression on it is a standard performance optimization for PgDiskANN benchmarks, consistent with how pgvector handles it.

  2. Column naming: label is the column name in the database table, following the same convention used across all other database clients in this codebase. labels is the dataset field name. The mismatch is intentional to stay consistent with the existing standard.

@XuanYang-cn XuanYang-cn merged commit 63cc50a into zilliztech:main Apr 21, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants