Add index scan by sandy-bes · Pull Request #2351 · apache/age

sandy-bes · 2026-03-03T07:14:50Z

Motivation / Problem:
As a result of load testing, a significant performance degradation was found in insertion scenarios. The scenarios used were taken from an open-source benchmark and rewritten in pure SQL. Examples of the queries can be found here:

Perf analysis showed that the main bottleneck is the entity_exists function. The root cause lies in the use of a Sequential Scan (SeqScan) to check for the existence of an entity prior to insertion. The time complexity of a SeqScan is O(N), meaning the search time grows linearly as the number of rows in the table increases. The larger the graph became, the longer each individual insertion took. This led to a drop in TPS regardless of the concurrency level (the issue was consistently reproduced with both 1 and 30 threads).

Profile Analysis (Flame Graphs & perf): I am attaching Flame Graphs for comparison.

Before Patch: The Flame Graph shows a massive "plateau" for the entity_exists function and its underlying sequential scan operations.

After Patch: The "plateau" of entity_exists has disappeared, confirming that the bottleneck is resolved.

Changes Made:

Added Index Scan (IndexScan / time complexity O(log N)) inside the entity_exists function.
Refactored other functions utilizing SeqScan — they were also migrated to use IndexScan wherever applicable.

Performance Impact:
Benchmarks were conducted on a server with 30 CPU cores and 32 GB of RAM, using a graph ranging from 20,000 to 200,000 objects over a 2-minute duration. The transition to index access completely eliminated the performance degradation associated with data volume growth:

Before: ~1,500 TPS (at peak, with subsequent degradation as the table grew).
After: Stable ~15,000 TPS (a 10x speedup).

Here are the scripts and commands for local reproduction:

1. Graph creation and population:
The sf parameter here defines the scale factor of the generated graph.
psql -d your_database -f generate_graph.sql -v sf=1
generate_graph.sql

2. Creating wrapper functions for the workload:
This script sets up the environment and creates functions with the correct parameter mapping for Apache AGE.
psql -d your_database -f setup_func_for_workload.sql
setup_func_for_workload.sql

3. Running the load test:
Running 8 clients for 120 seconds.
pgbench -d your_database -f workload_update.sql -D sf=1 -c 8 -T 120 -P
workload_update.sql

Acknowledgments:

Huge thanks to Daria Barsukova for conducting the load testing and isolating the issue.
Implementation of index scanning: Alexandra Bondar.

This commit fixes performance degradation during insertion scenarios by replacing SeqScan with IndexScan. Co-authored-by: Daria Barsukova <d.barsukova@g.nsu.ru> Co-authored-by: Alexandra Bondar <s6311704@gmail.com>

Copilot

Pull request overview

This PR targets a load-test regression where insertion throughput degraded as label tables grew, by replacing O(N) sequential scans with index-backed lookups (primarily around entity_exists and other hot paths that check entity/label presence).

Changes:

Switch entity_exists() to use an index scan when a suitable index is available (fallback to seq scan otherwise).
Refactor several other table lookups (vertex fetch, label enumeration, SET/DELETE code paths) to prefer index scans.
Add index discovery logic to locate a usable index when a PK index isn’t present.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`src/backend/utils/adt/agtype.c`	Uses index scan to fetch a vertex tuple by graphid, with fallback to sequential scan.
`src/backend/utils/adt/age_global_graph.c`	Uses index scan to enumerate labels for a graph_oid, filtering `kind` in the heap.
`src/backend/executor/cypher_utils.c`	Reworks `entity_exists()` to prefer index scans over seq scans.
`src/backend/executor/cypher_set.c`	Uses index scan to find the tuple to update (SET/REMOVE) instead of seq scan.
`src/backend/executor/cypher_delete.c`	Uses index scan to find tuples for DELETE and to accelerate connected-edge checks where possible.
`src/backend/catalog/ag_label.c`	Uses `ag_label_graph_oid_index` to accelerate edge-label enumeration for a graph.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/backend/executor/cypher_set.c

Copilot · 2026-03-03T16:22:49Z

src/backend/utils/adt/agtype.c

+        ScanKeyInit(&scan_keys[0], 1, BTEqualStrategyNumber, F_OIDEQ,
+                    Int64GetDatum(graphid));


The sequential-scan fallback still uses F_OIDEQ to compare the graphid column. Since the index path uses F_GRAPHIDEQ (and the column type is GRAPHIDOID), the fallback should also use F_GRAPHIDEQ/GRAPHID_GET_DATUM to avoid incorrect comparisons when no suitable index is found.

Suggested change

ScanKeyInit(&scan_keys[0], 1, BTEqualStrategyNumber, F_OIDEQ,

Int64GetDatum(graphid));

ScanKeyInit(&scan_keys[0], 1, BTEqualStrategyNumber, F_GRAPHIDEQ,

GRAPHID_GET_DATUM(graphid));

This is actually existing code that I didn't write. To keep the PR focused and avoid breaking existing logic, I didn't change its internal mechanics and simply moved it under the else branch. Let me know if you'd still prefer me to fix the F_OIDEQ to F_GRAPHIDEQ comparison.

Let me think about that.

If you could fix it, that would be great. Otherwise, it will just keep reminding us and it will still need to be fixed in a diff. pr.

Done! I've refactored this as requested. I also left a note about fixing this on a similar comment during the second review iteration.

src/backend/catalog/ag_label.c

jrgemignani · 2026-03-03T17:32:27Z

@sandy-bes Please address the above Copilot suggestions. Additionally, please make sure to note any AI usage, if used.

jrgemignani · 2026-03-03T19:56:34Z

@sandy-bes Please see this as well (Opus 4.6 review). All of these items need to be address.

Review of PR #2351 — Add index scan

The core idea — replacing O(N) sequential scans with O(log N) index scans in entity_exists and related hot paths — is excellent, and the performance numbers are compelling. However, there are several correctness, security, and code quality issues that need to be addressed before this can be merged.

Critical Issues

1. RLS bypass in cypher_set.c index-scan path (security bug)

The index-scan branch in process_update_list() calls update_entity_tuple() directly without checking check_security_quals. The seq-scan fallback correctly gates the update on RLS USING-policy evaluation. This means SET/REMOVE will bypass row-level security when an index is present. The same RLS gating logic must be applied in the index-scan branch.

2. heap_freetuple on wrong tuple in cypher_set.c (memory corruption)

heap_tuple = ExecFetchSlotHeapTuple(index_slot, true, &shouldFree);
// ...
heap_tuple = update_entity_tuple(resultRelInfo, slot, estate, heap_tuple);
// ...
if (shouldFree)
    heap_freetuple(heap_tuple);  // <-- frees the UPDATE result, not the fetched tuple

After update_entity_tuple(), heap_tuple points to the new tuple, but shouldFree applies to the originally fetched tuple. This can free the wrong memory. Keep the fetched tuple in a separate variable and only free that one.

Correctness Issues

3. F_OIDEQ vs F_GRAPHIDEQ mismatch in agtype.c fallback path

The index path correctly uses F_GRAPHIDEQ, but the seq-scan fallback at line ~5709 still uses F_OIDEQ. The column type is graphid, so both paths should use F_GRAPHIDEQ / GRAPHID_GET_DATUM.

4. Index attno vs heap attno confusion in ag_label.c

ScanKeyInit for the index scan uses Anum_ag_label_name as the attribute number. For index scans, the attno should be the index column number (typically 1), not a heap attribute-number macro. Additionally, the index is on graph_oid, so referencing Anum_ag_label_name is semantically incorrect — it only works by coincidence if the value happens to equal 1. Consider using ag_label_graph_oid_index_id() instead of get_relname_relid() string lookup as well.

Performance Issue

5. Index discovery runs per-tuple in hot paths

In process_delete_list() and process_update_list(), the index discovery block (RelationGetIndexList + open/close each index) runs inside the per-entity loop. For a DELETE or SET touching N entities of the same label, this runs N times. The index OID should be cached per-label or discovered once outside the inner loop.

Code Quality Issues

6. Massive code duplication

The index discovery pattern (iterate RelationGetIndexList, open each, check indisvalid/indnatts/indkey.values[0]) is copy-pasted 6+ times across the changed files. This should be factored into a shared utility function, e.g.:

Oid find_usable_index_for_attr(Relation rel, AttrNumber attnum);

7. DETACH DELETE duplication in cypher_delete.c

The two-pass connected-edge check (START_ID pass + END_ID pass) duplicates ~120 lines of nearly identical ACL check + RLS check + delete + error logic. A helper function would reduce this significantly.

8. C++ style comments

New code uses // comments in ag_label.c and elsewhere. The AGE/PostgreSQL codebase uses /* ... */ exclusively. Please convert for consistency and C90 portability.

9. Brace style inconsistency

The PR mixes } else { with the existing codebase convention of:

}
else
{

Please follow the existing style.

10. Inconsistent lock levels

Index open calls use RowExclusiveLock in some read-only scan paths (ag_label.c label enumeration, age_global_graph.c) where AccessShareLock would be more appropriate. The lock levels for index access during reads vs writes should be reviewed for consistency.

Summary

#	Issue	Severity
1	RLS bypass in SET index path	Critical
2	`heap_freetuple` on wrong tuple	High
3	`F_OIDEQ` vs `F_GRAPHIDEQ` mismatch	Medium
4	Index attno vs heap attno confusion	Medium
5	Index discovery per-tuple in loops	Medium
6	Index discovery code duplication (6x)	High
7	DETACH DELETE logic duplication	Medium
8	C++ style comments	Low
9	Brace style inconsistency	Low
10	Inconsistent lock levels	Medium

The performance improvement is real and valuable. With the security fix (#1), memory safety fix (#2), correctness fixes (#3-4), and a refactoring pass to deduplicate the index discovery logic (#5-7), this will be in good shape.

sandy-bes · 2026-03-04T05:37:14Z

I've updated the PR description with the steps to reproduce the tests locally.

jrgemignani · 2026-03-04T08:05:19Z

@sandy-bes

I've updated the PR description with the steps to reproduce the tests locally.

I was able to validate the perf claims with Opus earlier.

But, more importantly, I can't do anything with this PR until all of the above issues (Copilot & Opus) have been addressed. Including noting any use of AI. Btw, using AI is fine, it just needs to be noted.

sandy-bes · 2026-03-04T10:04:07Z

Yes, absolutely, I understand that the review process needs to be finalized first. I will do my best to address all the feedback over the next few days.

Regarding the use of AI agents: I did not use any AI agents while writing this code.

sandy-bes · 2026-03-13T11:05:54Z

All review comments and suggestions provided by GitHub Copilot and Opus 4.6 have been addressed and integrated into this patch.

jrgemignani · 2026-03-13T18:49:05Z

@sandy-bes A lot of regression tests failed. Are you running the installcheck locally to verify the build?

sandy-bes · 2026-03-15T04:26:35Z

I forgot to run installcheck for the latest change in the commit, my bad! It turned out there was a lock mismatch. I just fixed it, the tests are passing now, and I force-pushed the update.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-25T17:27:54Z

src/backend/utils/adt/agtype.c

+        if (curr_idx_rel->rd_index->indisvalid &&
+            curr_idx_rel->rd_index->indnatts >= 1 &&
+            curr_idx_rel->rd_index->indkey.values[0] == attnum) 


find_usable_index_for_attr() can select any valid index whose first key matches attnum, but all call sites build scan keys with BTEqualStrategyNumber (btree strategy numbering). If a non-btree index exists on that column (or an index AM with different strategy semantics), the scan can behave incorrectly or fail. Restrict the helper to only return btree indexes (and optionally skip partial/expression indexes), or change the scan-key construction to be compatible with the selected index AM/opclass.

I've updated the helper function to strictly filter for B-Tree indexes (relam == BTREE_AM_OID) and to skip partial indexes. To make this specific behavior explicit and prevent any future misuse, I also took the liberty of renaming the function to find_usable_btree_index_for_attr.

Copilot · 2026-03-25T17:27:54Z

src/backend/utils/adt/agtype.c

+        /* fallback to sequential scan */
+        ScanKeyInit(&scan_keys[0], 1, BTEqualStrategyNumber, F_OIDEQ,
+                    Int64GetDatum(graphid));
+
+        scan_desc = table_beginscan(graph_vertex_label, snapshot, 1, scan_keys);
+        tuple = heap_getnext(scan_desc, ForwardScanDirection);


The index path uses F_GRAPHIDEQ for the same attribute, but the seq-scan fallback still uses F_OIDEQ. That makes correctness depend on whether an index exists (and may be wrong if the column type/operator is graphid). Use the same operator/function for both paths (e.g., F_GRAPHIDEQ with the appropriate datum conversion) so results are consistent regardless of index availability.

As I mentioned in a similar comment, this was existing code that I simply moved into the else branch, which is why I didn't touch its internal mechanics initially. I'll update the fallback path to use F_GRAPHIDEQ with the correct datum conversion right away.

Copilot · 2026-03-25T17:27:55Z

src/backend/executor/cypher_set.c

+                            update_entity_tuple(resultRelInfo, slot, estate,
+                                                            heap_tuple);


In the seq-scan path, the return value of update_entity_tuple(...) is assigned back to heap_tuple, but in the index-scan path it is ignored. If update_entity_tuple can return a different HeapTuple (or if callers rely on the returned value for ownership/lifetime), this inconsistency can cause incorrect behavior or memory/tuple lifecycle issues. Make the index-scan path match the existing pattern (assign the return value and handle any required freeing/ownership consistently).

Suggested change

update_entity_tuple(resultRelInfo, slot, estate,

heap_tuple);

heap_tuple = update_entity_tuple(resultRelInfo, slot,

estate, heap_tuple);

I had actually removed that assignment in a previous PR iteration to fix a memory bug (calling heap_freetuple on the updated tuple instead of the fetched one caused issues).

But I completely agree that leaving the return value unassigned is bad practice and breaks consistency. I've refactored this block: we now store the fetched tuple in a dedicated original_tuple variable for safe memory freeing at the end, which allows us to correctly assign the update_entity_tuple result back to heap_tuple. Fixed it.

Copilot · 2026-03-25T17:27:55Z

src/backend/executor/cypher_utils.c

+
+        index_rel = index_open(index_oid, RowExclusiveLock);


entity_exists() is a read-only existence check, but it opens the index with RowExclusiveLock. This lock mode is stronger than necessary for scanning and can increase lock contention under concurrent workloads (especially since entity_exists is in the hot insertion path). Prefer AccessShareLock for index_open (while keeping the table lock as required by the surrounding logic).

I originally matched the table's lock level due to the upstream modification paths (create_vertex, etc.), but I see now why AccessShareLock is the correct (and much more concurrent) choice for the index scan itself. Fixed in the latest commit.

Copilot · 2026-03-25T17:27:55Z

src/backend/catalog/ag_label.c

+    ag_label = table_open(ag_label_relation_id(), AccessShareLock);
+
+    index_oid = find_usable_index_for_attr(ag_label, Anum_ag_label_graph);

    resultRelInfo = create_entity_result_rel_info(estate, "ag_catalog",
                                                  "ag_label");



This function opens ag_label via table_open(...) but later closes resultRelInfo->ri_RelationDesc instead of the ag_label relation it opened. This double-open pattern is easy to get wrong and risks leaking a relation reference/lock if the descriptors are not the same instance. Prefer using a single relation handle throughout (e.g., use resultRelInfo->ri_RelationDesc for scanning and close that, or explicitly close ag_label if it remains separately opened).

Suggested change

ag_label = table_open(ag_label_relation_id(), AccessShareLock);

index_oid = find_usable_index_for_attr(ag_label, Anum_ag_label_graph);

resultRelInfo = create_entity_result_rel_info(estate, "ag_catalog",

"ag_label");

resultRelInfo = create_entity_result_rel_info(estate, "ag_catalog",

"ag_label");

ag_label = resultRelInfo->ri_RelationDesc;

index_oid = find_usable_index_for_attr(ag_label, Anum_ag_label_graph);

I see exactly what you mean now: the code explicitly does table_open with AccessShareLock, and then create_entity_result_rel_info opens the exact same relation again via parserOpenTable with RowExclusiveLock.

This is actually existing code that I didn't write. To keep the PR focused and avoid breaking existing logic, I didn't change its internal mechanics. Let me know if you'd still prefer me to fix it.

Copilot · 2026-03-25T17:27:55Z

src/backend/utils/adt/age_global_graph.c


-    /* get all of the label names */
-    while((tuple = heap_getnext(scan_desc, ForwardScanDirection)) != NULL)
+    /* We look for 'ag_label_graph_oid_index' or any index starting with 'graph' */


The comment is inaccurate: the code does not look for an index by name/prefix; it selects the first “usable” index whose first key matches the attribute number. Update the comment to reflect the actual selection logic to avoid misleading future maintainers.

Suggested change

/* We look for 'ag_label_graph_oid_index' or any index starting with 'graph' */

/* Find a usable index whose first key column is ag_label.graph (Anum_ag_label_graph) */

jrgemignani · 2026-03-25T17:35:22Z

@sandy-bes Sorry, you know the drill by now, more Copilot,... I keep hoping it stops complaining :( But, I think we're close.

sandy-bes · 2026-03-31T06:10:43Z

@jrgemignani No worries at all, I completely understand! As much as it complains, I have to admit Copilot actually gives some really good hints and catches subtle things.

Alexandra Bondar and others added 2 commits March 3, 2026 14:03

Add index scan

0ffedcd

This commit fixes performance degradation during insertion scenarios by replacing SeqScan with IndexScan. Co-authored-by: Daria Barsukova <d.barsukova@g.nsu.ru> Co-authored-by: Alexandra Bondar <s6311704@gmail.com>

Merge branch 'master' into add_index_scan

7e07f32

jrgemignani requested a review from Copilot March 3, 2026 16:16

Copilot started reviewing on behalf of jrgemignani March 3, 2026 16:17 View session

Copilot AI reviewed Mar 3, 2026

View reviewed changes

jrgemignani requested review from MuhammadTahaNaveed and jrgemignani March 3, 2026 17:32

Address code review commets

61659f2

sandy-bes force-pushed the add_index_scan branch from 1fdce2c to 61659f2 Compare March 15, 2026 04:22

jrgemignani requested a review from Copilot March 25, 2026 17:20

Copilot started reviewing on behalf of jrgemignani March 25, 2026 17:26 View session

Copilot AI reviewed Mar 25, 2026

View reviewed changes

Address code review commets

54fb996

		ScanKeyInit(&scan_keys[0], 1, BTEqualStrategyNumber, F_OIDEQ,
		Int64GetDatum(graphid));

		update_entity_tuple(resultRelInfo, slot, estate,
		heap_tuple);

	/* We look for 'ag_label_graph_oid_index' or any index starting with 'graph' */
	/* Find a usable index whose first key column is ag_label.graph (Anum_ag_label_graph) */

Conversation

sandy-bes commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jrgemignani commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrgemignani commented Mar 3, 2026

Review of PR #2351 — Add index scan

Critical Issues

Correctness Issues

Performance Issue

Code Quality Issues

Summary

Uh oh!

sandy-bes commented Mar 4, 2026

Uh oh!

jrgemignani commented Mar 4, 2026

Uh oh!

sandy-bes commented Mar 4, 2026

Uh oh!

sandy-bes commented Mar 13, 2026

Uh oh!

jrgemignani commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sandy-bes commented Mar 15, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

sandy-bes commented Mar 3, 2026 •

edited

Loading

jrgemignani commented Mar 3, 2026 •

edited

Loading

jrgemignani commented Mar 13, 2026 •

edited

Loading