Skip to content

Conversation

Copy link

Copilot AI commented Jan 31, 2026

Implements cosine_similarity array function following DuckDB semantics. The function computes the cosine similarity between two float arrays: dot(x, y) / (||x|| * ||y||).

Changes

Backend (C++)

  • Added CosineSimilarity class in function_array_distance.h
  • Implemented computation in function_array_distance.cpp using the existing FunctionArrayDistance template
  • Returns 0.0 for zero vectors (consistent with DuckDB)

Frontend (Java)

  • Created CosineSimilarity.java scalar function class
  • Added visitor method in ScalarFunctionVisitor.java
  • Registered in BuiltinScalarFunctions.java

Tests

  • BE unit tests: identical/orthogonal/opposite vectors, zero vectors, empty arrays, known values
  • Regression tests: NULL handling, array size validation, table queries, relationship with cosine_distance

Usage

SELECT cosine_similarity([1, 2, 3], [1, 2, 3]);  -- 1.0 (identical)
SELECT cosine_similarity([1, 0], [0, 1]);        -- 0.0 (orthogonal)
SELECT cosine_similarity([1, 2, 3], [-1, -2, -3]); -- -1.0 (opposite)

-- Relationship: cosine_similarity + cosine_distance = 1.0
SELECT cosine_similarity([1, 2, 3], [3, 5, 7]) + cosine_distance([1, 2, 3], [3, 5, 7]); -- 1.0

Behavior

  • Input: Two ARRAY<FLOAT> of equal length
  • Output: FLOAT
  • NULL arrays or NULL elements throw exceptions (consistent with other distance functions)
  • Different array sizes throw exceptions

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • duckdb.org
    • Triggering command: /home/REDACTED/work/_temp/ghcca-node/node/bin/node /home/REDACTED/work/_temp/ghcca-node/node/bin/node --enable-source-maps /home/REDACTED/work/_temp/copilot-developer-action-main/dist/index.js (dns block)

If you need me to access, download, or install something from one of these locations, you can either:


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copilot AI changed the title [WIP] Add Array function cosine_similarity implementation [Feature] Implement cosine_similarity array function Jan 31, 2026
Copilot AI requested a review from zclllyybb January 31, 2026 08:36
@zclllyybb
Copy link
Contributor

run buildall

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 100.00% (15/15) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.52% (19289/36724)
Line Coverage 36.00% (179249/497974)
Region Coverage 32.41% (139048/429043)
Branch Coverage 33.35% (60154/180353)

@doris-robot
Copy link

TPC-H: Total hot run time: 31615 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit ee407798f640e58bcf140c4eeeed1363113ff6cf, data reload: false

------ Round 1 ----------------------------------
q1	17620	5381	5052	5052
q2	2067	354	191	191
q3	10154	1288	738	738
q4	10230	852	312	312
q5	7532	2150	1881	1881
q6	188	178	149	149
q7	881	748	621	621
q8	9260	1427	1090	1090
q9	5246	4843	4802	4802
q10	6827	1945	1550	1550
q11	490	293	266	266
q12	344	371	223	223
q13	17797	4076	3236	3236
q14	250	241	224	224
q15	909	808	819	808
q16	706	669	623	623
q17	651	831	436	436
q18	6787	6604	6426	6426
q19	1235	982	604	604
q20	384	342	230	230
q21	2575	1979	1891	1891
q22	355	315	262	262
Total cold run time: 102488 ms
Total hot run time: 31615 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5336	5318	5306	5306
q2	267	326	253	253
q3	2172	2681	2274	2274
q4	1352	1737	1313	1313
q5	4242	4188	4082	4082
q6	226	186	139	139
q7	2017	2102	1916	1916
q8	2710	2451	2397	2397
q9	7465	7444	7555	7444
q10	2817	3023	2661	2661
q11	559	474	460	460
q12	761	746	600	600
q13	3881	4432	3467	3467
q14	290	352	331	331
q15	891	881	809	809
q16	653	739	659	659
q17	1179	1617	1390	1390
q18	8144	8096	7970	7970
q19	864	840	804	804
q20	2147	2142	2057	2057
q21	4739	4149	4170	4149
q22	545	571	508	508
Total cold run time: 53257 ms
Total hot run time: 50989 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 28.07 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit ee407798f640e58bcf140c4eeeed1363113ff6cf, data reload: false

query1	0.05	0.05	0.04
query2	0.09	0.04	0.05
query3	0.26	0.08	0.09
query4	1.60	0.11	0.11
query5	0.27	0.24	0.26
query6	1.16	0.68	0.67
query7	0.03	0.02	0.02
query8	0.06	0.04	0.05
query9	0.56	0.50	0.50
query10	0.55	0.54	0.55
query11	0.14	0.10	0.09
query12	0.14	0.10	0.11
query13	0.64	0.61	0.63
query14	1.07	1.07	1.05
query15	0.88	0.86	0.88
query16	0.38	0.39	0.41
query17	1.15	1.12	1.11
query18	0.23	0.20	0.21
query19	2.10	1.88	2.08
query20	0.02	0.01	0.01
query21	15.44	0.23	0.14
query22	5.07	0.05	0.04
query23	15.73	0.27	0.10
query24	2.98	0.63	0.27
query25	0.08	0.11	0.08
query26	0.14	0.13	0.13
query27	0.06	0.05	0.06
query28	4.60	1.13	0.97
query29	12.55	3.98	3.16
query30	0.27	0.13	0.14
query31	2.81	0.66	0.40
query32	3.23	0.60	0.49
query33	3.26	3.23	3.37
query34	16.24	5.42	4.69
query35	4.74	4.74	4.83
query36	0.65	0.52	0.49
query37	0.11	0.07	0.06
query38	0.07	0.04	0.04
query39	0.05	0.03	0.03
query40	0.18	0.17	0.16
query41	0.09	0.04	0.03
query42	0.04	0.03	0.03
query43	0.05	0.04	0.03
Total cold run time: 99.82 s
Total hot run time: 28.07 s

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (15/15) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.58% (25767/35999)
Line Coverage 54.21% (269290/496797)
Region Coverage 51.85% (224741/433453)
Branch Coverage 53.16% (96262/181085)

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 92.31% (12/13) 🎉
Increment coverage report
Complete coverage report

@github-actions
Copy link
Contributor

github-actions bot commented Feb 3, 2026

PR approved by anyone and no changes requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants