Skip to content

[fix] [broker] Remove the repl cursor if the creation fails and the replication is turned off#20025

Closed
poorbarcode wants to merge 8 commits intoapache:masterfrom
poorbarcode:flaky/testCreateTopicWithZombieReplicatorCursor
Closed

[fix] [broker] Remove the repl cursor if the creation fails and the replication is turned off#20025
poorbarcode wants to merge 8 commits intoapache:masterfrom
poorbarcode:flaky/testCreateTopicWithZombieReplicatorCursor

Conversation

@poorbarcode
Copy link
Copy Markdown
Contributor

@poorbarcode poorbarcode commented Apr 6, 2023

Fixes #20010

Motivation

The Cursorrepl is automatically created when replication is turned on and it will be deleted when replication is turned off. But there has a bug: if the Topic is not online when you turn off replication, the cursor cannot be deleted, and #19972 fixes this bug.

But #19972 cannot solve the following scenario that cause the flaky test #20010:

  1. a replication task starting
  2. get clusters( responses [local_cluster, remote_cluster])
  3. create cursor repl
  4. at this moment, the remote_cluster is removed, and the remote cluster is down now.
  5. then the task start replicator will fail, but the cursor still there

Modifications

Differing with #19972, do remove the repl cursor when the task start replicator is failed.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:


if (topicLevelPolicy) {
admin.topics().setReplicationClusters(topicName, Collections.singletonList(remoteCluster));
admin.topics().setReplicationClusters(topicName, Arrays.asList(remoteCluster, "test"));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you add another cluster "test" to this test? Even without the change of PersistentTopic, this test could still pass.

Copy link
Copy Markdown
Contributor Author

@poorbarcode poorbarcode Apr 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cluster is not set correctly, it can mislead others: clusters is an empty collection when Replication is disabled.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's another topic about the difference determined by whether the replication is enabled. It does not affect this test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's another topic about the difference determined by whether the replication is enabled. It does not affect this test.

Correct.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, it's better to revert this unrelated change.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So @poorbarcode Is there any other reason to add the local cluster here? It might be misleading that adding this change could make test better.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted thse changes

@BewareMyPower
Copy link
Copy Markdown
Contributor

It's still flaky.

image

java.util.concurrent.ExecutionException: java.lang.RuntimeException: org.apache.pulsar.metadata.api.MetadataStoreException$NotFoundException: remote

	at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
	at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2096)
	at org.apache.pulsar.broker.service.persistent.PersistentTopicTest.testCreateTopicWithZombieReplicatorCursor(PersistentTopicTest.java:587)

@Technoboy- Technoboy- added this to the 3.0.0 milestone Apr 6, 2023
@BewareMyPower
Copy link
Copy Markdown
Contributor

Could you also change the title? The PR title is duplicated with #19972.

Comment on lines +600 to +602
Awaitility.await().atMost(10, TimeUnit.SECONDS)
.until(() -> {
topic.initialize().get(3, TimeUnit.SECONDS);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's meaningless because topic.initialize might throw an exception and it won't be retried.


java.util.concurrent.ExecutionException: java.lang.RuntimeException: org.apache.pulsar.metadata.api.MetadataStoreException$NotFoundException: remote

	at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
	at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2096)
	at org.apache.pulsar.broker.service.persistent.PersistentTopicTest.lambda$testCreateTopicWithZombieReplicatorCursor$10(PersistentTopicTest.java:602)
	at org.awaitility.core.CallableCondition$ConditionEvaluationWrapper.eval(CallableCondition.java:99)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps .ignoreExceptionsInstanceOf(MetadataStoreException.NotFoundException.class) could help?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the root cause is not here. I added a try-catch block to swallow the exception, the test could still fail.

            try {
                topic.initialize().get(3, TimeUnit.SECONDS);
            } catch (ExecutionException e) {
                log.warn("Failed to initialize: {}", e.getCause().getMessage());
            }

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the root cause is not here. I added a try-catch block to swallow the exception, the test could still fail.

The root cause is like this:

  1. a replication task starting
  2. failed to get cluster meta from ZK
  3. check replication policies
  • Since the policy is updated asynchronously, the cache has not been updated.
  • get clusters( responses [local_cluster, remote_cluster])
  1. Since check replication successes, will not delete the cursor. bingo

Copy link
Copy Markdown
Contributor Author

@poorbarcode poorbarcode Apr 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to guarantee that the cursor will be deleted the first time because the client will try to reload the topic. We just need to guarantee that the cursor will be deleted eventually. So I did this change:

Awaitility.await().atMost(10, TimeUnit.SECONDS)
                .until(() -> {
                    topic.initialize().get(3, TimeUnit.SECONDS);
                    return !topic.getManagedLedger().getCursors().iterator().hasNext();
                });

topic.initialize().get(3, TimeUnit.SECONDS); just mock the load operation by client, and verify that the cursor will be deleted eventually.

Copy link
Copy Markdown
Contributor Author

@poorbarcode poorbarcode Apr 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've run this test 4*200 times with no errors

截屏2023-04-06 21 04 25

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just mock the load operation by client,

Actually, the operation failed at topic.initialize in my local env. An exception will be thrown and it won't retry.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have change the test to Awaitility.await().atMost(10, TimeUnit.SECONDS).ignoreExceptions().untilAsserted(, it will be fixed

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

Merging #20025 (903466c) into master (8c50a6c) will increase coverage by 39.64%.
The diff coverage is 76.66%.

Impacted file tree graph

@@              Coverage Diff              @@
##             master   #20025       +/-   ##
=============================================
+ Coverage     33.17%   72.81%   +39.64%     
- Complexity    12158    31601    +19443     
=============================================
  Files          1499     1860      +361     
  Lines        113832   137406    +23574     
  Branches      12368    15131     +2763     
=============================================
+ Hits          37769   100059    +62290     
+ Misses        71143    29381    -41762     
- Partials       4920     7966     +3046     
Flag Coverage Δ
inttests 24.41% <43.33%> (?)
systests 24.87% <0.00%> (?)
unittests 72.11% <76.66%> (+38.93%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...sar/broker/service/persistent/PersistentTopic.java 79.49% <76.66%> (+29.87%) ⬆️

... and 1530 files with indirect coverage changes

Comment on lines +1737 to +1738
replicationStartFuture.complete(null);
return removeReplicator(remoteCluster).whenComplete((ignore2, remoteCursorEx) -> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the replicationStartFuture completed before removing the replicator?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already fixed. Thanks

Comment on lines +564 to +569
try {
admin.topics().setReplicationClusters(topicName, Arrays.asList(remoteCluster, "test"));
return true;
} catch (Exception ex) {
return false;
}
Copy link
Copy Markdown
Member

@lhotari lhotari Apr 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use Awaitility.await().ignoreExceptions().untilAsserted( instead of the try / ... / return true / catch Exception ... return false solution.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Thanks, already change to Awaitility.await().ignoreExceptions().untilAsserted

@BewareMyPower
Copy link
Copy Markdown
Contributor

I've fixed it locally. I will open another PR to test.

image

@BewareMyPower
Copy link
Copy Markdown
Contributor

@poorbarcode Could you also change the title? The PR title is duplicated with #19972.

@poorbarcode poorbarcode changed the title [fix] [broker] Ignore and remove the replicator cursor when the remote cluster is absent [fix] [broker] Remove the repl cursor if the creation fails and the replication is turned off Apr 6, 2023
@poorbarcode
Copy link
Copy Markdown
Contributor Author

Could you also change the title? The PR title is duplicated with #19972.

Done

@BewareMyPower
Copy link
Copy Markdown
Contributor

BewareMyPower commented Apr 6, 2023

Here is a failure from my local env: failure.log

I see many Starting replicator logs, we should also check the remote clusters policy before calling startReplicator, which creates a replication cursor and could be called by:

onPoliciesUpdate
  checkReplicationAndRetryOnFailure
    checkReplication

}).whenComplete((ignore, ex) -> {
if (ex == null){
replicationStartFuture.complete(null);
} else {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to print the ex warning here.

@BewareMyPower
Copy link
Copy Markdown
Contributor

After discussing with @poorbarcode offline, let's move to #20037 to fix the flaky test.

@poorbarcode poorbarcode closed this Apr 7, 2023
@michaeljmarshall
Copy link
Copy Markdown
Member

As discussed on the mailing list https://lists.apache.org/thread/w4jzk27qhtosgsz7l9bmhf1t7o9mxjhp, there is no plan to release 2.9.6, so I am going to remove the release/2.9.6 label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/geo-replication doc-not-needed Your PR changes do not impact docs ready-to-test release/2.10.5 release/2.11.2 type/bug The PR fixed a bug or issue reported a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky-test: PersistentTopicTest.testCreateTopicWithZombieReplicatorCursor

6 participants