-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[SolidFire] Fix inter-cluster VM migrations on zone-wide SolidFire pools #11633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 4.19
Are you sure you want to change the base?
[SolidFire] Fix inter-cluster VM migrations on zone-wide SolidFire pools #11633
Conversation
|
@blueorangutan package |
|
@nvazquez a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## 4.19 #11633 +/- ##
============================================
- Coverage 15.18% 15.17% -0.01%
+ Complexity 11370 11369 -1
============================================
Files 5415 5416 +1
Lines 476082 476208 +126
Branches 58129 58155 +26
============================================
- Hits 72288 72283 -5
- Misses 395702 395834 +132
+ Partials 8092 8091 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 14992 |
|
@blueorangutan test matrix |
|
@nvazquez a [SL] Trillian-Jenkins matrix job (EL8 mgmt + EL8 KVM, Ubuntu22 mgmt + Ubuntu22 KVM, EL8 mgmt + VMware 7.0u3, EL9 mgmt + XCP-ng 8.2 ) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-14322)
|
|
[SF] Trillian test result (tid-14323)
|
|
[SF] Trillian test result (tid-14324)
|
|
[SF] Trillian test result (tid-14325)
|
weizhouapache
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code lgtm
...chestration/src/main/java/org/apache/cloudstack/engine/orchestration/VolumeOrchestrator.java
Outdated
Show resolved
Hide resolved
|
@harikrishna-patnala @sureshanaparti @vishesh92 @shwstppr |
...ns/hypervisors/vmware/src/main/java/com/cloud/hypervisor/vmware/resource/VmwareResource.java
Show resolved
Hide resolved
harikrishna-patnala
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code LGTM
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 15068 |
|
@blueorangutan test ol8 vmware-80u3 |
|
@weizhouapache a [SL] Trillian-Jenkins test job (ol8 mgmt + vmware-80u3) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-14378)
|
|
@blueorangutan test ol8 vmware-80u3 |
|
@weizhouapache a [SL] Trillian-Jenkins test job (ol8 mgmt + vmware-80u3) has been kicked to run smoke tests |
|
@blueorangutan test ol8 vmware-80u3 |
|
@weizhouapache a [SL] Trillian-Jenkins test job (ol8 mgmt + vmware-80u3) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-14440)
|
|
@blueorangutan package |
|
@weizhouapache @nvazquez , anything we can do to move this forwards? (do we still plan to merge this on 4.19?) |
|
@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 16031 |
|
@blueorangutan test ol8 vmware-80u3 |
|
@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + vmware-80u3) has been kicked to run smoke tests |
|
[SF] Trillian Build Failed (tid-14981) |
|
Hi @DaanHoogland @weizhouapache this PR is not ready yet, marked it as draft for now |
|
@blueorangutan package |
|
@borisstoyanov a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes two bugs in the SolidFire 1:1 plugin when using zone-wide storage for inter-cluster VM migrations within VMware environments:
Changes:
- Implements
zoneWideVolumesAvailableWithoutClusterMotion()andzoneWideVolumesDatastoreCleanupOnOriginHostAfterInterClusterMigration()in SolidFirePrimaryDataStoreDriver to indicate that zone-wide volumes don't require storage migration and need cleanup on the origin host after migration - Adds logic to mount datastores on the destination cluster during VM migration preparation and unmount them from the origin cluster after successful migration
- Introduces a new
UnmountDatastoresCommandto handle datastore unmounting on the source host after inter-cluster migration
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| plugins/storage/volume/solidfire/src/main/java/org/apache/cloudstack/storage/datastore/driver/SolidFirePrimaryDataStoreDriver.java | Implements new driver methods to enable zone-wide inter-cluster migrations without storage motion and enable cleanup on origin host |
| plugins/hypervisors/vmware/src/main/java/com/cloud/storage/resource/VmwareStorageProcessor.java | Changes unmountVmfsDatastore visibility from private to public for use in UnmountDatastoresCommand |
| plugins/hypervisors/vmware/src/main/java/com/cloud/hypervisor/vmware/resource/VmwareResource.java | Adds datastore preparation logic during migration and implements UnmountDatastoresCommand handling |
| engine/orchestration/src/main/java/org/apache/cloudstack/engine/orchestration/VolumeOrchestrator.java | Adds logic to detect inter-cluster migrations and populate disk details for datastore mounting/unmounting |
| engine/orchestration/src/main/java/com/cloud/vm/VirtualMachineManagerImpl.java | Integrates post-migration cleanup to unmount datastores from origin host |
| engine/api/src/main/java/org/apache/cloudstack/engine/subsystem/api/storage/PrimaryDataStoreDriver.java | Adds new interface method for drivers to indicate if cleanup is needed on origin host |
| engine/api/src/main/java/org/apache/cloudstack/engine/orchestration/service/VolumeOrchestrationService.java | Adds interface method for post-migration datastore release |
| core/src/main/java/com/cloud/agent/api/UnmountDatastoresCommand.java | New command class for unmounting datastores |
| api/src/main/java/com/cloud/agent/api/to/DiskTO.java | Adds constants for SCOPE and INTER_CLUSTER_MIGRATION details |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| private void prepareDatastoreForZoneWideManagedStorageInterClusterMigration(DiskTO disk, VmwareHypervisorHost hyperHost) throws Exception { | ||
| Map<String, String> details = disk.getDetails(); | ||
| if (MapUtils.isEmpty(details) || !details.containsKey(DiskTO.SCOPE) || | ||
| !details.containsKey(DiskTO.MANAGED) && !details.containsKey(DiskTO.INTER_CLUSTER_MIGRATION)) { |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The boolean logic in this condition is incorrect. The expression !details.containsKey(DiskTO.MANAGED) && !details.containsKey(DiskTO.INTER_CLUSTER_MIGRATION) should be (!details.containsKey(DiskTO.MANAGED) || !details.containsKey(DiskTO.INTER_CLUSTER_MIGRATION)) with OR instead of AND.
The current logic means the method returns only when BOTH MANAGED and INTER_CLUSTER_MIGRATION are missing, but the subsequent checks on lines 4666-4668 explicitly check for the presence of these keys. This means if either key is present (but not both), the method won't return early but will later evaluate false in the checks below, making the early return ineffective.
The intended logic should be: return early if SCOPE is missing OR if either MANAGED or INTER_CLUSTER_MIGRATION is missing.
| !details.containsKey(DiskTO.MANAGED) && !details.containsKey(DiskTO.INTER_CLUSTER_MIGRATION)) { | |
| (!details.containsKey(DiskTO.MANAGED) || !details.containsKey(DiskTO.INTER_CLUSTER_MIGRATION))) { |
| _storageProcessor.prepareManagedDatastore(context, hyperHost, datastoreName, | ||
| details.get(DiskTO.IQN), details.get(DiskTO.STORAGE_HOST), | ||
| Integer.parseInt(details.get(DiskTO.STORAGE_PORT))); |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a potential NumberFormatException if details.get(DiskTO.STORAGE_PORT) returns null or a non-numeric value. While the comment on line 4672 suggests details should not be null for managed storage, there's no explicit validation that STORAGE_PORT is present or valid before parsing. Consider adding a null check and validation before calling Integer.parseInt().
| _storageProcessor.prepareManagedDatastore(context, hyperHost, datastoreName, | |
| details.get(DiskTO.IQN), details.get(DiskTO.STORAGE_HOST), | |
| Integer.parseInt(details.get(DiskTO.STORAGE_PORT))); | |
| String storagePortString = details.get(DiskTO.STORAGE_PORT); | |
| if (StringUtils.isBlank(storagePortString) || !NumberUtils.isDigits(storagePortString)) { | |
| s_logger.warn(String.format("Skipping preparation of managed datastore %s due to invalid storage port value: %s", | |
| datastoreName, storagePortString)); | |
| return; | |
| } | |
| int storagePort = Integer.parseInt(storagePortString); | |
| _storageProcessor.prepareManagedDatastore(context, hyperHost, datastoreName, | |
| details.get(DiskTO.IQN), details.get(DiskTO.STORAGE_HOST), storagePort); |
| try { | ||
| _agentMgr.send(srcHostId, cmd); | ||
| } catch (AgentUnavailableException | OperationTimedoutException e) { | ||
| throw new RuntimeException(e); |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throwing a RuntimeException wrapping AgentUnavailableException or OperationTimedoutException may cause the entire migration to fail unexpectedly. Consider logging the error and handling it gracefully instead of propagating it as a RuntimeException, since this is a cleanup operation that happens after the VM has successfully migrated. The migration has already succeeded at this point (line 2887), so failing here could leave the system in an inconsistent state.
| throw new RuntimeException(e); | |
| s_logger.warn("Failed to unmount datastores on origin host " + srcHostId + | |
| " after migration of VM id=" + vmId + ". Datastores to unmount: " + poolsToReleaseOnOrigin, e); |
| @Inject private VolumeDao volumeDao; | ||
| @Inject private VolumeDetailsDao volumeDetailsDao; | ||
| @Inject private VolumeDataFactory volumeFactory; | ||
| @Inject private AgentManager agentManager; |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AgentManager is injected but never used in this class. This appears to be an unnecessary import and injection. Consider removing this if it's not needed for the functionality being implemented.
|
|
||
| import javax.inject.Inject; | ||
|
|
||
| import com.cloud.agent.AgentManager; |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The import for AgentManager is unused. This import was added but the AgentManager field is never actually used in the class. Consider removing this import along with the unused field injection.
| import com.cloud.agent.AgentManager; |
Description
This PR addresses 2 bugs on the SolidFire 1:1 plugin using zone-wide storage:
Fixes: #11297
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
How did you try to break this feature and the system with this change?