[Fix-18338] check task if it's waiting for TaskGroup slot when pause/kill#18344
[Fix-18338] check task if it's waiting for TaskGroup slot when pause/kill#18344eye-gu wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes issue #18338 by ensuring tasks that are waiting for a TaskGroup slot can be paused/killed directly (instead of entering a retry loop), and adds unit coverage for the new behavior.
Changes:
- Add
ITaskGroupCoordinator#isTaskWaitingForTaskGroupSlotand implement it inTaskGroupCoordinator. - Update
TaskSubmittedStateActionpause/kill handling to release TaskGroup queue entries and publish paused/killed events immediately when the task is waiting for a TaskGroup slot. - Add unit tests for both the new coordinator query and the submitted-state pause/kill behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/ITaskGroupCoordinator.java | Adds API to detect whether a task is currently waiting for a TaskGroup slot. |
| dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/TaskGroupCoordinator.java | Implements the runtime “waiting for slot” check used by pause/kill handling. |
| dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/task/statemachine/TaskSubmittedStateAction.java | Pauses/kills tasks immediately when they’re waiting for a TaskGroup slot (avoids retry loop). |
| dolphinscheduler-master/src/test/java/org/apache/dolphinscheduler/server/master/engine/TaskGroupCoordinatorTest.java | Adds unit test coverage for isTaskWaitingForTaskGroupSlot. |
| dolphinscheduler-master/src/test/java/org/apache/dolphinscheduler/server/master/engine/task/statemachine/TaskSubmittedStateActionTest.java | Adds unit tests validating pause/kill behavior for tasks waiting on TaskGroup slots. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
| if (taskGroupCoordinator.isTaskWaitingForTaskGroupSlot(taskExecution.getTaskInstance())) { | ||
| // Release the TaskGroupQueue first to prevent TaskGroupCoordinator | ||
| // from concurrently acquiring the slot and racing with this pause | ||
| taskGroupCoordinator.releaseTaskGroupSlot(taskExecution.getTaskInstance()); | ||
| log.info("Task: {} is waiting for TaskGroup slot, pause it directly", taskExecution.getName()); | ||
| taskExecution.getWorkflowEventBus().publish(TaskPausedLifecycleEvent.of(taskExecution)); | ||
| return; | ||
| } |
There was a problem hiding this comment.
This may result in a state error due to concurrency.
There was a problem hiding this comment.
You're right, there is a concurrency issue with the TaskGroupCoordinator's periodic task. However, the leaked slot will be fixed by amendTaskGroupUseSize eventually, and after the task enters PAUSE state, the dispatch event triggered by TGC's RPC will be safely ignored (just a warn log). So in practice there's no real impact. Is this acceptable? I haven't come up with a clean way to fully fix the concurrency yet.


Was this PR generated or assisted by AI?
Add unit tests.
Purpose of the pull request
fix #18338
Brief change log
Verify this pull request
Pull Request Notice
Pull Request Notice
If your pull request contains incompatible change, you should also add it to
docs/docs/en/guide/upgrade/incompatible.md