* celery workers have internal queue names that are named after the
system hostname. This may differ from what tower knows the host by,
Instance.hostname
This adds a mapping so we can convert internal celery names to Instance
names for purposes of reaping jobs.
* Randomly chose an instance in the controller instance group for which
to control the isolated node run. Note the chosen instance via a job
controller_node field
* Deciding the Instance that a Job runs on at celery task run-time makes
it hard to evenly distribute tasks among Instnaces. Instead, the task
manager will look at the world of running jobs and choose an instance
node to run on; applying a deterministic job distribution algo.
This also relaxes some of the task manager rules on Instance Groups
down the full stack such that workflow jobs tend to shortcut the
processing or omit it altogether.
This lets the workflow job spawning logic exist outside of the
instance group queues, which it doesn't need to participate in in the
first place.
Consolidate prompts accept/reject logic in unified models
Break out accept/reject logic for variables
Surface new promptable fields on WFJT nodes, schedules
Make schedules and workflows accurately reject variables
that are not allowed by the prompting
rules or the survey rules on the template
Validate against unallowed extra_data in system job schedules
Prevent schedule or WFJT node POST/PATCH with unprompted data
Move system job days validation to new mechanism
Add new psuedo-field for WFJT node credential
Add validation for node related credentials
Add related config model to unified job
Use JobLaunchConfig model for launch RBAC check
Support credential overwrite behavior with multi-creds
change modern manual launch to use merge behavior
Refactor JobLaunchSerializer, self.instance=None
Modularize job launch view to create "modern" data
Auto-create config object with every job
Add create schedule endpoint for jobs
This keeps the instance from re-using a pool that might have already
expired and is unusable for the inspector that needs to run as part of
the task manager
Relates #264.
This PR proposed and implemented a way of defining workflow failure
state:
A workflow job fails if one of the conditions below satisfies.
* At least one node runs into states `canceled` or `error`.
* At least one leaf node runs into states `failed`, but no child node is
spawned to run (no error handler).
Signed-off-by: Aaron Tan <jangsutsr@gmail.com>
- prevent dual-entry for first item in running jobs due to
setdefault syntax
- fix issue where queues (celery tasks) only returned the last
item in the inspector output due to looping problem
this caused reaper bugs in production