Commit Graph

33 Commits

Author SHA1 Message Date
AlanCoding
e59cb07064 Add wording for control message log 2020-02-11 10:01:25 -05:00
Ryan Petrello
38a08d163c get rid of celery/celerybeat
alternative to https://github.com/ansible/awx/pull/2530 which makes use
of https://pypi.org/project/schedule/

this doesn't have support for any persistence (like how celery beat uses
a shelve file), because all of our periodic jobs run at most every few
minutes
2020-02-10 17:32:02 -05:00
Ryan Petrello
3c31e0ed16 some more minor callback cleanup and development tweaks 2020-01-27 17:18:09 -05:00
Ryan Petrello
78b00652bd add the ability to enable profiling for the callback receiver workers 2020-01-27 12:03:53 -05:00
Ryan Petrello
8f33f1a6c2 remove another expensive logging lookup in the parent callback process 2020-01-24 16:46:32 -05:00
Bill Nottingham
4e46d5d7cd Fix some lint 2020-01-20 17:15:27 -05:00
Ryan Petrello
8bd9233d2c remove some unnecessary callback receiver debugging code 2020-01-14 14:21:53 -05:00
Ryan Petrello
306f504fb7 optimize the callback receiver to buffer writes on high throughput
additionaly, optimize away several per-event host lookups and
changed/failed propagation lookups

we've always performed these (fairly expensive) queries *on every event
save* - if you're processing tens of thousands of events in short
bursts, this is way too slow

this commit also introduces a new command for profiling the insertion
rate of events, `awx-manage callback_stats`

see: https://github.com/ansible/awx/issues/5514
2020-01-14 12:04:26 -05:00
AlanCoding
eec08fdcca Log case of duplicate UUIDs 2020-01-09 07:31:32 -05:00
Ryan Petrello
83550eeba0 make the callback receiver more robust to duplicate UUIDs from ansible 2019-11-01 09:24:52 -04:00
Ryan Petrello
3094b67664 work around a bug in the k8s client that leaves trash in /tmp 2019-10-29 11:24:17 -04:00
Ryan Petrello
d01088d33e Revert "add support for awx-manage run_callback_receiver --status" 2019-10-18 09:49:02 -04:00
Ryan Petrello
ffb1707e74 add support for awx-manage run_callback_receiver --status 2019-10-17 11:10:27 -04:00
Buymov Ivan
f2676064fd Fix error with rejoining node to cluster after lost connection to postgres 2019-09-27 01:17:27 -04:00
Ryan Petrello
40b1e89b67 add the ability to disable RabbitMQ queue durability 2019-05-28 15:49:32 -04:00
Ryan Petrello
17a803f49c remove the old callback plugin import paths and callback-specific tests 2019-04-12 16:11:23 -04:00
Ryan Petrello
32ee9838af use the correct logger for the callback receiver
the callback receiver and dispatcher share several modules, so add logic
to use the correct logger
2019-03-15 08:09:47 -04:00
Ryan Petrello
daeeaf413a clean up unnecessary usage of the six library (awx only supports py3) 2019-01-25 00:19:48 -05:00
Ryan Petrello
4707dc2a05 clean up some unnecessary dispatcher reaping code 2019-01-24 11:11:05 -05:00
Ryan Petrello
b2442d42a3 detect dead DB connections in the dispatcher when reaping jobs 2019-01-22 08:40:26 -05:00
Ryan Petrello
f223df303f convert py2 -> py3 2019-01-15 14:09:01 -05:00
Ryan Petrello
5950f26c69 only allow the task dispatch worker to import and run decorated tasks
this _technically_ prevents a remote code exploit where a user who has
access to publish AMQP messages to the dispatch queue could craft
a special message that would import and run arbitrary Python functions;
that said, the types of user with this privilege level are generally
_already_ the awx user (so they can already do this by hand if they
want)
2018-12-12 17:46:41 -05:00
Ryan Petrello
0391dbc292 add additional DB retry logic to the callback receiver
initially, I implemented this for _only_ the task worker, but it's
probably needed for callback event workers, too
2018-11-29 11:57:46 -05:00
Ryan Petrello
38bf174bda don't reap jobs that aren't running
this is a simple sanity check, but it should help us avoid shooting
ourselves in the foot in complicated scenarios, such as:

1.  A dispatcher worker is running a job, and it's killed with `kill -9`
2.  The dispatcher attempts to reap jobs with a matching celery_task_id
3.  The associated sync project update has the *same* celery_task_id
    (an implementation detail of how we implemented that), and it ends
    up getting reaped _even though_ it's already finished and has
    status=successful
2018-11-28 18:11:12 -05:00
Matthew Jones
7330102961 Remove a warning message for dispatcher pool for tests 2018-11-19 11:19:57 -05:00
Ryan Petrello
37234ca66e prevent the dispatcher from using a nonsensical max_workers value 2018-11-16 10:16:39 -05:00
AlanCoding
482395eb6a reduce default verbosity of devel-specific callback logging 2018-10-26 10:03:46 -04:00
Ryan Petrello
3be9113d6b fix a bug that breaks job cancel on single node jobs
1.  Install awx w/ a single node.
2.  Start a long-running job.
3.  Forcibly kill the `awx-manage run_dispatcher` process (e.g.,
    SIGKILL) and do not start it again.
4.  The job remains in running - without a second cluster to discover
    the job, it is never reaped.
5.  This PR allows you to cancel the job from the UI+API.
2018-10-19 09:10:33 -04:00
Ryan Petrello
0d29bbfdc6 make the dispatcher more fault-tolerant to prolonged database outages 2018-10-18 20:00:07 -04:00
Ryan Petrello
53ae05094e use the proper logger for the callback receiver 2018-10-17 10:56:29 -04:00
Ryan Petrello
720a634702 don't attempt to recover special QUIT messages in the worker pool
when `--reload` is sent to the dispatcher, it sends a special QUIT
message to each worker in the pool so that it will exit gracefully at
the next opportunity

when a worker process exits unexpectedly, the dispatcher attempts to
recover its queued messages and sends them to another worker in the
pool; in this scenario, we should _never_ re-enqueue these special
QUIT messages (because the process doesn't need to quit, it's already
gone)

To reproduce this race condition:

1.  Launch an adhoc that does `sleep 60`
2.  Run `awx-manage run_dispatcher --reload` to enqueue a `QUIT` message
    into the worker's queue
3.  Find the pid of the worker running the `sleep 60` and `SIGKILL` it.
4.  Observe that dispatcher attempts to requeue the `QUIT` message and
    logs a confusing error.
2018-10-15 12:17:52 -04:00
Ryan Petrello
ff1e8cc356 replace celery task decorators with a kombu-based publisher
this commit implements the bulk of `awx-manage run_dispatcher`, a new
command that binds to RabbitMQ via kombu and balances messages across
a pool of workers that are similar to celeryd workers in spirit.
Specifically, this includes:

- a new decorator, `awx.main.dispatch.task`, which can be used to
  decorate functions or classes so that they can be designated as
  "Tasks"
- support for fanout/broadcast tasks (at this point in time, only
  `conf.Setting` memcached flushes use this functionality)
- support for job reaping
- support for success/failure hooks for job runs (i.e.,
  `handle_work_success` and `handle_work_error`)
- support for auto scaling worker pool that scale processes up and down
  on demand
- minimal support for RPC, such as status checks and pool recycle/reload
2018-10-11 10:53:30 -04:00
Ryan Petrello
da74f1d01f refactor and test the callback receiver as a base for a task dispatcher 2018-10-11 10:53:26 -04:00