- this change adds rsyslog (https://github.com/rsyslog/rsyslog) as
a new service that runs on every AWX node (managed by supervisord)
in particular, this feature requires a recent version (v8.38+) of
rsyslog that supports the omhttp module
(https://github.com/rsyslog/rsyslog-doc/pull/750)
- the "external_logger" handler in AWX is now a SysLogHandler that ships
logs to the local UDP port where rsyslog is configured to listen (by
default, 51414)
- every time a LOG_AGGREGATOR_* setting is changed, every AWX node
reconfigures and restarts its local instance of rsyslog so that its
fowarding settings match what has been configured in AWX
- unlike the prior implementation, if the external logging aggregator
(splunk/logstash) goes temporarily offline, rsyslog will retain the
messages and ship them when the log aggregator is back online
- 4xx or 5xx level errors are recorded at /var/log/tower/external.err
Currently, users are allowed to define virtual environments in
`settings.BASE_VENV_PATH` only, because that's the only place
Tower looks for virtual environments. This feature allows users
to custom define the directory paths, using API or UI, to look
for virtual environments. Tower aggregates virtual environments
from all these paths, except environments with special name `awx`.
Signed-off-by: Vismay Golwala <vgolwala@redhat.com>
under the hood, Host.ansible_facts is a postgres jsonb field which
performs match operations using the JSON containment operator (@>)
this operator _only_ works on exact matches on containment (i.e.,
"does the `ansible_distribution` jsonb value contain _this exact_ JSON
structure"):
SELECT ...
FROM main_host
WHERE ansible_facts @> '{"ansible_distribution": "centos"}'
SELECT ...
FROM main_host
WHERE ansible_facts @> '{"packages": {"dnsmasq": [{"version": 2}]}}'
postgres does _not_ expose any operator for fuzzy or lookup-based
matches with this operator, so host filter values like these don't
really make sense (postgres can't _filter_ in the way intended in these
examples):
ansible_distribution__startswith=\"Cent\"
ansible_distribution__icontains=\"CentOS\"
ansible_facts__packages__dnsmasq[]__version__startswith=\"2\"
this commit implements the bulk of `awx-manage run_dispatcher`, a new
command that binds to RabbitMQ via kombu and balances messages across
a pool of workers that are similar to celeryd workers in spirit.
Specifically, this includes:
- a new decorator, `awx.main.dispatch.task`, which can be used to
decorate functions or classes so that they can be designated as
"Tasks"
- support for fanout/broadcast tasks (at this point in time, only
`conf.Setting` memcached flushes use this functionality)
- support for job reaping
- support for success/failure hooks for job runs (i.e.,
`handle_work_success` and `handle_work_error`)
- support for auto scaling worker pool that scale processes up and down
on demand
- minimal support for RPC, such as status checks and pool recycle/reload
If database connectivity is lost/interrupted in this block of celery
internals, beat is *not* smart enough to recover, and it gets stuck in
an endless fail loop. We don't _need_ to talk to the database here
anyways; just use settings.CLUSTER_HOST_ID to get what we need.
see: https://github.com/ansible/tower/issues/2957
we recently made a change so that instances no longer bind to
instance-group specific queues, but now instead they each bind to
a direct queue for their specific hostname
(https://github.com/ansible/tower/pull/1922)
Because of this, we shouldn't *need* to reconfigure the queue binds at
runtime anymore when group membership changes. Under our new model,
every celeryd listens on a queue named after its hostname; when the
scheduler finds a task to run, it picks an Instance in the target
Instance Group and sends the task to the queue for that Instance's
hostname.
* We now target Instances in the task manager when transitioning jobs
from pending to waiting; whereas before we submitted jobs to Instance
Groups to be picked up by Instance's in those Instance Groups.
Subscribing Instances to their Instance Groups is no longer needed. This
change removes the Instance Group queue subscription.
* Each time a route is needed (i.e. when a task is sumitted to celery).
The router will be queried. This is ideal. With the previous method we
had to consider how a change in the routes would propogate to all celery
workers and nodes.
* fully describe the default awx queue
* Our dynamic queue registration would correct awx_private_queue.
However, we don't want celery to even create an "invalid"/extra
queue-exchange-route. This change makes sure we don't create extranious
things in rabbitmq.
* reduce the cluster queue registration output. Only output when the
queue registration list changes.
refactor existing handlers to be the related
"real" handler classes, which are swapped
out dynamically by external logger "proxy" handler class
real handler swapout only done on setting change
remove restart_local_services method
get rid of uWSGI fifo file
change TCP/UDP return type contract so that it mirrors
the request futures object
add details to socket error messages
* Using Kombu's default Broadcast() constructor requires only 1
parameter. That parameter defines the exchange name and the queue name
is randomly generated per-node.
* This caused problems if/when celery enters an infinite restart loop
because too many rabbit queues get created and rabbit OOM's
(gracefully).
* To remedy this we tell Broadcast the queue name to use, which is
derived from some constant + the node name so that the per-node queue
name is stable.
* Before, we had a special group, tower, that ran any async work that
tower needed done. This allowed users fine grain control over which
nodes did background work. However, this granularity was too complicated
for users. So now, all tower system work goes to a special non-user
exposed celery queue. Tower remains the fallback instance group to
execute jobs on. The tower group will be created upon install and
protected from deletion.
In verbose unified job models (inventory updates, system jobs,
etc.), do not delay dispatch just because the encoded
event data is not part of the data written to the buffer.
This allows output from these commands to be submitted
to the callback queue as they are produced, instead
of waiting until the buffer is closed.