[rsyslog] ElasticSearch Bulk Indexing (was: Load balancing for rsyslog aggregators)
vladg at illinois.edu
Wed Feb 8 16:03:23 CET 2012
-----BEGIN PGP SIGNED MESSAGE-----
After the recent discussion of rsyslog sending logs to ElasticSearch, using the bulk indexing API, I did some playing around with the current plugin. First, let me just say that I really appreciate the work that Nathan did on the omelasticsearch plugin, and that it will work fine under many use cases. However, there are a few fundamental limitations with the current omelasticsearch/rsyslog integration:
- - omelasticsearch uses curl to make the API calls to ES. The downside of this is that you have to specify a hostname. ES supports auto-discovering a cluster, as well as fail-over. If the host omelasticsearch is using goes down, the cluster may still be fully functional, but omelasticsearch won't be able to find it. Of course, you could go in and add other cluster members as failover actions, but this would mean a config change every time you change your ES topology.
- - curl has a default of only returning 16KB of the HTTP response. This response contains the information of which messages were successfully inserted into ES, and which failed. For a large batch of messages, one could easily get a response over the 16KB limit. This would require running a custom-compiled version of curl, that ups this limit.
- - "Pushing" to ES seems to work much less reliably than having ES "pull" messages. For similarly small-sized batches (~250 messages), ES would often take 6-8ms for the bulk insert. However, it would occasionally spike up to 6000ms, which would cause quite a backlog in the queue. Having ES "pull" messages instead (more on this later) seemed to work much more consistently.
- - Finally, I'm a bit confused on how rsyslog receives commit errors with the new transactional plugin system. If there's a batch of 5 messages, and only message 4 is successfully committed during endTransaction, how would one convey that information back to rsyslog? I know Radu mentioned calling a program with omprog, and sending messages to ES from there, but in my setup, data integrity is paramount, and I don't want to re-implement rsyslog's reliable method delivery and failover systems.
The method that I'm currently stress-testing is using the ElasticSearch River with a RabbitMQ type. With this setup, rsyslog sends messages to a RabbitMQ queue. ElasticSearch is configured with the queue's information, and then it periodically pulls messages from that queue. Once it has the messages, it proceeds to bulk index them. If the master ES node goes down, the new master starts pulling messages from the queue. Overall, it seems to work well, and the indexing throughput seems higher, due to not pushing messages to ES when it's very busy.
Unfortunately, I can't find any rsyslog plugin for RabbitMQ, so I'm currently bouncing my messages through a logstash server. Does anyone know of any plugin? I suspect the zeromq plugins might be a good starting point; I'm not sure how much would have to be rewritten to send to RabbitMQ instead.
Those were my experiences - I hope some of that proves useful to others looking into ElasticSearch.
Vlad Grigorescu | IT Security Engineer
University of Illinois at Urbana-Champaign
Office of Privacy and Information Assurance
 - <http://www.elasticsearch.org/guide/reference/river/>
 - <https://github.com/elasticsearch/elasticsearch-river-rabbitmq>
 - <http://logstash.net/docs/1.1.0/outputs/elasticsearch_river>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.18 (Darwin)
-----END PGP SIGNATURE-----
More information about the rsyslog