From tail -f to Tailored Insights: Why I Built a Custom ELK Stack

Like many developers, I've spent countless hours staring at a terminal, using tail -f to monitor log files as events happen. It's a method that works well for quick debugging, but it quickly falls apart when you need to answer more complex questions like "How many times did this specific error occur over the last week?" The limitations became clear: I needed a centralized, searchable, and visual way to understand my application's behavior.

While a solution like Amazon's OpenSearch Service (a managed ELK stack) seemed like an easy answer, it came with significant costs that are often geared toward larger enterprises. For a small business like seriesreminder.com, a solution I could run on a spare server at my house offered a far more economical and educational path. My goal was to build a system that was powerful, cost-effective, and, most importantly, completely under my control.

However, a key constraint complicated the architecture: I did not want to expose any ports on my local network to the public internet. This specific requirement forced me to make a less-than-optimal design choice that I'll detail later in this post. Acknowledging trade-offs is a crucial part of the engineering process, and a self-imposed limitation often leads to the most creative problem-solving.

This article will walk you through my journey of building a custom ELK stack from the ground up, detailing the architectural decisions, the problems I solved, and the technical skills I gained along the way. While the technical details are important, the primary goal of this article is to outline the thought process behind each decision and the 'why' behind the 'what'. Plenty of blog articles are available on the internet that detail the individual parts of this process.

The Architecture: A Custom Solution for a Unique Constraint

The standard ELK stack architecture typically involves a simple data flow: Beats (like Filebeat) or Logstash on a server pushes data to a central Elasticsearch instance, which is then visualized by Kibana. However, my core requirement—not exposing any ports on my home network—meant I couldn't simply set up a Logstash instance at home and expect my cloud-based application server to send data to it.

An additional constraint for this project that actually helped make it easer, was that I did not need real-time, minute-by-minute logging. The ability to analyze trends and troubleshoot issues with a delay of an hour or so was perfectly acceptable. This freed me from having to implement a complex real-time log streaming solution and opened up a much simpler, more cost-effective path.

Here's a look at the architecture I designed to accommodate these requirements:

1. The Cloud-Based Application Server

Series Reminder is a Spring Boot application that runs on AWS Elastic Beanstalk and uses SLF4J and Logback to write logs to a file to the filesystem. This is a standard and reliable logging practice.

Elastic Beanstalk's S3 Integration: To get these logs off the server, I leveraged a built-in feature of AWS Elastic Beanstalk. In the environment configuration, I enabled the option to "Save log files to S3." This simple setting automates a process that copies the application logs to a designated S3 bucket once every hour. However, it's worth noting that Elastic Beanstalk renames these files with a non-human-readable, hashed filename, making it difficult to work with them directly in their raw form.

Event-Driven Notifications: Instead of constantly checking the S3 bucket for new files, I configured an S3 event notification. When a new log file is uploaded, S3 automatically sends a message to an Amazon SQS queue. This is a far more efficient and scalable solution than a polling-based approach, as it's triggered only when there's new data.

2. The Home Server (The "Brain" of the Stack)

The most significant architectural challenge was securely getting the logs from AWS to my local server without opening any inbound ports. My solution was to create a custom, pull-based system.

Custom Log Downloader Application: I wrote a simple Java application to handle the data transfer. This application connects to the SQS queue and reads any messages in the queue. I set the queue to retain messages for the maximum time limit of 14 days just in case. It then extracts the S3 filename from the message, and then uses the AWS SDK to download the log file from the S3 bucket, decompresses it, and appends it to the end of "log file" on the local filesystem, essentially recreating the original log file.

Jenkins for Automation: To automate this process, I leveraged Jenkins, which is running in a Docker container on my home server. I created a scheduled Jenkins job to run my custom log downloader application periodically. This job is responsible for fetching both the application logs and the NGINX access and error logs. This approach ensures that the logs are consistently pulled from AWS and made available for processing. It worked out pretty nicely, as it can pull the latest version of the code from Bitbucket and automatically build and run itself.

Containerized ELK Stack: To manage the ELK stack itself, I'm running Logstash, Elasticsearch, and Kibana all within a single Docker container. This provides a clean, portable, and reproducible environment for the entire stack.

The Docker Benefit: This decision was crucial for simplifying the setup and maintenance. Manually setting up each component of the ELK stack on a server is a time-consuming and error-prone process involving many command-line steps. By using Docker, all the configuration files and environment variables are defined in a docker-compose.yml file. This means the entire system is disposable—I can tear it down and rebuild it with a single command. Furthermore, these configuration files are checked into version control, making it easy to track changes, revert to previous versions, and ensure that the setup is reproducible.

With the logs now on the local server, I could turn my attention to processing them. This is where I made a key distinction between my application logs and the Nginx logs.

A Tale of Two Log Types: Logstash vs. Filebeat

Not all logs are created equal, and a one-size-fits-all approach to processing them can be inefficient. While both Logstash and Filebeat can handle log parsing, they excel in different areas. I chose the best tool for each specific job.

Ingesting Application Logs with Logstash
For the Spring Boot application logs from seriesreminder.com, I needed a little bit of custom parsing logic that required a bespoke grok filter. This is where Logstash shines. Its rich set of filters gives me complete control over how the log data is structured and enriched before it's sent to Elasticsearch.

Ingesting Nginx Logs with Filebeat
On the other hand, Nginx access and error logs have a well-defined and widely used format. The Elastic Stack provides a dedicated Nginx module for Filebeat that is specifically designed to handle this. By simply enabling this module and pointing it to the local Nginx log files, Filebeat automatically parses the logs. It also has an out of the box set of dashboards that it automatically installs into Kibana. The dashboards providing instant, beautiful visualizations of web traffic, error rates, and user agents without much manual configuration on my part.

This approach saved me a significant amount of time and effort while still providing a robust and professional-looking monitoring solution. It's a great example of using the right tool for the job.

Expanding the Pipeline: Handling S3 Access Logs

Once the initial ELK stack for application and Nginx logs was running smoothly, I saw an opportunity to gain even more insight into Series Reminder's performance and usage. The website hosts over 150,000 images and static assets in an S3 bucket, and I wanted to analyze access patterns for these files.

I enabled S3 access logging and configured it to save logs to a separate, private S3 bucket. Just as with the application logs, I set up an S3 event notification to fire a message into an SQS queue whenever a new log file was created.

However, this new data source presented a unique challenge. Unlike the hourly application log files, S3 access logging generates thousands of small files each day. My original, single-threaded Java log downloader application couldn't process them all in a reasonable amount of time. To solve this, I re-architected my Java application to use multi-threading. By implementing a thread pool, the application could now concurrently:

Consume messages from the SQS queue.
Download multiple log files from S3 simultaneously.
Decompress and save the files to the local filesystem.

This change drastically improved the performance of the data ingestion pipeline, allowing me to process the high volume of S3 access logs efficiently and ensure that the data was available for analysis in a timely manner. I then created a second Logstash pipeline specifically for these logs, but with a twist.

A Dual-Approach to S3 Log Analysis
For these S3 access logs, I wanted the best of both worlds: full control over parsing and a pre-built, production-ready dashboard. So, I took a two-pronged approach to ingestion.

Logstash for Custom Filtering and Indexing
First, I created a dedicated Logstash pipeline that reads the S3 access log files. I wrote a custom Grok filter to parse the logs, extracting specific information like the user agent, HTTP status code, and file path into meaningful fields. I then configured Logstash to index this structured data into its own dedicated Elasticsearch index. This gives me the flexibility to run my own queries and create custom visualizations based on very specific business needs.

Filebeat for Speed and Pre-Built Dashboards
Simultaneously, I also used a Filebeat module specifically designed for S3 access logs. By pointing it to the same log files, Filebeat automatically ingests and parses the data using its pre-configured settings and dashboards, similar to the Nginx logs. While my custom Logstash pipeline gives me granular control, the Filebeat module provides a much more polished and insightful starting point, proving that sometimes, leveraging the community's pre-built solutions is the most efficient and effective path. This dual strategy allows me to validate data against two sources and provides a richer analytical experience.

The Final Piece: CloudFront CDN Logs
The last component of my logging infrastructure was the CloudFront CDN access logs. All of the S3 images and assets are served through a CDN to improve performance and reduce latency for users around the world. To understand how effective this CDN was, I wanted to analyze its access logs.

Following the same trusted pattern, I configured CloudFront to save its access logs to a dedicated S3 bucket and send a notification to a new SQS queue. My multi-threaded Java application was easily adapted to handle this new queue, downloading the logs to my home server without requiring any external access to my network.

This log data, once ingested, provided the final piece of the puzzle. Using another powerful Filebeat module, I was able to automatically parse the CloudFront logs and get an immediate, detailed view of CDN performance. The preconfigured dashboards allow me to visualize the cache hit and miss ratio.

Conclusion: A Unified View of Series Reminder

By systematically building this logging architecture, I created a powerful and cost-effective monitoring solution for my side project. From the core application logs to the fine-grained access patterns of my static assets and CDN, I now have a unified, visual, and searchable view of the entire system. This journey, while challenging, was a fantastic learning experience that honed my skills in everything from cloud architecture and containerization to multi-threading and data analysis. I now have the tools not only to troubleshoot issues quickly but also to proactively identify opportunities for optimization and improvement.