Processing apache and nginx access logs

Lots of tools are available when you need to get some usefull data from your apache access logs. Sadly sometimes you don't have the time to set them up. Thankfully linux provides some great tools to do this yourself. Below are some commands I used recently to detect missbehaving clients.

You can just output the whole log by using `cat` and `zcat`.

$ cat access.log # uncompressed file
$ zcat access.log-20150313.gz # compressed files

It is very simple to process multiple files using wildcards

$ cat access*.log

404 request

Finding broken pages is a quick win to speed up your pages and prevent your server from doing useless file lookups.

$ cat access.log | awk '($9 ~ /404/)' | awk '{ print $7 }' | sort | uniq -c | sort -rn | head -n 25 

awk is a powerfull tool to filter on specific fields. By default awk splits the string by the space character. In this example we apply a regex on the 9th field, then print the 7th field, sort it to group them, remove the duplicated, sort them in reverse and limit the result to 25 lines.

In this case we are then displaying the most requested pages of this type, in descending order.

Client ip

$ cat access.log | awk '{ print $1 }' | sort | uniq -c | sort -rn | head -n 25

The above example will display the top 25 IP adresses based on their total requests.

$ cat access.log | awk '{ print $1 }' | sort | uniq -c | sort -rn | head -n 25 | awk '{ printf("%5d\t%-15s\t", $1, $2); system("geoiplookup " $2 " | cut -d \\: -f2 ") }'

Additionaly you can use geoip to find what country the requests are comming from.

You will need to install geoip.

$ sudo apt-get install libgeoip1 geoip-bin

Realtime

$ tail -f access.log | awk '{ printf("%-15s\t%s\t%s\t%s\n", $1, $6, $9, $7) }'

Watch live as requests hit your server.

$ tail -f access.log | awk '{
    "geoiplookup " $1 " | cut -d \\: -f2 " | getline geo
    printf("%-15s\t%s\t%s\t%-20s\t%s\n", $1, $6, $9, geo, $7);
  }'

Or add the location information.


Fork me on GitHub