Skip to main content

Getting rid of serendipity_spamblocklog

This Serendipity installation has a Spamblock feature which is currently logging comment spam to a database. Over time, this database grew quite a bit and while I'd like to keep the data around I don't need the data right away. Since the machine the database is running on is low on memory anyway, I wanted to archive & then purge old records from the spamblock logging table.

This is where we are:

$ ls -lhgo serendipity_spamblocklog*
-rw-rw----. 1 8.7K Nov 21 20:28 serendipity_spamblocklog.frm
-rw-rw----. 1 1.2G Jan 25 07:29 serendipity_spamblocklog.MYD
-rw-rw----. 1  41M Jan 25 07:39 serendipity_spamblocklog.MYI

$ for a in {2008..2015}; do
     printf "year: $a       "
     mysql -B -N -D s9y -e "select count(*) from serendipity_spamblocklog \
           where year(from_unixtime(timestamp)) = $a;"
  done
year: 2008      12
year: 2009      14901
year: 2010      93232
year: 2011      12332
year: 2012      4373
year: 2013      245002
year: 2014      1232742
year: 2015      131898
Yeah, 2014 was really the year-of-the-spam :-)

Export those into CSV files:
$ for a in {2008..2014}; do
     echo "year: $a"
     mysql -D s9y -e "select * from serendipity_spamblocklog \
        where year(from_unixtime(timestamp)) = \"$a\" into outfile \
        \"spamblocklog-$a.csv\" fields terminated by ',' enclosed by '\"' \
        lines terminated by '\n';"
  done
year: 2008
year: 2009
year: 2010
year: 2011
year: 2012
year: 2013
year: 2014
Which gives us (in datadir):
$ ls -hgo
total 1.2G
-r--------. 1 4.7K Jan 25 07:07 serendipity_spamblocklog-2008.csv
-r--------. 1 4.2M Jan 25 07:07 serendipity_spamblocklog-2009.csv
-r--------. 1  91M Jan 25 07:08 serendipity_spamblocklog-2010.csv
-r--------. 1 5.4M Jan 25 07:08 serendipity_spamblocklog-2011.csv
-r--------. 1 6.4M Jan 25 07:09 serendipity_spamblocklog-2012.csv
-r--------. 1 146M Jan 25 07:09 serendipity_spamblocklog-2013.csv
-r--------. 1 860M Jan 25 07:10 serendipity_spamblocklog-2014.csv
To count records, we can't use "wc -l" just like that because comments may contain newlines as well - so let's count timestamps instead:
$ grep -c '^\"1' *
serendipity_spamblocklog-2008.csv:12
serendipity_spamblocklog-2009.csv:14901
serendipity_spamblocklog-2010.csv:93232
serendipity_spamblocklog-2011.csv:12332
serendipity_spamblocklog-2012.csv:4373
serendipity_spamblocklog-2013.csv:245002
serendipity_spamblocklog-2014.csv:1232742
Delete the exported records:
$ for a in {2008..2014}; do
     echo "year: $a"
     mysql -D s9y -e "delete from serendipity_spamblocklog \
        where year(from_unixtime(timestamp)) = $a;"
done
The size of the database file may not decrease until after we run OPTIMIZE TABLE on the table:
$ mysqlcheck --optimize s9y serendipity_spamblocklog
$ ls -lhgo serendipity_spamblocklog.*
-rw-rw----. 1 8.7K Nov 21 20:28 serendipity_spamblocklog.frm
-rw-rw----. 1  88M Jan 25 08:15 serendipity_spamblocklog.MYD
-rw-rw----. 1 3.2M Jan 25 08:15 serendipity_spamblocklog.MYI
And we can still run some stats on the CSV files:
$ awk -F\",\" '/^\"1/ {print $2}' serendipity_spamblocklog-2013.csv | sort | uniq -c | sort -n
    30 "API_ERROR"
   119 "moderate"
 98045 "REJECTED"
146808 "MODERATE"

$ awk -F\",\" '/^\"1/ {print $8}' serendipity_spamblocklog-2014.csv | awk '{print $1}' | sort | uniq -c | sort -n | tail
   251 PHP/5.3.89
   252 PHP/5.3.59
   252 PHP/5.3.94
   261 PHP/5.2.19
   270 PHP/5.2.62
  1125 Opera/9.80
  4595 
 30848 PHP/5.2.10
509019 Mozilla/4.0
646256 Mozilla/5.0