Character collation
So, recently I came across this funny behaviour on a SLES11sp4 machine:
sles11$ netstat -ni | awk '/^[a-z]/' Kernel Interface table Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg eth0 1500 0 3562 0 0 0 1955 0 0 0 BMRU lo 16436 0 20 0 0 0 20 0 0 0 LRUWait, what? Why is the (uppercase) string "
Kernel
" matched against the lowercase "[a-z]
" search expression?
The same command on a SLES12sp1 machine does the Right Thing:
sles12$ netstat -ni | awk '/^[a-z]/' eth0 1500 0 685 0 0 0 438 0 0 0 BMRU lo 65536 0 12 0 0 0 12 0 0 0 LRUApparently, this is not an unknown problem and can indeed be fixed by providing another LC_COLLATE variable:
$ netstat -ni | LC_COLLATE=C awk '/^[a-z]/' eth0 1500 0 3711 0 0 0 2032 0 0 0 BMRU lo 16436 0 20 0 0 0 20 0 0 0 LRUWhile providing a different LC_COLLATE variable did help, this still smells like a bug in SLES11, as the configured
locales
were exactly the same:
sles11$ locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL= sles11$ locale -k LC_COLLATE collate-nrules=4 collate-rulesets="" collate-symb-hash-sizemb=2039 collate-codeset="UTF-8" sles11$ locale | md5sum 677d9b3dbdf9759c8b604f294accd102 - sles12$ locale | md5sum 677d9b3dbdf9759c8b604f294accd102 -Interestingly enough, both installations differ greatly in the way they look up locale information:
sles11$ echo | strace -e open awk '/^[a-z]/' open("/etc/ld.so.cache", O_RDONLY) = 3 open("/lib64/libdl.so.2", O_RDONLY) = 3 open("/lib64/libm.so.6", O_RDONLY) = 3 open("/lib64/libc.so.6", O_RDONLY) = 3 open("/usr/lib/locale/locale-archive", O_RDONLY) = 3 open("/usr/lib64/gconv/gconv-modules.cache", O_RDONLY) = 3 sles12$ echo | strace -e open awk '/^[a-z]/' 2>&1 | grep -v ENOENT open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 open("/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3 open("/lib64/libm.so.6", O_RDONLY|O_CLOEXEC) = 3 open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 open("/usr/lib/locale/en_US.utf8/LC_CTYPE", O_RDONLY|O_CLOEXEC) = 3 open("/usr/lib64/gconv/gconv-modules.cache", O_RDONLY) = 3 open("/usr/lib/locale/en_US.utf8/LC_COLLATE", O_RDONLY|O_CLOEXEC) = 3 open("/usr/lib/locale/en_US.utf8/LC_MESSAGES", O_RDONLY|O_CLOEXEC) = 3 open("/usr/lib/locale/en_US.utf8/LC_MESSAGES/SYS_LC_MESSAGES", O_RDONLY|O_CLOEXEC) = 3 open("/usr/lib/locale/en_US.utf8/LC_NUMERIC", O_RDONLY|O_CLOEXEC) = 3 open("/usr/lib/locale/en_US.utf8/LC_TIME", O_RDONLY|O_CLOEXEC) = 3 open("/dev/null", O_RDWR) = 3 +++ exited with 0 +++Alas, no bug has been reported yet :-\
While this appears to be documented behaviour, it's still very confusing and may even violate the Principle of Least Surprise. FWIW, GNU/grep behaves as expected on both systems, no matter the collation:
$ echo Abc | egrep --color '[[:lower:]]' Abc
PS: I forgot to mention how cool SUSE Studio is - this SLE12 test VM was up & running in minutes and accessible via SSH too and I didn't even have to fire up my local VirtualBox instance! :-)