Skip to content

Character collation

So, recently I came across this funny behaviour on a SLES11sp4 machine:
sles11$ netstat -ni | awk '/^[a-z]/' 
Kernel Interface table
Iface   MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0   1500   0     3562      0      0      0     1955      0      0      0 BMRU
lo    16436   0       20      0      0      0       20      0      0      0 LRU
Wait, what? Why is the (uppercase) string "Kernel" matched against the lowercase "[a-z]" search expression? The same command on a SLES12sp1 machine does the Right Thing:
sles12$ netstat -ni | awk '/^[a-z]/' 
eth0   1500   0      685      0      0      0      438      0      0      0 BMRU
lo    65536   0       12      0      0      0       12      0      0      0 LRU
Apparently, this is not an unknown problem and can indeed be fixed by providing another LC_COLLATE variable:
$ netstat -ni | LC_COLLATE=C awk '/^[a-z]/' 
eth0   1500   0     3711      0      0      0     2032      0      0      0 BMRU
lo    16436   0       20      0      0      0       20      0      0      0 LRU
While providing a different LC_COLLATE variable did help, this still smells like a bug in SLES11, as the configured locales were exactly the same:
sles11$ locale 
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

sles11$ locale -k LC_COLLATE
collate-nrules=4
collate-rulesets=""
collate-symb-hash-sizemb=2039
collate-codeset="UTF-8"

sles11$ locale | md5sum 
677d9b3dbdf9759c8b604f294accd102  -

sles12$ locale | md5sum 
677d9b3dbdf9759c8b604f294accd102  -
Interestingly enough, both installations differ greatly in the way they look up locale information:
sles11$ echo | strace -e open awk '/^[a-z]/' 
open("/etc/ld.so.cache", O_RDONLY)      = 3
open("/lib64/libdl.so.2", O_RDONLY)     = 3
open("/lib64/libm.so.6", O_RDONLY)      = 3
open("/lib64/libc.so.6", O_RDONLY)      = 3
open("/usr/lib/locale/locale-archive", O_RDONLY) = 3
open("/usr/lib64/gconv/gconv-modules.cache", O_RDONLY) = 3



sles12$ echo | strace -e open awk '/^[a-z]/' 2>&1 | grep -v ENOENT
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libm.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/locale/en_US.utf8/LC_CTYPE", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib64/gconv/gconv-modules.cache", O_RDONLY) = 3
open("/usr/lib/locale/en_US.utf8/LC_COLLATE", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/locale/en_US.utf8/LC_MESSAGES", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/locale/en_US.utf8/LC_MESSAGES/SYS_LC_MESSAGES", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/locale/en_US.utf8/LC_NUMERIC", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/locale/en_US.utf8/LC_TIME", O_RDONLY|O_CLOEXEC) = 3
open("/dev/null", O_RDWR)               = 3
+++ exited with 0 +++
Alas, no bug has been reported yet :-\

While this appears to be documented behaviour, it's still very confusing and may even violate the Principle of Least Surprise. FWIW, GNU/grep behaves as expected on both systems, no matter the collation:
$ echo Abc | egrep --color '[[:lower:]]'
Abc

PS: I forgot to mention how cool SUSE Studio is - this SLE12 test VM was up & running in minutes and accessible via SSH too and I didn't even have to fire up my local VirtualBox instance! :-)

Trackbacks

No Trackbacks

Comments

Display comments as Linear | Threaded

No comments

Add Comment

Form options