Fuzzy uniq & colorized HTML diffs
The other day I came across a file full of these infamous "Alle Kinder..." jokes. But the file was rich in almost-duplicates:
$ grep Liter alle-kinder.txt Alle Kinder sind besoffen, nur nicht Dieter, der trinkt noch 'n Liter Alle Kinder sind besoffen, nur nicht Dieter, der trinkt noch nen Liter Alle Kinder sind besoffen. nur nicht Dieter, der trinkt noch 'n Liter.So I could not just use sort(1) and filter out the duplicates - because they were not really duplicates. So I needed something to look for similar jokes in that file and filter them out, just like uniq(1) would do.
Luckily the internet is here to help, as always, and I came across this fantastic script: a fuzzy version of
uniq(1)
. Running the file through the script, only one of the 3 occurences of the same joke is left:
$ funiq.sh alle-kinder.txt | grep Liter Alle Kinder sind besoffen, nur nicht Dieter, der trinkt noch 'n Liter.Great! Oh, but then it'd be interesting to see which entries got kicked by the fuzzy uniq script. Sure, diff(1) could do that. But for some reason I wanted diff's output in color. Hm, ColorDiff? But what if I wanted the output to be HTML too? Don't ask what gave me that idea but it's nice to know that other people are equally crazy and put up a bash script to convert diff output into colorized HTML. Yeah, you got that right:
$ diff -u alle-kinder.txt alle-kinder_fuzzy.txt | diff2html.sh | tidy > diff.htmlAnd out comes something like this :-)