Skip to content

Fuzzy uniq & colorized HTML diffs

The other day I came across a file full of these infamous "Alle Kinder..." jokes. But the file was rich in almost-duplicates:
$ grep Liter alle-kinder.txt 
Alle Kinder sind besoffen, nur nicht Dieter, der trinkt noch 'n Liter
Alle Kinder sind besoffen, nur nicht Dieter, der trinkt noch nen Liter
Alle Kinder sind besoffen. nur nicht Dieter, der trinkt noch 'n Liter.
So I could not just use sort(1) and filter out the duplicates - because they were not really duplicates. So I needed something to look for similar jokes in that file and filter them out, just like uniq(1) would do.

Luckily the internet is here to help, as always, and I came across this fantastic script: a fuzzy version of uniq(1). Running the file through the script, only one of the 3 occurences of the same joke is left:
$ funiq.sh alle-kinder.txt | grep Liter
Alle Kinder sind besoffen, nur nicht Dieter, der trinkt noch 'n Liter.
Great! Oh, but then it'd be interesting to see which entries got kicked by the fuzzy uniq script. Sure, diff(1) could do that. But for some reason I wanted diff's output in color. Hm, ColorDiff? But what if I wanted the output to be HTML too? Don't ask what gave me that idea but it's nice to know that other people are equally crazy and put up a bash script to convert diff output into colorized HTML. Yeah, you got that right:
$ diff -u alle-kinder.txt alle-kinder_fuzzy.txt | diff2html.sh | tidy > diff.html
And out comes something like this :-)

Trackbacks

No Trackbacks

Comments

Display comments as Linear | Threaded

No comments

Add Comment

E-Mail addresses will not be displayed and will only be used for E-Mail notifications.
Form options

Submitted comments will be subject to moderation before being displayed.