Remove duplicate lines

If the order of lines is not important

Sort lines alphabetically, if they aren’t already, and perform these steps:

      Regex: ^(.*)(\n\1)+$
Replacement: $1

In the shell you can do:

sort input_file | uniq -u > output_file
sort -u input_file > output_file

If the order is important

If the order is important and you don’t care that you just keep the last of the duplicate lines, simply search for the following regexp if you want to only remove duplicate non-empty lines

      Regex: ^(.+\n)(?=(?:.*\n)*?\1)

If you also want to remove duplicate empty lines, use * instead of +.

      Regex: ^(.*\n)(?=(?:.*\n)*?\1)

In the shell you can do:

awk '!seen[$0]++' input_file > output_file
awk '/^\s*?$/||!seen[$0]++' input_file > output_file # preserves newlines or lines with spaces
gawk -i inplace '!a[$0]++' input_file


xxx(?=...) is a lookahead-match. So it makes sure that, whatever follows “xxx” matches “…”, but does not advance the search. (?:...) is just a bracket which does not count in the bracket count. .*\n is a pattern for a (possibly empty) line. * means that there may be as several lines, even none. The ? after the asterisk (*) means that we want as few lines as possible. As \1 follows this expression the effect is that we look ahead for all the lines which do not match \1 until we find a line matching \1. I hope this makes it clear