If the order of lines is not important
Sort lines alphabetically, if they aren’t already, and perform these steps:
Regex: ^(.*)(\n\1)+$
Replacement: $1
In the shell you can do:
sort input_file | uniq -u > output_file
sort -u input_file > output_file
If the order is important
If the order is important and you don’t care that you just keep the last of the duplicate lines, simply search for the following regexp if you want to only remove duplicate non-empty lines
Regex: ^(.+\n)(?=(?:.*\n)*?\1)
Replacement:
If you also want to remove duplicate empty lines, use *
instead of +
.
Regex: ^(.*\n)(?=(?:.*\n)*?\1)
Replacement:
In the shell you can do:
awk '!seen[$0]++' input_file > output_file
awk '/^\s*?$/||!seen[$0]++' input_file > output_file # preserves newlines or lines with spaces
gawk -i inplace '!a[$0]++' input_file
Explanation
xxx(?=...)
is a lookahead-match. So it makes sure that, whatever follows “xxx” matches “…”, but does not advance the search. (?:...)
is just a bracket which does not count in the bracket count. .*\n
is a pattern for a (possibly empty) line. *
means that there may be as several lines, even none. The ?
after the asterisk (*
) means that we want as few lines as possible. As \1
follows this expression the effect is that we look ahead for all the lines which do not match \1
until we find a line matching \1
. I hope this makes it clear