BashScripts.org
http://bashscripts.org/forum/

Remove duplicate in first column
http://bashscripts.org/forum/viewtopic.php?f=8&t=1136
Page 1 of 1

Author:  mattdaddym [ Wed Jun 30, 2010 5:28 pm ]
Post subject:  Remove duplicate in first column

Hi all,

First post...just found the site and it looks great.

I have a test file that is just over 4 million rows. It is tab delimited. The first column is an ID# for which I want to remove duplicates. To clarify, any ID# should only appear once in this column and if it appears more than once (it may be repeated many times), then the whole row should be removed. Please note, the whole row will not be a duplicate....just the ID # in the first column. I want to port the output (the unique hits) to a new, tab delimited text file. The entire row should be preserved. Example:

123456 Matt Red House
987654 Jon Blue Car
123456 Angie Green Bike <-- This row would get deleted.
987654 Val Black Car <-- This row would get deleted.

There are four rows, but two are duplicates in the first column even though the entire rows are not duplicates. The file that is created should have only two rows. I assume it would contain the first instance of the unique ID#s and the rest would be deleted, but it is irrelevant.

It would be sweet if I could also get a count of how many rows were removed on the screen.

I have very little experience with bash scripts, but it feels like the best way to address this issue. I haven't done a lot of work trying to figure this out. My work schedule is out of control and this would help out a lot. I did a few searches on the forum but couldn't come up with exactly what I'm trying to do.

Thanks so much all!

Author:  Watael [ Wed Jun 30, 2010 7:38 pm ]
Post subject:  Re: Remove duplicate in first column

Code:
#!/bin/bash

while read id restoftheline
do exists=0
   for i in "${ids[@]}"
   do (( id == i )) && { exists="1"; break; }
   done
   (( exists )) || { ids+=( $id ); echo "$id $restoftheline" >> output.file; }
done < input.file
you should consider awk as a better choice for such a big file.

Author:  Patsie [ Wed Jun 30, 2010 11:04 pm ]
Post subject:  Re: Remove duplicate in first column

Watael wrote:
you should consider awk as a better choice for such a big file.

Your wish is my command ;)
Code:
awk '
BEGIN { oldnumbers=""; duplicates=0; }
{
  if (!index(oldnumbers, " "$1)) {
    oldnumbers=oldnumbers" "$1;
    print;
  } else { duplicates++; }
}
END { printf("Duplicates found: %d\n", duplicates); }'

Author:  choroba [ Thu Jul 01, 2010 10:27 am ]
Post subject:  Re: Remove duplicate in first column

Or similarly in perl:
Code:
perl -pe '
@item=split "\t";
if($seen{$item[0]}++){
  $removed++;
  undef $_;
}
END { print "Removed: $removed\n" } '

Update: The code shortened ('else' removed).

Author:  mattdaddym [ Thu Jul 01, 2010 10:53 am ]
Post subject:  Re: Remove duplicate in first column

Wow...what can I say. You all rock. Thank you!

Of course, I'll be a greedy person and ask how can I sort by one of these columns? It will be the last column and it contains text data (not numeric), but I may need to sort by other columns. I am familiar with sort, but not how to specify a specific column of data when the data is tab-delimited.

Thanks again!

Author:  mattdaddym [ Thu Jul 01, 2010 12:28 pm ]
Post subject:  Re: Remove duplicate in first column

I heard Perl is not a good for files this size (over 4 millions rows). True or no?

Author:  Patsie [ Thu Jul 01, 2010 1:24 pm ]
Post subject:  Re: Remove duplicate in first column

mattdaddym wrote:
Of course, I'll be a greedy person and ask how can I sort by one of these columns? It will be the last column and it contains text data (not numeric), but I may need to sort by other columns. I am familiar with sort, but not how to specify a specific column of data when the data is tab-delimited.

Tab delimiting doesn't matter to sort. The standard FS (field seperator) variable is set to space and tab). This can be overwritten by sort's -t option. to specify a column number other than the first, use the -k option. If you don't know how, please read sort's manual page.

Author:  choroba [ Thu Jul 01, 2010 3:50 pm ]
Post subject:  Re: Remove duplicate in first column

mattdaddym wrote:
I heard Perl is not a good for files this size (over 4 millions rows). True or no?

If the data do not contain many duplicates, the program I suggested might take lots of memory (because it has to remember each ID it encounters). In such a case, you can for example split your data according to the first character of the ID and process the smaller parts separately.

Page 1 of 1 All times are UTC - 6 hours
© 2000, 2002, 2005, 2007 phpBB Group • http://www.phpbb.com