Register
It is currently Mon Apr 21, 2014 9:29 am

Remove duplicate in first column


All times are UTC - 6 hours


Post new topic Reply to topic  [ 8 posts ] 
Author Message
 PostPosted: Wed Jun 30, 2010 5:28 pm   

Joined: Wed Jun 30, 2010 5:15 pm
Posts: 4
Hi all,

First post...just found the site and it looks great.

I have a test file that is just over 4 million rows. It is tab delimited. The first column is an ID# for which I want to remove duplicates. To clarify, any ID# should only appear once in this column and if it appears more than once (it may be repeated many times), then the whole row should be removed. Please note, the whole row will not be a duplicate....just the ID # in the first column. I want to port the output (the unique hits) to a new, tab delimited text file. The entire row should be preserved. Example:

123456 Matt Red House
987654 Jon Blue Car
123456 Angie Green Bike <-- This row would get deleted.
987654 Val Black Car <-- This row would get deleted.

There are four rows, but two are duplicates in the first column even though the entire rows are not duplicates. The file that is created should have only two rows. I assume it would contain the first instance of the unique ID#s and the rest would be deleted, but it is irrelevant.

It would be sweet if I could also get a count of how many rows were removed on the screen.

I have very little experience with bash scripts, but it feels like the best way to address this issue. I haven't done a lot of work trying to figure this out. My work schedule is out of control and this would help out a lot. I did a few searches on the forum but couldn't come up with exactly what I'm trying to do.

Thanks so much all!


Top
 Profile  
 PostPosted: Wed Jun 30, 2010 7:38 pm   

Joined: Mon Mar 02, 2009 3:03 am
Posts: 512
Code:
#!/bin/bash

while read id restoftheline
do exists=0
   for i in "${ids[@]}"
   do (( id == i )) && { exists="1"; break; }
   done
   (( exists )) || { ids+=( $id ); echo "$id $restoftheline" >> output.file; }
done < input.file
you should consider awk as a better choice for such a big file.


Top
 Profile  
 PostPosted: Wed Jun 30, 2010 11:04 pm   
User avatar

Joined: Sun Jun 27, 2010 12:57 am
Posts: 192
Watael wrote:
you should consider awk as a better choice for such a big file.

Your wish is my command ;)
Code:
awk '
BEGIN { oldnumbers=""; duplicates=0; }
{
  if (!index(oldnumbers, " "$1)) {
    oldnumbers=oldnumbers" "$1;
    print;
  } else { duplicates++; }
}
END { printf("Duplicates found: %d\n", duplicates); }'


Last edited by Patsie on Thu Jul 01, 2010 1:16 pm, edited 1 time in total.

Top
 Profile  
 PostPosted: Thu Jul 01, 2010 10:27 am   
User avatar

Joined: Tue Apr 27, 2010 2:28 pm
Posts: 172
Location: Czech Republic
Or similarly in perl:
Code:
perl -pe '
@item=split "\t";
if($seen{$item[0]}++){
  $removed++;
  undef $_;
}
END { print "Removed: $removed\n" } '

Update: The code shortened ('else' removed).


Last edited by choroba on Thu Jul 01, 2010 3:38 pm, edited 1 time in total.

Top
 Profile  
 PostPosted: Thu Jul 01, 2010 10:53 am   

Joined: Wed Jun 30, 2010 5:15 pm
Posts: 4
Wow...what can I say. You all rock. Thank you!

Of course, I'll be a greedy person and ask how can I sort by one of these columns? It will be the last column and it contains text data (not numeric), but I may need to sort by other columns. I am familiar with sort, but not how to specify a specific column of data when the data is tab-delimited.

Thanks again!


Top
 Profile  
 PostPosted: Thu Jul 01, 2010 12:28 pm   

Joined: Wed Jun 30, 2010 5:15 pm
Posts: 4
I heard Perl is not a good for files this size (over 4 millions rows). True or no?


Top
 Profile  
 PostPosted: Thu Jul 01, 2010 1:24 pm   
User avatar

Joined: Sun Jun 27, 2010 12:57 am
Posts: 192
mattdaddym wrote:
Of course, I'll be a greedy person and ask how can I sort by one of these columns? It will be the last column and it contains text data (not numeric), but I may need to sort by other columns. I am familiar with sort, but not how to specify a specific column of data when the data is tab-delimited.

Tab delimiting doesn't matter to sort. The standard FS (field seperator) variable is set to space and tab). This can be overwritten by sort's -t option. to specify a column number other than the first, use the -k option. If you don't know how, please read sort's manual page.


Top
 Profile  
 PostPosted: Thu Jul 01, 2010 3:50 pm   
User avatar

Joined: Tue Apr 27, 2010 2:28 pm
Posts: 172
Location: Czech Republic
mattdaddym wrote:
I heard Perl is not a good for files this size (over 4 millions rows). True or no?

If the data do not contain many duplicates, the program I suggested might take lots of memory (because it has to remember each ID it encounters). In such a case, you can for example split your data according to the first character of the ID and process the smaller parts separately.


Top
 Profile  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 8 posts ] 

All times are UTC - 6 hours


Who is online

Users browsing this forum: Bing [Bot] and 3 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  


BashScripts | Promote Your Page Too
Powered by phpBB © 2011 phpBB Group
© 2003 - 2011 USA LINUX USERS GROUP