Register
It is currently Fri Apr 18, 2014 11:27 am

separate text from one long text line


All times are UTC - 6 hours


Post new topic Reply to topic  [ 12 posts ] 
Author Message
 PostPosted: Thu Oct 30, 2008 11:43 am   

Joined: Thu Oct 30, 2008 11:39 am
Posts: 8
Hi,

I have one very long text line like this:
<tr bgcolor='#FFFFFF'><td width='50%' class='un_subj'><a href=../result/1237.html>Waitec </a></td><td width='50%' class='un_subj'><a href=../result/455.html>EXP Computers </a></td></tr><tr bgcolor='#FFFFFF'><td width='50%' class='un_subj'><a href=../result/1242.html>Wearnes </a></td><td width='50%' class='un_subj'><a href=../result/475.html>Formosa </a></td></tr><tr bgcolor='#FFFFFF'><td width='50%' class='un_subj'>..........

I need to extract each /result/????.html in a new line into a file.

How do I do it?

Thanks


Top
 Profile  
 PostPosted: Thu Oct 30, 2008 6:48 pm   

Joined: Wed Sep 24, 2008 11:32 pm
Posts: 7
hi guy, I found a solution with sed and grep. Give it a try:

Suppose the long line is in a file named blah, then
Code:
cat blah | sed 's/\/result\/[^(html)]*html/\n&\n/g' | grep result

So you can see what you want. If necessary, just pipe it to a text file with ">>file.txt" appended.


Top
 Profile  
 PostPosted: Fri Oct 31, 2008 3:01 am   

Joined: Thu Oct 30, 2008 11:39 am
Posts: 8
thanks a lot, that worked


Top
 Profile  
 PostPosted: Sat Nov 01, 2008 9:27 pm   
Moderator
User avatar

Joined: Wed May 03, 2006 2:05 pm
Posts: 242
For something with a few less moving parts, try this!

Code:
grep -Eo '/result/[0-9]*.html'


You may have to escape the slashes, I don't remember... grep -o is one of my favoritest commands ever for scripting stuff like this! Here's what I get for your sample line:

Code:
$ grep -Eo '/result/[0-9]*.html' log.txt
/result/1237.html
/result/455.html
/result/1242.html
/result/475.html


Hope this helps!
-J


Top
 Profile YIM  
 PostPosted: Sun Nov 02, 2008 4:41 am   

Joined: Thu Oct 30, 2008 11:39 am
Posts: 8
wow, I use grep a lot but did not know that you can do this with it

thanks


Top
 Profile  
 PostPosted: Mon Nov 03, 2008 5:54 am   

Joined: Wed Sep 24, 2008 11:32 pm
Posts: 7
thx jeo! That's fabulous!


Top
 Profile  
 PostPosted: Mon Nov 03, 2008 2:47 pm   

Joined: Thu Oct 30, 2008 11:39 am
Posts: 8
another question, what if I want to extract everything between html> and </a> from the same line?
Example:
...<a href=../result/158.html>Atapi</a></td><td ...
after extraction I should see Atapi

Note: there are names using 2 words, using special characterslike () and '

Thanks in advance


Top
 Profile  
 PostPosted: Mon Nov 03, 2008 3:59 pm   
Moderator
User avatar

Joined: Wed May 03, 2006 2:05 pm
Posts: 242
Now you're getting fancy... I know there's a better way to do this, using regex and parentheses, but here's the first thing that came to mind...

Code:
grep -Eo 'l>[a-zA-Z ]*<\/a' log.txt |sed 's/l>\|<\/a//g'


If I figure out the right way to do it I'll post it later :)

EDIT: forgot to add sample output... This is on debian, so your results may vary...

Code:
grep -Eo 'l>[a-zA-Z ]*<\/a' log.txt |sed 's/l>\|<\/a//g'
Waitec
EXP Computers
Wearnes
Formosa


Top
 Profile YIM  
 PostPosted: Tue Nov 04, 2008 10:01 am   

Joined: Thu Oct 30, 2008 11:39 am
Posts: 8
odd, I've tried something similar but it did not want to work

it does not include the special character once but I can do those manully since there're not much

thanks a lot for you help


Top
 Profile  
 PostPosted: Tue Nov 04, 2008 2:38 pm   
Moderator
User avatar

Joined: Wed May 03, 2006 2:05 pm
Posts: 242
Doh! I forgot you mentioned special characters... Digging deeper! after work...


Top
 Profile YIM  
 PostPosted: Tue Nov 04, 2008 3:05 pm   
Moderator
User avatar

Joined: Wed May 03, 2006 2:05 pm
Posts: 242
Okay, I still think there's a cleaner way to do it, but you can modify the one that I sent to include the characters that you expect. I added the parentheses () here:

[code]
grep -Eo 'l>[a-zA-Z() ]*<\/a' log.txt |sed 's/l>\|<\/a//g'
[code]

You should be able to add other special characters as well. I just hope you're not expecting any "<>" characters or else it will start matching everything between the first "l>" and the last "</a" on each line. Hope this helps!

-Jeo


Top
 Profile YIM  
 PostPosted: Fri Nov 07, 2008 4:55 am   

Joined: Thu Oct 30, 2008 11:39 am
Posts: 8
that did it

thanks


Top
 Profile  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 12 posts ] 

All times are UTC - 6 hours


Who is online

Users browsing this forum: Google [Bot] and 25 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  


BashScripts | Promote Your Page Too
Powered by phpBB © 2011 phpBB Group
© 2003 - 2011 USA LINUX USERS GROUP