Register
It is currently Thu Nov 27, 2014 10:08 pm

get text between two tags


All times are UTC - 6 hours


Post new topic Reply to topic  [ 7 posts ] 
Author Message
 PostPosted: Wed Feb 23, 2011 4:43 am   

Joined: Wed Feb 23, 2011 4:34 am
Posts: 3
Hi,

I have a sample text file:

Code:
<category name="Temp1">something1</category><!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
</TD></TR></TABLE></BODY></HTML>
<category name="Temp2">something2
</category>


New lines in the file may or may not occur.

I would like to get only those parts of the file which are between the closest 'category' tags, so in this example:

Code:
<category name="Temp1">something1</category><category name="Temp2">something2</category>


I am trying to force awk to do that like that:

Code:
awk -F "</?category.*>" '{ print $1 }' file.txt


But this command gives me only:

Code:
</TD></TR></TABLE></BODY></HTML>


Could anyone point me how to write the command properly?

Regards,
Robert


Top
 Profile WWW  
 PostPosted: Wed Feb 23, 2011 9:49 am   
Moderator
User avatar

Joined: Wed May 03, 2006 2:05 pm
Posts: 242
Hi Robert, and welcome!

Try something like this:

Code:
sed -n 's/^.*<category.name="[^"]*">\([^<]*\).*/\1/p'


I ran this against your sample text and got:

Code:
$ cat tmp.txt
<category name="Temp1">something1</category><!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
</TD></TR></TABLE></BODY></HTML>
<category name="Temp2">something2
</category>

$ sed -n 's/^.*<category.name="[^"]*">\([^<]*\).*/\1/p' tmp.txt
something1
something2


I hope this helps!
-J


Top
 Profile YIM  
 PostPosted: Wed Feb 23, 2011 9:58 am   
Moderator
User avatar

Joined: Wed May 03, 2006 2:05 pm
Posts: 242
So, a further explanation of the sed expression posted earlier:

We're using sed's search and print functions, and capturing the data that we want using parentheses ().

sed -n
-> This tells sed to run 'silently', only printing what we specify with the 'p' command

s/
-> This begins our search string

^.*<category.name="[^"]*">
-> This part of the regex matches anything starting at the beginning of the line, up through the opening 'category' tag

\([^<]*\).*
-> This groups together everything after the opening category tag that *doesn't* match the '<' character, which should mark the beginning of the closing </category> tag

/\1/
This says that we want to work with the information within that set of parens. If there were multiple sets of parentheses, the next set would be \2, then \3 and so on.

p
The "p" command tells sed we want to go ahead and print the matches.

I hope this helps!
-J


Top
 Profile YIM  
 PostPosted: Fri Mar 04, 2011 2:51 am   

Joined: Wed Feb 23, 2011 4:34 am
Posts: 3
Hi Jeo,

Thanks for your all-embracing answer. You are right - it was my mistake that I have cut my sample file too much :/

Imagine a slightly more complex file:

Code:
<category name="Temp1">something1<blah>some<test>aa</test></blah></category>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
</TD></TR></TABLE></BODY></HTML>
<category name="Temp2">something2<cat><test1>aa</test1>ww</cat></category>


I would like to get:

Code:
<category name="Temp1">something1<blah>some<test>aa</test></blah></category><category name="Temp2">something2<cat><test1>aa</test1>ww</cat></category><category name="Temp1">something1<blah>some<test>aa</test></blah></category><category name="Temp2">something2<cat><test1>aa</test1>ww</cat></category>


from it. Notice that I would like to have my 'category' tags included in my result.

Could you tell me how to rewrite the command?


Top
 Profile WWW  
 PostPosted: Fri Mar 04, 2011 8:03 am   
Moderator
User avatar

Joined: Wed May 03, 2006 2:05 pm
Posts: 242
There are a few ways to do this actually! Let's start with the sed command we already used above, but instead of using (parentheses) to isolate what's between the tags, let's just print the tags and everything between!

Code:
sed -n 's/^.*\(<category.*category>\).*/\1/p'


This will break if the category tags are on separate lines though.


Top
 Profile YIM  
 PostPosted: Mon Mar 07, 2011 4:15 am   

Joined: Wed Feb 23, 2011 4:34 am
Posts: 3
Jeo,

I really appreciate your help. I have one more ask.

What about if I could not say are there any newline's characters in my file or not? So my file can be like that:

Code:
<category name="Temp1">something1<blah>some<test>aa</test></blah></category><!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"></TD></TR></TABLE></BODY></HTML><category name="Temp2">something2<cat><test1>aa</test1>ww</cat></category>


Top
 Profile WWW  
 PostPosted: Fri Mar 25, 2011 12:11 pm   
User avatar

Joined: Tue Apr 27, 2010 2:28 pm
Posts: 172
Location: Czech Republic
I'd use Perl and some of the HTML parsers written for it.


Top
 Profile  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 7 posts ] 

All times are UTC - 6 hours


Who is online

Users browsing this forum: Google [Bot] and 4 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
cron


BashScripts | Promote Your Page Too
Powered by phpBB © 2011 phpBB Group
© 2003 - 2011 USA LINUX USERS GROUP