It is currently Mon May 21, 2018 4:13 pm

use sed/ awk to get value after particular HTML tag?

All times are UTC - 6 hours

Post new topic Reply to topic  [ 3 posts ] 
Author Message
 PostPosted: Thu Jan 19, 2012 8:41 am   

Joined: Tue Apr 19, 2011 11:01 pm
Posts: 38
I am trying to find a way to get the numerical value of a line of HTML code, that falls after a particular HTML tag. Here is an example of what I am talking about:

<TD align="middle" valign="middle"><font face="Arial" size="3">100

The above is just an example, but every value that I need to get comes after the size="3"> (emphasis added). I want to just grab, for the example, the number 100 and remove everything behind the size="3">, to include removing size="3"> itself. There are other areas in the HTML that don't look necessarily like this, but all of the numerical values that I need to get have the size="3"> right before them, so I was thinking I could use that as a search pattern. The quotes around the 3 are screwing me up too.

I was tinkering with something I found on the net about grabbing a single character after a certain word:
sed 's/.*size="3">\(.\).*/\1/' report_file.htm

Also, the size="3"> pattern is found multiple times in the same line, so whatever it is that I need to do also has to account for that (meaning, not deleting anything past the first found instance of size="3"> on a particular line.


 PostPosted: Thu Jan 19, 2012 2:52 pm   

Joined: Tue Apr 19, 2011 11:01 pm
Posts: 38
Wanted to report that I figured it out, but I know there MUST be an easier way! Here is what I did, with comments (maybe to help others if they ever googled the same thing). *I got a lot of this from various nooks and crannies on the net...just saying that up front.
cat report_file.htm \
| sed "s/<\/TD>/~/g" \                       
| sed "s/size=\"3\">"/~&/g" \               
| tr "~" "\n" \                                     
| grep "size=\"3\">" \                           
| sed -e :a -e 's/<[^>]*>//g;//ba' \
| sed '1,18d' \
| cut -d">" -f2 \
| sed 's/[^[:digit:]]//g' \
| sed '/./!d' \
| sed -e :a -e '{N; s/\n/\t/g; ta}'         

# sed #1 - replacing all of the closing cell tags (</TD>) with ~
# sed #2 - replacing all of the size="3"> finds with ~
# tr - making every instance of ~ have it's own line
# grep - showing only those lines with size="3"> (the TD tag was screwing me up since multiple size="3"> were in some of those lines...hence the reason for sed #1
# sed #3 - removing the rest of the html tags
# sed #4 - removing the 1st 18 lines as they had a bunch of crap in them irrelevant to what I was needing
# cut - changing the delimiter to > and printing the 2nd field
# sed #5 - removing all remaining non-numerical junk
# sed #6 - removing the empty lines left over from sed #5
# sed #7 - changing the vertical output to horizontal and adding a tab between each entry

again, I am MORE than open to hearing of an easier way, as I know there has to be...but for the time being, this works for what I need to do.

 PostPosted: Fri Jan 20, 2012 11:04 am   
User avatar

Joined: Tue Apr 27, 2010 2:28 pm
Posts: 172
Location: Czech Republic
I'd use Perl and its HTML::TableExtract.

Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 3 posts ] 

All times are UTC - 6 hours

Who is online

Users browsing this forum: No registered users and 5 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  

BashScripts | Promote Your Page Too
Powered by phpBB © 2011 phpBB Group