Register
It is currently Tue Jul 22, 2014 1:44 pm

Exctract and order some info from an html page


All times are UTC - 6 hours


Post new topic Reply to topic  [ 12 posts ] 
Author Message
 PostPosted: Fri Mar 20, 2009 10:36 am   

Joined: Fri Mar 20, 2009 10:14 am
Posts: 4
Hi :)

From an html page I must exctract some info (name and links name).

Page links example:

Code:
<a href="http://www.localhost/topic.php?f=15&amp;t=12345"
target="_blank">LINK NAME</a><br />
<a href="http://www.localhost/topic.php?f=15&amp;t=13467"
target="_blank">OTHER LINK NAME</a><br />
<a href="http://www.localhost/topic.php?f=15&amp;t=987678"
target="_blank">ANOTHER LINK NAME</a><br />
...



And I need this:

Code:
LINK NAME
http://http://www.localhost/topic.php?f=15&amp;t=12345

OTHER LINK NAME
http://http://www.localhost/topic.php?f=15amp;t=13467

ANOTHER LINK NAME
http://http://www.localhost/topic.php?f=15&amp;t=987678

...


Thanks for your support!

BR :)


Top
 Profile  
 PostPosted: Sat Mar 21, 2009 3:51 am   

Joined: Mon Nov 17, 2008 7:25 am
Posts: 221
Code:
#!/bin/bash
file=$1
IFS="
"
for i in $(cat $file); do
   url=$(echo $i | sed "s/^[^\"]\+\"\([^\"]\+\)\".*$/\1/")
   name$(echo $i | sed "s/^[^>]\+>\([^<]\+\).*$/\1/")
   if [ ! -z $url && ! -z $name ]; then
       echo "$name"
       echo "$url"
       unset name url
   fi
done


Usage: ./script.sh file-containing-links.txt
That should do what you want :)

Best regards
Fredrik Eriksson


Top
 Profile  
 PostPosted: Sat Mar 21, 2009 8:23 am   

Joined: Fri Mar 20, 2009 10:14 am
Posts: 4
Thank you Fredrik for your support :)

But output is...

./script.sh links.txt

Code:
./script.sh: line 7: name<a href="http://www.localhost/topic.php?f=15&amp;t=12345": No such file or directory
./script.sh: line 8: [: missing `]'
./script.sh: line 7: nameLINK NAME: command not found
./script.sh: line 8: [: missing `]'
./script.sh: line 7: name<a href="http://www.localhost/topic.php?f=15&amp;t=13467": No such file or directory
./script.sh: line 8: [: missing `]'
./script.sh: line 7: nameOTHER LINK NAME: command not found
./script.sh: line 8: [: missing `]'
./script.sh: line 7: name<a href="http://www.localhost/topic.php?f=15&amp;t=987678": No such file or directory
./script.sh: line 8: [: missing `]'
./script.sh: line 7: nameANOTHER LINK NAME: command not found
./script.sh: line 8: [: missing `]'


Top
 Profile  
 PostPosted: Mon Mar 23, 2009 4:50 am   
Moderator
User avatar

Joined: Thu Oct 11, 2007 7:12 am
Posts: 229
Location: London - UK
wrong type of 'and' in the if statement, I have corrected it. I also added some extra quotes and braces to reduce the changes of having weird filename issues and empty string errors.

Code:
#!/bin/bash
file=$1
IFS="
"
for i in $(cat "${file}"); do
   url=$(echo "${i}" | sed "s/^[^\"]\+\"\([^\"]\+\)\".*$/\1/")
   name=$(echo "${i}" | sed "s/^[^>]\+>\([^<]\+\).*$/\1/")
   if [ ! -z "${url}" -a ! -z "${name}" ]; then
       echo "${name}"
       echo "${url}"
       unset name url
   fi
done


DW


Top
 Profile  
 PostPosted: Mon Mar 23, 2009 4:50 am   
Moderator
User avatar

Joined: Thu Oct 11, 2007 7:12 am
Posts: 229
Location: London - UK
oh yeah, i also corrected the missing '=' :)

DW


Top
 Profile  
 PostPosted: Mon Mar 23, 2009 6:36 pm   

Joined: Fri Mar 20, 2009 10:14 am
Posts: 4
thank you @DarthWavy, with your script the result is:


Code:
<a href="http://www.localhost/topic.php?f=15&amp;t=12345"
http://www.localhost/topic.php?f=15&amp;t=12345
LINK NAME
_blank
<a href="http://www.localhost/topic.php?f=15&amp;t=13467"
http://www.localhost/topic.php?f=15&amp;t=13467
OTHER LINK NAME
_blank
<a href="http://www.localhost/topic.php?f=15&amp;t=987678"
http://www.localhost/topic.php?f=15&amp;t=987678
ANOTHER LINK NAME
_blank


with sed '/_blank/d' I can remove "_blank"

Output result:

Code:
<a href="http://www.localhost/topic.php?f=15&amp;t=12345"
http://www.localhost/topic.php?f=15&amp;t=12345
LINK NAME
<a href="http://www.localhost/topic.php?f=15&amp;t=13467"
http://www.localhost/topic.php?f=15&amp;t=13467
OTHER LINK NAME
<a href="http://www.localhost/topic.php?f=15&amp;t=987678"
http://www.localhost/topic.php?f=15&amp;t=987678
ANOTHER LINK NAME


But how can I obtain this output?
(reverse name/link)


Code:
LINK NAME
http://http://www.localhost/topic.php?f=15&amp;t=12345

OTHER LINK NAME
http://http://www.localhost/topic.php?f=15amp;t=13467

ANOTHER LINK NAME
http://http://www.localhost/topic.php?f=15&amp;t=987678

...


Top
 Profile  
 PostPosted: Tue Mar 24, 2009 6:19 am   
Moderator
User avatar

Joined: Thu Oct 11, 2007 7:12 am
Posts: 229
Location: London - UK
I'm sorry, i corrected the bash issues but I don't know sed very well, I tend to use perl for that type of thing. Fredrik can hopefully tweak that bit :)


Top
 Profile  
 PostPosted: Tue Mar 24, 2009 7:35 am   

Joined: Mon Nov 17, 2008 7:25 am
Posts: 221
Well my seds are just looking for specific formated lines.

This is an issue because the a href line spans over multiple lines.
One thing I didn't think of is that both url and name will be set at the first spin and it will echo because of it.

That script is flawed, the regexps are as correct as I can think, but the script isn't.

Sadly I'm a bit too busy atm to write a correction.

Best regards
Fredrik Eriksson


Top
 Profile  
 PostPosted: Wed Apr 01, 2009 1:30 pm   
Moderator
User avatar

Joined: Wed May 03, 2006 2:05 pm
Posts: 242
Hi!

How're you getting your data? Is it handed to you in a file, or can you choose how it's parsed from the web page? The problem with what we're working with here is that it's all split up on separate lines. If you could parse the page with something like this, it would grab each "a" tag individually (and dump into a file):

Code:
lynx -source google.com|grep -Po \<a.*?\/a\> > dump.txt


I'm intrigued by fredrik's regex in his sed syntax, and I'm going to have to poke at it and see how it works :) In the meantime, I typically use perl for something like this, because sed and awk do "greedy" matching, which makes it hard to parse between two tags (like <a and /a>) which is actually why I used "-P" in the grep statement above to use perl style regex. Here's something quick and dirty (the regex would need a lot of refinement for anything beyond a basic example... it even misses some things in my google example, but parses your output perfectly as long as the tags are all on one line):

Code:
open(FILE, "dump.txt");
while($line = <FILE>){
  $line =~ /<a.href="(.*?)".*?>(.*?)<.*$/;
  print "$2\n$1\n\n";
}
close(FILE);'


Hope this helps?... I need to learn more about posix regex :)

-J


Top
 Profile YIM  
 PostPosted: Wed Apr 01, 2009 2:12 pm   

Joined: Mon Nov 17, 2008 7:25 am
Posts: 221
Jeo, I believe my regexps works in perl as well.

When cleaned up abit they look like this
Code:
s/^[^\"]+\"([^\"]+)\".*$/\1/
s/^[^>]+>([^<]+).*$/\1/


[^>]+ is just an expression for everything up till a ">"-sign. It's not really necessery, but I like it more then .* since it's alot more exact as to where the stop is.

Best regards
Fredrik Eriksson


Top
 Profile  
 PostPosted: Thu Apr 02, 2009 5:16 pm   
Moderator
User avatar

Joined: Wed May 03, 2006 2:05 pm
Posts: 242
Yeah! figured that out last night after some playing. That is a cool one, thanks for the update!


Top
 Profile YIM  
 PostPosted: Fri Apr 03, 2009 1:28 pm   

Joined: Fri Mar 20, 2009 10:14 am
Posts: 4
Many thanks for your help :)

So:

@jeo:
lynx -source google.com|grep -Po \<a.*?\/a\> > dump.txt
Output:
Code:
grep: Support for the -P option is not compiled into this --disable-perl-regexp binary

I use Debian repository...


@fredrik.eriksson
With this code:

Code:
#!/bin/bash
file=dump.txt
IFS="
"
for i in $(cat "${file}"); do
   url=$(echo "${i}" | sed "s/^[^\"]+\"([^\"]+)\".*$/\1/")
   name=$(echo "${i}" | sed "s/^[^>]+>([^<]+).*$/\1/")
   if [ ! -z "${url}" -a ! -z "${name}" ]; then
       echo "${name}"
       echo "${url}"
       unset name url
   fi
done


I obtain only some output errors...

Thank :)


Top
 Profile  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 12 posts ] 

All times are UTC - 6 hours


Who is online

Users browsing this forum: No registered users and 4 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  


BashScripts | Promote Your Page Too
Powered by phpBB © 2011 phpBB Group
© 2003 - 2011 USA LINUX USERS GROUP