Register
It is currently Thu Jul 24, 2014 9:38 pm

downloading pages with input a text file; bookmarks, etc.


All times are UTC - 6 hours


Post new topic Reply to topic  [ 2 posts ] 
Author Message
 PostPosted: Tue Jan 20, 2009 1:42 pm   

Joined: Tue Jan 20, 2009 1:24 pm
Posts: 1
hello. it would be of great use to me to have a script of the following sort.

- initially we will have a text file containing URLs. these URLs are yanked from vimperator and placed into the file into the following format:
<URL1>
<URL2>
...

with precisely zero spaces and an enter following each URL so that they may be read consistently. this is done manually. let this file be known as "bkurlsrc" and placed in a fixed directory.

now, bkurlsrc is to be read (by some means of which i am unaware :P) and taken as input by wget. wget should download those pages only, following no links, with the files of the page, etc., in .html format. now, based on DATE only (only up to, say, 01/20/09 specificity) a folder by name "012009" (or similar format) is to be created if it does not already exist, and those downloaded .html files must be moved from the current directory into that folder. using a tool such as html2text all of these html files should then be converted into text with the original htmls still kept.

now, the file "bkurlsrc" should be copied. one copy is to be cleared of all text. the second is to be renamed with the date and hour/minute/second appended to the filename. this file should again be copied and one copy should be placed into a folder "LOGS". the second "log" file should be moved into the folder 012009.

OK, i think thats it. i'd appreciate if you could help me. i am new to bash and besides being of utility it will help me learn some of the basic functions. :)


Top
 Profile  
 PostPosted: Wed Jan 21, 2009 4:12 am   

Joined: Mon Nov 17, 2008 7:25 am
Posts: 221
If I understand you correct this is quite simple actually.

Code:
#!/bin/bash
input_file=/path/to/url_file.txt
output_path=/path/to/where/url/data/is/to/be/stored
log_path=/path/to/where/logs/should/be/stored
# Do not change $date or $timestamp unless you want to reformat the dates
date=$(date +%Y%m%d)
timestamp=$(date +%Y%m%d%H%M)

for i in $(cat $input_file); do
    [ -d ${output_path}/${date} ] || mkdir $output_path/$date
    ( \
       cd ${output_path}/${date}; \
       wget "${i}"; \
    )
    html2text-or-whatever-program-line-you-want
    cp ${input_file} ${output_path}/${date}/${input_file}.${timestamp}
    cp ${input_file} ${log_path}/${input_file}.${timestamp}
done


This will create a new folder named after todays date in the path specified in $output_path. Then it'll just wget down the url supplied the text file specified $input_file.
Also, you will have to change html2text-or-whatever-program-line-you-want to some program that does what you want.
And you should take a look at the variable at the top. Those will control where files should be and where to look for others

Best regards
Fredrik Eriksson


Top
 Profile  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 2 posts ] 

All times are UTC - 6 hours


Who is online

Users browsing this forum: Bing [Bot], Google [Bot] and 3 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  


BashScripts | Promote Your Page Too
Powered by phpBB © 2011 phpBB Group
© 2003 - 2011 USA LINUX USERS GROUP