Register
It is currently Wed Apr 16, 2014 2:15 am

Script to download topics from some russian forums


All times are UTC - 6 hours


Post new topic Reply to topic  [ 1 post ] 
Author Message
 PostPosted: Tue Feb 19, 2008 8:09 am   

Joined: Tue Feb 19, 2008 7:52 am
Posts: 1
First of all, I say sorry for my bad English :(

Script is wriiten by newbie. Wget is used. The script is quite inefficient as downloads pages that are linked from the main one even if they did not change since the last download. There are three files:

File "download" is used by other files
Code:
#! /bin/sh
#
# Description:
#   Wrapper over wget that downloads the specified page and all objects it has
#   links to and does its best to convert links and file names for offline viewing
# Arguments:
#   1. cite
#   2. page (maybe same as cite)
#   3. output directory
#   4. additional wget options
# Input:
#   None
# Output:
#   None
# Return value:
#   0 if successful, not 0 otherwise
# Examples:
#   None
# Notes:
#   None
# Requirements:
#   wget

# Download
wget $4                         \
  --base=$1                     \
  --convert-links               \
  --directory-prefix="$3"       \
  --html-extension              \
  --level 1                     \
  --no-verbose                  \
  --page-requisites             \
  --recursive                   \
  --restrict-file-names=windows \
  --span-hosts                  \
  --timeout=10                  \
  "$2"

return $?

Here is file "habr" to download news from habrahabr.ru
Code:
#! /bin/bash

CITE=habrahabr.ru
DIR=/tmp/habrahabr.ru
OUTDIR=~/docs/disk/news/habrahabr.ru/

# Do not try to use the --reject option to filter html pages.
# It shouldn't work according to wget manual
WGETOPTS=

# Create necessary directories
mkdir -p ${DIR}
mkdir -p ${OUTDIR}
[ $? -eq 0 ] || exit 1

echo
echo "DOWNLOADING"
echo

# Download news from main page
download ${CITE} ${CITE} ${DIR} ${WGETOPTS}
[ $? -eq 0 ] || exit 1

echo
echo "PROCESSING DOWNLOADED FILES"
echo

for i in $(find ${DIR}/${CITE} -type f)
do

  NAME=$i
  echo ${NAME}

  # TODO: Process text files only

  # Add information about codepage because Opera for Windows Mobile
  # is not clever enough to detect it
  CHARSET=windows-1251
  META='<meta http-equiv="content-type" content="text/html; charset='${CHARSET}'">'
  sed --in-place 's%<head>%<head>'"${META}"'%g' ${NAME}

done

echo
echo "COPYING TO OUTPUT DIRECTORY"
echo

# Copy everything to the output directory
cp -R ${DIR}/* ${OUTDIR}

Now here is file "lor" to download news from linux.org.ru
Code:
#! /bin/bash

CITE=www.linux.org.ru
DIR=/tmp/linux.org.ru
OUTDIR=~/docs/disk/news/linux.org.ru/
LOCALDIR='/SD-MMC card/news/linux.org.ru/'

# Do not try to use the --reject option to filter html pages.
# It shouldn't work according to wget manual
WGETOPTS=-erobots=off

# Create necessary directories
mkdir -p ${DIR}
mkdir -p ${OUTDIR}
[ $? -eq 0 ] || exit 1

echo
echo "DOWNLOADING"
echo

# Download news from main page
download ${CITE} ${CITE} ${DIR} ${WGETOPTS}
[ $? -eq 0 ] || exit 1

# Now download screenshots
for i in ${DIR}/${CITE}/gallery/*-icon.jpg
do
  NUM=$(basename $i | sed 's/-icon.jpg$//')
  wget                                         \
    --directory-prefix=${DIR}/${CITE}/gallery/ \
    --no-verbose                               \
    ${CITE}/gallery/${NUM}.jpg
  wget                                         \
    --directory-prefix=${DIR}/${CITE}/gallery/ \
    --no-verbose                               \
    ${CITE}/gallery/${NUM}.png
done

echo
echo "PROCESSING DOWNLOADED FILES"
echo

for i in $(find ${DIR}/${CITE} -type f)
do

  NAME=$i
  echo ${NAME}

  # TODO: Process text files only

  # Delete html pages that are designed to comment messages
  # as they are not useful offline
  if test $(echo ${NAME} | grep -F 'comment-message.jsp@')
  then
    rm ${NAME}
    continue
  fi

  # Delete html pages which addresses differ by "lastmod" only.
  # I do not know what is the purpose of "lastmod"
  NEWNAME=$(echo ${NAME} | sed 's/&lastmod=[[:digit:]]*//g')
  if test ${NAME} != ${NEWNAME}
  then
    if test -f ${NEWNAME}
    then
      rm ${NAME}
      continue
    else
      mv ${NAME} ${NEWNAME}
      NAME=${NEWNAME}
    fi
  fi

  # Change links accordingly
  sed --in-place 's/&amp;lastmod=[[:digit:]]*//g' ${NAME}

  # Add html extension because Opera for Windows Mobile require them
  FILENAME='([^"/]*/)*(([^"/]*\.[^"/.]*[^"/.[:alnum:]][^"/.]*)|([^"/.]*))'
  if test $(echo ${NAME} | grep -E '^'${FILENAME}'$')
  then
    mv ${NAME} ${NAME}.html
    NAME=${NAME}.html
  fi

  # Change links accordingly
  SEDEXPR='s%href="('${FILENAME}')"%href="\1.html"%g'
  sed --in-place --regexp-extended ${SEDEXPR} ${NAME}

  # Add information about codepage because Opera for Windows Mobile
  # is not clever enough to detect it
  CHARSET=utf-8
  META='<meta http-equiv="content-type" content="text/html; charset='${CHARSET}'">'
  sed --in-place 's%<head>%<head>'"${META}"'%g' ${NAME}

  # Change some links that were not converted by wget for some reason.
  # (Such links really exist)
  LOCALDIR_="${LOCALDIR}"/${CITE}
  sed --in-place 's%href="http://'${CITE}'%href="'"${LOCALDIR}"'/%g' ${NAME}
  sed --in-place 's%src="http://'${CITE}'%src="'"${LOCALDIR}"'/%g' ${NAME}

done

echo
echo "COPYING TO OUTPUT DIRECTORY"
echo

# Copy everything to the output directory
cp -R ${DIR}/* ${OUTDIR}

Well, all feedback is greatly appreciated :) I'd also like to learn how to deal with quotes properly...


Top
 Profile  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 1 post ] 

All times are UTC - 6 hours


Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  


BashScripts | Promote Your Page Too
Powered by phpBB © 2011 phpBB Group
© 2003 - 2011 USA LINUX USERS GROUP