First of all, I say sorry for my bad English
Script is wriiten by newbie. Wget is used. The script is quite inefficient as downloads pages that are linked from the main one even if they did not change since the last download. There are three files:
File "download" is used by other files
Code:
#! /bin/sh
#
# Description:
# Wrapper over wget that downloads the specified page and all objects it has
# links to and does its best to convert links and file names for offline viewing
# Arguments:
# 1. cite
# 2. page (maybe same as cite)
# 3. output directory
# 4. additional wget options
# Input:
# None
# Output:
# None
# Return value:
# 0 if successful, not 0 otherwise
# Examples:
# None
# Notes:
# None
# Requirements:
# wget
# Download
wget $4 \
--base=$1 \
--convert-links \
--directory-prefix="$3" \
--html-extension \
--level 1 \
--no-verbose \
--page-requisites \
--recursive \
--restrict-file-names=windows \
--span-hosts \
--timeout=10 \
"$2"
return $?
Here is file "habr" to download news from habrahabr.ru
Code:
#! /bin/bash
CITE=habrahabr.ru
DIR=/tmp/habrahabr.ru
OUTDIR=~/docs/disk/news/habrahabr.ru/
# Do not try to use the --reject option to filter html pages.
# It shouldn't work according to wget manual
WGETOPTS=
# Create necessary directories
mkdir -p ${DIR}
mkdir -p ${OUTDIR}
[ $? -eq 0 ] || exit 1
echo
echo "DOWNLOADING"
echo
# Download news from main page
download ${CITE} ${CITE} ${DIR} ${WGETOPTS}
[ $? -eq 0 ] || exit 1
echo
echo "PROCESSING DOWNLOADED FILES"
echo
for i in $(find ${DIR}/${CITE} -type f)
do
NAME=$i
echo ${NAME}
# TODO: Process text files only
# Add information about codepage because Opera for Windows Mobile
# is not clever enough to detect it
CHARSET=windows-1251
META='<meta http-equiv="content-type" content="text/html; charset='${CHARSET}'">'
sed --in-place 's%<head>%<head>'"${META}"'%g' ${NAME}
done
echo
echo "COPYING TO OUTPUT DIRECTORY"
echo
# Copy everything to the output directory
cp -R ${DIR}/* ${OUTDIR}
Now here is file "lor" to download news from linux.org.ru
Code:
#! /bin/bash
CITE=www.linux.org.ru
DIR=/tmp/linux.org.ru
OUTDIR=~/docs/disk/news/linux.org.ru/
LOCALDIR='/SD-MMC card/news/linux.org.ru/'
# Do not try to use the --reject option to filter html pages.
# It shouldn't work according to wget manual
WGETOPTS=-erobots=off
# Create necessary directories
mkdir -p ${DIR}
mkdir -p ${OUTDIR}
[ $? -eq 0 ] || exit 1
echo
echo "DOWNLOADING"
echo
# Download news from main page
download ${CITE} ${CITE} ${DIR} ${WGETOPTS}
[ $? -eq 0 ] || exit 1
# Now download screenshots
for i in ${DIR}/${CITE}/gallery/*-icon.jpg
do
NUM=$(basename $i | sed 's/-icon.jpg$//')
wget \
--directory-prefix=${DIR}/${CITE}/gallery/ \
--no-verbose \
${CITE}/gallery/${NUM}.jpg
wget \
--directory-prefix=${DIR}/${CITE}/gallery/ \
--no-verbose \
${CITE}/gallery/${NUM}.png
done
echo
echo "PROCESSING DOWNLOADED FILES"
echo
for i in $(find ${DIR}/${CITE} -type f)
do
NAME=$i
echo ${NAME}
# TODO: Process text files only
# Delete html pages that are designed to comment messages
# as they are not useful offline
if test $(echo ${NAME} | grep -F 'comment-message.jsp@')
then
rm ${NAME}
continue
fi
# Delete html pages which addresses differ by "lastmod" only.
# I do not know what is the purpose of "lastmod"
NEWNAME=$(echo ${NAME} | sed 's/&lastmod=[[:digit:]]*//g')
if test ${NAME} != ${NEWNAME}
then
if test -f ${NEWNAME}
then
rm ${NAME}
continue
else
mv ${NAME} ${NEWNAME}
NAME=${NEWNAME}
fi
fi
# Change links accordingly
sed --in-place 's/&lastmod=[[:digit:]]*//g' ${NAME}
# Add html extension because Opera for Windows Mobile require them
FILENAME='([^"/]*/)*(([^"/]*\.[^"/.]*[^"/.[:alnum:]][^"/.]*)|([^"/.]*))'
if test $(echo ${NAME} | grep -E '^'${FILENAME}'$')
then
mv ${NAME} ${NAME}.html
NAME=${NAME}.html
fi
# Change links accordingly
SEDEXPR='s%href="('${FILENAME}')"%href="\1.html"%g'
sed --in-place --regexp-extended ${SEDEXPR} ${NAME}
# Add information about codepage because Opera for Windows Mobile
# is not clever enough to detect it
CHARSET=utf-8
META='<meta http-equiv="content-type" content="text/html; charset='${CHARSET}'">'
sed --in-place 's%<head>%<head>'"${META}"'%g' ${NAME}
# Change some links that were not converted by wget for some reason.
# (Such links really exist)
LOCALDIR_="${LOCALDIR}"/${CITE}
sed --in-place 's%href="http://'${CITE}'%href="'"${LOCALDIR}"'/%g' ${NAME}
sed --in-place 's%src="http://'${CITE}'%src="'"${LOCALDIR}"'/%g' ${NAME}
done
echo
echo "COPYING TO OUTPUT DIRECTORY"
echo
# Copy everything to the output directory
cp -R ${DIR}/* ${OUTDIR}
Well, all feedback is greatly appreciated

I'd also like to learn how to deal with quotes properly...