You are missing something
There is no gurantee the current template is consistent, though there may be certain delimeters that can be used. Humans are more likely to screw this up and generate non-valid pages with typos etc.
The encoding and formatting currently used in the story pages is shockingly attrocious. Some of the pages are in windows-125[0-9], some are in utf8, some are in ISO-8859-1. The webserver does not give out valid Content-Encoding and the <meta type='Content-Encoding'> tags cannot be relied upon. If you're going to bother with the work migrating the content, why would you not sort out some of this cluster-@#*&.
Even if everything was nice and lovely, you're still talking about a manual process for all 700+ files.
Computers were *designed* for this kind of repetitive crap. You give them a list of instructions and they execute them. Fast. VERY VERY fast. Like faster than millions of humans doing the same work for days in a matter of seconds fast. I cannot overestimate how damn fast modern comptuters are at doing this kind of task when compared to a human.
It's not actually that hard to get a decent way through the problem with a pretty small script. You could clean up all the corner cases, valide the output with an SGML/XML parser etc. but that might be a diminishing returns job.
I can't be arsed to do a full job right now (as per point 5 above), but below is the sum of 20 minutes work which does a decent chunk of the job. It is not elegant, it is not efficient, it is not pretty but it's a LOT faster than any human doing it manually.
#/usr/bin/perl -w
use LWP::Simple;
use HTML::Entities;
$STRIP_FORMATTING = 'YES';
$pageContent = get($ARGV[0]);
$startOfStoryContent = index($pageContent,'class="style7"');
$endOfStoryContent = index($pageContent, '</td>',$startOfStoryContent);
$storyContent = substr($pageContent,$startOfStoryContent + 15,
($endOfStoryContent - ($startOfStoryContent + 15)));
$storyContent = cleanUpStoryFormatting($storyContent);
print $storyContent;
exit;
sub cleanUpStoryFormatting
{
my ($cleanThis) = @_;
# Handle <pre> tags
if (index($cleanThis,'<pre>') > -1)
{
my $newCleanThis = substr($newCleanThis,index($cleanThis,'<pre>')+5,index($cleanThis,'</pre>'));
#printf "Pre %d %d %d\n", index($cleanThis,'<pre>'),index($cleanThis,'</pre>'), length($newCleanThis);
$newCleanThis =~ s#[\r\n]+#</p><p class="storyParagraph">#ig;
$newCleanThis .= substr($newCleanThis,index($cleanThis,'</pre>'));
$cleanThis = substr($cleanThis,0, index($cleanThis,'<pre>'));
$cleanThis .= $newCleanThis;
}
# Remove non-standard (microsoft word) HTML tags and custom font-faces
$cleanThis =~ s#</?o:[^>]+>##ig; $cleanThis =~ s#</?font[^>]*>##ig;
# Clean up what should be standard HTML tags
$cleanThis =~ s#<b[^>]*>#<b>#gi; $cleanThis =~ s#<u[^>]*>#<u>#gi; $cleanThis =~ s#<i[^>]*>#<u>#gi;
if ($STRIP_FORMATTING =~ m/^YES$/i)
{
$cleanThis =~ s#</?i[^>]*>##gi; $cleanThis =~ s#</?b[^>]*>##gi; $cleanThis =~ s#</?u[^>]*>##gi;
$cleanThis =~ s#</?span[^>]*>##gi; $cleanThis =~ s#</?div[^>]*>#<br /><br />#gi;
$cleanThis =~ s#</?strong>##gi;
}
#Replace / fix paragraph entities
$cleanThis =~ s#<p( [^>]+)?>#<p class="storyParagraph">#ig;
$cleanThis =~ s#(<br(\s*/)?>)+#</p><p class="storyParagraph">#sig;
$cleanThis =~ s#</p>[\s\r\n]*</p>#</p>#g;
my @paragraphs = split(/<p class="storyParagraph">/,$cleanThis);
$cleanThis = '';
foreach $paragraph (@paragraphs)
{
$paragraph =~ s#</p>##; $paragraph = decode_entities($paragraph);
$paragraph = encode_entities($paragraph); $paragraph =~ s#^[\s\r\n]+##g;
$paragraph =~ s#[\s\r\n]+$##g; $cleanThis .= "<p class=\"storyParagraph\">$paragraph</p>\n\n";
}
# Clean up formatting
$cleanThis =~ s#[\r\n]+# #g; $cleanThis =~ s# +# #g; $cleanThis =~ s#\n\s+#\n#sig;
$cleanThis =~ s#\s+# #g; $cleanThis =~ s#<p#\n\n<p#g; $cleanThis =~ s#\s* \s*# #ig;
$cleanThis =~ s#<p class="storyParagraph">[\s\n\r]*</p>[\s\n\r]*##sig;
#Dodgy Encoding
$cleanThis =~ s/�/'/g; $cleanThis =~ s/>'/>"/g; $cleanThis =~ s/'</"</g;
$cleanThis =~ s/ '/ "/g; $cleanThis =~ s/' /" /g; $cleanThis =~ s/'/'/g;
$cleanThis =~ s/">\s*/">/g;
return $cleanThis;
}
and run it through an execution harness:
$ perl -lane 'for (1..999) { $out = sprintf("%03d.html",$_); `perl exportAndClean.pl "http://www.dailydiapers.com/content/stories/$out" > $out`; }'
P.S. If this post isn't worth a 'like', I have no idea what someone has to do around here to get one