| Mail-Digest-Tools documentation | view source | Contained in the Mail-Digest-Tools distribution. |
%config_in: THE INTERNAL STRUCTURE OF A DIGEST
%config_out: HOW TO PROCESS A DIGEST ON YOUR SYSTEM
Mail::Digest::Tools - Tools for digest versions of mailing lists
This document refers to version 2.12 of digest.pl, released May 14, 2011.
use Mail::Digest::Tools qw(
process_new_digests
reprocess_ALL_digests
reply_to_digest_message
repair_message_order
consolidate_threads_multiple
consolidate_threads_single
delete_deletables
);
%config_in and %config_out are two configuration hashes whose setup
is discussed in detail below.
process_new_digests(\%config_in, \%config_out);
reprocess_ALL_digests(\%config_in, \%config_out);
$full_reply_file = reply_to_digest_message(
\%config_in,
\%config_out,
$digest_number,
$digest_entry,
$directory_for_reply,
);
repair_message_order(
\%config_in,
\%config_out,
{
year => 2004,
month => 01,
day => 27,
}
);
consolidate_threads_multiple(
\%config_in,
\%config_out,
$first_common_letters, # optional integer argument; defaults to 20
);
consolidate_threads_single(
\%config_in,
\%config_out,
[
'first_dummy_file_for_consolidation.thr.txt',
'second_dummy_file_for_consolidation.thr.txt',
],
);
delete_deletables(\%config_out);
Mail::Digest::Tools provides useful tools for processing mail which an individual receives in a 'daily digest' version from a mailing list. Digest versions of mailing lists are provided by a variety of mail processing programs and by a variety of list hosts. Within the Perl community, digest versions of mailing lists are offered by such sponsors as Active State, Sourceforge, Yahoo! Groups and London.pm. However, you do not have to be interested in Perl to make use of Mail::Digest::Tools. Mail from any of the thousands of Yahoo! Groups, for example, may be processed with this module.
If, when you receive e-mail from the digest version of a mailing list, you simply read the digest in an e-mail client and then discard it, you may stop reading here. If, however, you wish to read or store such mail by subject, read on. As printed in a normal web browser, this document contains 40 pages of documentation. You are urged to print this documentation out and study it before using this module.
To understand how to use Mail::Digest::Tools, we will first take a look at a typical mailing list digest. We will then sketch how that digest looks once processed by Mail::Digest::Tool. We will then discuss Mail::Digest::Tool's exportable functions. Next, we will study how to prepare the two configuration hashes which hold the configuration data. Finally, we will provide some tips for everyday use of Mail::Digest::Tools.
Here is a dummied-up version of a typical mailing list digest as it appears once saved to a plain-text file. For illustrative purposes, let us suppose that the file is named: 'Perl-Win32-Users Digest, Vol 1 Issue 9999.txt'
Send Perl-Win32-Users mailing list submissions to
perl-win32-users@listserv.ActiveState.com
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Perl-Win32-Users digest..."
Today's Topics:
1. Introducing Mail::Digest::Tools (James E Keenan)
2. A Different Discussion (steve)
3. Re: Introducing Mail::Digest::Tools (David H Adler)
----------------------------------------------------------------------
Message: 1
From: "James E Keenan" <jkeen@some.web.address.com>
To: <Perl-Win32-Users@listserv.activestate.com>
Subject: Introducing Mail::Digest::Tools
Date: Sat, 31 Jan 2004 14:10:20 -0600
Mail::Digest::Tools is the greatest thing since sliced bread.
Go download it now!
------------------------------
Message: 2
From: "steve" <steve@some.web.address.com>
To: <Perl-Win32-Users@listserv.activestate.com>
Subject: A Different Discussion
Date: Sat, 31 Jan 2004 14:40:20 -0600
This is a new topic. I am not discussing Mail::Digest::Tools in this
submission.
------------------------------
Message: 3
From: "David H Adler" <dha@some.web.address.com>
To: <Perl-Win32-Users@listserv.activestate.com>
Subject: Re: Introducing Mail::Digest::Tools
Date: Sat, 31 Jan 2004 14:50:20 -0600
Jim, what's this nonsense about sliced bread. Weren't you on the Atkins
diet? Unlike beer, sliced bread is Off Topic.
------------------------------
_______________________________________________
Perl-Win32-Users mailing list
Perl-Win32-Users@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
End of Perl-Win32-Users Digest
Note that the digest has an overall structure, while each message within the digest has its own structure.
The digest's overall structure consists of:
Today's Topics
----------------------------------------------------------------------
\n newlines (so that the delimiter is a paragraph unto
itself). Other digests may use a two-line delimiter such as:
_______________________________________________________
_______________________________________________________
--__--__--
------------------------------
\n newlines (so that the delimiter is a paragraph unto
itself).
split
digests on this delimiter. Message:
From:
Organization:
Reply-To:
To:
CC:
Date:
Subject:
process_new_digests below.
Using the dummy messages provided above, typical use of Mail::Digest::Tools would produce (in a bare-bones configuration) the following results:
Thread: Introducing Mail::Digest::Tools
Message: 001_9999_001
From: "James E Keenan" <jkeen@some.web.address.com>
Text:
Mail::Digest::Tools is the greatest thing since sliced bread.
Go download it now!
--__--__--
Thread: Introducing Mail::Digest::Tools
Message: 001_9999_003
From: "David H Adler" <dha@some.web.address.com>
Text:
Jim, what's this nonsense about sliced bread. Weren't you on the Atkins
diet? Unlike beer, sliced bread is Off Topic.
--__--__--
Thread: A Different Discussion
Message: 001_9999_002
From: "steve" <steve@some.web.address.com>
Text:
This is a new topic. I am not discussing Mail::Digest::Tools in this
submission.
--__--__--
Today's Topics
...
Perl-Win32-Users digest, Vol 1 #9999 - 3 msgs.txt
1. Introducing Mail::Digest::Tools (James E Keenan)
2. A Different Discussion (steve)
3. Re: Introducing Mail::Digest::Tools (David H Adler)
001_9999;Fri Feb 6 18:57:41 2004;Fri Feb 6 18:57:41 2004
Mail::Digest::Tools exports no functions by default. Each of its current seven functions is imported only on request by your script.
In everyday use, you will probably call just one of Mail::Digest::Tool's exportable functions in a particular Perl script. Typically, you will import the function as described in the SYNOPSIS above, populate two configuration hashes, and finally call the one function you have imported.
As will become evident, the most challenging part of using Mail::Digest::Tools is not calling the functions. Rather, it is the initial setup and testing of configuration files from which the two configuration hashes passed as arguments to the various Mail::Digest::Tools functions are drawn.
More on those configuration hashes later. For now, let's look at the exportable functions.
process_new_digestsprocess_new_digests(\%config_in, \%config_out);
process_new_digests() is the Mail::Digest::Tools function which you will
use most frequently on a daily basis. Based on information supplied in the
two configuration hashes passed to it as arguments, process_new_digests()
does the following:
consolidate_threads_single() function discussed below.)reprocess_ALL_digestsreprocess_ALL_digests(\%config_in, \%config_out);
reprocess_ALL_digests() is the Mail::Digest::Tools function which you
should use ONLY when you are setting up and fine-tuning Mail::Digest::Tools
to process a given digest -- and you should NEVER use it thereafter!
Why? Read on!
reprocess_ALL_digests() does almost exactly the same things as does
process_new_digests(), but it does them on ALL digest files found in the
directory in which you store such digests -- not just on those previously
processed. But in the process it does not merely append new messages to
already existing thread files, leaving older thread files untouched. Instead,
reprocess_ALL_digests() WIPES OUT your entire directory of thread files and
rebuilds it from scratch.
That's cool if you have retained all instances of a given digest which you
wish to process into thread files. But if you've thrown out older instances
of a given digest and call reprocess_ALL_digests(), you will not be able
to process the messages contained in those discarded digests. The message
sources are gone. That's cool once you're certain that you've got a given
digest configured just the way you want it -- but not until that moment.
Let's make this more concrete. Suppose that you have begun to subscribe to the digest version of the London Perlmongers mailing list. When you receive e-mails from this provider, you store them in a directory whose contents look like this:
london.pm digest, Vol 1 #1856 - 7 msgs.txt
london.pm digest, Vol 1 #1857 - 18 msgs.txt
london.pm digest, Vol 1 #1858 - 15 msgs.txt
london.pm digest, Vol 1 #1859 - 17 msgs.txt
london.pm digest, Vol 1 #1860 - 11 msgs.txt
Initially, you decide that you want to post the messages in these digests to thread files that are discarded after three days. You set up your configuration files to do precisely this. (See below for how this is done.) You then write a script which calls
reprocess_ALL_digests(\%config_in, \%config_out);
Three days go by. One or two new london.pm digests arrive each day. You want to process only the newly arrived files, so each day you simply call:
process_new_digests(\%config_in, \%config_out);
and on Day 4 Mail::Digest::Tools starts to notify you on standard output that it is discarding thread files which have not been changed (i.e., received new postings) in three days.
But then you decide that London.pm's contributors are the most witty and erudite Perlmongers anywhere and you wish to archive their contributions until the end of time (or until the first production release of Perl 6, whichever comes first). Fortunately, you've still got all your London.pm digest files going back to the beginning of your subscription. You make appropriate changes to your configuration setup to say, ''Instead of killing these thread files after 3 days of inactivity, archive them after 3 days instead.'' (Again, we'll see how to do this below.) You then call:
reprocess_ALL_digests(\%config_in, \%config_out);
one last time. All your previously existing thread files are wiped out, and all your London.pm digests are reprocessed from scratch. But that's okay, because you've decided to live with your configuration decisions. So you can now begin to discard older digest files and process newly arrived files only with
process_new_digests(\%config_in, \%config_out);
Your London.pm thread archive grows exponentially, and you live happily ever after.
The ALL CAPS in reprocess_ALL_digests() is a little warning that this
Mail::Digest::Tools function is very powerful, but potentially very dangerous.
You are also alerted to this danger by this screen prompt which appears when
you call this function:
By default, this program processes only NEWLY ARRIVED
[London.pm/other digest] files found in this directory. Messages in
these new digests are sorted and appended to the appropriate
'.thr.txt' files in the 'Threads' subdirectory.
However, by choosing method 'reprocess_ALL_digests()' you have
indicated that you wish to process ALL digest files found in this
directory -- regardless of whether or not they have previously been
processed. This is recommended ONLY for initialization and testing
of this program.
Since this will wipe out all threads files ('.thr.txt') as well --
including threads files for which you no longer have their source
digest files -- please confirm that this is your intent by typing
ALL at the prompt.
GOT IT?
To proceed, you must type ALL in ALL CAPS, hit [Enter], then respond to
yet another prompt:
You have chosen to WIPE OUT all '.thr.txt' files currently
existing in the 'Threads' subdirectory and reprocess all
[London.pm/other digest] digest files from scratch.
Please re-confirm your choice by once again typing 'ALL'
and hitting [Enter]:
You must again type ALL in ALL CAPS and hit [Enter] to reprocess all
digests. Should you fail to type ALL at both of these prompts, your
script will default to process_new_digests() and only process newly
arrived digest files.
reply_to_digest_message $full_reply_file = reply_to_digest_message(
\%config_in,
\%config_out,
$digest_number,
$digest_entry,
$directory_for_reply,
);
Once you have begun to follow discussion threads on a mailing list with the aid of Mail::Digest::Tools, you may wish to join the discussion and reply to a message.
If you tried to do this by hitting the 'Reply' button in your e-mail client, you would probably end up with a 'Subject' line in your e-mail that looked this:
Re: london.pm digest, Vol 1 #1814 - 2 msgs
Needless to say, this is tacky. So tacky that many mailing list digest programs insert this message into each digest's headers:
When replying, please edit your Subject line so it is more specific
than "Re: Contents of london.pm digest, Vol 1, #xxxx..."
You don't want to be tacky; you want to be lazy. You want Perl to do the
work of initiating an e-mail with a meaningful subject header for you.
Mail::Digest::Tool's reply_to_digest_message does just this. It creates
a plain-text file for you that has a meaningful subject line and prepends
each line of the body of the message with \ >. You then open this
plain-text file, edit it to reply to its contents, copy-and-paste it into
your e-mail client, and send it.
The arguments passed to reply_to_digest_message() are:
Suppose that you wished to reply to message #2 in London.pm digest #1814:
Message: 2
From: James E Keenan <jkeen@some.web.address.com>
To: London Perlmongers <london.pm@london.pm.org>
Date: Fri, 2 Jan 2004 23:41:01 -0500
Subject: re: language courses
Reply-To: london.pm@london.pm.org
On Fri, 2 Jan 2004 22:38:40 +0000 (GMT), Ali Young wrote concerning:
language courses
> Depends what you count as useful. Learning Esperanto means that you
> can read the current London.pm website.
BTW, wasn't the Esperanto on the website supposed to expire on 31 Dec?
Jim Keenan
Brooklyn, NY
You would call the function as follows:
$full_reply_file = reply_to_digest_message(
\%config_in,
\%config_out,
1814,
2,
'/home/jimk/mail/digest/london',
);
Mail::Digest::Tools will then create a plain-text file which you can use as the first draft of your reply. It will print this screen prompt:
To complete reply, edit text in:
/home/jimk/mail/digest/london/language_courses.reply.txt
When you open language_courses.reply.txt in your text editor, it will look like this:
Reply-To:
london.pm@london.pm.org
Subject:
language courses
On Fri, 2 Jan 2004 23:41:01 -0500, James E Keenan
<jkeen@some.web.address.com> wrote:
> On Fri, 2 Jan 2004 22:38:40 +0000 (GMT), Ali Young wrote concerning:
> language courses
>
> > Depends what you count as useful. Learning Esperanto means that you
> can
> > read the current London.pm website.
>
> BTW, wasn't the Esperanto on the website supposed to expire on 31 Dec?
>
> Jim Keenan
> Brooklyn, NY
>
The 'Reply-To' and 'Subject' paragraphs are provided simply to give you
something to cut-and-paste into a GUI e-mail client. The 'Reply-To'
paragraph will only appear if in %config_in the key
reply_to_style_flag is defined for a particular digest.
You edit this plain-text file, pop it into the body of your e-mail window and send it. Not elegant, but it at least gives you a first draft.
repair_message_order repair_message_order(
\%config_in,
\%config_out,
{
year => 2004,
month => 01,
day => 27,
}
);
From time to time you may receive digest versions of mailing lists out of chronological/numerical sequence. This is especially true when e-mail traffic is being disrupted by worms or viruses. You may discover that you have received and processed
london.pm digest, Vol 1 #1856 - 7 msgs
london.pm digest, Vol 1 #1858 - 15 msgs
before realizing that you were missing
london.pm digest, Vol 1 #1857 - 18 msgs
If you were to now process digest 1857 with process_new_digests(), messages
from that digest would be appended to their respective thread files after
messages from digest 1858. Since the whole point of Mail::Digest::Tools is to
be able to read a discussion thread in chronological order, this would not be
desirable.
Fortunately, you can fix this problem as follows:
process_new_digests()Call process_new_digests() as you normally would. In the above example,
go ahead and call it on digest 1857 even though it creates thread files with
messages out of chronological order.
Examine the timestamps on your digest files for the date of the first digest you received out of sequence. In the above example, that would be the date of digest 1858. Since digest files were received out of proper sequence on or after that date, all thread files generated after that date may have out-of-sequence messages and need re-ordering.
repair_message_order() with the repair dateCall repair_message_order() with the following arguments:
year, month and day,
the values for which keys are the elements of the repair date.Mail::Digest::Tools will examine all thread files from midnight local time on that date. Where messages have been posted to the thread files out of proper sequence, they will be reposted in the correct order. The thread file with the correct sequence will overwrite the file with the incorrect sequence.
consolidate_threads_multiple consolidate_threads_multiple(
\%config_in,
\%config_out,
);
or
consolidate_threads_multiple(
\%config_in,
\%config_out,
$first_common_letters, # optional integer argument
);
As described above, Mail::Digest::Tool's process_new_digests() function
will, to the greatest extent possible, delete extraneous words such as 'Re:'
or 'Fwd:' from a message's subject so that all relevant postings on a given
subject can be included in a single thread file. What happens when this is
not sufficient? For example, suppose someone posts a message to a list with a
slightly misspelled or altered subject line:
Help telnetting to remote host through CGI.thr.txt
Help telnetting to remote host thru CGI.thr.txt
Mail::Digest::Tools offers two functions to address this problem.
consolidate_threads_multiple() is the easier to use and will be discussed
first. This function presumes that people who re-type e-mail subject lines
when replying tend to type the first several words correctly, then make errors
or alterations toward the end of the subject line. If the first n letters
of the subject line of two or more messages are identical, there is a strong
chance that the messages are discussing the same topic and should be posted to
the same discussion thread. Mail::Digest::Tool's default value for n is
20, but you can set a different value for a particular digest by passing an
optional third argument as shown above. consolidate_threads_multiple()
accordingly:
Candidates for consolidation:
Help telnetting to remote host through CGI.thr.txt
Help telnetting to remote host thru CGI.thr.txt
To consolidate, type YES:
YES in ALL CAPS, the files will be consolidated into a single
thread file whose name will be derived from the Subject line of the very first
posting to the discussion thread. Standard output will display:
Files will be consolidated
YES in ALL CAPS -- or simply hit [Enter],
then the files will not be consolidated and standard output will display:
Files will not be consolidated
.DELETABLE.
Help telnetting to remote host through CGI.thr.txt.DELETABLE
Help telnetting to remote host thru CGI.thr.txt.DELETABLE
delete_deletables() function discussed below. Analysis of the first 20 letters of each file in
[threads directory]
shows no candidates for consolidation. Please hard-code
names of files you wish to consolidate as arguments to
&consolidate_threads_single
consolidate_threads_single consolidate_threads_single(
\%config_in,
\%config_out,
[
'first_dummy_file_for_consolidation.thr.txt',
'second_dummy_file_for_consolidation.thr.txt',
],
);
Suppose that the thread files which you wish to consolidate have names whose
spelling diverges before the 21st letter. The algorithm which
consolidate_threads_multiple() applies would not detect the potential
rationale for consolidation. This could happen when someone tries to change
the subject of discussion from:
Best book for extreme Newbie to programming
to:
De incunabula nostra (Was Best book for extreme Newbie to programming)
Solution: Hard-code the files to be consolidated as elements of an
anonymous array. Pass a reference to that anonymous array as the third
argument to consolidate_threads_single() as shown above.
As with consolidate_threads_multiple(), the resulting consolidated file
will bear the name of the source file containing the very first posting to
the discussion thread. The files so consolidated will not automatically be
deleted. Rather, they will be renamed with the extension .DELETABLE as a
safety precaution and left for you to delete with delete_deletables().
delete_deletablesdelete_deletables(\%config_out);
Mail::Digest::Tools function delete_deletables() tidies up after use of
either consolidate_threads_multiple() or consolidate_threads_single().
Unlike all other public functions provided by Mail::Digest::Tools,
delete_deletables() needs to be passed a reference to only one of the
two configuration hashes, viz., the 'out' configuration hash. The
function simply changes to the directory where thread files for a given
digest are stored and deletes all files with the extension .DELETABLE.
To use a Mail::Digest::Tool function, you need to answer two fundamental questions:
What internal structure has the mailing list sponsor provided for a given digest?
How do I want to structure the results of applying Mail::Digest::Tools to a particular digest on my system?
Each of these two questions breaks down into sub-parts. Their answers supply you with the information with which you will construct the two configuration hashes passed to most Mail::Digest::Tools functions. Let us take each in turn.
%config_in: THE INTERNAL STRUCTURE OF A DIGESTThe best way to learn about the internal structure of a mailing list digest
(other than to study the application which created the digest in the first
place) is to accumulate several instances of the digest on your system in a
directory devoted to that purpose. Examine the way the digest's filename is
formed. Then examine the digest file itself. You will soon pick up a feel
for the structure of the digest, which will guide you in configuring
Mail::Digest::Tools for your system. That configuration will take the form
of a Perl hash which, for illustrative purposes, we shall here call
%xxx_config_in where xxx is a short-hand title for a particular digest.
For heuristic purposes we will examine the characteristics of two mailing list digests which the author has been following and archiving for several years: ActiveState's 'Perl-Win32-Users' digest and Yahoo! Groups' Perl Beginners group digest.
We must study a digest's file name in order to be able to write a pattern with which we will be able to distinguish a digest file from any non-digest file sitting in the same directory, as well as to be able to extract the digest number from that file name.
Once saved as plain-text files, Perl-Win32-Users digest files typically look like this in a directory:
Perl-Win32-Users Digest, Vol 1 Issue 1771.txt
Perl-Win32-Users Digest, Vol 1 Issue 1772.txt
Similarly, the Perl Beginner digest files look like this:
[PBML] Digest Number 1491.txt
[PBML] Digest Number 1492.txt
To correctly identify Perl-Win32-Users digest files from any other files in
the same directory, we compose a string which would form the core of a Perl
regular expression, i.e., everything in a pattern except the outer
delimiters. Internally, Mail::Digest::Tools passes the file name through a
grep { /regexp/ } pattern, so the first key is called grep_formula.
%pw32u_config_in = (
grep_formula => 'Perl-Win32-Users Digest',
...
);
The equivalent pattern for the Perl Beginners digest would be:
%pbml_config_in = (
grep_formula => '\[PBML\]',
...
);
Note that the [ and ] characters have to be escaped with a \
backslash because they are normally metacharacters inside Perl regular
expressions.
We next have to extract the digest number from the digest's file name.
Certain mailing list programs give individual digests both a 'Volume' number
as well as an individual digest number. Perl-Win32-Users typifies this. In
the example above we need to capture both the 1 as volume number and 1771
as digest number. The next key in our configuration hash is called
pattern_target:
%pw32u_config_in = (
grep_formula => 'Perl-Win32-Users Digest',
pattern_target => '.*Vol\s(\d+),\sIssue\s(\d+)\.txt',
...
);
Note the two sets of capturing parentheses.
Other digests, such as those at Yahoo! Groups, dispense with a volume number and simply increment each digest number:
%pbml_config_in = (
grep_formula => '\[PBML\]',
pattern_target => '.*\s(\d+)\.txt$',
...
);
Note that this pattern_target contains only one pair of capturing
parentheses.
A digest's internal structure is discussed in detail above (see
'A TYPICAL MAILING LIST DIGEST'). Here we need to identify two
characteristics: the way the digest introduces its list of today's topics
and the string it uses to delimit the list of today's topics from the first
individual message in the digest and all subsequent messages from one another.
Continuing with our two examples from above, we provide values for keys
topics_intro and source_msg_delimiter:
%pw32u_config_in = (
grep_formula => 'Perl-Win32-Users digest',
pattern_target => '.*Vol\s(\d+),\sIssue\s(\d+)\.txt',
topics_intro => 'Today\'s Topics:',
source_msg_delimiter => "--__--__--\n\n",
...
);
Note the escaped ' apostrophe character in the value for key
topics_intro.
%pbml_config_in = (
grep_formula => '\[PBML\]',
pattern_target => '.*\s(\d+)\.txt$',
topics_intro => 'Topics in this digest:',
source_msg_delimiter => "________________________________________________________________________\n________________________________________________________________________\n\n",
...
);
Note that the values provided for the respective source_msg_delimiter keys
had to be double-quoted strings. That's because all such delimiters include
two or more \n newline characters so that they form paragraphs unto
themselves. Unless indicated otherwise, the values for all other values in
the configuration hash are single-quoted strings.
Note: In early 2004, while Mail::Digest::Tools was being prepared for its
initial distribution on CPAN, ActiveState changed certain features in the
daily digest versions of its mailing lists. Hence, the code example presented
above should not be 'copied-and-pasted' into a configuration hash with which
you, the user, might follow the current Perl-Win32-Users digest. In
particular, the source message delimiter was changed to a string of 30
hyphens followed by 2 \n newline characters:
"------------------------------\n\n"
However, since it is not unheard of for contributors to a mailing list to use such a string of hyphens within their postings or signatures, using a string of hyphens is not a particularly apt choice for a source message delimiter. In this particular case, the author is getting better (but not fully tested) results by including an additional newline before the hyphen string in order to more uniquely identify the source message delimiter:
"\n------------------------------\n\n"
The internal structure of an individual message within a digest is also discussed in detail above. Here we need to identify patterns with which we can extract the content of the message's headers.
Certain mailing list digest programs allow a wide variety of headers to appear in digested messages. The Perl-Win32-Users digest typifies this. Each message in a Perl-Win32_Users digest must have a message number and headers for the message's author, recipients, subject and date.
Message: 1
From: Chris Smithson <ChrisSmithson@some.web.address.com>
To: "'Carter Kraus'" <carter@some.web.address.com>,
"Perl-Win32-Users (E-mail)" <perl-win32-users@activestate.com>
Subject: RE: OO Perl Issue.
Date: Wed, 4 Feb 2004 14:17:24 -0600
But a message in this digest may have additional headers for the author's organization, reply address and/or carbon-copy recipients.
Message: 5
Date: Wed, 4 Feb 2004 15:15:44 -0800
From: Sam Spade <sspade@some.web.address.com>
Organization: Some Web Address
Reply-To: Sam Spade <sspade@some.web.address.com>
To: "Time" <summers@some.web.address.com>
CC: "Perl List" <perl-win32-users@listserv.activestate.com>
Subject: Re: New IE Update causes script problems
Patterns are easily developed to capture this information and store it in the configuration hash:
%pw32u_config_in = (
grep_formula => 'Perl-Win32-Users digest',
pattern_target => '.*Vol\s(\d+),\sIssue\s(\d+)\.txt',
topics_intro => 'Today\'s Topics:',
source_msg_delimiter => "--__--__--\n\n",
message_style_flag => '^Message:\s+(\d+)$',
from_style_flag => '^From:\s+(.+)$',
org_style_flag => '^Organization:\s+(.+)$',
to_style_flag => '^To:\s+(.+)$',
cc_style_flag => '^CC:\s+(.+)$',
subject_style_flag => '^Subject:\s+(.+)$',
date_style_flag => '^Date:\s+(.+)$',
reply_to_style_flag => '^Reply-To:\s+(.+)$',
...
);
Other mailing list digest programs allow much fewer headers in digested messages. The Yahoo! Groups digests such as Perl Beginner typify this.
Message: 4
Date: Sun, 7 Dec 2003 19:24:03 +1100
From: Philip Streets <phil@some.web.address.com.au>
Subject: RH9.0, perl 5.8.2 and qmail-localfilter question
The patterns developed to capture this information and store it in the configuration hash would be as follows:
%pbml_config_in = (
grep_formula => '\[PBML\]',
pattern_target => '.*\s(\d+)\.txt$',
topics_intro => 'Topics in this digest:',
source_msg_delimiter => "________________________________________________________________________\n________________________________________________________________________\n\n",
message_style_flag => '^Message:\s+(\d+)$',
from_style_flag => '^\s+From:\s+(.+)$',
subject_style_flag => '^Subject:\s+(.+)$',
date_style_flag => '^\s+Date:\s+(.+)$',
...
);
Note that this pattern is written to expect 1 or more whitespaces at the
beginning of the from_style_flag and the date_style_flag.
We could -- but do not need to -- add the following key-value pairs to the
%pbml_config_in hash.
org_style_flag => undef,
to_style_flag => undef,
cc_style_flag => undef,
reply_to_style_flag => undef,
Certain mailing lists allow subscribers to post messages in either plain-text or HTML. Certain lists allow subscribers to post attachments; others do not. When it comes to preparing digests of these messages, the programs which different lists take lead to different results. The most annoying situation occurs when a list allows a subscriber to post in 'multipart MIME format' and then fails to strip out the redundant HTML part after printing the needed plain-text part.
Example: An all too typical example from an older version of an ActiveState list digest. (ActiveState changed the format of its digests in early 2004 to strip out HTML attachments. Hence, the following code no longer accurately represents what a subscriber to an ActiveState digest will see. Other mailing lists still suffer from MIME bloat, however, so treat the following code as illustrative.) The message begins:
Message: 1
To: Perl-Win32-Users@activestate.com
Subject: Can not tie STDOUT to scolled Tk widget
From: John_Wonderman@some.web.address.ca
Date: Thu, 15 Jan 2004 16:25:17 -0500
This is a multipart message in MIME format.
--=_alternative 00750F0485256E1C_=
Content-Type: text/plain; charset="US-ASCII"
Hi;
I am trying to implement a scrolling text widget to capture output for for
at tk app. Without scrolling:
my $text = $mw->Text(-width => 78,
-height => 32,
-wrap => 'word',
-font => ['Courier New','11']
)->pack(-side => 'bottom',
-expand => 1,
-fill => 'both',
);
...
When the plain-text part of the message is finished, it is then repeated in HTML:
--=_alternative 00750F0485256E1C_=
Content-Type: text/html; charset="US-ASCII"
<br><font size=2 face="Tahoma">Hi;</font>
<p><font size=2 face="Tahoma">I am trying to implement a scrolling text
widget to capture output for for at tk app. Without scrolling:</font>
<p><font size=2 face="Bitstream Vera Sans Mono">my $text = $mw->Text(-width
=> 78,</font>
<br><font size=2 face="Bitstream Vera Sans Mono">
-height => 32,</font>
<br><font size=2 face="Bitstream Vera Sans Mono">
-wrap => 'word',</font>
<br><font size=2 face="Bitstream Vera Sans Mono">
-font => ['Courier New','11']</font>
<br><font size=2 face="Bitstream Vera Sans Mono">)->pack(-side =>
'bottom',</font>
<br><font size=2 face="Bitstream Vera Sans Mono">
-expand => 1,</font>
<br><font size=2 face="Bitstream Vera Sans Mono">
-fill => 'both',</font>
There is no reason to retain this bloat in your thread file. The digest providers should have stripped it out, but the program they were using failed to do so. Other digests, such as those at Yahoo! Groups, eliminate all this blather.
Now, with Mail::Digest::Tools, you can eliminate much of the bloat yourself.
After examining 6-10 instances of a particular mailing list digest, you should
be able to determine whether the digest needs a dose of digital castor oil or
not, and you set key MIME_cleanup_flag accordingly. If the digest contains
unnecessary multipart MIME content, you set this flag to 1; otherwise, to
0.
And with that you have completed your analysis of the internal structure of a given digest and entered the relevant information into the first configuration hash:
%pw32u_config_in = (
grep_formula => 'Perl-Win32-Users digest',
pattern_target => '.*Vol\s(\d+),\sIssue\s(\d+)\.txt',
topics_intro => 'Today\'s Topics:',
source_msg_delimiter => "--__--__--\n\n",
message_style_flag => '^Message:\s+(\d+)$',
from_style_flag => '^From:\s+(.+)$',
org_style_flag => '^Organization:\s+(.+)$',
to_style_flag => '^To:\s+(.+)$',
cc_style_flag => '^CC:\s+(.+)$',
subject_style_flag => '^Subject:\s+(.+)$',
date_style_flag => '^Date:\s+(.+)$',
reply_to_style_flag => '^Reply-To:\s+(.+)$',
MIME_cleanup_flag => 1,
);
%pbml_config_in = (
grep_formula => '\[PBML\]',
pattern_target => '.*\s(\d+)\.txt$',
topics_intro => 'Topics in this digest:',
source_msg_delimiter => "________________________________________________________________________\n________________________________________________________________________\n\n",
message_style_flag => '^Message:\s+(\d+)$',
from_style_flag => '^\s+From:\s+(.+)$',
subject_style_flag => '^Subject:\s+(.+)$',
date_style_flag => '^\s+Date:\s+(.+)$',
MIME_cleanup_flag => 0,
);
%config_out: HOW TO PROCESS A DIGEST ON YOUR SYSTEM%config_in holds the answers to the question: What internal structure has
the mailing list sponsor provided for a given digest? In contrast,
%config_out will hold the answer to this question: How do I want to
structure the results of applying Mail::Digest::Tools to a particular digest
on my system?
For purpose of illustration, we will continue to assume that we are processing digest files received from the Perl-Win32-Users and Perl Beginner lists. We will make slightly different choices as to how we process those digest files so as to illustrate different options available from Mail::Digest::Tools.
We shall also assume that we going to place the scripts from which we call
Mail::Digest::Tools functions in the directory above the directories in
which we store the digest files once they have been saved as plain-text files.
If we call this directory digest and place the scripts in that directory,
then we will have a directory structure that starts out like this:
digest/
process_new.pl
process_ALL.pl
reply_digest_message.pl
repair_digest_order.pl
consolidate_threads.pl
deletables.pl
pw32u/
Perl-Win32-Users Digest, Vol 1 Issue 1771.txt
Perl-Win32-Users Digest, Vol 1 Issue 1772.txt
pbml/
[PBML] Digest Number 1491.txt
[PBML] Digest Number 1492.txt
%config_out KeysThere are 9 keys which are required in %config_out in order for
Mail::Digest::Tools to function properly. They correspond to 9 decisions
which you must make in setting up a Mail::Digest::Tools configuration on
your system.
Each digest must be given a title which is used whenever Mail::Digest::Tools
needs to prompt or warn you on standard output. The key which holds this
information in %config_out must be called title; the value for this
element should be sensible.
%pw32u_config_out = (
title => 'Perl-Win32-Users',
...
);
%pbml_config_out = (
title => 'Perl Beginner',
...
);
For each digest a directory must be designated where individual digest files
are stored in plain-text format. The key which holds this information in
%config_out must be called dir_digest. In the examples below
directories are named relative to the 'current' directory (..),
i.e., the directory where the script invoking a
Mail::Digest::Function is stored.
%pw32u_config_out = (
title => 'Perl-Win32-Users',
dir_digest => "../pw32u",
...
);
%pbml_config_out = (
title => 'Perl Beginner',
dir_digest => "../pbml",
...
);
For each digest a directory must be designated where the thread files created
by use of Mail::Digest::Tools functions are stored. The key which holds this
information in %config_out must be called dir_threads. In the examples
below the threads directory is a subdirectory of the digest directory, but
you may make other choices.
%pw32u_config_out = (
title => 'Perl-Win32-Users',
dir_digest => "../pw32u",
dir_threads => "../pw32u/Threads",
...
);
%pbml_config_out = (
title => 'Perl Beginner',
dir_digest => "../pbml",
dir_threads => "../pbml/Threads",
...
);
For each digest a file must be kept which logs whether a given digest file
has already been processed or not and, if so, when. The key which holds this
information in %config_out must be called digests_log. It has been
found convenient to keep this file in the digests directory, but you may make
other choices.
%pw32u_config_out = (
title => 'Perl-Win32-Users',
dir_digest => "../pw32u",
dir_threads => "../pw32u/Threads",
digests_log => "../pw32u/digests_log.txt",
...
);
%pbml_config_out = (
title => 'Perl Beginner',
dir_digest => "../pbml",
dir_threads => "../pbml/Threads",
digests_log => "../pbml/digests_log.txt",
...
);
For each digest a file must be kept which holds an ongoing record of the
list of topics found in each individual digest file. The key which holds this
information in %config_out must be called <todays_topics>. It has been
found convenient to keep this file in the digests directory, but you may make
other choices.
%pw32u_config_out = (
title => 'Perl-Win32-Users',
dir_digest => "../pw32u",
dir_threads => "../pw32u/Threads",
digests_log => "../pw32u/digests_log.txt",
todays_topics => "../pw32u/todays_topics.txt",
...
);
%pbml_config_out = (
title => 'Perl Beginner',
dir_digest => "../pbml",
dir_threads => "../pbml/Threads",
digests_log => "../pbml/digests_log.txt",
todays_topics => "../pbml/todays_topics.txt",
...
);
For each digest you must choose how to format the number(s) of the individual
digest file being processed when messages from that file are written to a
threads file. What you are doing here is formatting the information captured
by the pattern_target key in a given digest's %config_in (see above).
You express this choice as a single-quoted string which formats the data
captured by Perl regular expression which in pattern_target. This
formatting is done via the Perl sprintf function. The resulting string
is assigned to be the value of %config_out key <id_format>.
We saw above that digests from the Perl-Win32-Users list carried both a volume number and an individual digest number.
Perl-Win32-Users Digest, Vol 1 Issue 1771.txt
Perl-Win32-Users Digest, Vol 1 Issue 1772.txt
Both numbers were captured by the Perl regular expression in
%pw32u_config_in key <pattern_target>.
'.*Vol\s(\d+),\sIssue\s(\d+)\.txt',
Here we have chosen to format the volume number as a 3-digit, 0-padded number and the individual digest number as a 4-digit, 0-padded number. We then join these two data with an underscore.
%pw32u_config_out = (
title => 'Perl-Win32-Users',
dir_digest => "../pw32u",
dir_threads => "../pw32u/Threads",
digests_log => "../pw32u/digests_log.txt",
todays_topics => "../pw32u/todays_topics.txt",
id_format => 'sprintf("%03d",$1) . \'_\' . sprintf("%04d",$2)',
...
);
We saw above that digests from the Perl Beginners list carried only an digest number -- no volume number.
[PBML] Digest Number 1491.txt
[PBML] Digest Number 1492.txt
This number was captured by the Perl regular expression in %pbml_config_in
key <pattern_target>.
'.*\s(\d+)\.txt$'
Here we have chosen to format the digest number as a 5-digit, 0-padded number.
%pbml_config_out = (
title => 'Perl Beginner',
dir_digest => "../pbml",
dir_threads => "../pbml/Threads",
digests_log => "../pbml/digests_log.txt",
todays_topics => "../pbml/todays_topics.txt",
id_format => 'sprintf("%05d",$1)',
...
);
Note that if you allow for a 4-digit number, the highest numbered digest you
can process off a given mailing list will be 9999. If you allow for a
5-digit number, the upper limit will be 99999. The latter should be
sufficient for a lifetime even for a mailing list (e.g., London.pm) which
generates 3 or 4 digest files per day or over 1000 per year.
For each digest you must choose how to format the number which the digest
assigns to its individual messages. Experience suggests that 2 digits should
be more than sufficient to format this number, as all digests which the author
has observed have fewer than 100 entries. However, below we have arbitrarily
decided to allow for up to 9999 entries in a given digest. As with the digest
number, the formatting is accomplished via the Perl sprintf function.
The result is stored in a %config_out key which must be called
output_id_format.
%pw32u_config_out = (
title => 'Perl-Win32-Users',
dir_digest => "../pw32u",
dir_threads => "../pw32u/Threads",
digests_log => "../pw32u/digests_log.txt",
todays_topics => "../pw32u/todays_topics.txt",
id_format => 'sprintf("%03d",$1) .
\'_\' . sprintf("%04d",$2)',
output_id_format => 'sprintf("%04d",$1)',
...
);
%pbml_config_out = (
title => 'Perl Beginner',
dir_digest => "../pbml",
dir_threads => "../pbml/Threads",
digests_log => "../pbml/digests_log.txt",
todays_topics => "../pbml/todays_topics.txt",
id_format => 'sprintf("%05d",$1)',
output_id_format => 'sprintf("%04d",$1)',
...
);
For each digest you must compose a string which will separate one message in
a threads file from its successor. This string must be double-quoted and
assigned to %config_out key thread_msg_delimiter. For readability, this
string should terminate in two or more \n\n newline characters so that the
delimiter is always a paragraph unto itself.
This delimiter may -- or may not -- be the same string which the mailing list
provider uses to separate messages in the digest files themselves. In other
words, you may choose to use the same string for thread_msg_delimiter in
%config_out as you reported the list provider used in %config_in key
source_msg_delimiter.
In the example below we make the thread_msg_delimiter for the output from
Perl-Win32-Users to be the same as its source_msg_delimiter.
%pw32u_config_out = (
title => 'Perl-Win32-Users',
dir_digest => "../pw32u",
dir_threads => "../pw32u/Threads",
digests_log => "../pw32u/digests_log.txt",
todays_topics => "../pw32u/todays_topics.txt",
id_format => 'sprintf("%03d",$1) .
\'_\' . sprintf("%04d",$2)',
output_id_format => 'sprintf("%04d",$1)',
thread_msg_delimiter => "--__--__--\n\n",
...
);
Note: In light of the earlier discussion of the changes ActiveState made
to its mailing list digests in early 2004, the reader is cautioned that the
code above should not be directly 'copied-and-pasted' into a configuration
hash with which you might follow an ActiveState mailing list. Treat it as
educational. In particular, the author is now testing the following as a
setting for $pw32u_config_out{'thread_msg_delimiter'}:
"\n--__--__--\n\n",
For threads generated by appling Mail::Digest::Tools to the Perl Beginners list, we choose an output message delimiter which differs from the source message delimiter.
%pbml_config_out = (
title => 'Perl Beginner',
dir_digest => "../pbml",
dir_threads => "../pbml/Threads",
digests_log => "../pbml/digests_log.txt",
todays_topics => "../pbml/todays_topics.txt",
id_format => 'sprintf("%05d",$1)',
output_id_format => 'sprintf("%04d",$1)',
thread_msg_delimiter => "_*_*_*_*_*_\n_*_*_*_*_*_\n\n\n",
...
);
Whatever choice you make for the thread_msg_delimiter it should be a string
unlikely to occur within the text of a message and should terminate in two or
more newlines.
For each digest you process with Mail::Digest::Tools, you must decide whether
to retain the resulting thread files in an archive them in a separate
directory after a specified period of time, to delete them from disk
after a specified period of time, or to do neither and allow them to
accumulate indefinitely in the threads directory. Your decision is represented
as the value of %config_out key <archive_kill_trigger>. This value must
be expressed as one of three numerical values:
0 Thread files are neither archived nor deleted
1 Thread files are archived in a separate directory (or directories)
after the number of days specified by key 'archive_kill_days'
(see below)
-1 Thread files are deleted after I<n> days as specified by key
'archive_kill_days'
In the examples below we have chosen to archive all threads generated by the Perl-Win32-Users list but to kill all threads generated by the Perl Beginner list after a number of days whose specification we shall come to shortly.
%pw32u_config_out = (
title => 'Perl-Win32-Users',
dir_digest => "../pw32u",
dir_threads => "../pw32u/Threads",
digests_log => "../pw32u/digests_log.txt",
todays_topics => "../pw32u/todays_topics.txt",
id_format => 'sprintf("%03d",$1) . \'_\' .
sprintf("%04d",$2)',
output_id_format => 'sprintf("%04d",$1)',
thread_msg_delimiter => "--__--__--\n\n",
archive_kill_trigger => 1,
...
);
%pbml_config_out = (
title => 'Perl Beginner',
dir_digest => "../pbml",
dir_threads => "../pbml/Threads",
digests_log => "../pbml/digests_log.txt",
todays_topics => "../pbml/todays_topics.txt",
id_format => 'sprintf("%05d",$1)',
output_id_format => 'sprintf("%04d",$1)',
thread_msg_delimiter => "_*_*_*_*_*_\n_*_*_*_*_*_\n\n\n",
archive_kill_trigger => -1,
...
);
This completes the 9 required keys for %config_out. We now turn to keys
which are either optional or which are required if you have assigned a value
of 1 or -1 to key archive_kill_trigger.
%config_out KeysAs an option, Mail::Digest::Tools offers file to log which instances of a
particular digest have previously been processed which is more
human-readable than the file named in %config_out key digests_log.
That file logs a digest as follows:
001_9999;Fri Feb 6 18:57:41 2004;Fri Feb 6 18:57:41 2004
It is probably easier to read this data like this:
09999:
first processed at Fri Feb 6 18:57:41 2004
most recently processed at Fri Feb 6 18:57:41 2004
To choose this option you need to set two keys in %config_out:
digests_read_flagThis must be assigned a true value such as 1. This tells
Mail::Digest::Tools that you indeed want a 'digests read' file.
digests_readThis should be assigned the name of the 'digests read' file, but it will
default to a file digests_read.txt placed in the directory named by key
dir_digest.
Adding these keys to our %config_out, we get:
%pw32u_config_out = (
title => 'Perl-Win32-Users',
dir_digest => "../pw32u",
dir_threads => "../pw32u/Threads",
digests_log => "../pw32u/digests_log.txt",
todays_topics => "../pw32u/todays_topics.txt",
id_format => 'sprintf("%03d",$1) . \'_\' .
sprintf("%04d",$2)',
output_id_format => 'sprintf("%04d",$1)',
thread_msg_delimiter => "--__--__--\n\n",
archive_kill_trigger => 1,
digests_read_flag => 1,
digests_read => "../pw32u/digests_read.txt",
...
);
%pbml_config_out = (
title => 'Perl Beginner',
dir_digest => "../pbml",
dir_threads => "../pbml/Threads",
digests_log => "../pbml/digests_log.txt",
todays_topics => "../pbml/todays_topics.txt",
id_format => 'sprintf("%05d",$1)',
output_id_format => 'sprintf("%04d",$1)',
thread_msg_delimiter => "_*_*_*_*_*_\n_*_*_*_*_*_\n\n\n",
archive_kill_trigger => -1,
digests_read_flag => 1,
digests_read => "../pbml/digests_read.txt",
...
);
If, as discussed above, you have assigned the value 1 to the
<archive_kill_trigger key in %config_out, then Mail::Digest::Tools
will archive older thread files, i.e., it will move thread files from the
directory specified in key dir_threads to an archive directory if the
thread file has not been modified in a specified number of days. If new
messages need to be posted to a thread file which has been archived, that
file will be de-archived and brought back to the dir_threads directory.
Thread files which are either archived or de-archived via a call to
process_new_digests() or reprocess_ALL_digests() will be logged in
appropriately named files.
Hence, the keys you will need to define when archiving thread files are:
archive_kill_daysThis key must be assigned the number of days after which a thread file sitting
in the dir_threads directory is moved to an archive directory. If not
specified, will default to 14 days.
dir_archive_topThis key must be assigned the name of the top archive directory, i.e., the directory at the top of a tree of archive directories.
When you track a particular mailing list digest for a number of years, the number of different thread files can grow to enormous proportions. For example, the author has tracked over 10,000 distinct thread files from the Perl-Win32-Users list over a three-and-a-half year period. 10,000 files in a single directory is completely unwieldy and slows directory read-times tremendously. Mail::Digest::Tools therefore by default provides a tree of archive directories: a top directory which contains no thread files but instead holds 27 subdirectories , one for each letter of the English alphabet and one for thread files which start with any other character (guaranteed to work with ASCII only; not tested with other character sets).
dir_archive_top
a
b
c
...
z
other
The user gets to choose where to place the top archive directory but the 27
subdirectories are automatically placed beneath that one. The top archive
directory is the value assigned to %config_out key dir_archive_top.
archived_todayThis key should be assigned the name of a file which will log any and all
files archived by a single call to process_new_digests() or
reprocess_ALL_digests(). (By 'single' call is meant that this is not
an ongoing log; it only shows what happened today.) If not assigned a value,
it will default to a file called archived_today.txt located in the
directory named by key dir_digest.
de_archived_todayThis key should be assigned the name of a file which will log any and all
files de-archived by a single call to process_new_digests() or
reprocess_ALL_digests(). (By 'single' call is meant that this is not
an ongoing log; it only shows what happened today.) If not assigned a value,
it will default to a file called de_archived_today.txt located in the
directory named by key dir_digest.
archive_configThis key is reserved for future use. In the current version of
Mail::Digest::Tools it does not need to be set, but, should you be obsessive
about this, set it to 0.
Adding these keys to our sample %config_out hashes, we get:
%pw32u_config_out = (
title => 'Perl-Win32-Users',
dir_digest => "../pw32u",
dir_threads => "../pw32u/Threads",
digests_log => "../pw32u/digests_log.txt",
todays_topics => "../pw32u/todays_topics.txt",
id_format => 'sprintf("%03d",$1) . \'_\' .
sprintf("%04d",$2)',
output_id_format => 'sprintf("%04d",$1)',
thread_msg_delimiter => "--__--__--\n\n",
archive_kill_trigger => 1,
digests_read_flag => 1,
digests_read => "../pw32u/digests_read.txt",
archive_kill_days => 14,
dir_archive_top => "../pw32u/Threads/archive",
archived_today => "../pw32u/archived_today.txt",
de_archived_today => "../pw32u/de_archived_today.txt",
...
);
%pbml_config_out = (
title => 'Perl Beginner',
dir_digest => "../pbml",
dir_threads => "../pbml/Threads",
digests_log => "../pbml/digests_log.txt",
todays_topics => "../pbml/todays_topics.txt",
id_format => 'sprintf("%05d",$1)',
output_id_format => 'sprintf("%04d",$1)',
thread_msg_delimiter => "_*_*_*_*_*_\n_*_*_*_*_*_\n\n\n",
archive_kill_trigger => -1,
digests_read_flag => 1,
digests_read => "../pbml/digests_read.txt",
...
);
Note that since in our example we chose not to archive thread files from
the Perl Beginner list -- as evinced by the assignment of -1 to key
archive_kill_trigger -- we do not need to assign any values to
dir_archive_top, archived_today or de_archived_today in
%pbml_config_out.
The keys needed for %config_out when you have chosen to delete thread
files after a specified interval parallel those you would have needed if you
had chosen to archive those files instead.
archive_kill_daysThis key must be assigned the number of days after which a thread file sitting
in the dir_threads directory is deleted. If not specified, will default
to 14 days.
deleted_todayThis key should be assigned the name of a file which will log any and all
files deleted by a single call to process_new_digests() or
reprocess_ALL_digests(). (By 'single' call is meant that this is not
an ongoing log; it only shows what happened today.) If not assigned a value,
it will default to a file called deleted_today.txt located in the
directory named by key dir_digest.
Adding these keys to our sample %config_out hashes, we get:
%pw32u_config_out = (
title => 'Perl-Win32-Users',
dir_digest => "../pw32u",
dir_threads => "../pw32u/Threads",
digests_log => "../pw32u/digests_log.txt",
todays_topics => "../pw32u/todays_topics.txt",
id_format => 'sprintf("%03d",$1) . \'_\' .
sprintf("%04d",$2)',
output_id_format => 'sprintf("%04d",$1)',
thread_msg_delimiter => "--__--__--\n\n",
archive_kill_trigger => 1,
digests_read_flag => 1,
digests_read => "../pw32u/digests_read.txt",
archive_kill_days => 14,
dir_archive_top => "../pw32u/Threads/archive",
archived_today => "../pw32u/archived_today.txt",
de_archived_today => "../pw32u/de_archived_today.txt",
...
);
%pbml_config_out = (
title => 'Perl Beginner',
dir_digest => "../pbml",
dir_threads => "../pbml/Threads",
digests_log => "../pbml/digests_log.txt",
todays_topics => "../pbml/todays_topics.txt",
id_format => 'sprintf("%05d",$1)',
output_id_format => 'sprintf("%04d",$1)',
thread_msg_delimiter => "_*_*_*_*_*_\n_*_*_*_*_*_\n\n\n",
archive_kill_trigger => -1,
digests_read_flag => 1,
digests_read => "../pbml/digests_read.txt",
archive_kill_days => 14,
deleted_today => "../pbml/deleted_today.txt",
...
);
Note that since in our example we chose to archive thread files from
the Perl-Win32-Users list -- as evinced by the assignment of 1 to key
archive_kill_trigger -- we do not need to assign any values to
deleted_today in %pw32u_config_out.
Recall from above that you had to study a given digest to determine whether or
not it contained multipart MIME content in need of stripping out. If a digest,
such as the ActiveState Perl-Win32-Users digest, contained a lot of such bloat,
you set key MIME_cleanup_flag in %config_in to a value of 1. If, on
the other hand, the mailing list provider stripped out the multipart MIME
content before distributing the digest, you set that key to a value of 0.
Mail::Digest::Tools will automatically strip out multipart MIME content once
you have set MIME_cleanup_flag to 1. All that is left for you to decide
is: Do I want to view a log of which messages processed in a single call of
process_new_digests() or reprocess_ALL_digests() had multipart MIME
content stripped out -- or not? If so, you must set two keys in
%config_out:
MIME_cleanup_log_flagThis key must be set to a true value such as 1.
mimelogThis key should be assigned the name of the 'mimelog' file, but if you do not
specify a value it will default to a file mimelog.txt placed in the
directory named by key dir_digest.
The logfile so created looks like this:
Processed Problem
001_1775_0003 CASE C
001_1775_0015 CASE C
001_1775_0018 CASE C
001_1775_0021 CASE E
where items in the 'Processed' column were either (a) successfully stripped of multipart MIME content by Mail::Digest::Tools as specified by the internal rule denoted by the 'CASE'; or (b) were recognized by Mail::Digest::Tools as containing multipart MIME content that could not be stripped out.
This is relatively esoteric and probably of interest mainly to the module's
developer. So if you are not interested in this feature set
MIME_cleanup_log_flag to 0 and no mimelog will be created -- but
Mail::Digest::Tools will still do its best to strip out extraneous multipart
MIME content.
Our sample %config_out hashes are now complete. They look like this:
%pw32u_config_out = (
title => 'Perl-Win32-Users',
dir_digest => "../pw32u",
dir_threads => "../pw32u/Threads",
digests_log => "../pw32u/digests_log.txt",
todays_topics => "../pw32u/todays_topics.txt",
id_format => 'sprintf("%03d",$1) . \'_\' .
sprintf("%04d",$2)',
output_id_format => 'sprintf("%04d",$1)',
thread_msg_delimiter => "--__--__--\n\n",
archive_kill_trigger => 1,
digests_read_flag => 1,
digests_read => "../pw32u/digests_read.txt",
archive_kill_days => 14,
dir_archive_top => "../pw32u/Threads/archive",
archived_today => "../pw32u/archived_today.txt",
de_archived_today => "../pw32u/de_archived_today.txt",
mimelog => "../pw32u/mimelog.txt",
MIME_cleanup_log_flag => 1,
);
%pbml_config_out = (
title => 'Perl Beginner',
dir_digest => "../pbml",
dir_threads => "../pbml/Threads",
digests_log => "../pbml/digests_log.txt",
todays_topics => "../pbml/todays_topics.txt",
id_format => 'sprintf("%05d",$1)',
output_id_format => 'sprintf("%04d",$1)',
thread_msg_delimiter => "_*_*_*_*_*_\n_*_*_*_*_*_\n\n\n",
archive_kill_trigger => -1,
digests_read_flag => 1,
digests_read => "../pbml/digests_read.txt",
archive_kill_days => 14,
deleted_today => "../pbml/deleted_today.txt",
);
Note that %pbml_config_out does not have MIME_cleanup_log_flag or
mimelog keys. It doesn't need them, because in providing the Perl
Beginners mailing list Yahoo! Groups strips out unnecessary multipart
MIME content before sending the digest to you.
... in which the module author shares what he has learned using Mail::Digest::Tools and its predecessors since August 2000.
As mentioned above, if you are considering creating a local archive of threads originating in daily digest versions of a mailing list, you should first accumulate 6-10 instances of such digests and both:
study the internal structure of the digest -- needed to develop a
%config_in for the digest; and
carefully consider how you wish to structure the output from the module's
use on your system -- needed to develop %config_out for the digest
Once you have developed the initial configuration, you should call
reprocess_ALL_digests() on the digests, then open the files created to see
if the results are what you want. If they are not what you want, then you
need to think about what you should change in %config_in and/or
%config_out. Make those changes, then call reprocess_ALL_digests()
again. Repeat as needed, making sure not to delete any of the digest files
you are using as sources until you are completely satisfied with your
configuration.
Once, however, you are satisfied with your configuration, you should call
process_new_digests() on new instances of digests and never call
reprocess_ALL_digests() for that digest again (lest you not be able to
regenerate threads containing messages from digests you have deleted over
time).
As mentioned above, you will probably find it convenient to write separate
Perl scripts to call each one of Mail::Digest::Tool's public functions. You
could code %config_in and %config_out in each of those scripts just
before the respective function calls. But that would violate the principle of
'Repeated Code Is a Mistake' and multiply maintenance problems. It's far
better to code the two configuration hashes in a separate plain-text file and
'require' that file into your script. That way, any changes you make in the
configuration will be automatically picked up by each script that calls a
Mail::Digest::Tools function.
Here is an example of such a file holding the configuration hashes governing use of the Perl-Win32-Users digest, along with a script making use of that file.
# file: pw32u.digest.data
$topdir = "E:/Digest/pw32u";
%config_in = (
grep_formula => 'Perl-Win32-Users digest',
pattern_target => '.*Vol\s(\d+),\sIssue\s(\d+)\.txt',
# next element's value must be double-quoted
source_msg_delimiter => "--__--__--\n\n",
topics_intro => 'Today\'s Topics:',
message_style_flag => '^Message:\s+(\d+)$',
from_style_flag => '^From:\s+(.+)$',
org_style_flag => '^Organization:\s+(.+)$',
to_style_flag => '^To:\s+(.+)$',
cc_style_flag => '^CC:\s+(.+)$',
subject_style_flag => '^Subject:\s+(.+)$',
date_style_flag => '^Date:\s+(.+)$',
reply_to_style_flag => '^Reply-To:\s+(.+)$',
MIME_cleanup_flag => 1,
);
%config_out = (
title => 'Perl-Win32-Users',
dir_digest => $topdir,
dir_threads => "$topdir/Threads",
dir_archive_top => "$topdir/Threads/archive",
archived_today => "$topdir/archived_today.txt",
de_archived_today => "$topdir/de_archived_today.txt",
deleted_today => "$topdir/deleted_today.txt",
digests_log => "$topdir/digests_log.txt",
digests_read => "$topdir/digests_read.txt",
todays_topics => "$topdir/todays_topics.txt",
mimelog => "$topdir/mimelog.txt",
id_format => 'sprintf("%03d",$1) . \'_\' .
sprintf("%04d",$2)',
output_id_format => 'sprintf("%04d",$1)',
MIME_cleanup_log_flag => 1,
# next element's value must be double-quoted
thread_msg_delimiter => "--__--__--\n\n",
archive_kill_trigger => 1,
archive_kill_days => 14,
digests_read_flag => 1,
archive_config => 0,
);
# script: dig.pl
# USAGE: perl dig.pl
#!/usr/bin/perl
use strict;
use warnings;
use Mail::Digest::Tools qw( process_new_digests );
our (%config_in, %config_out);
my $data_file = 'pw32u.digest.data';
require $data_file;
process_new_digests(\%config_in, \%config_out);
print "\nFinished\n";
The module author has maintained local archives of more than a half dozen different mailing list digests over the past several years. He has found it convenient to maintain the configuration information for all the digests he is following at a given time in a single configuration file. The advantage to this approach is that if two digests share a similar internal structure (perhaps due to being generated by the same mailing list program or list provider) and if the user chooses to structure the output from the two digests in similar or identical ways, then getting the configuration hashes becomes much easier and the potential for error is reduced.
Here is a sample directory and file structure for maintaining archives of two different digests on a Win32 system:
digest/
digest.data
process_new.pl
process_ALL.pl
reply_digest_message.pl
repair_digest_order.pl
consolidate_threads.pl
deletables.pl
pw32u/
Perl-Win32-Users Digest, Vol 1 Issue 1771.txt
Perl-Win32-Users Digest, Vol 1 Issue 1772.txt
digest_log.txt
digest_read.txt
mimelog.txt
Threads/
pbml/
[PBML] Digest Number 1491.txt
[PBML] Digest Number 1492.txt
digest_log.txt
Threads/
File digest.data would look like this:
# digest.data
$topdir = "E:/Digest";
%digest_structure = (
pbml => {
grep_formula => '\[PBML\]',
pattern_target => '.*\s(\d+)\.txt$',
...
},
pw32u => {
grep_formula => 'Perl-Win32-Users digest',
pattern_target => '.*Vol\s(\d+),\sIssue\s(\d+)\.txt',
...
},
);
%digest_output_format = (
pbml => {
title => 'Perl Beginner',
dir_digest => "$topdir/pbml",
dir_threads => "$topdir/pbml/Threads",
...
},
pw32u => {
title => 'Perl-Win32-Users',
dir_digest => "$topdir/pw32u",
dir_threads => "$topdir/pw32u/Threads",
...
},
);
To accomodate this slightly more complex structure in the configuration file, the calling script might be modified as follows:
# script: dig.pl
# USAGE: perl dig.pl [short-name for digest]
#!/usr/bin/perl
use Mail::Digest::Tools qw( process_new_digests );
my ($this_key, %config_in, %config_out);
# variables imported from $data_file
our (%digest_structure, %digest_output_format);
my $data_file = 'digest.data';
require $data_file;
$this_key = shift @ARGV;
die "\n The command-line argument you typed: $this_key\n does not call an accessible digest$!"
unless (defined $digest_structure{$this_key}
and defined $digest_output_format{$this_key});
my ($k,$v);
while ( ($k, $v) = each %{$digest_structure{$this_key}} ) {
$config_in{$k} = $v;
}
while ( ($k, $v) = each %{$digest_output_format{$this_key}} ) {
$config_out{$k} = $v;
}
process_new_digests(\%config_in, \%config_out);
print "\nFinished\n";
For several years the module author used the scripts which were predecessors to Mail::Digest::Tools on a Win32 system where mail was read with Microsoft Outlook Express. He would do a "File/Save as.." on an instance of a digest, select text format (*.txt) and save it to an appropriate directory. Later, the author used the shareware e-mail client Poco, in which the same operation was accomplished by highlighting a file and keying "Ctrl+S".
But as the number of digests the author was tracking grew, this procedure became more and more tedious. Fortunately, about that time the author was assigned to write a review of the second edition of the Perl Cookbook, and he learned how to use the Net::POP3 module to receive his e-mail directly. So now he uses a Perl script to get all his digests and save them as text files to appropriate directories -- and then lets a GUI e-mail client take care of the rest.
Here is a script which more or less accomplishes this:
# script: get_digests.pl
#!/usr/bin/perl
use strict;
use warnings;
use Net::POP3;
use Term::ReadKey;
my ($site, $username, $password);
my ($verref, $pop3, $messagesref, $undeleted, $msgnum, $message);
my ($k,$v);
my ($oldfh, $output);
my %digests = (
'pbml' => "E:/Digest/pbml",
'pw32u' => "E:/Digest/pw32u",
'london' => "E:/Digest/london",
);
$site = 'pop3.someISP.com';
$username = 'myuserid';
$pop3 = Net::POP3->new($site)
or die "Couldn't open connection to $site: $!";
print "Enter password for $username at $site: ";
ReadMode('noecho');
$password = ReadLine(0);
chomp $password;
ReadMode(0);
print "\n";
defined ($pop3->login($username, $password))
or die "Can't authenticate: $!";
$messagesref = $pop3->list
or die "Can't get list of undeleted messages: $!";
while ( ($k,$v) = each %$messagesref ) {
my ($messageref, $line, %headers);
print "$k:\t$v\n";
$messageref = $pop3->top($k);
local $_;
foreach (@$messageref) {
chomp;
last if (/^\s*$/);
next unless (/^\s*(Date:|From:|Subject:|To:)/);
if (/^\s*Date:\s*(.*)/) {
$headers{'Date'} = $1;
}
if (/^\s*From:\s*(.*)/) {
$headers{'From'} = $1;
}
if (/^\s*Subject:\s*(.*)/) {
$headers{'Subject'} = $1;
}
if (/^\s*To:\s*(.*)/) {
$headers{'To'} = $1;
}
}
if ($headers{'Subject'} =~ /^\[PBML\]/) {
get_digest($pop3, $k, 'pbml', $headers{'Subject'});
}
if ($headers{'Subject'} =~ /^Perl-Win32-Users/) {
get_digest($pop3, $k, 'pw32u', $headers{'Subject'});
}
if ($headers{'Subject'} =~ /^london\.pm/) {
get_digest($pop3, $k, 'london', $headers{'Subject'});
}
}
$pop3->quit() or die "Couldn't quit cleanly: $!";
print "Finished!\n";
sub get_digest {
my ($pop3, $msgnum, $digest, $subj) = @_;
print "Retrieving $msgnum: $subj";
my $message =
$pop3->get($msgnum) or die "Couldn't get message $msgnum: $!";
if ($message) {
print "\n";
my $digestfile = "$digests{$digest}/$subj.txt";
_print_message($digestfile, $message);
print "Marking $msgnum for deletion\n";;
$pop3->delete($msgnum) or die "Couldn't delete message $msgnum: $!";
} else {
print "Failed: $!\n";
}
}
sub _print_message {
my ($digestfile, $message) = @_;
my @lines = @{$message};
my $counter = 0;
open(FH, ">$digestfile")
or die "Couldn't open $digestfile for writing: $!";
for (my $i = 0; $i<=$#lines; $i++) {
chomp($lines[$i]);
# Identify the first blank line in the digest,
# i.e., the end of the headers
if ($lines[$i] =~ /^$/) {
$counter = $i;
last;
}
};
# Transfer digest to appropriate directory, skipping over digest header
# so as to start just above Today's Topics
foreach my $line (@lines[$counter+1 .. $#lines]) {
chomp($line);
# For some reason the $pop3->get() puts a single whitespace at the
# start of most (all but the first?) lines
# That has to be cleaned up so digest.pl can correctly process
# header info and identify beginning of Today's Topics
if ($line =~ /^\s(.*)/) {
print FH $1, "\n";
} else {
print FH $line, "\n";
}
}
close FH or die "Couldn't close after writing: $!";
}
No promise is made that this script or any script contained in this documentation will work correctly on your system. Hack it up to get it to work the way you want it to.
The main assumption on which Mail::Digest::Tools depends for its success is that the provider of a particular digest continues to use the same mailing list software to produce the digest. If the provider changes his/her software, you must modify Mail::Digest::Tools' configuration data accordingly.
At its current stage of development Mail::Digest::Tools is only applicable to
mailing list digests which arrive as one continuous file. It is not
applicable to digests (e.g., Cygwin, module-authors@perl.org) which are
supplied in a format consisting of (a) one file with instructions and a table
of contents and (b) all the individual messages provided as e-mail attachments.
The program was created with Perl 5.6. Certain features, such as the use of
the our modifier, were not available prior to 5.6. Modifications to
account for pre-5.6 features are left as an exercise for the user.
Mail::Digest::Tools internally uses Perl core extension Time::Local. If at some future point this module is not included as part of a Perl core distribution, you would have to install it manually from CPAN.
ActiveState maintains Perl for Windows-based platforms and also maintains a variety of mailing lists for users of its Windows-compatible versions of Perl. Subscribers to these lists can receive messages either as individual e-mails or as part of a daily digest which contains a listing of the day's topics and the complete text of each message. The messages are often best followed as discussion 'threads' which may extend over several days' worth of digests.
In June of 2000, however, ActiveState had to temporarily take its mailing lists
off-line for technical reasons. When these lists were restored to service,
their archive capacities were not immediately restored. I had just begun my
study of Perl and had come to enjoy reading the Perl-Win32-Users digest. As
I set off for the Yet Another Perl Conference in Pittsburgh, I shouted out,
'I want my Perl-Win32-Users digest!' I wrote a Perl script called digest.pl
to fill that gap.
ActiveState has since restored archiving capacity to their lists. For reasons that would perhaps best be explored in a psychotherapeutic context, however, I had become attached to my local archive of the 'pw32u' list, so I continued to maintain this program and fine-tune its coding.
In early 2001 it became apparent that this program could be applied to a wide
variety of mailing list digests -- not just those provided by ActiveState. In
particular, valuable digests provided by Yahoo Groups (formerly E-groups) such
as NT Emacs Users, Perl 5 Porters and Perl Beginners could also be archived if
digest.pl were modified appropriately. I made those modifications and
began to track several other digests. I was able to use the archive I had
developed as a window into one part of the Perl community in a Lightning Talk
I gave at YAPC::North America in Montreal in June 2001, ''An Index of
Incivility in the Perl Community.''
Maintaining digest.pl was, to a considerable extent, the way I taught myself
Perl. Along the way I incorporated my first profiler into the script -- and
then discarded it. Some of the subroutines I had written for early versions of
the program had applicability to other scripts -- and thus was born my first
module -- also since discarded. By July 2003 I was up to version 1.3.
Following a suggestion by Uri Guttman at the YAPC::EU conference held in Paris
in July 2003, wherever possible the use of separate
print statements for each line to be printed was eliminated in favor of
concatenating strings to be printed into much larger strings which could be
printed all at once. This revision reduced the number of times filehandles
had to be opened for writing. A given thread file was now opened only once
per call of this program, rather than once for each message in each digest
processed per call of the program.
Various other improvements, such as the possibility of stripping out unnecessary multipart MIME content and the introduction of subdirectories for archiving, were made in late 2003. At that point I decided to transform the script into a full-fledged Perl module. At first I tried out an object-oriented structure (with which I was familiar from my first two CPAN modules, List::Compare and Data::Presenter). That OO structure necessitated one constructor and one method call per typical script, but since the constructor did nothing but some cursory validation of the configuration data, it was mostly superfluous. Hence, I jettisoned the OO structure in favor of a functional approach. The result: Mail::Digest::Tools.
After these revisions, I was up to version 1.96. Why revert to a lower version number at this point? That is why Mail::Digest::Tools makes its CPAN debut in version 2.04.
v1.97 (2/18/2004): Dealing with problem that Win32 and Unix/Linux may create
different thread names for the same set of source messages because they have
different lists of characters forbidden in file names. This became a problem
while writing tests for process_new_digests() because it made predicting
the names of thread files created via that function more difficult to predict.
Tests adjusted appropriately.
v1.98 (2/19/2004): Eliminated suspect uses of /o modifier on regexes.
This was causing problems when I called process_new_digests() on two
different types of digests in the same script. Also, eliminated code
referring to DOS (e.g., code eliminating characters unacceptable in
DOS filenames) as I have no way to test this module on a DOS box.
v1.99 (2/22/2004): ActiveState introduced a new format for its
Perl-Win32-Users digest -- the digest which originally inspired the creation
of this module's predecessor in 2000. One aspect of this new format was a
clear improvement: HTML attachments are now stripped before messages are
posted to the digest, so multipart MIME content has either been reduced
considerably or eliminated altogether. But another aspect of this new
format upset code going back four years: The delimiter immediately
following Today's Topics is now different from the delimiters separating each
message in the digest. Working around this appeared to be surprisingly
difficult, especially since this revision had to be done in the middle of
writing a test suite for CPAN distribution. A new key has been added to the
%config_in hash for each digest:
$config_in{'post_topics_delimiter'}
v2.00 (2/23/2004): Testing conducted after the last revision revealed a bug
going back several versions in the internal subroutine stripping multipart
MIME content. The last paragraph of each message which did not have MIME
content was being stripped off. The offending code was found within
_analyze_message_body(). (The author recently learned of the CPAN
module Email::StripMime. This looks promising as a replacement for
the hand-rolled subroutine used within Mail::Digest::Tools, but a full study
of its possibilities will be deferred to a later version. Also in this
version, POD was rewritten to reflect the introduction of the post-topics
delimiter.
v2.01 (2/24/2004): Backslashes (except as part of \n newline characters)
are prohibited in %config_out key thread_msg_delimiter. This is
because in the test suite that key's value is used as a variable inside a
regular expression which in turn is used as an argument to split().
Preliminary investigation suggests that to work around the backslash
metacharacter in that situation would be very time-consuming.
v2.02 (2/26/2004): Revised reply_to_digest_message() internal
subroutine _strip_down_for_reply to reflect distinction between post-topics
delimiter and source message delimiter.
v2.03 (3/04/2004): Fixed bug in readdir call in repair_message_order().
Extensive reworking of test suite.
v2.04 (3/05/2004): No changes in module. Refinement of test suite only.
v2.05 (3/07/2004): Fixed accidental deletion of incrementation of
$message_count in _strip_down().
v2.06 (3/10/2004): Correction of errors in test suite. Elimination of use of List::Compare in test suite.
v2.07 (3/11/2004): Correction of error in t/03.t
v2.08 (3/11/2004): Correction in _clean_up_thread_title and in tests.
v2.10 (3/15/2004): Corrections to README and documentation only.
v2.11 (10/23/2004): Fixed several errors which resulted in "Bizarre copy of hash in leave" error when running test suite under Devel::Cover.
v2.12 (05/14/2011): Added 'mirbsd' to list of Unixish-OSes.
James E. Keenan (jkeenan@cpan.org).
Creation date: August 21, 2000. Last modification date: May 14, 2011. Copyright (c) 2000-2011 James E. Keenan. United States. All rights reserved.
This software is distributed with absolutely no warranty, express or implied. Use it at your own risk. This is free software which you may distribute under the same terms as Perl itself.
| Mail-Digest-Tools documentation | view source | Contained in the Mail-Digest-Tools distribution. |