| RSSycklr documentation | Contained in the RSSycklr distribution. |
RSSycklr - (beta) Highly configurable recycling of syndication (RSS/Atom) feeds into tailored, guaranteed XHTML fragments.
0.12
use strict;
use warnings;
use RSSycklr;
use Encode;
my @feeds = ({ uri => "http://www.xkcd.com/atom.xml",
max_display => 1, },
{ uri => "http://feeds.theonion.com/theonion/daily" },
{ title_override => "O NOES, IZ TEH DED",
uri => "http://rss.news.yahoo.com/rss/obits", });
my $rsklr = RSSycklr->new();
$rsklr->config({ feeds => \@feeds,
title_only => 1 });
while ( my $feed = $rsklr->next() )
{
print Encode::encode_utf8( $feed->title_override || $feed->title ), "\n";
for my $entry ( $feed->entries )
{
print "\t* ", Encode::encode_utf8( $entry->title ), "\n";
}
}
This is a more of a mini-app engine than a pure module. RSSycklr is a package that wraps up the best parts of XML::Feed and HTML::Truncate then filters it through XML::LibXML to guarantee valid XHTML and adds a side of Template for auto-formatted output of XHTML fragments should you so desire.
This is probably easier to show with examples than explain. This is the part where I show, or maybe explain, someday. For now, take a look at the CONFIGURATION sample below and the source for the tool 'rssycklr' that comes with this distribution.
XHTML validation is currently based on "-//W3C//DTD XHTML 1.0 Transitional//EN" and errors are not fatal. They carp (carp in Carp) right now. You will be able to pick your DTD eventually and decide if errors are fatals or skip the entry or just complain.
Create an RSSycklr object.
Takes a YAML file name or string. It must conform to the configuration format. No validation of input is done at this point. More config options will be probably be added soon. As it calls config underneath, loading configuration options will be add them to what's already there, not reset them.
Set/get hash reference of the configuration and raw feed data. Setting config is additive, each new hash reference is merged with the current config hash reference.
Takes an array ref of hash refs of feed info. uri is the only required key in the hash ref. Other possible keys are shown in CONFIGURATION below.
Iteration through feeds with delayed execution. Feeds are only fetched and cleaned-up as they are called. next is destructive and can be used in a while loop.
while ( my $feed = $rssycklr->new() ) {
print "Title: ", $feed->title, "\n";
}
If you prefer to get your feeds at once in a list or an array ref, use feeds. It iterates on next under the hood, therefore next will be empty after feeds has been called though feeds may be called repeatedly without refetching or parsing. If you add_feeds to add new feeds, next will able to iterate on those and feeds will add them to those already parsed and fetched.
Remember that each feed is a web request and they aren't done in any kind of parallel nature so you could expect a list of 20 feeds to return slowly, maybe very slowly.
Sort of does this-
$rssycklr->process($rssycklr->template, { rssycklr => $rssycklr })
or confess $rssycklr->tt2->error();
Can also be called for a return value, like so-
my $output = $rssycklr->as_string;
In void context, it processes/prints to STDOUT.
$rssycklr->as_string;
The list (stored as a hash ref) of tags which will be kept when creating ledes from entry bodies. The default list generally comprises the phrasal tags; e.g., <i/>, <q/>, <del/>, <dfn/>, <sup/>, et cetera.
perl -MRSSycklr -MYAML -le '$rsklr = RSSycklr->new; print Dump $rsklr->keep_tags'
Example: dropping images-
delete $rsklr->keep_tags->{img};
Example: drop all tags-
$rsklr->keep_tags({});
The Template object we may create to do output. It's deferred so if you never ask for it, and never call its methods, it's never created.
The template that will be passed to process in Template. It can be a string (scalar ref), a file, or a file handle. The default is a string ref.
perl -MRSSycklr -le '$rsklr = RSSycklr->new; print ${$rsklr->template}'
The XML::LibXML object.
Passes an HTML fragment through some HTML::TokeParser::Simple sanity cleanup and returns an XML::LibXML::Document. This is an
The HTML::Truncate object.
Internal method. To allow config and load_config to be passed as arguments. BUILD runs the methods at initialization if you do.
As noted above, an RSSycklr object has a collection of objects it wrangles. You may call methods on it which get delegated t its objects. All the methods below belong to the indicated classes and may be treated exactly as the relevant documents show.
This is process in Template.
parse_html_string in XML::LibXML. You also have access to recover in XML::LibXML::Parser and recover_silently in XML::LibXML::Parser which are set to "1" by default.
Calls from RSSyckler objects to feeds and next return RSSycklr::Feed objects. They are based on XML::Feed objects.
More configuration settings are shown in the config example.
For some feeds, like say a search generated feed from Google, you might get back a title in the XML which is ridiculous for display; e.g., "bingo cards" +tacos site:example.org. In cases likes this it would be nice to provide your own title.
The processed entries from the feed which passed configuration filters.
The number of entries a feed has. Note, this is not the number of entries in the actual XML::Feed, but the number of entries which passed your configuration filters.
The following delegate to the underlying XML::Feed object.
The excerpted portion of the feed entry's content.
The parent RSSyckler::Feed object.
The following delegate to the underlying XML::Feed::Entry object.
This will eventually be replaced by a native method.
Configuration is a hash in two levels. The top level contains defaults. The key feeds contains per feed settings. You can have max_display => 3 in the top, for example, but have max_display => 1 and max_display => 10 in individual feed data. Leaving max_display out of feed data would mean a feed would fall back to the top default setting 3.
---
# length of entry excerpt to keep as "lede"
excerpt_length: 110
# don't do excerpts, titles, only
title_only: ~
# master setting for oldest entry age
hours_back: 30
# stop fetching at this point
max_feeds: 10
# master setting for entries to keep per feed
max_display: 3
# seconds to try a feed fetch before skipping
timeout: 10
# ellipsis on truncated ledes/titles
ellipsis: " "
# text for "read more" link
read_more: [more]
# css class for top <div> wrapper
css_class: rssycklr
# not implemented
title_length: ~
# not implemented, dl/dt/dd happens now
excerpt_style: dl|p|br|ul
# not implemented, ul/li happens now
title_style: ul|p|br
# this is hardcoded for now
max_images: 1
feed_title_tag: h4
dtd: xhtml1-transitional.dtd
feeds:
- uri: http://green.yahoo.com/rss/blogs/all
max_display: 5
hours_back: 24
- uri: http://sedition.com/feed/atom
title_only: 1
hours_back: 105
timeout: 3
- uri: http://dd.pangyre.org/dd.atom
excerpt_length: 300
hours_back: 48
Caveat: the ellipsis default is utf8 so set it to "..." (three periods) or … if it's going to cause a problem in your handling.
How long to make ledes. This is passed through HTML::Truncate so it tries to count displayed characters, not real real characters; i.e., <p>Oh, Hai!</p> is counted as 8 characters, not 15. Default is at 170.
If true, don't do excerpts, only pull titles.
Maximum age of feed entries to include.
How many entries from a feed to parse and keep.
How many seconds to wait for a feed fetch to return before skipping it.
Text for "read more" link.
The CSS class for the top <div/> wrapper.
Maximum images to keep in a lede. Hardcoded to 1 right now.
Stop fetching at this point.
The DTD to validate feed snippets against. The default is xhtml1-transitional.dtd. Also available: xhtml1-frameset.dtd, xhtml1-strict.dtd, and xhtml11.dtd. Because we use XML::LibXML to parse our snippets we cannot, and frankly wouldn't want to, support HTML 4 and earlier.
Not implemented.
Not implemented, dl/dt/dd happens in template now.
Not implemented, ul/li happens now.
Settable; h4 in template now.
The image handling is probably the most important part. Feeds might return huge images or several images.
.rssycklr {
font-family: helvetica, sans-serif;
}
.rssycklr h4 {
border-bottom: 1px solid #ccc;
line-height: 100%;
}
.rssycklr h4 a {
color:#039!important;
text-decoration:none;
}
.rssycklr .datetime {
color: #445;
font-size: 80%;
}
.rssycklr a.readmore {
text-decoration:none;
font-size: 90%;
}
.rssycklr img {
float: right;
clear: right;
width: 60px;
margin: -3px 0 0 3px;
}
Ashley Pond V, <ashley@cpan.org>.
as_is flag?
Pass through the Pod to make it a bit more useful and less redundant on config stuff.
If abutting tags stripped tags are flow level, insert a newline...? Define the behavior in the config so even a ¶ or something could be inserted. Turn <br/>s into newlines? Use canTighten and isPhraseMarkup from HTML::Tagset to make these choices. Maybe this is where the DTDs should live too...?
Test timed out feeds.
next should be putting feeds aside for feeds?
Translate tags? To drop blockquote to q and h* to bold, etc?
Text only option for ledes? Makes it easier to work on that setting keep_tags to empty.
Put a name field for feeds to override the feed supplied title.
Make the validation controllable.
Make a master timeout vs a feed level timeout? No...
Make utf8 a settable...?
Move all the DTD handling, and all the other historical ones, HTML 1 and up, into a real distribution...? Ikegami's catalog stuff?
Make attribute filter configurable.
More tests.
Throw errors for extraneous or malformed config data.
Implement anything in the configuration example which reads, "not implemented." E.g., make the style/tags configurable for titles/ledes; e.g., dl|p|br|ul.
Submit a patch, or ticket, to Benjamin for a content_type XML::Feed::Entry. We're just assuming it's HTML.
Template->process should probably have a before call to allow the config to be merged into the top of the template data.
Make image count configurable.
Regex filters?
Chance of inclusion: a decimal so that a list of feeds 100 feeds with a level of 0.1 would only load (or rather try to) approximately 10 feeds.
I love good feedback and bug reports. Please report any bugs or feature requests directly to me via email or through the web interface at http://rt.cpan.org/Public/Dist/Display.html?Name=RSSycklr.
Stevan Little, Shawn M Moore, and Benjamin Trott. I had no idea how cool Moose and Mouse were before I put this together. They make very complicated interactions seem quite natural. I changed design and features three or four times putting this together and if the code had all been by hand it probably would have made me dump the project since I already had a perfectly serviceable program doing what it does. Instead, with Mouse, fairly deep changes were nearly trivial.
XML::Feed, XML::Feed::Entry, Mouse/Moose, XML::LibXML, Template, YAML, HTML::Truncate, DateTime, Scalar::Util, URI, Encode.
Copyright (©) 2008-2009 Ashley Pond V.
This program is free software; you can redistribute it or modify it or both under the same terms as Perl itself.
Because this software is licensed free of charge, there is no warranty for the software, to the extent permitted by applicable law. Except when otherwise stated in writing the copyright holders or other parties provide the software "as is" without warranty of any kind, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. The entire risk as to the quality and performance of the software is with you. Should the software prove defective, you assume the cost of all necessary servicing, repair, or correction.
In no event unless required by applicable law or agreed to in writing will any copyright holder, or any other party who may modify and/or redistribute the software as permitted by the above licence, be liable to you for damages, including any general, special, incidental, or consequential damages arising out of the use or inability to use the software (including but not limited to loss of data or data being rendered inaccurate or losses sustained by you or third parties or a failure of the software to operate with any other software), even if such holder or other party has been advised of the possibility of such damages.
| RSSycklr documentation | Contained in the RSSycklr distribution. |
package RSSycklr; use Moose; no warnings "uninitialized"; use Carp qw( carp confess croak ); use YAML (); use XML::Feed (); use HTML::Truncate (); use HTML::TokeParser::Simple (); use XML::LibXML (); use DateTime (); use Scalar::Util qw( blessed ); use URI (); use File::ShareDir (); use Hash::Merge::Simple qw( merge ); use Encode qw( decode_utf8 ); our $VERSION = "0.12"; has "keep_tags" => is => "rw", isa => "HashRef", default => sub { return { map {; $_ => 1 } qw( del ins i u b em strong abbr br img dfn acronym q sub sup cite code kbd samp strong var strike s tt a ) }; }, ; has "tt2" => is => "ro", lazy => 1, # not always used isa => "Template", default => sub { require Template; Template->new({ ENCODING => 'UTF-8', DEFAULT_ENCODING => 'UTF-8', }); }, handles => [qw( process )], ; # No type so it can take any Template takes has "template" => is => "rw", lazy => 1, # not always used default => sub { \<<"TT_TEMPLATE"; <div class="[% css_class || "rssycklr" %]"> [%-FOR feed IN rssycklr.feeds() %] [%-NEXT UNLESS feed.count %] <div> <[% feed_title_tag || "h4" %]> <a href="[%-feed.link | html %]">[%-FILTER html; feed.title_override || feed.title; END %]</a> </[% feed_title_tag || "h4" %]> [%~IF feed.entries.0.lede %] <dl> [%-FOR entry IN feed.entries %] <dt><a href="[%-entry.link | html %]">[%-entry.title | html %]</a></dt> <dd> [% entry.lede %] <span class="datetime">[% modified = entry.modified ? entry.modified : entry.feed.modified %] [% modified.ymd(".") %] [% modified.hour_12 %]:[% modified.min | format('%02d') %][% modified.am_or_pm %] </span> </dd> [%~END %] </dl> [%~ELSE %] <ul> [%-FOR entry IN feed.entries %] <li><a href="[%-entry.link | html %]">[%-entry.title | html %]</a></li> [%~END %] </ul> [%~END %] </div> [%~END %] </div> TT_TEMPLATE }, ; has "xml_parser" => is => "rw", isa => "XML::LibXML", default => sub { my $libxml = XML::LibXML->new(); $libxml->keep_blanks(1); $libxml->line_numbers(1); $libxml->complete_attributes(1); $libxml->clean_namespaces(1); $libxml->no_network(1); $libxml->recover_silently(1); return $libxml; }, handles => [qw( parse_html_string )], ; has "dtd" => is => "rw", isa => "XML::LibXML::Dtd", ; has "truncater" => is => "rw", isa => "Object", # "HTML::Truncate", default => sub { HTML::Truncate->new(repair => 1, on_space => 1, chars => 170); }, handles => [ qw( truncate ) ], ; has "feeds" => is => "ro", auto_deref => 1, isa => "ArrayRef", default => sub { [] }, ; before "feeds" => sub { my $self = shift; while ( my $feed = $self->next() ) { push @{$self->{feeds}}, $feed; } }; sub BUILD { my ( $self, $args ) = @_; $self->config(delete $args->{config}) if $args->{config}; $self->load_config(delete $args->{load_config}) if $args->{load_config}; } sub config { my $self = shift; $self->{_config} ||= $self->_default_config(); my $hash = shift || return $self->{_config}; $self->{_config} = merge $self->{_config}, $hash; return $self->{_config}; } sub load_config { my $self = shift; my $src = shift || return; my $info = ref($src) ? $src : $src !~ /\n/ ? YAML::LoadFile($src) : YAML::Load($src); my $feeds = delete $info->{feeds} || []; $self->config($info); $self->add_feeds($feeds); return $self; } sub add_feeds { my $self = shift; my $feeds = shift; my $old = scalar @{$self->config->{feeds} || []}; my $new = scalar @{$feeds}; for my $info ( @{$feeds} ) { confess "URI is missing from feed data for feed: ", YAML::Dump($info) unless $info->{uri}; push @{$self->config->{feeds}}, $info; } return ( $old + $new ) == @{$self->config->{feeds}}; } sub as_string { my $self = shift; my $out = ""; $self->process($self->template, { rssycklr => $self }, \$out) or confess $self->tt2->error(); if ( defined wantarray ) { return $out; } else { print $out; return 1; } } sub next { my $self = shift; if ( $self->_maxed_out ) { $self->config->{feeds} = []; return; } my $info = shift @{ $self->config->{feeds} } || return; my $uri = blessed($info->{uri}) eq "URI" ? $info->{uri} : URI->new($info->{uri}); my $xml_feed; my $ok = eval { local $SIG{ALRM} = sub { die "Feed request timeout: $uri\n" }; alarm( $info->{timeout} || $self->config->{timeout} || 10 ); $xml_feed = XML::Feed->parse($uri) or croak("Could not parse $uri, ", XML::Feed->errstr); alarm(0); 1; }; alarm(0); # Racing parsing fatals can happen in the XML::Feed space(?). unless ( $ok == 1 ) { carp $@ || ( "Unknown error parsing " . $info->{uri} ); return $self->next; } my $hours_back = DateTime ->now( time_zone => 'floating' ) ->subtract( hours => $info->{hours_back} || $self->config->{hours_back} || 170 ); if ( $xml_feed->modified ) { return $self->next unless 1 == DateTime->compare( $xml_feed->modified, $hours_back ); } my $max_display = $info->{max_display} || $self->config->{max_display} || 10; my $excerpt_length = $info->{excerpt_length} || $self->config->{excerpt_length}; my $title_only = exists($info->{title_only}) ? $info->{title_only} # might be undef on purpose to override self->config setting : $self->config->{title_only}; my @entry; ENTRY: for my $entry ( $xml_feed->entries ) { next ENTRY unless $entry->issued; next ENTRY unless 1 == DateTime->compare( $entry->issued, $hours_back ); my %entry; unless ( $title_only ) { next ENTRY if $entry->content->body !~ /\S/; my $xhtml = $self->html_to_dom( $entry->content->body ) or die "Couldn't parse ", $entry->content->body; $self->_strip_attributes($xhtml); $self->_strip_tags($xhtml); $self->_handle_images($xhtml, $entry); my ( $body ) = $xhtml->findnodes("body"); unless ( $xhtml->findnodes("head") ) { my $head = $xhtml->createElement("head"); my $title = $xhtml->createElement("title"); my $text = $xhtml->createTextNode(__PACKAGE__ . "/" . $VERSION); $title->appendChild($text); $head->appendChild($title); $xhtml->insertBefore($head,$body); } # Cache it. unless ( $self->dtd ) { $self->config->{dtd} ||= "xhtml1-transitional.dtd"; my $dtd_file = File::ShareDir::dist_file(__PACKAGE__, $self->config->{dtd}); $/ = undef; open my $fh, "<", $dtd_file or croak "Couldn't open '$dtd_file' for reading: $!"; $self->{ $self->config->{dtd} } = <$fh>; close $fh or carp "Trouble closing '$dtd_file': $!"; $self->dtd( XML::LibXML::Dtd->parse_string($self->{ $self->config->{dtd} }) ); } unless ( eval { $xhtml->validate($self->dtd); 1; } ) { carp $@ || "Uknown error", " - parsing content of '", $entry->title, "' from ", $xml_feed->link; next ENTRY; } my $content = ""; $content .= $_->serialize(1) for $body->childNodes(); my $more = join("", decode_utf8($self->config->{ellipsis}), '<a class="readmore" href="', $entry->link, '">', decode_utf8($self->config->{read_more}), '</a>' ); my $output = $self->truncate( $content, $excerpt_length, $more ); $output =~ s/\s\s+/ /g; $entry{lede} = $output; } $entry{xml_feed_entry} = $entry; $entry{feed} = $xml_feed; push @entry, \%entry; last ENTRY if @entry >= $max_display; } return $self->next unless @entry; my $feed = RSSycklr::Feed->new( %{$info}, ellipsis => $self->config->{ellipsis}, # not sure, weak ref to parent instead? xml_feed => $xml_feed, ); $feed->{entries} = [ map { $_->{feed} = $feed; RSSycklr::Feed::Entry->new($_) } @entry ]; $self->{_feeds_returned}++; return $feed; } sub html_to_dom { my $self = shift; my $html = shift || return; my $renew = ""; my $p = HTML::TokeParser::Simple->new(\$html); no warnings "uninitialized"; while ( my $token = $p->get_token ) { if ( $token->is_text or not $HTML::Tagset::isKnown{ $token->get_tag } ) { my $txt = HTML::Entities::decode_entities($token->as_is); $txt =~ s/[^[:print:]]+/ /g; # kill unprintables for a space. $renew .= $txt; } elsif ( $token->get_tag =~ /\Abr\b/i ) { $renew .= "\n"; } elsif ( $HTML::Tagset::canTighten{ $token->get_tag } ) { # Replace block-like tags with \n if we have content # already and not more than twice consecutively. $renew .= $token->as_is; } else { $renew .= $token->as_is; } } $self->parse_html_string(<<"HTML"); <html><head><title>Untitled</title></head><body>$renew</body></html> HTML } sub _maxed_out { my $self = shift; if ( $self->config->{max_feeds} and $self->config->{max_feeds} <= $self->{_feeds_returned} ) { return 1; } return; } sub _strip_attributes { my ( $self, $root ) = @_; for my $node ( $root->findnodes("//*") ) { for my $attr ( $node->attributes ) { next if $node->nodeName eq 'a' and $attr->name eq 'href'; next if $node->nodeName eq 'img' and $attr->name eq 'src'; next if $attr->name eq 'title' and $node->nodeName =~ /\A(?:acronym|abbr|dfn|a)\z/; $node->removeAttribute($attr->name); } } } sub _handle_images { my ( $self, $root, $entry ) = @_; for my $node ( $root->findnodes("//img") ) { if ( $node->getAttribute("src") !~ m,\Ahttp://, ) { $node->parentNode->removeChild($node); return; } # Don't put a link on images that already have one. next if $node->parentNode->tagName eq "a"; my $link = $node->getOwner->createElement("a"); $link->setAttribute("href", $entry->link); $link->setAttribute("title", $entry->title); $node->setAttribute("alt", $entry->title); $link->appendChild( $node->cloneNode ); $node->parentNode->replaceChild( $link, $node ); return 1; # Just do one for now. } } sub _strip_tags { my ( $self, $root ) = @_; my $doc = $root->getOwnerDocument; my $keep = $self->keep_tags; # Special case, we must have this and don't want it mucking the interface. $keep->{body} = 1; my @nodes = $root->findnodes("//*"); for my $node ( @nodes ) { next unless $node; next if $keep->{$node->nodeName}; my $frag = $doc->createDocumentFragment(); for my $n ( $node->childNodes ) { $frag->appendChild($n); } $node->replaceNode($frag); } return 1 unless $keep->{br}; my @outer = $root->findnodes("body/*"); FORWARD: for my $br ( @outer ) { last FORWARD unless $br and $br->tagName eq "br"; $br->parentNode->removeChild($br); } BACKWARD: for my $br ( reverse @outer ) { last BACKWARD unless $br and $br->tagName eq "br"; $br->parentNode->removeChild($br); } return 1; } sub _default_config { return { excerpt_length => 150, ellipsis => "\x{2026}", # chr(8230), read_more => "[more]", title_only => undef, hours_back => 72, max_feeds => 10, # max_entries => 10, max_display => 3, timeout => 30, css_class => "rssycklr", # title_length => undef, # excerpt_style => dl|p|br|ul # title_style => ul|p|br # not implemented, ul/li happens now # max_images => 1 # this is hardcoded for now feed_title_tag => "h4", dtd => "xhtml1-transitional.dtd", }; } __PACKAGE__->meta->make_immutable(); package RSSycklr::Feed; use Moose; use HTML::Entities qw( decode_entities ); use Encode qw( decode_utf8 ); has "xml_feed" => is => "ro", required => 1, isa => "Object", handles => [qw( tagline link copyright modified author generator language )], ; has "entries" => is => "ro", lazy => 1, default => sub { [] }, required => 1, auto_deref => 1, isa => "ArrayRef", ; has "title_override" => is => "ro", isa => "Str", default => sub { "" }, ; sub count { scalar @{+shift->entries}; } sub title { my $self = shift; return $self->{_title} if $self->{_title}; # Try to guarantee it doesn't return entities. $self->{_title} = decode_entities(decode_entities($self->xml_feed->title)); $self->{_title} = decode_utf8( $self->{_title} ); } __PACKAGE__->meta->make_immutable(); package RSSycklr::Feed::Entry; use Moose; use DateTime; has "xml_feed_entry" => ( is => "ro", required => 1, isa => "Object", # ::Atom/RSS handles => [qw( title link content category id author issued modified )], ); has "lede" => ( is => "ro", isa => "Str", default => sub { "" }, ); has "feed" => ( is => "ro", weak_ref => 1, isa => "RSSycklr::Feed", ); __PACKAGE__->meta->make_immutable(); 1; __END__
Save for title_length stuff... has "title_length" => ( is => "ro", isa => "Int", default => sub { 0 }, ); has "ellipsis" => ( is => "ro", isa => "Str", default => sub { "" }, ); if ( $self->title_length and length($self->{_title}) > $self->title_length ) { $self->{_title} = substr($self->{_title}, 0, $self->title_length); $self->{_title} =~ s/[\s[:punc:]]+\z//; # Trim punctuation and spaces off end. $self->{_title} .= $self->ellipsis; }