Hailo - A pluggable Markov engine analogous to MegaHAL


Hailo documentation Contained in the Hailo distribution.

Index


Code Index:

NAME

Top

Hailo - A pluggable Markov engine analogous to MegaHAL

SYNOPSIS

Top

This is the synopsis for using Hailo as a module. See hailo for command-line invocation.

    # Hailo requires Perl 5.10
    use 5.010;
    use Any::Moose;
    use Hailo;

    # Construct a new in-memory Hailo using the SQLite backend. See
    # backend documentation for other options.
    my $hailo = Hailo->new;

    # Various ways to learn
    my @train_this = ("I like big butts", "and I can not lie");
    $hailo->learn(\@train_this);
    $hailo->learn($_) for @train_this;

    # Heavy-duty training interface. Backends may drop some safety
    # features like journals or synchronous IO to train faster using
    # this mode.
    $hailo->train("megahal.trn");
    $hailo->train($filehandle);

    # Make the brain babble
    say $hailo->reply("hello good sir.");
    # Just say something at random
    say $hailo->reply();

DESCRIPTION

Top

Hailo is a fast and lightweight markov engine intended to replace AI::MegaHAL. It has a Mouse (or Moose) based core with pluggable storage, tokenizer and engine backends.

It is similar to MegaHAL in functionality, the main differences (with the default backends) being better scalability, drastically less memory usage, an improved tokenizer, and tidier output.

With this distribution, you can create, modify, and query Hailo brains. To use Hailo in event-driven POE applications, you can use the POE::Component::Hailo wrapper. One example is POE::Component::IRC::Plugin::Hailo, which implements an IRC chat bot.

Etymology

Hailo is a portmanteau of HAL (as in MegaHAL) and failo.

Backends

Top

Hailo supports pluggable storage and tokenizer backends, it also supports a pluggable UI backend which is used by the hailo command-line utility.

Storage

Hailo can currently store its data in either a SQLite, PostgreSQL or MySQL database. Some NoSQL backends were supported in earlier versions, but they were removed as they had no redeeming quality.

SQLite is the primary target for Hailo. It's much faster and uses less resources than the other two. It's highly recommended that you use it.

See "Comparison of backends" in Hailo::Storage for benchmarks showing how the various backends compare under different workloads, and how you can create your own.

Tokenizer

By default Hailo will use the word tokenizer to split up input by whitespace, taking into account things like quotes, sentence terminators and more.

There's also a the character tokenizer. It's not generally useful for a conversation bot but can be used to e.g. generate new words given a list of existing words.

UPGRADING

Top

Hailo makes no promises about brains generated with earlier versions being compatable with future version and due to the way Hailo works there's no practical way to make that promise. Learning in Hailo is lossy so an accurate conversion is impossible.

If you're maintaining a Hailo brain that you want to keep using you should save the input you trained it on and re-train when you upgrade.

Hailo is always going to lose information present in the input you give it. How input tokens get split up and saved to the storage backend depends on the version of the tokenizer being used and how that input gets saved to the database.

For instance if an earlier version of Hailo tokenized "foo+bar" simply as "foo+bar" but a later version split that up into "foo", "+", "bar", then an input of "foo+bar are my favorite metasyntactic variables" wouldn't take into account the existing "foo+bar" string in the database.

Tokenizer changes like this would cause the brains to accumulate garbage and would leave other parts in a state they wouldn't otherwise have gotten into.

There have been more drastic changes to the database format itself in the past.

Having said all that the database format and the tokenizer are relatively stable. At the time of writing 0.33 is the latest release and it's compatable with brains down to at least 0.17. If you're upgrading and there isn't a big notice about the storage format being incompatable in the Changes file your old brains will probably work just fine.

ATTRIBUTES

Top

brain

The name of the brain (file name, database name) to use as storage. There is no default. Whether this gets used at all depends on the storage backend, currently only SQLite uses it.

save_on_exit

A boolean value indicating whether Hailo should save its state before its object gets destroyed. This defaults to true and will simply call save at DEMOLISH time.

See "in_memory" in Hailo::Storage::SQLite for how the SQLite backend uses this option.

order

The Markov order (chain length) you want to use for an empty brain. The default is 2.

engine_class

storage_class

tokenizer_class

ui_class

A a short name name of the class we use for the engine, storage, tokenizer or ui backends.

By default this is Default for the engine, SQLite for storage, Words for the tokenizer and ReadLine for the UI. The UI backend is only used by the hailo command-line interface.

You can only specify the short name of one of the packages Hailo itself ships with. If you need another class then just prefix the package with a plus (Catalyst style), e.g. +My::Foreign::Tokenizer.

engine_args

storage_args

tokenizer_args

ui_args

A HashRef of arguments for engine/storage/tokenizer/ui backends. See the documentation for the backends for what sort of arguments they accept.

METHODS

Top

new

This is the constructor. It accepts the attributes specified in ATTRIBUTES.

learn

Takes a string or an array reference of strings and learns from them.

train

Takes a filename, filehandle or array reference and learns from all its lines. If a filename is passed, the file is assumed to be UTF-8 encoded. Unlike learn|/learn, this method sacrifices some safety (disables the database journal, fsyncs, etc) for speed while learning.

You can prove a second parameter which, if true, will use aggressive caching while training, which will speed things up considerably for large inputs, but will take up quite a bit of memory.

reply

Takes an optional line of text and generates a reply that might be relevant.

learn_reply

Takes a string argument, learns from it, and generates a reply that might be relevant. This is equivalent to calling learn followed by reply.

save

Tells the underlying storage backend to save its state, any arguments to this method will be passed as-is to the backend.

stats

Takes no arguments. Returns the number of tokens, expressions, previous token links and next token links.

SUPPORT

Top

You can join the IRC channel #hailo on FreeNode if you have questions.

BUGS

Top

Bugs, feature requests and other issues are tracked in Hailo's RT on rt.cpan.org

SEE ALSO

Top

* POE::Component::Hailo - A non-blocking POE wrapper around Hailo
* POE::Component::IRC::Plugin::Hailo - A Hailo IRC bot plugin
* http://github.com/hinrik/failo - Failo, an IRC bot that uses Hailo
* http://github.com/bingos/gumbybrain - GumbyBRAIN, a more famous IRC bot that uses Hailo
* Hailo::UI::Web - A Catalyst and jQuery powered web interface to Hailo available at hailo.nix.is and as hailo-ui-web on GitHub
* HALBot - Another Catalyst Dojo powered web interface to Hailo available at bifurcat.es and as halbot-on-the-web at gitorious
* http://github.com/pteichman/cobe - cobe, a Python port of MegaHAL "inspired by the success of Hailo"

LINKS

Top

AUTHORS

Top

Hinrik Örn Sigurðsson, hinrik.sig@gmail.com

Ævar Arnfjörð Bjarmason <avar@cpan.org>

LICENSE AND COPYRIGHT

Top


Hailo documentation Contained in the Hailo distribution.

package Hailo;
BEGIN {
  $Hailo::AUTHORITY = 'cpan:AVAR';
}
BEGIN {
  $Hailo::VERSION = '0.69';
}

use 5.010;
use autodie qw(open close);
use Any::Moose;
use Any::Moose 'X::StrictConstructor';
use File::Glob ':glob';
use Class::Load qw(try_load_class);
use Scalar::Util qw(blessed);
use List::Util qw(first);
use namespace::clean -except => 'meta';

use constant PLUGINS => [ qw[
    Hailo::Engine::Default
    Hailo::Engine::Scored
    Hailo::Storage::MySQL
    Hailo::Storage::PostgreSQL
    Hailo::Storage::SQLite
    Hailo::Tokenizer::Chars
    Hailo::Tokenizer::Words
    Hailo::UI::ReadLine
] ];

has brain => (
    isa => 'Str',
    is  => 'rw',
);

has order => (
    isa     => 'Int',
    is      => 'rw',
    default => 2,
    trigger => sub {
        my ($self, $order) = @_;
        $self->_custom_order(1);
    },
);

has _custom_order => (
    isa           => 'Bool',
    is            => 'rw',
    default       => 0,
    init_arg      => undef,
    documentation => "Here so we can differentiate between the default value of order being explictly set and being set by default",
);

has _custom_tokenizer_class => (
    isa           => 'Bool',
    is            => 'rw',
    default       => 0,
    init_arg      => undef,
    documentation => "Here so we can differentiate between the default value of tokenizer_class being explictly set and being set by default",
);

has save_on_exit => (
    isa     => 'Bool',
    is      => 'rw',
    default => 1,
);

has brain_resource => (
    documentation => "Alias for `brain' for backwards compatibility",
    isa           => 'Str',
    is            => 'rw',
    trigger       => sub {
        my ($self, $brain) = @_;
        $self->brain($brain);
    },
);

sub BUILD {
    my ($self) = @_;
    my $brain = $self->brain;
    return if !defined $brain;
    $self->brain(bsd_glob($brain));
    return;
}

my %has = (
    engine => {
        name => 'Engine',
        default => 'Default',
    },
    storage => {
        name => 'Storage',
        default => 'SQLite',
    },
    tokenizer => {
        name => 'Tokenizer',
        default => 'Words',
    },
    ui => {
        name => 'UI',
        default => 'ReadLine',
    },
);

for my $k (keys %has) {
    my $name          = $has{$k}->{name};
    my $default       = $has{$k}->{default};
    my $method_class  = "${k}_class";
    my $method_args   = "${k}_args";

    # working classes
    has "${k}_class" => (
        isa           => 'Str',
        is            => "rw",
        default       => $default,
        ($k ~~ 'tokenizer'
         ? (trigger => sub {
             my ($self, $class) = @_;
             $self->_custom_tokenizer_class(1);
         })
         : ())
    );

    # Object arguments
    has "${k}_args" => (
        documentation => "Arguments for the $name class",
        isa           => 'HashRef',
        is            => "ro",
        default       => sub { +{} },
    );

    # Working objects
    has "_${k}" => (
        does        => "Hailo::Role::$name",
        lazy_build  => 1,
        is          => 'ro',
        init_arg    => undef,
    );

    # Generate the object itself
    no strict 'refs';
    *{"_build__${k}"} = sub {
        my ($self) = @_;

        my $obj = $self->_new_class(
            $name,
            $self->$method_class,
            {
                arguments => $self->$method_args,
                ($k ~~ [ qw< engine storage > ]
                 ? (order     => $self->order)
                                  : ()),
                                ($k ~~ [ qw< engine > ]
                                  ? (storage   => $self->_storage)
                 : ()),
                (($k ~~ [ qw< storage > ] and defined $self->brain)
                 ? (
                     hailo => do {
                         require Scalar::Util;
                         Scalar::Util::weaken(my $s = $self);

                         my %callback = (
                             has_custom_order           => sub { $s->_custom_order },
                             has_custom_tokenizer_class => sub { $s->_custom_tokenizer_class },
                             set_order => sub {
                                 my ($db_order) = @_;
                                 $s->order($db_order);
                                 $s->_engine->order($db_order);
                             },
                             set_tokenizer_class => sub {
                                 my ($db_tokenizer_class) = @_;
                                 $s->tokenizer_class($db_tokenizer_class);
                             },
                         );

                         \%callback;
                     },
                     brain => $self->brain
                 )
                 : ()),
                (($k ~~ [ qw< storage > ]
                  ? (tokenizer_class => $self->tokenizer_class)
                                    : ()))
                        },
                );

                return $obj;
        };
}

sub _new_class {
        my ($self, $type, $class, $args) = @_;

        my $pkg;
        if ($class =~ m[^\+(?<custom_plugin>.+)$]) {
        $pkg = $+{custom_plugin};
    } else {
        my @plugins = @{ $self->PLUGINS };
        # Be fuzzy about includes, e.g. DBD::SQLite or SQLite or sqlite will go
        $pkg = first { / $type : .* : $class /ix }
               sort { length $a <=> length $b }
               @plugins;

        unless ($pkg) {
            local $" = ', ';
            my @p = grep { /$type/ } @plugins;
            die "Couldn't find a class name matching '$class' in plugins '@p'";
        }
    }

    my ($success, $error) = try_load_class($pkg);
    die $error if !$success;

    return $pkg->new(%$args);
}

sub save {
    my ($self, @args) = @_;
    $self->_storage->save(@args);
    return;
}

sub train {
    my ($self, $input, $fast) = @_;

    $self->_storage->start_training();

    given ($input) {
        # With STDIN
        when (not ref and defined and $_ eq '-') {
            die "You must provide STDIN when training from '-'" if $self->_is_interactive(*STDIN);
            $self->_train_fh(*STDIN, $fast);
        }
        # With a filehandle
        when (ref eq 'GLOB') {
            $self->_train_fh($input, $fast);
        }
        # With a file
        when (not ref) {
            open my $fh, '<:encoding(utf8)', $input;
            $self->_train_fh($fh, $fast, $input);
        }
        # With an Array
        when (ref eq 'ARRAY') {
            for my $line (@$input) {
                $self->_learn_one($line, $fast);
                $self->_engine->flush_cache if !$fast;
            }
            $self->_engine->flush_cache if $fast;
        }
        # With something naughty
        default {
            die "Unknown input: $input";
        }
    }

    $self->_storage->stop_training();

    return;
}

sub _train_fh {
    my ($self, $fh, $fast) = @_;

    while (my $line = <$fh>) {
        chomp $line;
        $self->_learn_one($line, $fast);
        $self->_engine->flush_cache if !$fast;
    }
    $self->_engine->flush_cache if $fast;

    return;
}

sub learn {
    my ($self, $input) = @_;
    my $inputs;

    given ($input) {
        when (not defined) {
            die "Cannot learn from undef input";
        }
        when (not ref) {
            $inputs = [$input];            
        }
        # With an Array
        when (ref eq 'ARRAY') {
            $inputs = $input
        }
        default {
            die "Unknown input: $input";
        }
    }

    my $storage = $self->_storage;

    $storage->start_learning();
    $self->_learn_one($_) for @$inputs;
    $storage->stop_learning();
    return;
}

sub _learn_one {
    my ($self, $input, $fast) = @_;
    my $engine  = $self->_engine;

    my $tokens = $self->_tokenizer->make_tokens($input);
    $fast ? $engine->learn_cached($tokens) : $engine->learn($tokens);

    return;
}

sub learn_reply {
    my ($self, $input) = @_;
    $self->learn($input);
    return $self->reply($input);
}

sub reply {
    my ($self, $input) = @_;

    my $storage   = $self->_storage;
    # start_training() hasn't been called so we can't guarentee that
    # the storage has been engaged at this point. This must be called
    # before ->_engine() is called anywhere to ensure that the
    # lazy-loading in the engine works.
    $storage->_engage() unless $storage->_engaged;

    my $engine    = $self->_engine;
    my $tokenizer = $self->_tokenizer;

    my $reply;
    if (defined $input) {
        my $tokens = $tokenizer->make_tokens($input);
        $reply = $engine->reply($tokens);
    }
    else {
        $reply = $engine->reply();
    }

    return unless defined $reply;
    return $tokenizer->make_output($reply);
}

sub stats {
    my ($self) = @_;

    return $self->_storage->totals();
}

sub DEMOLISH {
    my ($self) = @_;
    $self->save() if blessed $self->{_storage} and $self->save_on_exit;
    return;
}

sub _is_interactive {
    require IO::Interactive;
    return IO::Interactive::is_interactive();
}

__PACKAGE__->meta->make_immutable;