| Speech-Recognizer-SPX documentation | view source | Contained in the Speech-Recognizer-SPX distribution. |
Speech::Recognizer::SPX - Perl extension for the PocketSphinx speech recognizer
use Speech::Recognizer::SPX qw(:fbs :uttproc) fbs_init([arg1 => $val, arg2 => $val, ...]); uttproc_begin_utt(); uttproc_end_utt(); fbs_end();
This module provides a Perl interface to the PocketSphinx speech recognizer library.
use Speech::Recognizer::SPX qw(:fbs :uttproc :lm);
Because most parts of the PocketSphinx library contain a lot of global internal state, it makes no sense to use an object-oriented interface at this time. However I don't want to clobber your namespace with a billion functions you may or may not use. To make things easier on your typing hands, the available functions have been grouped in to tags representing modules inside the library itself. These tags and the functions they import are listed below.
:fbsThis is somewhat of a misnomer - FBS stands for Fast Beam Search, but in actual fact this module (the fbs_main.c file in PocketSphinx) just wraps around the other modules in sphinx (one of which actually does fast beam search :-) and initializes the recognizer for you. Functions imported by this tag are:
fbs_init fbs_end
:uttprocThis is the utterance processing module. You feed it data (either raw audio data or feature data - which currently means vectors of mel-frequency cepstral coefficients), and it feeds back search hypotheses based on a language model. Functions imported by this tag are:
uttfile_open uttproc_begin_utt uttproc_rawdata uttproc_cepdata uttproc_end_utt uttproc_abort_utt uttproc_stop_utt uttproc_restart_utt uttproc_result uttproc_result_seg uttproc_partial_result uttproc_partial_result_seg uttproc_get_uttid uttproc_set_auto_uttid_prefix uttproc_set_lm uttproc_lmupdate uttproc_set_context uttproc_set_rawlogdir uttproc_set_mfclogdir uttproc_set_logfile search_get_alt
:lmThis is the language model module. It loads and unloads language models.
lm_read lm_delete
fbs_init(\@args);
The fbs_init function is the main entry point to the Sphinx
library. If given no arguments, it will snarf options from the global
@ARGV array (because that's what its C equivalent does). To make
life easier, and to entice people to write Sphinx programs in Perl
instead of C, we also give you a way around this by allowing you to
also pass a reference to an array whose contents are arranged in the
same way @ARGV might be, i.e. a list of option/value pairs.
To make things pretty, you can use the magical => operator, like this:
fbs_init([samp => 16000,
datadir => '/foo/bar/baz']);
Note that you can omit the leading dash from argument names (if you like).
Calling this function will block your process for a long time and print unbelievable amounts of debugging gunk to STDOUT and STDERR. This will get better eventually.
This function has a large number of options. Someday they will be
documented. Until then, either look in the example code, or go
straight to the source, namely the file include/cmdln_macro.h.
uttproc_begin_utt() or die; uttproc_rawdata($buf [, $block]) or die; uttproc_cepdata(\@cepvecs [, $block]) or die; uttproc_end_utt() or die;
To actually recognize some speech data, you use the functions exported
by the :uttproc tag. Before calling any of them, you must
successfully call uttproc_begin_utt, or Bad Things are certain to
happen (I can't speculate on exactly what things, but I'm sure they're
bad).
You should call uttproc_begin_utt before each distinct utterance
(to the extent that you can predict when individual utterances begin
or end, of course...), and uttproc_end_utt at the end of each.
After calling uttproc_begin_utt, you can pass either raw audio data
or cepstral feature vectors (see Audio::MFCC), using
uttproc_rawdata or uttproc_cepdata, respectively. Due to the
way feature extraction works, you cannot mix the two types of data
within the same utterance.
If live mode is in effect (i.e. -livemode = TRUE> was passed to
fbs_init), the optional $block parameter controls whether these
functions will return immediately after processing a single frame of
data, or whether they will process all pending frames of data. If you
need partial results, you probably want to pass a non-zero value
(FIXME: should be a true value but I don't know how to test for truth
in XS code) for $block, though this may increase latency elsewhere
in the system.
Unfortunately, it appears that there is no specific function to flush
all unprocessed frames before getting a partial result. Calling
uttproc_rawdata with an empty $buf and $block non-zero seems to
have the desired effect.
my ($frames, $hypothesis) = uttproc_result($block);
my ($frames, $hypothesis) = uttproc_partial_result();
my ($frames, $hypseg) = uttproc_result_seg($block);
my ($frames, $hypseg) = uttproc_partial_result_seg();
my $hypothesis = $hypseg->sent;
my $segs = $hypseg->segs;
my @nbest = search_get_alt($n); # Must call uttproc_result first!
foreach my $nhyp (@nbest) {
my $nsent = $nhyp->sent;
my $nsegs = $nhyp->segs;
print "Hypothesis: $nsent\n";
foreach my $seg (@$nsegs) {
printf " Start frame %d end frame %d word %s\n",
$seg->sf, $seg->ef, $seg->word;
}
}
At any point during utterance processing, you may call
uttproc_partial_result to obtain the current "best guess". Note
that this function does not flush unprocessed frames, so you might
want to use the trick mentioned above to do so before calling it if
you are operating in non-blocking mode.
By contrast, you may not call uttproc_result until after you have
called uttproc_end_utt (or uttproc_abort_utt or also possibly
uttproc_stop_utt). The $block flag is also optional here, but I
strongly suggest you use it.
The functions uttproc_result_seg and uttproc_partial_result_seg
functions work similarly except that instead of returning a string,
they return a Speech::Recognizer::SPX::Hypothesis object which
contains probability and word segmentation information. You can
access its fields with the following accessor functions:
sent senscale ascr lscr segs
The sent field contains the string representation of the
hypothesis, and is equivalent to the string returned by
uttproc_result. The senscale, ascr, and lscr fields are
currently unimplemented.
The segs field contains a reference to an array of
Speech::Recognizer::SPX::Segment objects. Each of these objects
contains fields which can be accessed with the following accessors:
word sf ef ascr lscr conf latden phone_perp fsg_state_to fsg_state_from
The word field contains the string representation of the word. The
sf and ef fields contain the start and end frames for this word.
The ascr and lscr fields contain the acoustic and language model
scores for the word. The fsg_state_from and fsg_state_to fields
indicate the finite-state grammar states in which this entry starts
and terminates, if a finite-state grammar is used. The latden
field contains the average lattice density for this word, while the
phone_perp contains the average phoneme perplexity. The conf
field contains a confidence score which is an estimated probability
that this word was recognized correctly.
You can also obtain an N-best list of hypotheses using the
search_get_alt function. This function returns a list of the
number of hypotheses requested, or as many as can be found. Each
element in this list is a Speech::Recognizer::SPX::Hypothesis
object like the above, except that the acoustic/language model score
is not filled in.
Changing language models, etc, etc... This documentation is under construction.
For now there are just some example programs in the distribution.
David Huggins-Daines <dhuggins@cs.cmu.edu>. Support for N-best hypotheses was funded by SingleTouch Interactive, Inc (http://www.singletouch.net/).
perl(1), Speech::Recognizer::SPX::Server, Audio::SPX, Audio::MFCC
| Speech-Recognizer-SPX documentation | view source | Contained in the Speech-Recognizer-SPX distribution. |