README for Text::Positional::Ngram

Text::Positional::Ngram is a perl module to determine variable length non-contiguous and contiguous ngrams from large corpora.

The module:

  1. Provides an easy to use interface to determine ngrams from a corpus. Some of the basic functionality include:

REQUIREMENTS

This module REQUIRES that the following software be download and installed.

--Programming Languages
Perl (version 5.8.5 or better)

INSTALLATION

There are multiple ways to install this package.

  1. You can use CPAN.pm to install Text::Positional::Ngram.

    To install type the following:

perl -MCPAN -e 'install Text-Positional-Ngram'

2. Or you can install this yourself.

To install this module type the following:

      perl Makefile.PL
      make
      make test
      make install

PROGRAM :

text-positional-ngram-driver.pl

This program takes as input a flat ASCII text file and outputs all Ngrams, or token sequences of length 'n', where the value of 'n' can be decided by the user, and the frequency of the ngram.

Using text-positional-ngram-driver.pl

The most basic way of running this program is the following:

% text-positional-ngram-driver.pl output.txt input.txt

            where input.txt is the input text file in which to find 
            the Ngrams and output.txt is the output file into which 
            count.pl will put all the Ngrams with their frequencies. 

Changing the Length of Ngrams

        The default ngram size is 2. This can be changed by using
        the parameter option --ngram N, where N is the number of
        tokens in each ngram. For example, to find all the trigrams
        in the file input.txt, you would running program:    
         
             %count.pl --ngram 3 output.txt input.txt

Using User-Provided Token Definitions:

The default token definitions are:

        \w+         -> this matches a contiguous sequence of 
                       alpha-numeric characters

        [\.,;:\?!]  -> this matches a single punctuation mark

        The default token definitions can be over-ridden by using 
        the option:
        
             --token FILE 
        
        where FILE is the name of the file containing the regular 
        expressions on which the token definitions will be based. 

        Each regular expression in this FILE should be:
              1. on a line of its own
              2. should be delimited by the forward slash '/'. 
              3. should be valid Perl regular expressions

Removing character strings

This option

--nontoken FILE

        allows a user to define regular expressions that 
        will match strings that should not be considered as tokens. 
        These strings will be removed from the data and not counted 
        or included in Ngrams. 

        The --nontoken option is recommended when there are predictable 
        sequences of characters that you know should not be included as 
        tokens for purposes of counting Ngrams, finding collocations, etc. 

        For example, if mark-up symbols like <s>, <p>, [item], [/ptr] 
        exist in text being processed, you may want to include those 
        in your list of nontoken items so they are discarded. If not, 
        a simple regex such as /\w+/ will match with 's', 'p', 'item',  
        'ptr' from these tags, leading to confusing results. 

        The FILE following the nontoken option file should contain Perl 
        regular expressions delimited by forward slashes '/' that define 
        non-tokens. Multiple expressions may be placed on separate lines 
        or be separated via the '|'  (Perl 'or') as in /regex1|regex2|../ 

        The following are some of the examples of valid non-token 
        definitions:

                /<\/?s|p>/ : will remove xml tags like <s>, <p>, </s>, </p>. 

                /\[\w+\]/  : will remove all words which appear in square 
                             brackets like [p], [item], [123] and so on. 

        The program will first remove any string from the input data that 
        matches the non-token regular expression, and only then will match 
        the remaining data against the token definitions. 

The Output Format

        Assume that the following are the contents of the input text file to
        text-positional-ngram-driver.pl; let us call the file test.txt: 

                first line of text
                second line
                and a third line of text

         Assume that text-positional-ngram-driver.pl is run in its most general
         mode:

                 % text-positional-ngram-driver.pl test.out test.txt 

         The output will contain all the bigrams found in the file test.txt
         using the default tokens as specified above. The contents of the 
         output file test.out would be:
        
                11
                line<>of<>2
                of<>text<>2
                second<>line<>1 
                line<>and<>1 
                and<>a<>1 
                a<>third<>1 
                first<>line<>1 
                third<>line<>1 
                text<>second<>1 

         The number on the first line, 11, indicates that there were 
         11 bigrams in test.txt

         Following are the bigrams that were found in the test.txt file
         delimited by the diamond sign, "<>". Therefore the first bigram
         is line<>of<>, make up of the tokens "line" and "of" in that
         order. After the diamond following the last token there is a 
         number, this number denotes how many times this bigram occurred
         in the text. 

The Marginals Option

         To obtain the a partial set of marginal counts for the bigram
         the option:

             --marginals

         must be set. This option outputs the individual frequency counts
         of each token in the ngram. Let us use our example from above
         but run the text-positional-ngram-driver.pl program as follows:

                 % text-positional-ngram-driver.pl --marginals test.out test.txt

         The output will contain all the bigrams found i the file test.txt
         using the default tokens as specified above, their frequency
         counts and the number of times each of the tokens in the bigram
         occurred in their respective positions. The contents of the 
         output file test.out would be:
                11
                line<>of<>2 3 2 
                of<>text<>2 2 2 
                second<>line<>1 1 3 
                line<>and<>1 3 1 
                and<>a<>1 1 1 
                a<>third<>1 1 1 
                first<>line<>1 1 3 
                third<>line<>1 1 3 
                text<>second<>1 1 1 
 
          The first number after the bigram is the frequency of the bigram
          seen in test.out. The second number after the bigram is the
          number of times the first token was seen in the first position 
          of all the bigrams and the second number is the number of times
          the second token was seen in the second position of all the
          bigrams.

Stoplists

          The user may "stop" the Ngrams formed by text-positional-ngram-driver.pl 
          by providing a list of stop-tokens through the option:

              --stop FILE. 
           
          Each stop token in FILE should be a Perl regular expression that 
          occurs on a line by itself. This expression should be delimited 
          by forward slashes, as in /REGEX/. All regular expression 
          capabilities in Perl are supported except for regular expression 
          modifiers (like the "i" /REGEX/i). 

          The following are a few examples of valid entries in the stop list.

                /^\d+$/
                /\bthe\b/
                /\b[Tt][Hh][Ee]\b/
                /^and$/
                /\bor\b/
                /^be(ing)?$/

                There are two modes in which a stop list can be used, 
                AND and OR. The default mode is AND, which means that
                an Ngram must be made up entirely of words from the
                stoplist before it is eliminated. The OR mode eliminates 
                an Ngram if any of the words that make up the Ngram
                are found in the stoplist.

Removing Low Frequency Ngrams:

           We allow the user to either remove or to not display low 
           frequency Ngrams. The user can remove low frequency Ngrams
           by using the option :

                --remove N
 
           by which all Ngrams that occur less than n times are
           removed. The Ngram and the individual frequency counts are
           adjusted accordingly upon the removal of these Ngrams. 
           
           The user can choose not to display low frequency Ngrams by
           using the option :
        
                --frequency N, 
        
           by which Ngrams that occur less than n times are not
           displayed in the output. Note that this differs from the
           remove option above in that the frequency counts are not
           changed. 

Windowing Option

           The Ngrams shown above are all formed from contiguous
           tokens. Although this is the default, we also allow Ngrams
           to be formed from non-contiguous tokens.

           We first need to define a "window" to be a sequence of k
           contiguous tokens, where the value of k is greater than or
           equal to the value of n for the Ngrams. The n tokens occur 
           in the Ngram in exactly the same order as they occur in the 
           window. 

           For example if we take the phrase

                 why      s      the      stock      falling      ?

           Then, the following are all the bigrams with a window size of 3:

                 why<>s<>               why<>the<>         s<>the<>    
                 s<>stock<>             the<>stock<>       the<>falling<>
                 stock<>falling<>       stock<>?<>         falling<>?<>

           The following are all the bigrams with a window size of 4:

                 why<>s<>               why<>the<>         why<>stock<>
                 s<>the<>               s<>stock<>         s<>falling<>
                 the<>stock<>           the<>falling<>     the<>?<>
                 stock<>falling<>       stock<>?<>         falling<>?<>

           The following are all the trigrams with a window size of 4:

                 why<>s<>the<>          why<>s<>stock<>     why<>the<>stock<>
                 s<>the<>stock<>        s<>the<>falling<>   s<>stock<>falling<>
                 the<>stock<>falling<>  the<>stock<>?<>     the<>falling<>?<>
                 stock<>falling<>?<>

COPYRIGHT AND LICENCE

Copyright (C) 2004-2007, Bridget Thomson McInnes

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

Note: a copy of the GNU Free Documentation License is available on the web at L<http://www.gnu.org/copyleft/fdl.html> and is included in this distribution as FDL.txt.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.