Start

This walk-through should illustrate how to work with PMSE (and maybe to how to start.) It will introduce you basic functions which represent the core functionality of PMSE.

Data Library

Generic software suite and middle-ware for SNLP. UNIX philosophy: a building kit of small units that may be combined to a new toolchain. Language-agnostic. Written in Perl, automated testsuite of high code coverage, UTF-8 aware. CLI based, Ef cient & parallel pro-cessing. Thorough documentation. Interactive mode available.

We have to define the working environment first. PMSE is designed to process documents independently on the language they are written in. We adopted therefore a strategy how to deal with documents in various languages. We established a root for library that is placed in:

/data/library/

When we the root of the library is set, directories for specific languages may be added. The code is derived from iso-639-3 language specification. Directory for English will be thus:

 /data/library/e/n/g/.

If you want to build your library from scratch, you will need a directory in which the source files will take place. We call such directory original, once you have your files, you need to process them (convert into plain text, do the tokenisation, n-grams extraction and so on) in order to get desired information. Processed files will be stored in derived directory.

Get the file

Get File Diagram

A P_daf script was designed to provide a frame-work for automated download. P_daf reads an INI file, in which is specified the URL of the target. Here is an example of very simply INI file called demo.ini:

     [global]        
                  lastfetch = 2013-01-30 00:00:00        
                  interval  = 6 months        
                  name      = demo   
                                
                  [Hyperion]        
                  threads = 1;        
                  BASE  = http://www.gutenberg.org        
                  url   = %BASE%/ebooks/5436        
                  match = a\shref="(?http://www.gutenberg.org/ebooks/(?\d+).(?kindle).noimages)"\stype        
                  get   = $file        store = "$ENV{PMCORP_ROOT}/e/n/g/original/Hyperion.mobi"

The meaning of sections and arguments in the file is described in the PMSE manual. The demo.ini file has to be stored in the $PMSE_BIN/cfg/daf.d/ directory. If you have placed the INI there, you may now call the script:

 P_daf --fetch demo

After the download, file 5436.epub should be placed in the /data/library/e/n/g/original/ directory.

Convert the file

First, go to the library and then call P_dmf script. To convert the MOBI format to txt, you should install calibre e-book manager, which is used by P_dmf as one of integrated tools. You can use PM_CONVERTOR_WARNINGS=1 flag to display convertors that are missing on your system.

        cd /data/library/e/n/g/
        PM_CONVERTOR_WARNINGS=1 P_dmf --in /data/library/e/n/g/original/

If everything went well, you should see /data/library/e/n/g/derived/ directory. It should contain a text file named Hyperion.txt. Let's display it's structure:

 tree derived/
             derived/       
              ├── Hyperion.mobi       
              │   ├── lvl.last       

              │   └── Hyperion.mobi       
              │   │       └── Hyperion.txt -> /data/library/e/n/g/derived/Hyperion.mobi/./lvl.1/Hyperion.mobi/Hyperion.txt       
              │   └── lvl.1       
              │       └── Hyperion.mobi       
              │           └── Hyperion.txt

Clean the file

The txt file contains a header and footer with Gutenberg info. We will use P_rer in order to 'clean' the file.

        P_rer 's{.+?(Title:\sHyperion)}{$1}xms' Hyperion.txt
        P_rer 's{\*\*\*\sEND\sOF\sTHE\sPROJECT.+}{}xms' Hyperion.txt

Removal of these sections will assure that the extracted linguistic data will be correct. (Repetition of tokens may affect distribution of frequencies.)

Tokenization

In PMSE exist a simply predefined tokenizer and sentence segmentator for English. Both use P_rer and have a form of a macro. Macro is here a shell wrapper - a script with which is called with specific arguments.

         MAK_tokenize Hyperion.txt eng
         MAK_1s1l -l eng -i Hyperion.txt

N-grams Extraction

n-grams diagram

The core functionality of PMSE is generation of n-grams and counting various statistic information. The following code will take as input all txt files in derived directory and will generate bigrams and count their MI-score.

 P_gnp --in derived/ --cluster count --ifilter '+token=\A[\w\d]+\z' --out bigrams --measure 'mi=all' --report 3

Note: We used a default specification of n-grams. The parameter form like this: --ngrams 2 2 ' ' (n-grams of size 2, from window of size 2 and the separator between tokens is a white-space).

Visualization & Filtering

The bigrams are stored as an internal PMSE object. P_dvf script is capable to convert this structure to various formats. Also, it is able to filter and sort the results.

 P_dvf --in mi_1\|2 --filter '($value < 9) | ($key =~   m{\b(that|this|was|and|we|she|he|I|a|is|are|the|be)\b}xmsi)' --sort   '+val'