This walk-through should illustrate how to work with PMSE (and maybe to how to start.) It will introduce you basic functions which represent the core functionality of PMSE.
Generic software suite and middle-ware for SNLP. UNIX philosophy: a building kit of small units that may be combined to a new toolchain. Language-agnostic. Written in Perl, automated testsuite of high code coverage, UTF-8 aware. CLI based, Ef cient & parallel pro-cessing. Thorough documentation. Interactive mode available.
We have to define the working environment first. PMSE is designed to process documents independently on the language they are written in. We adopted therefore a strategy how to deal with documents in various languages. We established a root for library that is placed in:
/data/library/
When we the root of the library is set, directories for specific languages may be added. The code is derived from iso-639-3 language specification. Directory for English will be thus:
/data/library/e/n/g/.
If you want to build your library from scratch, you will need a directory in which the source files will take place. We call such directory original, once you have your files, you need to process them (convert into plain text, do the tokenisation, n-grams extraction and so on) in order to get desired information. Processed files will be stored in derived directory.
A P_daf script was designed to provide a frame-work for automated download. P_daf reads an INI file, in which is specified the URL of the target. Here is an example of very simply INI file called demo.ini:
[global] lastfetch = 2013-01-30 00:00:00 interval = 6 months name = demo [Hyperion] threads = 1; BASE = http://www.gutenberg.org url = %BASE%/ebooks/5436 match = a\shref="(?http://www.gutenberg.org/ebooks/(?\d+).(?kindle).noimages)"\stype get = $file store = "$ENV{PMCORP_ROOT}/e/n/g/original/Hyperion.mobi"
The meaning of sections and arguments in the file is described in the PMSE manual. The demo.ini file has to be stored in the $PMSE_BIN/cfg/daf.d/ directory. If you have placed the INI there, you may now call the script:
P_daf --fetch demo
After the download, file 5436.epub should be placed in the /data/library/e/n/g/original/ directory.
First, go to the library and then call P_dmf script. To convert the MOBI format to txt, you should install calibre e-book manager, which is used by P_dmf as one of integrated tools. You can use PM_CONVERTOR_WARNINGS=1 flag to display convertors that are missing on your system.
cd /data/library/e/n/g/ PM_CONVERTOR_WARNINGS=1 P_dmf --in /data/library/e/n/g/original/
If everything went well, you should see /data/library/e/n/g/derived/ directory. It should contain a text file named Hyperion.txt. Let's display it's structure:
tree derived/ derived/ ├── Hyperion.mobi │ ├── lvl.last │ └── Hyperion.mobi │ │ └── Hyperion.txt -> /data/library/e/n/g/derived/Hyperion.mobi/./lvl.1/Hyperion.mobi/Hyperion.txt │ └── lvl.1 │ └── Hyperion.mobi │ └── Hyperion.txt
The txt file contains a header and footer with Gutenberg info. We will use P_rer in order to 'clean' the file.
P_rer 's{.+?(Title:\sHyperion)}{$1}xms' Hyperion.txt P_rer 's{\*\*\*\sEND\sOF\sTHE\sPROJECT.+}{}xms' Hyperion.txt
Removal of these sections will assure that the extracted linguistic data will be correct. (Repetition of tokens may affect distribution of frequencies.)
In PMSE exist a simply predefined tokenizer and sentence segmentator for English. Both use P_rer and have a form of a macro. Macro is here a shell wrapper - a script with which is called with specific arguments.
MAK_tokenize Hyperion.txt eng MAK_1s1l -l eng -i Hyperion.txt
The core functionality of PMSE is generation of n-grams and counting various statistic information. The following code will take as input all txt files in derived directory and will generate bigrams and count their MI-score.
P_gnp --in derived/ --cluster count --ifilter '+token=\A[\w\d]+\z' --out bigrams --measure 'mi=all' --report 3
Note: We used a default specification of n-grams. The parameter form like this: --ngrams 2 2 ' ' (n-grams of size 2, from window of size 2 and the separator between tokens is a white-space).
The bigrams are stored as an internal PMSE object. P_dvf script is capable to convert this structure to various formats. Also, it is able to filter and sort the results.
P_dvf --in mi_1\|2 --filter '($value < 9) | ($key =~ m{\b(that|this|was|and|we|she|he|I|a|is|are|the|be)\b}xmsi)' --sort '+val'
email: sales@petamem.com
phone: +49 911 894 6455
fax: +420 284 680 110
Believe it or not, now comes the good part. PMSE is a commercial product for the academic sector and PetaMem wants to offer you an "all inclusive, no hassle, no sorrow" package. We also want this software suite to be affordable for everyone. There are two simple licensing models: Per user per workstation, which is 49,- EUR per month or department-wide, which is 499,- EUR per month with no limits on number of users or workstations (including students). All pricing is + 19% VAT, which does not apply if you are outside of Germany and have a VAT-Id.
We promise "no hassle, no sorrow" licensing. With the licensing cost you obtain not only the right to use the software, but also free software upgrades and free support. Any license upgrades/downgrades are intuitive and do-what-I-mean. You had five single-licenses and would like a department license or vice-versa? No problem at all! Simply inform us via email and we adjust the licensing conditions the same working day.
It gets better: our company is committed to open source and when relying on commercial software, there are only few things we want to avoid more than a vendor lock-in. You might have similar concerns when relying on a commercial product and we fully understand that. Should you decide to end the licensing subscription of PMSE - well - you keep PMSE and are allowed to keep using it. You loose the free upgrades and free support, but you keep using the software. For free.
Should you have any further questions regarding licensing or would you like to order the product, please do not hesitate to contact sales@petamem.com.
Should you have further detailed technical questions, please contact support@petamem.com
close