PetaMem Scripting Environment (PMSE)

Software suite for advanced corpus processing


PMSE Text Diagram


Generic software suite and middle-ware for SNLP. UNIX philosophy: a building kit of small units that may be combined to a new toolchain. Language-agnostic. Written in Perl, automated testsuite of high code coverage, UTF-8 aware. CLI based, Efficient & parallel processing. Thorough documentation. Interactive mode available.

PMSE Hisotgram

Text Categorization

icon Case Study

A categorization of 250 parallel texts (sourced in European Medical Agency - EMA) in 20 European languages was performed The resulting graphs (trees) show similarities in their structure.


PMSE Binary Tree Estonia
PMSE Binary Tree Lithuania


The general task for the Text Categorization app is to categorize various documents in any language. A great care was taken on following features: High modularity. High performance. Support for parallel processing. The modularity of the source code allows the user to change the behaviour of all procedural steps. The whole application is extensible by simple plugins. Areas of possible application: language identifcation, corpus sorting, forensic linguistics and others.


The categorization process consists of several steps: Extraction of text from all given documents. Filtering of unwanted documents according to given cri-teria (completely modular). Computation of a vector for any document (completely modular). Computation of the distances (completely modular). Hierarchical agglomerative clustering. The final method depends on deployed module. Visualize a binary tree representing relations among the texts (dendrogram).


	       LogoVector: a list of values that characterize diferences among texts. Example of 4-elements vector:

1. frequencies of word occurrences
2. average words count in sentence
3. average words length
4. type - token ratio
This will need 4 simple plugins, each computing one of vectors. Each vector can have diferent weight in the computation of the distance. The clustering starts after measuring of distances.


Licensing Cost

Now comes the good part. PMSE is a commercial product for the academic sector and PetaMem wants to offer you an "all inclusive, no hassle, no sorrow" package. We also want this software suite to be affordable for everyone. There are two simple licensing models: Per user per workstation, which is 49,- EUR per month or department-wide, which is 499,- EUR per month with no limits on number of users or workstations (including students). All pricing is + 19% VAT, which does not apply if you are outside of Germany and have a VAT-Id.

Licensing Conditions

We promise "no hassle, no sorrow" licensing. With the licensing cost you obtain not only the right to use the software, but also free software upgrades and free support. Any license upgrades/downgrades are intuitive and do-what-I-mean. You had five single-licenses and would like a department license or vice-versa? No problem at all! Simply inform us via email and we adjust the licensing conditions the same working day.

It gets better: our company is committed to open source and when relying on commercial software, there are only few things we want to avoid more than a vendor lock-in. You might have similar concerns when relying on a commercial product and we fully understand that. Should you decide to end the licensing subscription of PMSE - well - you keep PMSE and are allowed to keep using it. You loose the free upgrades and free support, but you keep using the software. For free.

