TexLexAn.SourceForge.net

An other Open Source Text Summarizer for Linux

TexLexAn Analyze, Classify and Summarize any text.

TexLexAn is the project of an automatic text analyzer, classifier and summarizer. This software is at the frontier of the artificial intelligence and of the machine learning, and participates at its very modest level to the development of the softwares of the future. I take a lot of fun to develop it, I hope you will enjoy to try it.

Currently, it works with English, French, German, Italian and Spanish texts.

It can be used to:

Estimate the reading time and the reading difficulty.

Categorize a text (automatic classifier).

List keywords.

Summarize by extraction.

Count repetition and estimate the ratio of basic words

Look for an eventual plagiarism.
Evaluate sentiments.
Archive & retrieve documents.
Knowledge base.

It works with:

url links (use wget ).

text and html files.

pdf, odt, ppt, doc files but require pdftotext, odt2txt, ppthtml and antiword to be installed).

It uses:

The Perceptron in the supervised learning to add or update the text classification.
The linear classifier to classify / categorize the text.
A fuzzy logic and a case based reasoning to learn the best summarizing method from the past summarizations.
A summarizer based on the extraction of the most relevant sentences.
A basic re-worder based on the replacing of the deadwood expressions with their simplified equivalents.
A knowledge base to extract sentences carrying some sentiments.

Sources and Links:

Download the sources

Note: uncompress the file with tar -x -f pack1.xx , compile and link with make , install the programs with make install

Programming Languages:

C (10000 lines) and Python (2000 lines)

Why C? I chose it because the classifier and the summarizer require a lot of CPU cycles, so it was important to have a program directly compiled in binary. I considered the language D for a while; I found it very interesting and rather powerful but, too few developers use it. I considered Vala, but it is still too young and requires the libc. One of the main drawbacks of C is the intensive use of pointers and so the time consuming / difficulty to write a robust code.

Why Python? Because I love it! It is an easy to code language, the sources are very readable, there are many very powerful functions in many libraries covering many fields, no memory to allocate and to free... The productivity gain is huge compared to C/C++, but it is an interpreted language really too slow for intensive computation.

How does it work?

TexLexAn is composed of 5 programs. The first one is the graphic user interface written in Python. Named texlexan.py, it allows to drag'n drop a file or an internet link and runs the analyzer-classifier-summarizer engine and the learner engine. The second one is the analyzer, classifier, summarizer engine. It works in command line. The third one is the learner and works in command line too. There are two secondary programs useful to search inside the base of summaries and to find the original documents.

1- The graphic interface is named texlexan.py. It provides a few number of options, limited to the most useful. It glues several cli applications (wget, antiword, pdftotext, ppthtml, odt2txt, texlexan and lazylearner). It allows to enter a feedback and to create a new class. The last possibility is the text archiving.

2 - The analyzer-classifier-summarizer engine is named texlexan too. It does the main job: It detects the charset (UTF-8 and ISO-8859-1 are recognized), the file format (text or html), and the language, and then proceeds to the conversion in plain ASCII (7bits). The text converted is tokenized and the keywords are collected.

The classifier works like that: The keywords are searched in a dictionary of classes (or categories), when a keyword is found then the class of text is stored and a score attributed to the class. The same keyword often belongs several classes but with different probabilities (a probability is computed as a score). Finally, a grade for each class is computed, and the class with the highest grade is supposed the most probable class for the text.

The summarizer extracts the most relevant sentences in the text. It simplifies them a little bit by removing the sentences between the brackets and the repetition of same sentences, and replaces the deadwood expressions with their shortest forms (mode VIII). Texlexan uses 3 kinds of summarizers: One based on the keywords exclusively extracted from the text (mode I). Another one based on the keywords extracted from the dictionary of classes (mode II). The text classification provides a list of words belonging the class, these words are searched in each sentence of the text and allow to establish a score for each sentence. The most relevant sentences have the highest scores and are included in the summary. The last summarizing method is simply based on a list of cues expressions. Sentences having these cues expressions are extracted and included in the summary (mode IV). The mode I and the mode IV or the mode II and the mode IV can be combined to give a shorter summary. The selection between the different mode is done automatically by a routine called 'smartdecision'. This routine uses the cases based reasoning; it tries to find the best match of several characteristics of the text with the previous texts, then gets the results of the previous summarizations to decide of the best mode and compression rate to apply.

The sentiment analyzer works almost like the classifier, but it uses a static knowledge base and some basic syntax rules to evaluate the opinion expressed in each sentences. Opinions are identified as bipolar sentiments such as 'good' or 'bad'. Target words or expressions can be used to limit the sentence analysis to sentences containing these targets.

3 - The learner engine, named lazylearner because it is simply based on a Perceptron algorithm to update the classes dictionary with new class, new keywords or just update the score of existing keywords. The GUI texlexan.py runs the learner immediately after the analyzer-classifier-summarizer finished, so the dictionaries are immediately updated.

4 – Secondary programs, named sis and tlsearch are interesting to find terms inside the base of summaries archived and to retrieve the original documents. Texlexan.py can compress the documents analyzed (with the help of bzip2) and store them in the folder '~/texlexan_archive', the summaries, the classification results and links to the original documents are stored in the file classification.lst. Sentence or words are searched inside the classification.lst by tlsearch.py (with the help of sis and your web browser) displays the result (links + summaries).

The sis is the search engine (CLI) and tlsearch is the graphic frontend.

Note: There is a second package (pack2.x) that includes several programs to update the dictionaries in batch mode. This alternative gives more robust dictionaries. But only the english language is implemented.

Installation:

Important: To avoid the risk of loss of data, it's recommanded to install and to run these programs from a live cd.

-TexLexAn programs are tested on Ubuntu 8.04//9.04/10.04 and FreeBSD 6.2. The binaries included in the package work on Ubuntu 10.04 For FreeBSD, you will have to compile (command: make) the sources included in the package before to run the script: install.sh

The programs are compatible posix and theorically should run on Opensolaris (not tested).

1- Untar the archive

2 – Go in the folder texlexan

3- Run the script install.sh inside the decompressed folder (./install.sh or make install)

=> Note: If the binaries are not included in the package or are not compatible with your kernel. type 'make' to compile the sources and then 'make install' to run the install script. The make command must be executed inside the folder pack1.50

This script creates the directories /texlexan_progin and /texlexan_dico in your home dirextory. It copies the program files texlexan.py and texlexan in ~/texlexan_progin copies the dictionaries keyworder.en.dicoA, keyworder.en.dicoS , keyworder.en.dico1...keyworder.en.dico3 , basic.en.word , and excluded.en.word in the folder ~/texlexan_dico , copy the documentations in the folder ~/texlexan_doc and finaly starts texlexan.py (the gui).

During the first run, the texlexan.py creates 3 directories: ~/texlexan_cfg , ~/texlexan_result and ~/texlexan_archive.

It records the configuration file texlexan.cfg in ~/texlexan_cfg and checks that ~/texlexan_prog/texlexan and ~/texlexan_dico exist, finaly it generates the graphic user interface.

Note: texlexan.cfg can be edited with any basic text editor. It contains the defaults directories and some other parameters.

~/texlexan_result is a directory that is used to store the summary files. The summaries are plain text files (abstract1.txt, abstract2.txt, abstract3.txt) and can be opened with any editor.

~/texlexan_archive is a directory that is used to store compressed documents and a list of summaries + classification results + links of documents analyzed. This list 'classification.lst' is used by the search engine sis.

In case of problems:

Check the permission: install.sh, texlexan.py, tlsearch.py, texlexan and sis must be executable.
Python < 3.0 and the library pygtk are required to run the gui. ( sudo apt-get install python-gtk2 )
odt2txt, antiword, pdftotext and ppthtml are required to convert some documents (Texlexan can work without them if you just want to analyze plain text and html documents).
The binaries in the package are tested on Ubuntu 9.04 and 9.10, the sources are included in the package too, to compile them, go to the folder: cd texlexan and type make, make clean to remove the object files, and make install to copy the programs and the dictionary.

If pygtk, antiword, pdftotext, ppthtml, odt2txt... are not installed and for Debian and derived, type:

sudo apt-get install python-gtk2

sudo apt-get install antiword

sudo apt-get install poppler-utils

sudo apt-get install ppthtml

sudo apt-get install odt2txt

You will need wget to download html, pdf, ppt, ..., doc files from the web. To install it:

sudo apt-get install wget

To use it:

The programs are installed in the folder ~/texlexan_prog .

1 - To start texlexan, type in the console: ./texlexan_prog/texlexan.py

Drag and drop the http link or the file to analyse and validate. After a while depending of the size of the text, the result window will appear. You can confirm the text classification or create a new class. One click on the button 'learn' will update the dictionaries.

If you want to use tlsearch/sis you have to check the option box 'Archive files'.

2 - To start tlsearch, type in the console: ./texlexan_prog/tlsearch.py

Enter the sentence searched or the list of words searched.

- Check the box 'separate words' to search each word individually.

Check 'Relevant sentence' to display only sentences where the searched terms are found.

3 – You can launch texlexan or tlsearch from the Desktop too. You need to copy the file 'texlexan.desktop' in your Desktop folder ( install.sh should do it for you ), so you will just to click on the icon to run a very small python program ( tlstart.py ). This one will give you the choice between texlexan and tlsearch.

Mechanism:

The text analysis and classification goes through 11 steps:

1 – Text format is detected and converted.

2 – Characters coding is detected and the text is converted in ascii.

3 - Language is detected and the dictionaries are selected.

4 – Text is tokenized.

5 – Syllables and basic words are counted.

6 – Sentiments clue-words are searched, corresponding sentences are extracted.

7 – Sentence plagiarism is searched.

8 – Lowest significant words are suppressed.

9 – N-grams are extracted from the text and their occurrences are counted.

10 – N-grams are searched inside the dictionaries and the classes found are weighed.

11 – Weighed classes result is analyzed, the most probable class is extracted.

The text summarization goes through 3 steps:

1 – The best summarizing method is selected from a knowledge base.

2 – The most relevant sentences are searched and extracted.

3 – “Deadwood” sentences are simplified or suppressed.

The learning process requires 3 steps:

1 – N-grams are searched inside the dictionaries.

2 – N-grams weights are computed.

3 – Weights are standardized.

Organigram simplified:

* CONVERTION & TOKENIZING *

[Text input]
|
v
Characters coding 
detection & conversion
|
v
[ Iso converted text ]
|
v
Detect main language
|
v
Extract words (tokenize)
mark sentence
Count syllables
|
v
[ tokenized text ]
+
[ list of  words ]

* REPETITION *

[ tokenized text ]
|
v
Compute repetition
of the same word
Count basic words
|
v
[ Results ]

* SENTIMENTS *

[ tokenized text ]
|
V
Extract sentences 
carrying sentiments
|
v
[ Results ]

* PLAGIARISM *

[ tokenized text ]
|
V
Search similar sentences 
(plagiarism)
|
V
[ Results ]

* CLASSIFIER *

[ list of  words ]
&
[ tokenized text ]
|
V
[ excluded.lan.word ]-->Reject some basic words                
|
V
Extract & count
similar 2...n grams
|
V
Sort result
|
V
[ keyworder.build ]------>Append---------->[ keyworder.build ]
|
V
[ keyworder.lan.dicoN ]-->Search words (strict or levenstein)
Weigh Class
|
V
Sort result
|
V
[ Results for 2..ngrams ]



[ list of  words ]
&
[ tokenized text ]
|
V
Extract & count
                         similar words--->[ list of keywords ]
|
V
Sort result
|
V
[ keywords.build ]------->Append------->[ keywords.build ]
|
V
[keyworder.lan.dicN]-->Search keywords (strict or levenstein)
Weigh Class
|
V
Sort result
|
V
[ Results for monogram ]

[ Results for 1-grams...n-grams ]
|
V
Compute the score of each class found
|
V
Search the highest score
|
V
[ Most probable class ]

* SUMMARIZERS *

Mode I (use the keywords extracted from the text)

[ Iso converted text ]
+
[ Tokenized text ]
|
V
  [ keywords extracted ]-->Extract and score relevant sentences            
|
V
Keep the most relevant sentences
|
V
Reformat the text
|
V
[ Summary ]

Mode II (use the result of the classifier)

[ Iso converted text ]
+
[ Most probable class ]
+
[ Tokenized text ]
|
V
[keyworder.lan.dic1]-->Extract and score relevant sentences                 
|
V
Keep the most relevant sentences
|
V
Reformat the text
|
V
[ Summary ]

Mode IV (use a list of cue words)

[ Iso converted text 
or summary ]
|
V
[ List of cue words ]-->Extract and score relevant sentences       
|
V
Keep the most relevant sentences
|
V
Reformat the text
|
V
[ Summary ]

Mode VIII (simplify a little bit the text)

[ Iso converted text
or Summary ]
|
V
[ List of deadwood expr. ]---->Search “deadwood expressions”              
Simplify or suppress sentences
|
V
Reformat the text
|
V
[ Summary ]

Note:

Summarizers mode 1 and 2 can be combined with the mode 4 and 8.

The normal summarizing modes are 2 + 8 or 2 + 4 + 8

The mode 4 is used when the text is large.

When the class probability is too low, the mode 1 replaces the mode 2.

When the text is too short, only the mode 8 is used.

The combination between the different mode is automatically. It is use an algorithm able to learn (from the size of the summaries, eventually the users' opinion, some characteristic of the text) and is able to take some fuzzy decisions.

[to be continued]

Dictionaries:

TexLexAn uses several databases called dictionaries, there are language specific.

IMPORTANT: dictionaries must contain only plain ASCII characters (7 bits), without accents or special characters.

1 – List of excluded words.

File name: excluded.lan.word ( lan= de, en, es, fr, it, ... )

It is simply a list of very common word (in lowercase). The advantage to exclude these words is to speed up the classifier.

Example of list:

the;some;a;an;of

i;you;it;he;she;we;they

be;am;is;are;been;being

have;has;having;had

do;does;done;doing

go;goes;gone;going

www;http;org;com

2 – Classification n-grams.

File name: keyworder.lan.dicN ( lan= de, en, es, fr, it, ... ) ( N term(s) in n-gram )

Each lines of this dictionary contain a list of n-grams belonging a class. Each n-grams have a weight. The weight is a value between 0 and 9 representing the strength linking the n-gram and the class. A n-gram appearing only in one class and very frequently will have a high weight. A n-gram appearing only in one class but rarely will have a low weight. A n-gram appearing in any class will have a low weight too.

One dictionary contains all the classes available for one language. The dictionary is automatically updated by the lazylearner program. The weights of existing n-terms are recomputed, new n-terms are added to an existing class, and new class can be added to the dictionary.

The structure of the dictionary is pretty simple and allows to read it and modify it with any text editor, for the reason that texlexan is experimental; the drawback of this approach is a lack of efficiency during the search & update process.

Example:

#<LINE_LG:80527>

#Updated expressions by the lazy learner

1 en.text-speech-energy:3/5\those-who/9\gasoline-price/9\renewable-energy/9\climate-crisi/9\national-security/

…

25 en.text-technic-computer-unclassified:5/1\18th-dear/1\added-such/1\algorithm-design/1\analysi-algorithm/

26 en.text-health-therapy:8/1\about-how/1\abuse-depression/1\abusing-individual/1\academic-press/1\acb-home/1\access-reinforcer/

#<END>

#<LINE_LG:xxxxxxx> gives the length of the largest line. It is used to allocate enough memory.

Next lines starting with # are comments.

'1 en.text-speech-energy:' is the class label.

:3/ is the class constant depending of the training size.

/9\ is the weight of 2-terms \gasoline-price/

'-' separates two words.

Each line must be terminated with a line feed \n or char(10).

#<END> terminates the dictionary.

3 – Sentiment cue words.

File name: keyworder.lan.dicE ( lan= de, en, es, fr, it, ... )

The structure of the dictionary of 6 classes of words. The inversers, like 'not' inverse the sentiment of the sentence: It is good => it is not good. The enhancers, like 'very' increase the strength of the sentiment: It's good => it's very good. The mitigers reduce the strength of the sentiment. The suborders split a sentence. A suborder stops the propagation of an inverser, enhancer or mitiger: It is not very fancy but it does the job petty well. Finally, the positive words and the negative words.

A value from 1 to 9 gives the strength of the sentiment that the word expresses: exceptional > good.

The value of the enhancer word is a multiplier: I am happy => 6, I am very happy => 2x6

The value of the ''mitiger'' word is a divider: I am happy => 6, I would be happy => 1/2*6

Example:

# Lexicon for Emotion 

1 inverser:/1\no/1\not/1\dont/1\isnt/1\arent/1\wasnt/1\werent/1\hasnt/1\hadnt/
2 enhancer:/2\very/2\extremely/2\definitively/2\particularly/2\really/

3 mitiger:/2\should/2\would/2\could/2\may/

4 suborder:/1\if/1\so/1\but/1\that/1\when/1\because/1\which/1\who/

5 positive-opinion:/9\exceptional/9\excellent/8\marvelous/.../6\happy/6\enjoy/6\enjoyed/
6 negative-opinion:/9\disastrous/9\disaster/8\miserable/.../5\unsafe/5\overcome/5\defeated/

#<END>

4 – Famous sentences (for plagia).

File name: keyworder.lan.dicS ( lan= de, en, es, fr, it, ... )

# Sentences for plagiarism search

1 text-literature-mythology-BraveNewWorld:/9\the liftman was a small simian creature dressed in the black tunic of an epsilon minus semi moron/

2 text-literature-mythology-NativeAmerican-WhitePlume:/9\the young man was noted throughout the whole nation for his accuracy with the bow and arrow and was given the title of dead shot or he who never misses his mark and the young woman noted for her beauty was named beautiful dove/

#<END>

5 – Language detection.

File name: languages.word

It is simply a list of common and specific words (in lowercase) in each languages.

Example of list:

Identify language with specific words. Use 7bits ASCII, no accent, no special char

fr:/oui/du/des/une/des/le/etes/sommes/sont/avons/avez/ont/

en:/yes/some/an/the/of/is/are/been/has/have/had/she/you/we/they/

es:/los/el/las/unos/unas/yo/nosotros/henos/vosotros/habeis/han/habia/habias/habianos/

it:/al/alle/del/della/gli/io/egli/noi/voi/eosi/sono/sei/sianos/siete/

de:/ich/bin/der/und/den/von/zu/das/mit/sich/nich/dass/auf/wir/haben/auch/einen/schon/

po://

6 – Reading length estimation.

File name: languages.chrono

It is the average duration in milliseconds of syllables, silences between two words, coma and end of
sentences, in each languages implemented.

Example of list:

#Average duration (in milliseconds) lang:syllable/word/coma/sentence

fr:180/50/250/500

en:200/10/250/500

es:150/50/250/500

sp:150/50/250/500

it:150/50/250/500

de:200/50/250/500

ge:200/50/250/500

#End

Limitations and future developments:

There are two limitations inherent to the design of the program that will comments here.

1 - The search of a substring uses the strstr function. This function intensively used to search keywords in the text and in the dictionaries. It is a very basic approach, there are several advantages such as the simplicity, the robustness, the low memory requirement but there are two major drawbacks: it is slow and cannot work with characters coded in 16 or 32 bits.

2 – The text is converted in two files. For one file, the characters are coded in 7 bits (plain ascii) and it used to extract keywords, to classify/categorize the text and to find the most important/relevant sentences. For the another file, the characters are coded in 8 bits (iso 8859) and it used to extract sentences for the summary. This solution works very well with languages like English (of course) and French, Italian, Spanish where accentuated character can be easily simplified by a plain character (é,è,ê => e). It is more complicated with Greek, Cyrillic, Arabic, Hebrew alphabets for instance because they require a transliteration and a simplification to convert them in plain ascii. Finally, the transliteration of the Mandarin is particularly difficult because the thousands of symbols used.

In conclusion: The current versions of TexLexAn work well with small texts (10 pages), small dictionaries (100 classes) and with the most common western languages. (English, French, Spanish, German, Italian). It will require a complete rewrite to work with large texts, large classification dictionaries, and almost any languages.

FUTURES:

Replace strstr search with a more efficient method.
Convert text in UTF-32.
Rewrite the code in C++ or D ?.

Contacts:

Auteur: Jean-Pierre Redonnet (last update: 2010/08/13)

Perso: texlexan@gmail.com
Prof: jredonnet@wcupa.edu

-----------------------------------------------------------------------------------------------------------------------------------------

Reminder:

$ sftp redonnet,texlexan@frs.sourceforge.net

Connecting to frs.sourceforge.net...

redonnet,texlexan@frs.sourceforge.net's password:

sftp> cd /home/frs/project/f/fo/fooproject/Rel_1

sftp> put /home/jp/..../index.html

Uploading /home/jp/Bureau/texlexan/docs/index.html to /home/groups/t/te/texlexan/htdocs/index.html

sftp>ls

sftp>quit

-------------------------------------------------------------------------------------------

TexLexAn.SourceForge.net

TexLexAn Analyze, Classify and Summarize any text.

*** CONVERTION & TOKENIZING ***

*** REPETITION ***

*** SENTIMENTS ***

*** PLAGIARISM ***

*** CLASSIFIER ***

*** SUMMARIZERS ***

Mode I (use the keywords extracted from the text)

Mode II (use the result of the classifier)

Mode IV (use a list of cue words)

Mode VIII (simplify a little bit the text)

* CONVERTION & TOKENIZING *

* REPETITION *

* SENTIMENTS *

* PLAGIARISM *

* CLASSIFIER *

* SUMMARIZERS *