Counting words

December 8th, 2015 § 0 comments

While exploring NLTK or Natural Language Toolkit, I came across an interesting way of ‘reading’ The Book of Genesis.
This post is a small report. You find a a Dutch variant here.

NLTK is a library for Python that allows extensive language treatment. It comes with a lot of ‘data’, amongst them, a series of ‘books’. When you load the books, you get a nice list of hierarchically sorted books, starting with Melvilles’s Moby Dick, Jane Austen’s Sens and Sensibility, and the Book of Genesis. Follow next: Inaugural Address Corpus, Chat corpus, Monty Python and the Holy Grail, Wall Street Journal, Personals Corpus and Chesterton’s The Man who was Thursday.

A ‘distant’ algorithmic reading of The Book of Genesis learn me details like this dispersion plot:

These are the figures:
* “man” is mentioned 114 times
* “woman” is mentioned 20 times
* “he” is mentioned 648 times
* “she” is mentioned 161 times
FYI, “sex”, “sensual”, “sensuality”, “copulation” are mentioned 0 times; “flesh” is mentioned 26 times of which 6 in combination with circumcision.

The longest words count 15 characters each, and are: ‘Zaphnathpaaneah’ and ‘interpretations’.
26% of the text consists of 3-letter-words, 11599 occurrences in total. Herewith you find the collection of these words, in alphabetical order, organised for a mosaic of words on the wall in a metro station or some intelligence quiz on tv:

From the 2615 words in the lexicon of Genesis, the following words are all part of a title and count more than 10 letters each: Abelmizraim, Allonbachuth, Beerlahairoi, Canaanitish, Chedorlaomer, Girgashites, Hazarmaveth, Hazezontamar, Ishmeelites, Jegarsahadutha, Jehovahjireh, Kirjatharba, Melchizedek, Mesopotamia, Peradventure, Philistines, Zaphnathpaaneah.

Many writing algorithms are useful to find suitable names. ‘Hazezontamar’ is so exotic. Gamers thought of it before.

Leave a Reply