big data blog

anything that drives and relates to big data, social networks, mobile media, and analytics, and more.

Litarary Word Comparison

Introduction
This is one of the small research projects that I am currently conducting. I am not pretending to offer or accomplish any scientific added value to the research community in the field of Natural Language Processing (NLP) but humbly submit my efforts to gain further personal learning. While the research remains unfinished and until I publish it formally, I will keep this post as a mini-post. As a Universal Man, a Humanist, a Renaissance Man each individual man has an obligation to question and further his or her knowledge and understanding, as it lies within our capacities. Learning is a tool to humble our heart, and most of all we should mistrust brave hearts.

Matt Ridley in his book Nature via Nurture says (says Richard Dawkins in his The Ancestor’s Tale in The Mouse Tale chapter) that “the list of words in David Copperfield is almost the same as the list of words in The Catcher in the Rye.” Springing from this saying, I concluded that it would be an interesting project to create a plotter diagram in which the major works in literature (written, translated or edited into modern English for reasons of ease of comparison) are set out as number of total words versus the number of different words used and another network graph that displays the relative closeness of literary works by words used. The first diagram is the easiest to create of course, so I will start with this first, then moving on to the next network diagram.

In the network diagram, several pieces of information can weigh into the closeness of one to another point. Number of words, wordlength, number of long words versus number of short words etcetera. I will create a list of possible factors to include in the calculation of closeness, extending the application from a simple calculation to grow more complex in time, based on the feedback of more educated specialists.

But in principle I will treat the texts without semantical intepretation, but as blind data, as numbers, not as complex thoughts that erupted from a spur of genius.

Planning:
- Background Reading,
- Write thesis and project description,
- Evaluate planning,
- Write simple parser, separating words from a text, eliminating grammar marks,
- Evaluate Planning,
- Use third party tool or compare similar projects.

Texts:
- The Koran (koran10.txt),
- Plato, The Republic (repub11.txt),
- Jonathan Swift, Gulliver’s Travels (gltrv10.txt),
- Biccolo Machiavelli, The Prince (1232.txt),
- Sunzi, The Art of War (17405.txt),
- Henrik Ibsen, A Doll’s House : a play (dlshs11.txt),
- James Joyce, Ulysses (ulyss12.txt),
- The Declaration of Independence.

Resources:
- Natural Language Processing @wikipedia.org
- NLP Research @MicroSoft,
- Stanford Natural Language Processing Group,
- Cornell Natural Language Processing Group,
- NLP at Brown Laboratory,
- OpenNLP,
- Chris Manning and Hinrich Sch├╝tze, Foundations of Statistical Natural Language Processing, MIT Press. Cambridge, MA: May 1999,
- Linguistic Data Consortium (LDC),
- NPL Blog,
- Proxem, resources for NLP.

Lexical Data:
- WordNet @Princeton.

date: December 26th, 2006 | categories: it, science | tags: