USE AS REFERENCE, NO FUTURE DEVELOPMENT PLANNED
This was a fun play project that I put together with @arogers1 while learning some NLP. It was a fun exploration in extracting features from text.
Stylometry is the application of the study of linguistic style, usually to written language, but it has successfully been applied to music and to fine-art paintings as well.
Stylometry is often used to attribute authorship to anonymous or disputed documents. Stylometry has legal, academic, and literary applications, which include determination of the true authorship of some of Shakespeare's works and forensic linguistics.
Even after hours of searching through Python libraries, I was unable to find one that seemed to fit my needs.
But soon, I stumbled upon a captivating research paper about applying machine learning techniques to stylometry. In addition, I came across an awesome library that provied the foundation for statistical analysis of raw text data: the Natural Language Tooklit Library, or NLTK. Based on this foundation, I decided to write a simple library that could specifically handle stylometry.
The initial version of the software took only about 3 hours in development but still allowed me to extract a wide variety of features from text data. The library has since been extended into several different facets.
brew install graphviz
pip install -r requirements.txt
import nltk nltk.download('punkt')
Couldn't import dot_parser, loading of dot files will not be possible., do this:
pip uninstall pydot pip uninstall pyparsing pip install -Iv https://pypi.python.org/packages/source/p/pyparsing/pyparsing-1.5.7.tar.gz pip install pydot
split -l 1000 hamlet.txt -d. To rename the output files, consider using
split --numeric-suffixes=1 --additional-suffix=.csv -l 1000 hamlet.txt hamlet_.
cat book-0.txt book-1.txt book-2.txt > entire-book.txt.
Make sure that every author have a similar number of lines and samples to analyze. For example,
|Pride and Prejudice||Jane Austen||13024||13|
|Tale of Two Cities||Charles Dickens||14496||14|
|Romeo & Juliet Hamlet||William Shakespeare||11077||11|
|The Adventures of Huckleberry Finn||Mark Twain||11433||10|
|War and Peace - Only Books 1 & 2||Leo Tolstoy||10517||11|
Download the repositories
$ git clone email@example.com:jpotts18/stylometry.git $ git clone firstname.lastname@example.org:jpotts18/stylometry-data.git $ ipython
Extract stylometry features from one document
from stylometry.extract import * dickens1 = StyloDocument('stylometry-data/Dickens/tale-two-cities-0.txt') dickens1.text_output()
Extract stylometry features from a set of documents called a corpus
import stylometry # Single Author Corpus dickens_corpus = StyloCorpus.from_glob_pattern('stylometry-data/Dickens/*.txt') dickens_corpus.output_csv('/Users/jpotts18/Desktop/dickens.csv') # All authors novel_corpus = StyloCorpus.from_glob_pattern('stylometry-data/*/*.txt') novel_corpus.output_csv('/Users/jpotts18/Desktop/novels.csv')
Decision Tree Classificaiton
from stylometry.classify import * # splits data into validation and training default 80% train 20% validation dtree = StyloDecisionTree('/Users/jpotts18/Desktop/novels.csv') # fit the decision tree to the data dtree.fit() # predict the authorship of the validation set dtree.predict() # Show the confusion matrix and accuracy of the validation prediction dtree.confusion_matrix() # Write the decision tree to an image file dtree.write_tree('tree.png')
Clustering and PCA
from stylometry.cluster import * # Create a KMeans clusterer and run PCA on the data kmeans = StyloKMeans('/Users/jpotts18/Desktop/novels.csv') # Cluster the PCA'd data using K-means kmeans.fit() # Shot the plot of explained variance per principle component kmeans.stylo_pca.plot_explained_variance() # Show the plot of the PCA'd data with the cluster centroids kmeans.plot_clusters()