Document Clustering

The project e-ŠNUNNA analyzes a corpus of letters of the old-Babylonian Kingdom of EŠNUNNA, by exploiting Document Clustering, in sinergy with other e-tools.
Text Clustering is a data mining technique, which identifies set of documents, sharing common features (i.e., the same topic). Document in the same cluster are similar or highly-related to each other than texts belonging to different clusters.
In the e-ŠNUNNA Project one of the aims is to discover homogeneous groups and hidden relations in the data, to support scholars in their assyriological studies. For this purpose, ENEA developed a data mining tool, called ASTEC (ASsyriology Text Clustering), which offers a set of clustering algorithms and features, analyzes and discovers homogenous groups of transliterated and lemmatized.
The selected texts belong to a collections of 51 tablets, first transliterated by Albrecht Goetze in "Fifty Old Babylonian Letter from Harmal" (Sumer 14), 1958, pp. 3-78, and now slightly reviewed. These letters offer important elements for a study of the E-ŠNUNNA old Babylonian grammar.
After TalTaC2 text processing and grammatical tagging, texts were lemmatized and then TalTaC2 reconstructed all fragments by lemmas, following the original sequence of words.

Goetze lemmatized collection:
IM 51046.pdf .txtASCII
IM 51047.pdf .txtASCII
IM 51048.pdf .txtASCII
IM 51049.pdf .txtASCII
IM 51053.pdf .txtASCII
IM 51062.pdf .txtASCII
IM 51104 (see IM 51108)
IM 51105.pdf .txtASCII Lemm2014byTaLTaC2
IM 51108.pdf .txtASCII Lemm2014byTaLTaC2
IM 51110.pdf .txtASCII
IM 51111.pdf .txtASCII Lemm2013byTaLTaC2
IM 51112.pdf .txtASCII
IM 51113.pdf .txtASCII
IM 51114.pdf .txtASCII
IM 51154.pdf .txtASCII
IM 51155.pdf .txtASCII Lemm2014byTaLTaC2
IM 51156.pdf .txtASCII Lemm2014byTaLTaC2
IM 51180.pdf .txtASCII Lemm2014byTaLTaC2
IM 51182.pdf .txtASCII
IM 51184.pdf .txtASCII Lemm2014byTaLTaC2
IM 51186.pdf .txtASCII Lemm2014byTaLTaC2
IM 51189.pdf .txtASCII Lemm2014byTaLTaC2
IM 51192.pdf .txtASCII Lemm2014byTaLTaC2
IM 51193.pdf .txtASCII Lemm2014byTaLTaC2
IM 51194.pdf .txtASCII Lemm2014byTaLTaC2
IM 51197.pdf .txtASCII
IM 51198.pdf .txtASCII Lemm2014byTaLTaC2
IM 51226.pdf .txtASCII
IM 51229.pdf .txtASCII Lemm2014byTaLTaC2
IM 51234.pdf .txtASCII Lemm2014byTaLTaC2
IM 51235.pdf .txtASCII
IM 51237.pdf .txtASCII
IM 51238a.pdf .txtASCII
IM 51238b.pdf .txtASCII Lemm2014byTaLTaC2
IM 51240.pdf .txtASCII
IM 51251.pdf .txtASCII
IM 51260.pdf .txtASCII
IM 51269.pdf .txtASCII Lemm2014byTaLTaC2
IM 51270.pdf .txtASCII
IM 51272.pdf .txtASCII
IM 51294.pdf .txtASCII
IM 51305.pdf .txtASCII
IM 51310.pdf .txtASCII
IM 51311.pdf .txtASCII Lemm2014byTaLTaC2
IM 51312.pdf .txtASCII
IM 51321.pdf .txtASCII
IM 51365.pdf .txtASCII
IM 51376.pdf .txtASCII
IM 51382.pdf .txtASCII Lemm2014byTALTAC
IM 51490.pdf .txtASCII
IM 51503.pdf .txtASCII
IM 51585.pdf .txtASCII

Corpus Characteristics
The lemmatized Corpus is composed by 2741 word form (token) and has a vocabulary of 500 word types. It shows a lexical richness index of 18% and 50% of hapax.
Senders and Receivers were explicitly identified, and this will be useful for further prosopographic and geographic analysis and studies (see TALTAC screenshot).
With respect to general corpora, in TaLTaC2, these texts shows peculiarities in the text mining phase, such as:
- punctuation is lacking;
- word boundaries are marked only by space and carriage return (punctuation marks were not assumed as separator, due to their different function in transliteration methodologies in Assyriology);
- Akkadian personal names have been written in lower case;
- Sumerograms have been written in small caps;
- e-texts with UNICODE characters required a limited intervention of simplification (see text encoding ).

Goetze, A., "Fifty Old Babylonian Letter from Harmal" (Sumer 14), 1958, pp.3-78