Welcome to TIGRIS Virtual Lab Project Area

ASTEC (ASsyriology TExt Clustering)



ASTEC (ASsyriology TExt Clustering)


ASTEC (ASsyriology TExt Clustering) is the data mining tool developed by ENEA within the e-ŠNUNNA project, supporting scholars in their assyriological studies. The aim is to discover homogeneous groups and hidden relations in the data, by performing clustering algorithms and setting up measure in the tool.
Text Clustering is a data mining technique, which identifies set of documents, sharing common features (i.e., the same topic). Document in the same cluster are similar or highly-related to each other than texts belonging to different clusters.

ASTEC offers a set of clustering algorithms and features, analyzes and discovers homogeneous groups of transliterated and lemmatized texts. Using ASTEC it is possible to underline hidden relations within the e-ŠNUNNA Corpus, easening the discover and extraction of new patterns and information from textual data.
The tool is customizable. The user is able to choose several settings:
- clustering algorithms (K-means, UPGMA),
- relevance measures (Document Frequency, Term Frequency, Term Frequency - Inverse Document Frequency),
- clustering quality evaluation measures (Inter-Intra Distance, F-measure).

ASTEC is written in Java and is executable in every system as it is independent from the execution platform. This makes ASTEC particuarly suitable for web environments. Furtermore, it is modular and projected to be easily extended with other algorithms and relevance measures.


Corpus Characteristics
The lemmatized Corpus is composed by 2741 word form (token) and has a vocabulary of 500 word types. It shows a lexical richness index of 18% and 50% of hapax.


Bibliography

CRESCO ENEA HPC system: http://www.cresco.enea.it/

ENEA-GRID: www.eneagrid.enea.it

Fayyad, U.M. et al. "From Data Mining to Knowledge Discovery: An Overview. In Advances in Knowledge Discovery and Data Mining, . AAAI/MIT Press, 1996, pp. 1-34

IT@CHA Project. The IT@CHA Project in ENEA: http://utict.enea.it/it/progetti/utict-e-i-progetti/it-cha

Jain, A.K., Dubes, R.C., "Algorithms for Clustering Data. Prentice-Hall", 1988

JAVA Programming Language http://www.java.com

McQueen, J.B. "Some methods for classification and analysis of multivariate observations". In Proc. Berkeley Symposium on Mathematical Statistics and Probability, 1967, pages 281-297.

Salton, G., Lesk, M.E., "Computer Evaluation of Indexing and Text Processing". Journal of the ACM, 1968, 15(1) :8-36

..