Scraping Archaeology: A Methodological Approach from the Web Scraping and Text Mining
DOI:
https://doi.org/10.31048/1852.4826.v16.n2.41094Keywords:
R, Web scraping, Text mining, Data analytics, Digital ArchaeologyAbstract
As the amount of information available on the web increases, so does the task of locating and analysing it, and performing this task manually can be costly in terms of time and effort. Although search engines and database engines can help to find the required information, in large digital infrastructures where search results are in the thousands - or more - new tools are needed to effectively retrieve the searched content. This paper proposes the application of Web Scraping and Text Mining as methodological inputs to be able to compile and process large volumes of data in digital infrastructures in a more automated way. The automation of both processes provides a great advantage in analysing textual corpora of thousands of records, which significantly simplifies the collection of different types of data, facilitating the work considerably. It is hoped that this contribution will expand the possibilities of the archaeological community in terms of a novel methodology for the collection and handling of structured and unstructured data that can be integrated into the research of the wider archaeological community.
Downloads
References
Ali, R. H., Kashefi, A. K., Gorman, A. C., Walsh, J. St. P., y Linstead, E. J. (2022). Automated identification of astronauts on board the International Space Station: A case study in space archaeology. Acta Astronautica, 200, 262-269. https://doi.org/10.1016/j.actaastro.2022.08.017
Allés Torrent, S., del Rio Riande, G., De León, R., Fila, M., Hernández, N., Bonnell, J., y Song, D. (2020). Narrativas digitales de la COVID-19 en Twitter: de los datos a la interpretación. Publicaciones de la Asociación Argentina de Humanidades Digitales, 1. https://doi.org/10.24215/27187470e002
Arcila-Calderón C., Barbosa-Caro E. y Cabezuelo-Lorenzo F. (2016): Técnicas Big Data: análisis de textos a gran escala para la investigación científica y periodística. El profesional de la información 25 (4), 623-631.
Ávido, D., y Vitores, M. (2018). Lectura distante y visualización de textos en arqueología y disciplinas afines. Trabajo presentado en el III Congreso Internacional de la Asociación de Humanidades Digitales (AAHD). https://n2t.net/ark:/13683/pzBp/DDe
Beigel F. (2012). David y Goliath. El sistema académico mundial y las perspectivas del conocimiento producido en la periferia. Pensamiento Universitario 15.
Beigel F. (2014). Publishing from the Periphery: Structural Heterogeneity and Segmented Circuits. The Evaluation of Scientific Publications for Tenure in Argentina’s CONICET. Current Sociology, 62 (5), 743-765. https://doi.org/10.1177/0011392114533977
Bordignon, F. y Maisonobe, M. (2022). Researchers and their data: A study based on the use of the word data in scholarly articles. Quantitative Science Studies, 3(4), 1156-1178. https://doi.org/10.1162/qss_a_00220
Calvo E. y Aruguete N. (2020). Fake news, trolls y otros encantos: Como funcionan (para bien y para mal) las redes sociales. Siglo XXI Editores, Buenos Aires.
Daly P. y Evans T.L. (2006). Introduction: archaeological theory and digital pasts. En: T.L. Evans y Daly P (Eds.), Digital Archaeology: bridging method and theory (3-7). Abingdon: Routeledge.
Demi̇r, N., Boyoğlu, C. S., y Kayikci, D. (2023). A web scrapping and AI approach for archeologists to analyze the ancient cities. Cultural Heritage and Science, 4(1), 1-8. https://doi.org/10.58598/cuhes.1213426
Feldman R. y Dagan I. (1995). Knowledge Discovery in Textual Databases (KDT). KDD 95, 112-117.
Feldman R. y Sanger J. (2006). The Text Mining Handbook: Advanced approaches in analyzing unstructured data. Cambridge University Press.
Graham, S., Huffer, D., y Blackadar, J. (2020). Towards a Digital Sensorial Archaeology as an Experiment in Distant Viewing of the Trade in Human Remains on Instagram. Heritage, 3(2), 208-227. https://doi.org/10.3390/heritage3020013
Grzegorczyk, M., y Salerno, V. (2022). Un análisis a través de las redes sociales y noticias periodísticas sobre el detectorismo de metales en Argentina. Revista de Arqueología Histórica Argentina y Latinoamericana, 16(1). http://www.doi.org/10.55695/rdahayl16.01.01
Hernández A., Gómez Vásquez E., Berdejo Rincón C., Montero Gacía J., Calderón Maldonado A. e Ibarra Orozco R. (2015). Metodologías para análisis político utilizando web scraping. Resarch in Computing Science. 95, 113-121.
Hernando A. (2002). Arqueología de la identidad. Akal: Madrid.
Isasi Velasco J. y del Rio Riande G. (2022). ¿En qué lengua citamos cuando escribimos sobre Humanidades Digitales?. Revista de Humanidades Digitales 7, 127-143. https://doi.org/10.5944/rhd.vol.7.2022
Izeta A.D. y Cattáneo R. (2018). ¿Es posible una arqueología digital en Argentina? Un acercamiento desde la práctica. Humanidades Digitales: Construcciones locales en contextos globales. Asociación Argentina de Humanidades Digitales: Buenos Aires. https://n2t.net/ark:/13683/ey3x/gwo
Kearney M. W. (2019). rtweet: Collecting and analyzing Twitter data. Journal of Open Source Software, 4(42). 1829. doi:10.21105/joss.01829
Kristiansen K. (2012). Archaeological communities and languages. En R. Skeates, C. McDavid y J. Carman (Eds.), The Oxford Handbook of Public Archaeology (461-467). Oxford University Press.
Laitano G. y Nieto A. (2022). La conflictividad social en los barrios de Mar del Plata (2016-2020): un acercamiento computacional. En: G. Laitano y A. Nieto (Eds.), La conflictividad social en la historia reciente. Enfoques cuantitativos desde lo local a lo regional (153-228). Buenos Aires.
Martínez R., Rodríguez R., Vera P. y Parkinson C. (2019). Análisis de técnicas de raspado de datos en la web – Aplicado al portal del estado nacional argentino. XXV Congreso Argentino de Ciencias de la Computación (457-466). Río Cuarto.
Martinovich V., Arakaki J. y Spinelli H. (2014). Diez años de Salud Colectiva: una aproximación a las reglas del juego del campo editorial científico. Salud Colectiva 10 (1). https://doi.org/10.18294/sc.2014.205
Marwick B., Boettiger C. y Mullen, L. (2018). Packaging data analytical work reproducibly using R (and friends). The American Statistician, 72(1), 80-88. https://doi.org/10.1080/00031305.2017.1375986
R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing: Vienna, Austria. https://www.R-project.org/
Richards, D. J., Tudhope, D., y Vlachidis, A. (2015). Text Mining in Archaeology: Extracting Information from Archaeological Reports. En J. A. Barcelo y I. Bogdanovic (Eds.), Mathematics and Archaeology (pp. 240-254). CRC Press. https://doi.org/10.1201/b18530-17
Richards J.D. (2009). From anarchy to good practice: the evolution of standards in archaeological computing. Archeologia e Calcolatori, 20, 27-35.
Richardson L. (2019). Using social media as a source for understanding public perceptions of archaeology: research challenges and methodological pitfalls. Journal of Computer Applications in Archaeology, 2(1), 151-162. https://doi.org/10.5334/jcaa.39
Richardson L. (2013). A Digital Public Archaeology? Papers from the Institute of Archaeology, 23(1), 10, 1-12. http://doi.org/10.5334/pia.431
Royero J.M. (2007). Las redes de investigación y desarrollo (I+D) en América Latina. Revista de Universidad y Sociedad del Conocimiento 3 (2). http://dx.doi.org/10.7238/rusc.v3i2.280
Rozemblun C., Unzurrungaza C., Banzato G. y Pucacco C. (2015). Calidad editorial y calidad científica en los parámetros para inclusión de revistas científicas en bases de datos en Acceso Abierto y comerciales. Palabra Clave 4 (2).
Schadla-Hall T. (2004). The comforts of unreason: the importance and relevance of alternative archaeology. En: N. Merriman (Ed.), Public archaeology (269-285). Routledge.
Spengler, G. A., & Kligmann, D. M. (2022). Primeras aproximaciones al estudio de los hábitos de publicación de los arqueólogos argentinos. Revista Iberoamericana de Ciencia, Tecnología y Sociedad, 17(49), 91-125. http://ojs.revistacts.net/index.php/CTS/article/view/263
Twitter Blue. [@Twitter Blue] (8 de febrero de 2023). need more than 280 characters to express yourself?. [Tweet]. Twitter. https://twitter.com/TwitterBlue/status/1623411400545632256
Van Dijck J. (2016). La cultura de la conectividad: Una historia crítica de las redes sociales. Siglo XXI Editores: Buenos Aires.
Wallerstein I. (1999). Impensar las Ciencias Sociales. Límites de los paradigmas decimonónicos. Siglo XXI Editores: México.
Wickham H. (2016). Package rvest. https://cran.r-project.org/web/packages/rvest/rvest.pdf
Downloads
Published
Issue
Section
License
Copyright (c) 2023 Humberto Aguilar
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Those authors who have publications with this Journalaccept the following terms:
a. Authors will retain their copyrights and guarantee the journal the right of first publication of their work, which will be simultaneously subject to the Creative Commons Attribution License (Licencia de reconocimiento de Creative Commons) that allows third parties to share the work as long as its author and his first publication in this journal.
b. Authors may adopt other non-exclusive licensing agreements for the distribution of the version of the published work (eg, deposit it in an institutional electronic file or publish it in a monographic volume) provided that the initial publication in this journal is indicated.
c. Authors are allowed and recommended to disseminate their work on the Internet (eg in institutional telematic archives or on their website) before and during the submission process, which can lead to interesting exchanges and increase citations of the published work. (See The Effect of Open Access - El efecto del acceso abierto)