Machine learning-assisted environmental surveillance of Legionella: A retrospective observational study in Friuli-Venezia Giulia region of Italy in the period 2002–2019


Legionella spp. are considered an important cause of potentially preventable morbidity and mortality, making environmental surveillance a crucial component of risk assessment plans. In this work, historical data of about 24,000 water samples collected during a series of environmental surveys in the Italian region of Friuli-Venezia Giulia in the period 2002–2019 were analysed using machine learning methods. The objective was to determine the most relevant factors linked to the Legionella spread in the considered territory, and to compare such findings to those already presented in the literature. Specifically, retrospective unsupervised spatio-temporal analyses on geocoded data were carried out to identify potential clusters with an excess of cases in the higher-risk categories. In addition, tree-based ensemble classification methods were employed to try to predict the samples serogroup and contamination levels based on climatic and geographical factors, highlighting relationships between the values of such predictors and the presence of specific serogroups. Results based on cross-validation show that it is possible to identify spatially restricted zones with unusual contamination based on data collected during routine surveillance, and that the serogroup appears to be influenced by humidity, temperature and latitude, although the interaction with environmental factors is complex and still not entirely clarified. Despite that, our analyses highlighted several relationships between environmental factors and the presence of Legionella. Most importantly, there is an indication that some Legionella strains are more tolerant to specific microclimates, suggesting that further work should be conducted in this sense when more data will become available.