Abstract
Nowadays, people spend over 80% of their lives indoors, which makes indoor air quality (IAQ) research important. The paper presents, firstly, a structured overview of publicly available IAQ datasets suitable for machine learning (ML) research, secondly, a comparative analysis of the reviewed datasets, thirdly, an ML-oriented mapping between tasks and algorithms, to outline the algorithmic families that are most appropriate given the dataset structure and the prediction target, and fourthly, an investigation on IAQ–ML using custom-made solutions that include sensors for data acquisition. The methodology included an analysis of 1162 papers from the Web of Science, 1536 from Scopus, and 756 from IEEE Xplore, between 1 January 2020 and 31 December 2025, to capture recent trends in ML-based IAQ research. The findings show that linear regression (132 articles), Logistic regression (91), random forest—RF (77), Long Short-Term Memory—LSTM (77), Principal Component Analysis (63), and Elastic Net are the most popular among researchers. Most studies report accuracy over 90%, with maximum values of 99.37% for LSTM and 99.20% for RF. In the case of regression, the R2 values range between 82% and 98%, especially for CO2 and PM2.5 prediction. eXtreme Gradient Boosting or hybrid RF-LSTM architectures achieve R2 values of up to 99%. The IAQ public and private datasets analyzed for this study provide a strong foundation for transfer learning, but differences require careful preprocessing to ensure consistent comparisons and reliable conclusions. The distribution of articles by sensor type for IAQ parameters shows that linear regression remains the most widely used ML method (26 studies), followed by LSTM (19) and RF (18). The research results confirm that there is no universal algorithm for IAQ, and the quality and structure of the data contribute to the success of ML models. This study aims to be a foundation for the development of future intelligent IAQ monitoring systems.
IPC Classification
Keywords
€ 4.00