Speaker
Description
The growing availability of heterogeneous clinical, neuropsychological, rehabilitative, and sensor-derived data is creating new opportunities for Artificial Intelligence (AI)-driven decision support in pediatric neurology. However, the development of reliable AI models in this domain is still severely constrained by major data engineering issues, including fragmentation of clinical repositories across healthcare centers, heterogeneity of acquisition protocols, inconsistent semantic representation of variables, lack of interoperability among data sources, and strict privacy constraints associated with sensitive pediatric data. In multicenter scenarios, these limitations directly affect data quality, comparability, and reusability, ultimately reducing the effectiveness of downstream analytics pipelines and hindering the identification of robust digital biomarkers.
Within this context, the TELENEURART project addresses the need for an integrated ICT infrastructure able to support the collection, harmonization, transformation, and analysis of multimodal data related to pediatric patients affected by complex neurological conditions, including congenital and acquired brain injuries, autism spectrum disorders, intellectual disabilities, and neuromuscular diseases. The project is aimed at building a clinical-technological network in which data generated at different hospitals can be made interoperable and analytically exploitable while preserving local autonomy and compliance with privacy regulations. The final objective is to enable the construction of shared, high-quality datasets suitable for longitudinal and cross-sectional analyses, as well as for the application of AI methods targeted at digital biomarker discovery and personalized rehabilitation planning.
A central contribution of TELENEURART lies in the design of an ETL-oriented architecture for multicenter clinical data integration. ETL (Extract, Transform, Load) is a data integration process that consists of three steps: Extract: pulling raw data from one or more source systems (databases, files, APIs, etc.); Transform: cleaning, converting, enriching and restructuring the data to make it suitable for analysis (e.g. standardizing formats, calculating new fields, removing duplicates); Load: inserting the transformed data into a target database, data warehouse or data lake for querying and reporting. In this framework, REDCap was adopted as the primary platform for structured data collection and management, providing a standardized and controlled environment for case report form design, data entry, and harmonized acquisition across centers. The use of REDCap represents a key enabling factor for the project, as it supports schema consistency, improves traceability of collected variables, and facilitates the downstream implementation of data transformation and quality assurance processes. Nevertheless, because participating centers differ in terms of local workflows, storage systems, and clinical protocols, REDCap alone is not sufficient to guarantee full interoperability. For this reason, the core research effort focused on the definition of an ETL pipeline capable of bridging local heterogeneity and producing a unified analytical layer for AI-ready data exploitation.
The first phase of the work addressed the mapping of data sources and semantic alignment of variables across the participating clinical centers. The analysis considered structured clinical records, neuropsychological and psychometric scales, diagnostic imaging data, physiological signals acquired through sensors, and outputs generated by serious games used in rehabilitative settings. This assessment highlighted substantial discrepancies in variable encoding, temporal granularity, update frequency, sampling rate, measurement units, and storage formats. To mitigate such heterogeneity, a semantic normalization process was designed, partly grounded in shared clinical frameworks such as the International Classification of Functioning, Disability and Health (ICF). This step was essential for defining a common data model capable of supporting integration across sites and reducing ambiguity in the interpretation of clinical variables.
The core of the proposed solution is the ETL pipeline, designed as the main ICT component of the TELENEURART infrastructure. In the Extract phase, data acquisition strategies were defined to support the retrieval of information from local datasets and REDCap-managed collections while taking into account the technological constraints of each hospital. This phase required the identification of data sources, the characterization of source-specific formats, and the specification of access procedures compatible with decentralized governance policies. In the Transform phase, particular emphasis was placed on data quality and harmonization. Transformation rules were defined to perform completeness checks, missing value management, deduplication, normalization of coding systems and measurement units, validation of temporal consistency, and syntactic/semantic standardization of variables. This stage was conceived not as a simple preprocessing layer, but as a methodological component crucial for ensuring that the resulting datasets would be consistent, comparable, and reusable for advanced analytics.
A particularly relevant aspect of the transformation layer concerns privacy-preserving data processing. Since the project operates in a multicenter clinical environment involving sensitive pediatric information, the ETL design explicitly incorporates anonymization and pseudonymization procedures. These include the separation of identifying information from clinical variables, minimization of sensitive attributes, and the assignment of consistent pseudonymous keys to enable longitudinal tracking without direct exposure of patient identity. Such mechanisms were defined in accordance with European data protection requirements and with the internal policies of the participating healthcare institutions. From an ICT perspective, this privacy-aware ETL design is a critical prerequisite for enabling secure data reuse in AI workflows. In the Load phase, the processed data are organized into a shared data warehouse designed to host both structured and semi-structured information while preserving metadata, provenance, and versioning. This architectural choice is particularly important for AI applications, as model reliability depends heavily on the traceability of source data, reproducibility of preprocessing operations, and explicit management of data lineage. The resulting warehouse is intended not only as a storage layer, but as a computationally meaningful repository supporting future statistical analyses, machine learning pipelines, and digital biomarker extraction tasks.
Overall, the work carried out in TELENEURART demonstrates that, in pediatric neurological research, the effectiveness of AI solutions depends critically on the prior availability of a robust ICT infrastructure centered on ETL processes, data standardization, and privacy-aware integration. The project shows how REDCap-based data collection, combined with a carefully designed ETL pipeline and data warehouse architecture, can provide the foundation for transforming fragmented multicenter clinical data into interoperable and AI-ready resources. This represents a necessary step toward the identification of digital biomarkers and the development of more personalized, data-driven rehabilitative pathways in complex pediatric neurology.