AIDAinformazioni Anno 42 – N. 3-4 – luglio-dicembre 2024

AIDAinformazioni

RIVISTA SEMESTRALE DI SCIENZE DELL’INFORMAZIONE

NUMERO 34

ANNO 42

LUGLIODICEMBRE 2024

cacucci

editore

bari

Editrice: Cacucci Editore S.a.s.

Via D. Nicolai, 39 – 70122 Bari (BA)

www.cacuccieditore.it

e-mail: riviste@cacuccieditore.it

Telefono 080/5214220

Proprietario della rivista:

Università della Calabria

Direttore Scientico:

Roberto Guarasci, Università della Calabria

Direttore Responsabile:

Fabrizia Flavia Sernia

Comitato scientico:

Anna Rovella, Università della Calabria;

Maria Guercio, Sapienza Università di Roma;

Giovanni Adamo, Consiglio Nazionale delle Ricerche †;

Claudio Gnoli, Università degli Studi di Pavia;

Ferruccio Diozzi, Centro Italiano Ricerche Aerospaziali;

Gino Roncaglia, Università della Tuscia;

Laurence Favier, Université Charles-de-Gaulle Lille 3;

Madjid Ihadjadene, Université Vincennes-Saint-Dénis Paris 8;

Maria Mirabelli, Università della Calabria;

Agustín Vivas Moreno, Universidad de Extremadura;

Douglas Tudhope, University of South Wales;

Christian Galinski, International Information Centre for Terminology;

Béatrice Daille, Université de Nantes;

Alexander Murzaku, College of Saint Elizabeth, USA;

Federico Valacchi, Università di Macerata.

Comitato di redazione:

Antonietta Folino, Università della Calabria;

Erika Pasceri, Università della Calabria;

Maria Taverniti, Consiglio Nazionale delle Ricerche;

Maria Teresa Chiaravalloti, Consiglio Nazionale delle Ricerche;

Assunta Caruso, Università della Calabria;

Claudia Lanza, Università della Calabria.

Segreteria di Redazione:

Valeria Rovella, Università della Calabria

AIDAinformazioni

R   S ’I

Fondata nel 1983 da Paolo Bisogno

AIDAinformazioni

R   S ’I

«AIDAinformazioni» è una rivista scientiﬁca che pubblica articoli inerenti alle Scienze dell’In-

formazione, alla Documentazione, all’Archivistica, alla Gestione Documentale e all’Organiz-

zazione della Conoscenza ma amplia i suoi conni in ulteriori campi di ricerca ani quali

la Terminologia, la Linguistica Computazionale, la Statistica Testuale, ecc. È stata fondata

nel 1983 quale rivista uciale dell’Associazione Italiana di Documentazione Avanzata e nel

febbraio 2014 è stata acquisita dal Laboratorio di Documentazione dell’Università della Ca-

labria. La rivista si propone di promuovere studi interdisciplinari oltre che la cooperazione e

il dialogo tra proﬁli professionali aventi competenze diverse, ma interdipendenti. I contributi

pubblicati arontano questioni teoriche, metodologie adottate e risultati ottenuti in attività di

ricerca o progettuali, denizione di approcci metodologici originali e innovativi, analisi dello

stato dell’arte, ecc.

«AIDAinformazioni» è riconosciuta dall’ANVUR come rivista di Classe A per l’Area 11 –

Gruppo Scientico Disciplinare 11/HIST-04 – Scienze del libro, del documento e storico-re-

ligiose e come rivista scientica per le Aree 10 – Scienze dell’antichità, ﬁlologico-letterarie e

storico-artistiche; 11 – Scienze storiche, ﬁlosoﬁche, pedagogiche e psicologiche; 12 – Scienze

giuridiche; 14 – Scienze politiche e sociali. È anche annoverata dall’ARES (Agence d’éval-

uation de la recherche et de l’enseignement supérieur) tra le riviste scientiﬁche dell’ambito

delle Scienze dell’Informazione e della Comunicazione. La rivista è, inoltre, indicizzata in:

ACNP – Catalogo Italiano dei Periodici; BASE –Bielefeld Academic Search Engine; ERIH

PLUS – European Reference Index for the Humanities and Social Sciences – EZB – Elektro-

nische Zeitschriftenbibliothek – Universitätsbibliothek Regensburg; Gateway Bayern; KVK

– Karlsruhe Virtual Catalog; e Library Catalog of Georgetown University; SBN – Italian

union catalogue; Ulrich’s; Union Catalog of Canada; LIBRIS – Union Catalogue of Swedish

Libraries; Worldcat.

I contributi sono valutati seguendo il sistema del double blind peer review: gli articoli ricevuti

sono inviati in forma anonima a due referee, selezionati sulla base della loro comprovata espe-

rienza nei topics speciﬁci del contributo in valutazione.

AIDAinformazioni

Anno 42

N. 3-4 – luglio-dicembre 2024

cacucci

editore

bari

  

2024 Cacucci Editore – Bari

Via Nicolai, 39 – 70122 Bari – Tel. 080/5214220

http://www.cacuccieditore.it e-mail: info@cacucci.it

Ai sensi della legge sui diritti d’Autore e del codice civile è vietata la

riproduzione di questo libro o di parte di esso con qualsiasi mezzo,

elettronico, meccanico, per mezzo di fotocopie, microfilms, registra-

zioni o altro, senza il consenso dell’autore e dell’editore.

Sommario

Contributi

A A, Il nuovo regolamento eIDAS e alcune “quisquilie”

archivistiche 9

F B, M T, Exploration du réseau numérique

YouTube autour de la santé des militaires: quelles sont les thématiques des

discours, les sources d’informations et les acteurs de la communication? 29

E C, L F, Assisted morbidity coding: the

SISCO.web use case for identifying the main diagnosis in Hospital

Discharge Records 51

V F, A humanistic approach to datafication 79

R P, Testimonianze di un impegno culturale per

l’Università di Salerno. Le carte di Alfonso Menna 101

F S, A B, E G,

S M, CompL-it: a Computational Lexicon of Italian 119

Rubriche

C G, Non solo libri 151

Contributi

AIDAinformazioni

ISSN 1121–0095

ISBN 979-12-5965-456-4

DOI 10.57574/596545643

pag. 51-78 (luglio-dicembre 2024)

Assisted morbidity coding: the SISCO.web

use case for identifying the main diagnosis in

Hospital Discharge Records

Elena Cardillo

, Lucilla Frattura

Abstract: Coding morbidity data using international standard diagnostic classifications is in-

creasingly important and still challenging. Clinical coders and physicians assign codes to pa-

tient episodes based on their interpretation of case notes or electronic patient records. There-

fore, accurate coding relies on the legibility of case notes and the coders’ understanding of

medical terminology. During the last ten years, many studies have shown poor reproducibility

of clinical coding, even recently, with the application of Artificial Intelligence-based models.

Given this context, the paper aims to present the SISCO.web approach designed to support

physicians in filling in Hospital Discharge Records with proper diagnoses and procedures

codes using the International Classification of Diseases (9

and 10

revisions), and, above all,

in identifying the main pathological condition. The web service leverages NLP algorithms,

specific coding rules, as well as ad hoc decision trees to identify the main condition, showing

promising results in providing accurate ICD coding suggestions.

Keywords: Coding Support Systems, Hospital Discharge Records, ICD, Morbidity coding,

Coding Rules.

1. Introduction

The proper use of standard classifications, such as the International Clas-

sification of Diseases (ICD) and coding of morbidity data has always been

fundamental for all general epidemiological and many health-management

purposes (WHO 2016). One example is the use of the information flow of

the Hospital Discharge Records (SDO) collected in national databases for mo-

nitoring hospitalization episodes provided in public and private hospitals and

thus the provision of hospital assistance. This has become an indispensable

tool for both administrative analyses (i.e., for accurate billing) and clinical

Institute of Informatics and Telematics, National Research Council (IIT-CNR), Ren-

de, Italy. elena.cardillo@iit.cnr.it.

Azienda Sanitaria Universitaria Giuliano Isontina (ASUGI), Udine, Italy. lucilla.frattu-

ra@asugi.sanita.fvg.it.

52 Elena Cardillo, Lucilla Frattura

elaborations (e.g., health quality assessment), which can bring to the planning

of new measures to support healthcare and welfare activities or to more strictly

clinical-epidemiological and outcome analyses.

In this frame, although approaches to coding vary across institutions, cli-

nical coding specialists frequently perform coding retrospectively. The assign-

ment of codes to each patient episode of care during hospitalization is deter-

mined by different factors, among others by the coder’s interpretation of the

available case notes or the completeness of the electronic health records. As a

result, accurate coding is dependent on both the intelligibility of the case notes

and the coders’ knowledge of medical terminology (Sundararajan et al. 2015).

Several studies have indicated poor reproducibility of clinical coding (Ta-

tham 2008) and poor accuracy which seems not dependent on the version of

the standard coding system used, which in the case of SDO is ICD (Quan et

al. 2014).

In recent years, even if the application of artificial intelligence (AI) has

begun to attract and, in some cases, assist clinicians in the practice of medical

coding, the performances achieved by AI models do not meet expectations.

Many studies have proven this, especially concerning inadequate levels of data

coding accuracy (less than 50%) and high computational costs (Falis et al.

2024; Soroush et al. 2024). This means that more reliable and trustworthy

systems are required to support physicians or coders in speeding up the coding

process while retaining the necessary precision.

Given this context, the paper aims to describe the results of the “SISCO.

web” project

, whose scope was to design and implement a Coding Support Sy-

stem (CSS), in the form of a web service, to improve accuracy in coding health

conditions in Italian Hospital Discharge Records (SDO). The main objective

of the service is to support Italian physicians (coders) in morbidity coding, and

more specifically in the coding of diagnoses and procedures/interventions using

ICD-9

revision, Clinical Modifications (ICD-9-CM), mandatory in Italy,

and, more notably, in identifying the “main condition” to be filled in SDOs.

The paper is structured as follows: Section 2 provides background infor-

mation on using and coding SDO, and describes the applied methodology.

Section 3 showcases the results and includes a preliminary evaluation. Section

4 presents some related works, and finally, Section 5 offers conclusions and

future directions.

The “SISCO.web” project, funded by the Friuli Venezia Giulia (FVG) Region and

coordinated by the Italian Collaborating center of the World Health Organization Family

of International Classifications (WHO-FIC) in Udine through the Azienda Sanitaria Bassa

Friulana Isontina n. 2 (incorporated now into the “Azienda sanitaria universitaria Giuliano

Isontina” - ASUGI) was executed from 2017 to 2021 and led to the development of a proto-

type (SISCO.web service) which can assist clinicians in coding SDO data using ICD-9-CM,

but it is also set up to support ICD-10 coding.

Assisted morbidity coding 53

2. Materials and Methods

2.1. Hospital Discharge Records

The Hospital Discharge Record Database was established, in Italy, with the

Decree of the Ministry of Health on 28 December 1991. It serves as a tool for

collecting information about each patient discharged from public and private

hospitalization institutions across the country. The information gathered in

each SDO includes, beyond the patient’s characteristics (e.g., age, sex, etc.),

the peculiarities of the hospitalization (e.g., institution and discharge discipli-

ne, method of discharge, etc.) and, above all, clinical features (e.g., the main

diagnosis, concomitant diagnoses, diagnostic or therapeutic procedures, and

interventions), excluding information relating to drugs administered during

hospitalization

Subsequently, other decrees introduced new regulations for the informa-

tion flow transmission to the Ministry of Health, expanded the information

content of the SDO, and adopted the international classification ICD-9-CM

version 1997 (Italian Ministry of Health 2000) for the coding of diagnoses

and diagnostic and therapeutic procedures, then updating this regulation with

the adoption of the 2007 Italian version and introducing the adoption of the

Diagnosis Related Group classification (DRG), version 24 for hospital admis-

sions (Italian Ministry of Labor, Health and Social Affairs 2008a).

In 2011, the “It.DRG Project”, coordinated by the Ministry of Health, was

launched to develop a new classification and assessment method for inpatient

care, specific and representative to the Italian context (Sforza et al. 2021).

The objective of this project was: the development and testing of an updated

version of the ICD-10 classification (International Classification of Diseases

and Health Related Problems-10

Revision) that incorporates WHO-appro-

ved updates and makes minor changes, finalizing the so-called Italian modi-

fication of ICD-10 (ICD-10-IM); the development and testing of the Italian

classification of Interventions and Procedures (CIPI), a version of the section

on procedures and interventions of ICD-9-CM modified and supplemented,

to adapt it to specific Italian needs and to provide for integration with codes

that allow for the detection of information on: (i) Procedures/treatments pro-

vided (also) in ambulatory care; (ii) Medical-surgical devices; (iii) High-cost

drugs; and iii) finally, a new version of the DRG system (Nonis et al. 2018).

Despite the significant outcomes of the “It.DRG project” for innovating

and improving SDO data management, there is a need to create a roadmap

for implementing the new classifications, especially ICD-10, in a more sim-

plified manner. This involves using crosswalking tables to ICD-10-IM and

Hospital Discharge Records database (HDR/SDO), see for details European Health

Information Portal (2023).

54 Elena Cardillo, Lucilla Frattura

confirming the planned current version of DRG classification. The attention

in this paper is paid primarily to a tool for coding diagnoses and intervention

using ICD-9-CM, with the understanding that the mentioned crosswalking

tables for coding diagnoses in ICD-10-IM can be easily implemented in the

tool’s architecture.

2.1.1. The International Classification of Disease

The International Classification of Disease is the most known and widely

used standardized WHO classification system, which was originally intended

to facilitate the statistical analysis of health data (Moriyama et al. 2011). Each

successive revision to the ICD, typically spanning 10-20 years, has sought to

address new use cases while adapting to advances in medicine and healthcare

and has continued to grow in number of total codes (Williamson et al. 2024).

The tenth version has approximately 14,000 codes for health conditions,

signs, symptoms, and reasons to encounter health services. This revision has

then been renewed with the implementation of the eleventh revision of the

classification, ICD-11 (World Health Organization 2019/2021), developed

thanks to an unprecedented collaboration between WHO working groups,

knowledge engineers and informaticians from Stanford University (USA), and

professionals all over the world to become a global standard for health data,

clinical documentation and statistical aggregation. It presents a new coding

structure compared to previous revisions and is fully digital for the first time.

The basic component is an underlying ontology database containing all ICD

entities (over 55,000 unique entities)

. The new structure, its digital nature,

and the tools provided to support the use of the classification enhanced its

application flexibility. Moreover, it is interoperable with health information

systems and other coding systems.

As mentioned above, in Italy, ICD-9-CM is used for morbidity coding,

containing over 15,000 diagnosis codes. Its use is also recommended in pri-

mary care prescription documents and for diagnoses and problems encoding

in the Italian Patient Summary (Italian Permanent working table for Digital

health in Regions and Autonomous Provinces 2010) each entity within the

ICD-9-CM is encoded by a unique identification string consisting of three to

five digits and an optional single letter prefix corresponding to a supplementa-

ry category. Practical applications of the ICD in healthcare have expanded and

now have come to include the indexing of health record data in hospitals, the

These entities include diseases, injuries, external causes, signs and symptoms, substanc-

es, drugs, anatomy, etc., pointing to about 17,000 categories, for over 120,000 clinical terms

covered, allowing the description of health conditions at any level of detail by combining

codes.

Assisted morbidity coding 55

coding of medical billing claims (Moriyama et al. 2011), and the assessment

of quality of patient care (O’Malley et al. 2005).

2.1.2. The coding of the main condition

A coded health data record can have a varying number of diagnostic codes.

Some authors, considering that there is no uniform definition of “main condi-

tion”, noted that one of these diagnoses must be coded as the main condition,

known also as “main diagnosis”, “primary diagnosis”, “principal diagnosis or

“discharge diagnosis” (Sukanya 2017).

Two definitions have been used for themain condition in ICD-coded he-

alth data: a “resource use” definition and a “reason for admission” definition.

In Italy, the first definition is implemented, as said above, in detecting and co-

ding the discharge diagnosis using ICD-9-CM, 2007 version (Italian Ministry

of Labor, Health and Social Affairs 2008b). In the Italian SDO, it is necessary

to code the main diagnosis, and several other diagnoses related to the hospital

episode of care. The mentioned national database on SDO contains more

than 290 million records (7,957,647 only in 2023). Annual reports are avai-

lable for download from the website of the Italian Ministry of Health (Italian

Ministry of Health 2024). Coding of these records is made directly by clini-

cians and health professionals, with some levels of accuracy monitoring at the

hospital and regional level before the data are sent to the Ministry of Health

periodically. This richness of data must face with its accuracy. Several Italian

studies are available showing low accuracy in coding. Hospital discharge data

were found to be specific but insensitive in many fields. For example, the

reporting of acute ischemic stroke and thrombolysis provides misleading indi-

cations about both thequantity and quality of acute ischemic stroke hospital

care in many studies (Rinaldi et al. 2003; Spolaore et al. 2005). Other studies

show that Hospital discharge records appear to poorly reflect the incidence of

amyotrophic lateral sclerosis and can be used only after clinical verification of

the diagnosis (Chiò et al. 2002). Moreover, looking at (Amodio et al. 2014),

the diagnosis of influenza seems to be overcoded. Nevertheless, based on the

retrieved evidence, administrative databases can be employed to identify pri-

mary breast cancer. The best algorithm suggested is ICD-9 or ICD-10 codes

located in theprimary position (Abraha et al. 2018). At an international level,

many studies confirmed that physicians do not code the disease in SDOs ac-

cording to the main diagnosis principles (Wang et al. 2021). It is observed that

in many cases, the main diagnosis is mistaken for an outpatient diagnosis, ma-

king it more difficult to identify when multiple diseases occur simultaneously

or in cases of complications. These studies reveal that physicians still require

support to collect, classify, analyze, and use medical record information accor-

ding to disease classification criteria.

56 Elena Cardillo, Lucilla Frattura

2.2. The SISCO.web approach

The main scope of the SISCO.web service, as mentioned above, is to sup-

port the coding of SDOs, guiding the physicians to identify and code the

main condition, allowing the most appropriate ICD-9-CM codes, and in the

future ICD-10 codes. This means that its function is to guide the user before

the compilation of the SDOs, to choose and assign appropriate ICD codes to

the diagnostic formulations available in medical record documentation col-

lected during patient hospitalization, and, further, to identify among different

diagnoses, the main one (Cardillo et al. 2019). Peculiarities of this support

system are:

• A knowledge base containing clinical concepts, related terms, and map-

pings to ICD-9-CM for managing the transition from the usual scien-

tific language to the coding language. This means, the integration of

such resources with the ICD-9-CM systematic index, the ICD-9-CM

alphabetical index, and other additional terms (synonyms, acronyms,

linguistic variants, common medical terms, etc.);

• Standardized coding rules (e.g., “diagnostic and procedure codes are

to be used at their highest level of specificity”; “three-digit codes are

to be assigned only if there are no four-digit codes within that code

category”; etc.);

• A rule engine for managing these rules, represented by the Business

Rules Management System (BRMS) “Drools”.

As shown in Fig. 1, the SISCO.web architecture includes three main layers:

1. Presentation layer: handling the interactions that users have with

the software. Here the web component, has a multi-tier architectu-

re, deployed on aTomcat web server, offering two web user interfaces

(WUIs) to support the compilation of SDOs. The WUIs make JSON

calls to the Web Services of the underlying levels, which access the data

resources built by the batch component. The two WUIs allow for two

specific tasks: i) the text encoding WUI (TEM module), which serves

as a coding tool, since it allows for searching clinical terms (diagnoses

and procedures) and suggests the most appropriate ICD-9-CM codes

based on search algorithms and related terms derived from the know-

ledge base; ii) the identification of the main diagnosis WUI (IMDM

module), based on a rule engine that implements a specific decision

tree for choosing and coding the main condition among the multiple

diagnoses selected in the previous step. These two modules will be de-

scribed in detail in the following paragraphs;

2. Application layer: handling the main code definitions and the most

basic functions of the developed application. In SISCO.web this layer

Assisted morbidity coding 57

includes five main functions which will be detailed later (e.g., search,

autocomplete, code Details, use of related Terms for improving search,

coding rules application through the Drools engine);

3. Data layer: which is mainly devoted to data storage. In fact, it houses

not only data but indexes and tables. Here the batch component is

aimed to build the data resources, i.e., the SISCO.web knowledge base,

which is stored on the Apache Lucene Index.

The Apache Lucene Index

, was chosen because it is a valid open-source

tool for retrieving data and information. It provides straightforward Java APIs

for creating text indexes and full-text search with options such as proximity

search, fuzzy search, and score-based sorting, weighted filter search.

To implement the RESTful layer of web services within the system archi-

tecture, we chose Jersey

, an open-source framework based on the JAX-RS

API using annotation-based programming, which simplifies the creation of

RESTful web services. It also facilitates the representation of data in standard

formats such as JSON, XML, and HTML.

Figure 1: SISCO.web architecture.

Apache Lucene is available for download at (Apache Lucene n.d.).

Eclipse Jersey is available for download at (Eclipse Foundation n.d.).

58 Elena Cardillo, Lucilla Frattura

The main process to reach the supported coding of morbidities and proce-

dures and the identification of the main condition is shown in Fig. 2 and can

be briefly described as follows:

1. Using the first module, i.e., TEM, the user starts searching for a dia-

gnosis (one at a time) using the ones reported in the discharge letter

(LDO) of the patient, to look for its ICD-9-CM code;

2. The system applies classic Natural Language Processing (NLP) algori-

thms such as Tokenization, text similarity algorithms to assign the most

appropriate code to the diagnosis plus Decision Trees, and Symbolic

NLP algorithms, i.e., rule-based and knowledge-based algorithms,

relying on predefined linguistic rules and knowledge representations.

For this reason, dictionaries, grammars, and ontologies are used to pro-

cess language;

3. Every time the user searches for a diagnosis and selects one of the results

suggested by the system, a list of coded diagnoses is generated to allow

the user to identify, among these diagnoses the main condition;

4. The same procedure is used to search for procedures and interventions

if reported in the discharge letters, and a second list of coded procedu-

res/interventions will be generated by the system to be used as well by

the IMDM module;

5. These two lists of codes represent the input data for the decision tree

algorithm, which, as described in Subsection 2.2.3., will guide the user

to identify the main pathological condition based on the defined co-

ding rules.

Figure 2: The SISCO.web main process.

To better understand the above-mentioned process, the next subsections

will give details on the knowledge base, the algorithms, the decision tree and

the coding rules used in the two modules to suggest the most appropriate

ICD-9-CM codes.

Assisted morbidity coding 59

2.2.1. The SISCO.web Knowledge Base

The knowledge base (KB) built for the project and used in the TEM inte-

grates a series of terminological resources related to diagnoses and interven-

tions/procedures in EHRs. The main data sources, as shown in Fig. 3, are

represented by the Italian versions of:

• ICD-9-CM (v. 2007), Systematic index of diagnoses and procedures,

considering the codes at themaximum level of specification;

• ICD-9-CM (v. 2007), Alphabetic index of diagnoses, and Alphabetic

index of procedures.

For this project, an ontological version of ICD-9-CM has been created

starting from the available ministerial tables of the classification, bringing to

the development of the ICD-9-CM Ontology in OWL.

The lists of terms present in ICD-9-CM, in some cases inappropriate or

outdated jargon, were supplemented with terms taken from other sources such

as:

• Ad hoc created glossaries of diagnoses derived from physicians’ scienti-

fic language, developed during a previous project (Cardillo et al. 2018);

• A glossary of diagnoses coded in ICD-9-CM extracted from the FVG

Emergency Department (ED) EHRs database;

• Rare Diseases terms (Prime Minister’s Decree 2017);

• Italian MeSH diagnoses and procedures terms (Istituto Superiore di

Sanità n.d.).

All the terms derived from these sources were in most cases already mapped

to the corresponding ICD-9-CM codes and were qualified as exact or approxi-

mate mapping.

Regarding the resource extracted from FVG Emergency Department “SEI

Database”, in the beginning, a list of 425 common pathological conditions in

the ED was proposed by the ED FVG regional working group. On this list,

a further analysis was performed to verify the use of technical/scientific terms

and the correctness of the ICD-9-CM coding associated to these pathological

conditions, bringing in the end to a glossary of 696 diagnoses (2,530 words)

which enriched the SISCO.web KB.

60 Elena Cardillo, Lucilla Frattura

Resources Version N. of Terms

ICD-9-CM systematic index IT- 2007 16,294

ICD-9-CM alphabetical index IT- 2007 289,834

Physicians’ Glossary of diagnoses v. 2017 1,421

Rare Diseases terms v. 2017 683

Emergency physicians’ diagnoses and

pathological conditions (SEI database)

v. 2018 696

MeSH synonyms for diagnoses and

procedures

v. 2017 641

Neoplasms related terms v. 2017 13,290

Total 322,859

Table 1: Knowledge Base SISCO.web (Cardillo et. al 2019).

As observable, the total number of terms in the KB, considering the whole

Italian ICD-9-CM resource and the above-mentioned additional resources, is

about 323,000. It’s important to note that the entire SISCO.web KB, particu-

larly the data extracted from the SEI dataset, is not publicly accessible.

Regarding the ICD-9-CM Ontology, as said above, we created a proces-

sable version of the Ministerial file published online, since the original .xls

file missed important details about each ICD-9-CM code. This information

includes descriptions, inclusion and exclusion criteria, and notes, which are

crucial for giving coding support based on ICD. To solve this problem, we

developed a script that builds a lightweight ontology in OWL which can also

be used to search for inconsistencies in the ICD-9-CM hierarchy or in the

attribute’s association. The ontology classes are based on the structure of the

ICD-9-CM systematic index. At the top level, there are two main classes re-

presenting the ICD-9-CM main sections: Diseases and Injuries, and Procedures

and Interventions. Within the Diseases and Injuries section, there are 17 classes

that correspond to the ICD-9-CM “chapters” in this category, along with two

additional classes for supplementary classifications: one for external causes of

injury and poisoning and another for factors influencing health status and contact

with health services.

Each chapter has its own class hierarchy, following the index structure that

includes blocks, categories, subcategories, and subclassifications. To help with

navigation, we labelled chapter classes with chapter numbers (e.g., Chapter

I, Chapter II) and use E and V for the above mentioned additional classes.

Similarly, in the Procedures and Interventions section, each category is organi-

zed under ranges such as the Nervous System Intervention, which covers codes

01-05. Each class/subclass in the ontology connects to the relevant data type

annotations and, when needed, to Object Properties (i.e., relationships betwe-

Assisted morbidity coding 61

en classes) and axioms. Access to the ICD-9-CM Ontology is currently restri-

cted. However, we are planning to make it available on public repositories or

GitHub shortly.

2.2.2. The Text encoding module

The first module is designed for searching the appropriate code for one or

more diagnoses and procedures/interventions mentioned in the patient’s di-

scharge letter. The user enters a diagnosis in the search box using free text, whi-

ch can be a single word or a multi-word term (T1). As the user starts typing,

the system provides suggestions for autocompletion based on the knowledge

base (KB), drawing from systematic or alphabetical indexes, MeSH synonyms,

glossaries of general practitioners or emergency physicians, rare diseases, etc.

These suggestions are the ones that have the entered text as their prefixes. Sub-

sequently, the system conducts a syntactic search on the description of each

attribute associated with ICD-9-CM classes in all types of resources in the KB.

Different weights are assigned to each attribute based on its source and posi-

tion. The search yields a list of ICD classes (diagnoses/procedures) that meet

the search criteria, i.e., one or more attributes containing T1. The results are

displayed in descending order based on their score.

To enhance the search function for the coder, the system permits filtering

of the results in the list. This is achieved by incorporating the terms used in the

query with related terms suggested by the system. These suggestions are based

on their co-occurrence with the searched term within the ICD descriptors. To

be more specific, the descriptions of the resulting ICD classes are tokenized to

extract the most significant words (stop words are not considered). Moreover,

to facilitate the tokenization and subsequent counting of term occurrences,

the following ICD attributes are to be considered:

• The main description of the ICD class, along with any supplementary

descriptions and inclusion terms in the systematic index;

• The description of the entry terms in the alphabetical index.

The system counts the number of times each token/term appears in the

list of ICD classes resulting from the search. It then arranges the terms in

descending order based on the number of occurrences and presents them to

the user as related terms in a separate box. The user can choose one of the re-

lated terms or continue entering other free text in the search box. The system

provides suggestions for autocompletion as the user enters more terms (T1,

T2, etc.). The result list of ICD codes (diagnoses or procedures, depending on

the user’s initial selection) is updated to consider the search criteria, ensuring

that one or more attributes contain all the input terms (T1, T2, etc.), and

co-occurrences, making the search more precise. A similar approach is used in

62 Elena Cardillo, Lucilla Frattura

the ICD-11 Coding Tool

, which, unlike SISCO.web, allows also to use ICD

chapters and rangesas research filters. From here on the algorithm performs

the same steps, until the user selects a specific diagnosis/procedure among the

ICD classes displayed in the search results which is always a leaf code. Once

the diagnosis/procedure is selected, the system adds it to the list of candidate

diagnoses/procedures used by the decision tree algorithm for identifying the

main condition.

Is worth mentioning that the search and coding algorithm for procedures

follows the same steps as that for diagnoses, but the Knowledge base which

supports the process is smaller. In fact, in the case of procedures, the NLP

algorithm examines only terminological resources related to interventions and

procedures, therefore fewer terms are indexed. Specifically, the search is con-

ducted almost entirely on the classes contained in the systematic index of

ICD-9-CM section procedures, as well as on the procedure terms present in

the ICD-9-CM alphabetical index, and the external resource MeSH.

2.2.3. The Identification of the main diagnosis module

To support physicians in the identification of the main condition, a deci-

sion tree was created to adhere to the WHO guidelines for morbidity coding

in ICD-10 (Zavaroni et al. 2018). This includes following, on one hand, the

WHO ICD-10 rules and guidelines for morbidity coding (WHO 2016)

which are up-to-date compared to ICD-9-CM 2007 rules, and on the other

hand the WHO definition of the main condition, i.e., «the condition, dia-

gnosed at the end of the episode of health care, primarily responsible for the

patient’s need for treatment or investigation» (WHO 2016, 147).

Furthermore, interventions and procedures were also considered in the de-

cision-making process. To manage the extensive array of ICD codes (about

5,000), they were grouped into three sets:

1. “relevant surgery”: encompassing interventions or procedures typical-

ly requiring an operating room, or those with resource consumption

comparable to operating room costs;

2. “selected non-relevant surgical interventions”: encompassing interven-

tions or procedures, other than relevant surgery, that require significant

resources, mostly higher than a non-surgical treatment of a condition;

ICD-11 Coding tool is used to find the correct ICD-11 code for a specific diagnosis

and it is connected to the ICD-11 browser to allow user to see further details for a searched

diagnosis. It is available at (WHO 2024).

This guideline has been updated during the publication of the sixth edition of ICD-10

in 2019 and later with the publication of ICD-11 release.

Assisted morbidity coding 63

3. “residual non-relevant surgical interventions”: encompassing interven-

tions or procedures that necessitate fewer resources than non-surgical

treatments.

Conditions were categorized into “conditions” (including diseases and cli-

nical manifestations or normal physiological changes) and “pathological con-

ditions” (abnormal anatomy or functioning constituting diseases).

The decision tree hierarchy includes: i) specific hospital settings which are

highly specialized by age and changes of particular conditions, such as “ne-

onatology” and “pregnancy, delivery, and puerperium”, foreseen specific or

partially specific paths; ii) paths for the other hospital settings, according to

the general rules and, iii) the interventions/procedures set. Notably, the third

group of interventions/procedures mentioned above is excluded as a viable

option for identifying the main condition.

In Summary, the coding of certain health conditions is driven by the con-

dition itself (pregnancy and related conditions, neonatal health), whereas for

others, resource consumption due to procedures is the primary determinant.

Thus, when a relevant intervention/surgery is identified, it influences the

choice of the targeted condition. The decision tree rules are integrated into

the rule engine module of the SISCO.web service.

The algorithm which determines the main condition, uses a Drools-ba-

sed rule engine. Drools is an open-source Business rule management system

(BRMS), released under the Apache License 2.0., that can easily be embedded

in any Java application, which include an inference engine based on forward

and backward chaining (Proctor 2012). The primary function of the Drools

rule engine is to match incoming data, (i.e., facts), to the conditions outlined

in the rules. It then determines whether and how to execute these rules. Key

components in Drools are the following: rules; facts that are matched against

the conditions of the rules to execute the applicable ones; a production memory

(i.e., where the rules are kept); a working memory (i.e., location for the facts)

In our implementation, the system consists of four components, developed

in Java, and utilizes the RabbitMQ message broker (see Fig. 4). The primary

component is the SISCO Drools Engine, serving as a wrapper for the Drools

engine. It takes input data that triggers the execution of one or more rules

down to a node, corresponding to a decision (leaf node), or the generation

of a request for other parameters. The modules exchange messages in JSON

format. On the web server side, the SISCO Rules Web Service component

implements a servlet for dynamically creating content based on the interaction

with the engine invoked by the main page of the SISCO.web system. The

More details on the Drools key components can be found at Red Hat, Inc., Drools rule

engine. Full documentation section (Drools n.d.).

64 Elena Cardillo, Lucilla Frattura

SISCO Rules Data Receiver and the SISCO Rule Data Sender components,

finally, act as interfaces with the message broker, transforming the asynchro-

nous communication with the broker into the classic synchronous request/

response client/web server communication.

The decision tree represents knowledge in the form of “if P then Q” rules.

In the decision tree diagram, non-leaf nodes have two outgoing arcs: YES and

NO. The rules defined for each node determine the selection of the outgoing

arc and, consequently, the next computed node, based on terminological co-

des and user responses to the engine. The rules defined on two arcs from the

same node are mutually exclusive to ensure the path’s clarity. The decision

algorithm takes two ICD-9-CM code lists as input: Pathological conditions

(PC) and Procedures and Interventions (PI). The selection of the outgoing arc

can be determined in two ways: automatically, based on the KB terminology

codes feeding the engine, or decided by the user if no knowledge is available

in the KB. If the rule engine is unable to ascertain the fulfilment of a rule

based on incoming terminological codes, or when a decision necessitates the

clinician judgment (e.g., Are the pathological conditions related to each other?),

the engine will prompt user intervention by formulating a question within the

web interface. This question may seek a binary YES/NO response (e.g., Has

it caused complications?) or the selection of one or more terminological codes

(e.g., Identify the most complex event). Subsequently, the engine will generate a

JSON message encompassing all requisite details for presenting the question,

including the query text, answer type (binary or selection of codes), and per-

missible response values (e.g., YES/NO, TRUE/FALSE, or specific codes).

Figure 4: The Rules Engine Component Diagram.

Assisted morbidity coding 65

In this way, the WUI content is automatically created by the browser, ge-

nerating fields based on the answer type. For example, radio buttons are used

for exclusive choices and check buttons for multiple choices. Fig. 4 shows two

Drools rules for states 18 and 19 in the decision tree diagram. The “S18_ask”

rule prompts the user to indicate one or more pathologies not related to the

intervention. The “S19_true” rule manages the arrival of the response and

determines the next transition from state 19 (“is it a single pathological con-

dition?”) based on whether the user has selected one or more codes among the

relevant conditions. The result of the rule execution is reaching a leaf node

associated with one or more codes suggested for the main pathological condi-

tion, which is then displayed in the SISCO.web interface.

Figure 5: Drools S18-S19 rules example.

3. Evaluation

After an internal test conducted by the project’s informaticians and termi-

nologists, a more detailed usability test was performed by three physicians:

This evaluation employed a subset of pathological conditions extracted from

the SEI database mentioned in Section 2, along with diseases and interven-

tions drawn from several anonymized patient discharge letters. These LDO

contained multiple diagnosis and interventions/procedures, particularly fo-

cusing on complex cases characterized by comorbidities and intricate diag-

66 Elena Cardillo, Lucilla Frattura

nostic definitions. The aim was to assess the tool’s effectiveness in suggesting

appropriate codes, required for completing the SDO. At this stage, the eval-

uation was more qualitative than quantitative, as the physicians were unable

to access an LDO/SDO database for the project. Nonetheless, initial results

indicate that the system performed well, successfully suggesting the most ap-

propriate ICD-9-CM diagnosis even in instances where the input text in the

search box of the TEM module was complex or included comorbidities. On

average, SISCO.web provided precise ICD-9-CM code suggestions for 80%

of 30 use cases tested by physicians, with improved accuracy when using the

related terms feature. An example of diagnosis coding (in this case “diabe-

tes”) is given in Fig. 6. Here, when a user types “diabete” (diabetes) into the

search box, the system auto-completes with suggestions like “diabete-nanis-

mo-obesità” (diabetes-nanism-obesity) and “pre-diabete” (prediabetes). After

selecting “diabete”, the system displays matching classes in the search results

section (considering all the attributes associated to the class, such as title, other

description, inclusions, exclusions, alphabetic index terms, etc.), ordered by

score. It also suggests related terms (on the left of the page) that co-occur with

“diabete” in the ICD-9-CM descriptors. The user can then select a related

term like “mellito” (mellitus), prompting the system to refine results based on

both selected terms.

At each iteration, the system displays matching classes and related co-oc-

curring terms based on user input. The search progressively narrows down

until the user identifies and selects the correct ICD-9-CM class, which is then

added to the “Selected Diagnoses” section at the bottom left of the page. Be-

fore selecting the proper code, for each code in the results list, the user can

view code details (displayed if present on the right of the page and represented

by symbols), including:

• Leaf nodes: Indicates to select a leaf code from the list presented, being

the selected code not a leaf code;

• Exclusion criteria: Lists conditions excluded by that ICD-9-CM class;

• Basic diseases attribute: Advises coding a basic disease before using the

selected code;

• Use additional codes: Recommends additional codes relevant to the se-

lected class.

These features resulted helpful for avoiding inconsistencies, providing aler-

ts to key ICD-9-CM coding rules, such as the necessity of coding a basic

disease first or using leaf codes instead of general three-digit diagnosis codes

(unknown rule by professionals or, in some cases, taken for granted).

Assisted morbidity coding 67

Figure 6: SISCO.web Interface: An example of coding for “diabetes mellitus” diagnosis.

Not completely known is also the need for the combined use of ICD-

9-CM alphabetical and systematic indexes (both part of the KB) to extend

knowledge about a code, providing references to additional codes related to

the selected one, etc. Another useful feature of the TEM module was consi-

dered the possibility to show, starting from an ICD-9-CM class in the search

results, the hierarchy of the classes, derived from the ICD-9-CM ontology,

including all the details for each code.

Regarding the second module focused on the identification of the main

condition (IMCM) the WUI, illustrated in Fig. 7, consists of three main

sections: the upper section displays the two lists of codes (for diagnoses and

procedures) selected by the user in the TEM module; the central section in-

teracts with the user during the decision tree process, and the lower section

reveals the main diagnosis once it has been identified.

When the user opens the module page, he will see two lists of codes at the

top and a progress bar further down. At this point, the backend navigates

the decision tree until it hits the first node that requires user input. At this

stage, the rule engine requests the user input the necessary parameters to con-

tinue the navigation of the tree. These may include, for instance, the “most

resource-consuming pathological condition during hospitalization” among

the coded diagnoses (in case of multiple diagnoses). The user then selects one

from a combo box, thereby entering the required parameter into the module.

Subsequently, the rule engine resumes the path of the tree until the final node

is reached, i.e., the identification of the main condition, which is finally di-

splayed to the user for confirmation via a dedicated button. The central part

of the page displays only a partial representation of the decision tree structu-

re, including nodes requiring manual input, and the final three stages. This

68 Elena Cardillo, Lucilla Frattura

should help the user understand the operations performed by the rule engine

to determine the main diagnosis.

Nevertheless, the system can autonomously perform certain steps in the

decision tree, utilizing previously provided information, the formalized coding

rules, and inferences derived from the KB.

Furthermore, the WUI provides a button that cancels the rule engine ope-

rations and returns to the text encoding module WUI.

Figure 7: SISCO.web Interface: Rule engine support to identify the main condition.

The SISCO.web system was tested both in terms of the usability and effi-

ciency of the search algorithms, by the doctors involved in the project, and in

terms of functionality and performance, by the team of computer experts and

terminologists who developed the service. The test highlighted that the search

results for ICD-9-CM diagnoses obtained using the mentioned algorithms are

substantially superimposable. However, it is noted that:

• The weights assigned to the various ICD-9-CM attributes associated

with each ICD class in the search results appear inconsistent concer-

ning the relationship between the importance of the various resources

present in the KB and the recurrence of the terms (roots);

• Hierarchical algorithm guarantees greater appropriateness in the se-

lection of ICD-9-CM categories since it maintains the relationship of

importance between the resources present in the KB even in the event

of their enrichment.

Hence, it was necessary to refine the weights assigned to the various attri-

butes

to guarantee appropriateness in the selection of ICD-9-CM categories

In particular, weights ranges from 0 to 10: the main description of the ICD class in

the systematic index was still considered the most important with weight 10, the additional

terms of the ICD class title have weight 7,5; inclusion terms have weight 2,5; alphabetical

index “entry term” has weight 2,5, while its indentations (from the first to the sixth one) were

assigned weight 0,1; neoplasm entry term in the alphabetic were assigned weight 2,5, and

Assisted morbidity coding 69

even in the event of moving KB resources from one step to another. The steps

of the algorithm implemented for the coding activity of a diagnosis were con-

firmed.

The test revealed issues in the identification of the diagnosis module, which

is almost related to the formalization and computerization of the decision tree,

particularly for some steps of the tree where the physician’s input is necessary.

This is especially true when the physician selects multiple interventions, as it’s

crucial, at a certain point of the process to indicate the relevance of each one.

The decision tree is not fully computerized in terms of additional resources

for automating certain steps (as it can be for example alist of relevant inter-

ventions aligned to anatomical sites or mapped to diagnosis categories, which

although available in pdf, is still under elaboration for the integration into the

rule engine) and allowing the physician to select multiple options. Currently,

the computerized decision tree enables the physician to identify the main pa-

thological condition by answering a series of YES/NO questions.

4. Related works

Different coding support systems have been developed in the last two de-

cades. Some of them aimed to support the coding of causes of death, gener-

ally coded using ICD-10. Examples of these tools are MICAR-ACME, of the

US National Center for Health Statistics (Israel 1990), and the IRIS system

developed by a European consortium (Pavillon et al. 2007). The main issue

encountered in these systems is the processing of natural language, which,

in the last twenty years has been faced with developing automated coding

tools based on NLP algorithms (Friedman et al. 2004). Only afew systems

were based on properly defined coding rules, as done by (Farkas and Szarvas

2008) and (Cardillo et al. 2018), both focused on the ICD-9-CM coding.

In recent years, challenges have been encountered, from the perspective of

Artificial Intelligence (AI) and NLP, based on the literature. Many researchers

and companies started applying more sophisticated methods such as Neural

Networks or Large Language Models (LLM) to enable EHR data coding (Rios

and Kavuluru 2018). This trend is confirmed also by the results of the CLEF

ICD10 task

, held in 2020, focused on ICD-10 coding for clinical textual

data in Spanish and including, in particular, two subtasks for evaluating sys-

tems that predict ICD-10-CM (diagnostic) and ICD-10-PCS (procedural)

codes using the Spanish CodiEsp corpus. Here most of the participants used

Machine learning approaches and deep learning language models (prefer-

indentations had 0,1, which is also the weight assigned to the main description of diagnoses /

procedures derived from the other glossaries in the KB.

CLEF eHealth 2020 – Task 1: Multilingual Information Extraction (CLEF eHealth Lab

Series n.d.).

70 Elena Cardillo, Lucilla Frattura

ring fine-tuned Multilingual BERT), but the highest mean average precision

(MAP) for the prediction of ICD-10 diagnostic codes (0.593) resulted by

the combination of a XGBoost classifier and a Jaro Winkler string matching

system (Miranda-Escalada et al. 2020). Other studies focused on the applica-

tion of general-purpose LLMs (e.g., ChatGPT 3.5/4, LLAMA, etc.) to test

their performances in the task of automated coding of diagnoses extracted

from Discharge summaries by using ICD-10. Nevertheless, gaps between the

current deep learning-based approach applied to clinical coding and the need

for explainability and consistency in real-world practice were reported (Dong

et al. 2022). Some studies indicate alternative methods or frameworks specif-

ically designed for automatic ICD coding. For example (Chao-Wei Huang

2022) used apre-trained language model for ICD coding, sharing a similar

idea with BERT-XML, an extension of BERT designed for ICD coding. This

model was pre-trained on a large collection of EHR clinical notes using an

EHR-specific vocabulary (Zhang et al. 2020). Additionally, (Kim and Gana-

pathi 2021) introduced the Read, Attend, and Code (RAC) framework for

accurate ICD code prediction. Another approach involved the use of off-the-

shelf pre-trained generative LLMs to perform ICD coding, without labelled

training examples and leveraging the hierarchical nature of the ICD ontolo-

gy, thus relying on dynamic searches for clinical entities within the ontology

(Boyle et al. 2023).

It is worth observing in this context the lack of available datasets for ICD

coding to train AI-based models, especially in some languages, such as Italian.

Few approaches show how to mitigate this issue. In (Almagro et al. 2019) a

cross-lingual approach based on Machine Translation methods is proposed

to code death certificates with ICD-10 through supervised learning. In brief,

they tried to code Italian death certificates using certificates from another lan-

guage (French), so combining collections of different languages to increase the

availability of coded documents. Improvements in the system performance

here were observed for codes assigned to labels with few occurrences. Silvestri

et al. (2020) conducted a study on cross-lingual XLM fine-tuning aimed at

predicting and classifying ICD-10 codes. A preliminary evaluation of a model

fine-tuned on short medical notes written in English using an Italian test set

was provided, but results indicated the need for further experiments to in-

crease the number of samples in the test set, to better assess the model’s ability

to generalize.

A more recent overview on the topic is provided by the study conducted by

the Icahn School of Medicine at Mount Sinai in New York revealed significant

shortcomings in the performance of LLMs in clinical coding. The analysis

showed that the existing models, including the highest-performing GPT-4,

achieved less than 50% accuracy in matching medical codes to clinical texts.

Such inaccuracies can result in serious billing errors and compliance issues

Assisted morbidity coding 71

within healthcare systems. The study also highlighted varying performance

levels among different LLMs, posing challenges in clinical environments

where precise coding is essential for billing and ensuring accurate patient care

(Soroush et al. 2024).

These results emphasize the need for refinement and validation of these

technologies before considering clinical implementation, thus providing cus-

tomized AI tools specifically designed for medical coding, instead of using

general-purpose LLMs.

Given this overview, we can state that SISCO.web performances are com-

parable with most of the mentioned systems. Unlike existing systems and the

most recent AI-based coding support, SISCO.web offers dual support. Firstly,

it helps in finding the appropriate ICD-9-CM (or in the future ICD-10) code

for a diagnosis or procedure by utilizing NLP techniques combined with the

application of trustworthy coding rules, which are necessary to know when

dealing with the selected classification system. Secondly, it assists in identify-

ing the main diagnosis (the most serious and/or resource-intensive during hos-

pitalization or the inpatient encounter) among multiple diagnoses, which is

often a challenging and underestimated task. The advantages of this approach

also stem from the integration of decision tree algorithms, which expand the

system’s functionalities.

5. Conclusions and future directions

This paper shows the approach used to develop a web service aimed at

supporting physicians in the compilation of the SDO, while coding the main

condition, secondary pathologies, procedures and interventions in ICD-9-CM

and, where necessary, in ICD-10. The system also proposes a module based

on a series of formal rules that represent a decision tree specifically designed

for identifying the main pathological condition, which needs to be indicated

and coded in a separate field in SDO. The evaluation of the TEM module,

allowing for the search and suggestion of ICD-9-CM coding for diseases and

procedures, has reached good performances in terms of theaccuracy of the

coding suggestions, theefficiency of the system, and regarding the usability

of the system. Differently, some limitations are highlighted concerning the

rule engine module, which allows, through a series of steps and interactions

with the user, the identification of the main diagnosis. In this case, the ini-

tial formalization of the rules provided by the decision tree did not yield the

expected results. It has therefore become necessary to update the rules and,

above all, to make available ad hoc terminological resources to be submitted

to the rule engine to automate some steps of the decision tree, thus ensuring

the required performance compared to other support systems available in the

literature. Considering that ICD-9-CM is currently mandatory in Italy for

72 Elena Cardillo, Lucilla Frattura

coding diagnosis into SDO, the prototype and tests of SISCO.web uses this

ICD version to be used in hospital coding. Nevertheless, the system has been

designed to work using also ICD-10, including a decision tree specifically

set for ICD-10 for identifying the main diagnosis. This possibility, recently,

resulted advantageously since, as mentioned in Section 1, the Italian Ministry

of Health, to be aligned to European guidelines on cross-boarding care, is wor-

king on a roadmap to shift from ICD-9-CM to ICD-10-IM for the coding

of morbidities in SDO, leveraging the results of the It.DRG project. For this

reason, future work will be the extension of the system, in terms of integration

of the KB with the Italian version of ICD-10 (the mentioned ICD-10-IM)

and the necessary crosswalking tables as well as the implementation of the alre-

ady defined ICD-10-based decision tree in the rule engine. At the same time,

it will be possible to set up versions of this support system able to manage

classifications of interventions other than those used in Italy. Another possible

future work is the development of aJavaScript library to distribute the service

to interested parties and test it on a large scale (i.e., some hospital wards). As

observed in Section 4, automated clinical coding holds promise for AI despite

the technical and organizational challenges, but coders need to be involved in

the development process, as done in the present work. Given this understan-

ding, it can be argued that SISCO.web could serve as a good compromise,

particularly if focusing on a new research direction that could be pursued over

the next five years. This would involve improving the approach using LLMs

+ Retrieval-Augmented Generation (RAG) to enhance both the text enco-

ding module and the implementation of the decision tree in the rule engine.

Another possible future work could be to use a complementary approach for

the analysis, through NLP/DL, of the diagnostic sections of hospital dischar-

ge letters (LDOs in Italy), which are very detailed reports. In our use case, a

sample of these documents was used to test the performances of SISCO.web

in terms of capacity to support coding for complex search records (e.g., co-

morbidities, very detailed diagnoses, etc.). In the future, it would be valuable

to explore the possibility of providing coding support while registering LDOs’

data, particularly in the diagnostic section.

References

Abraha, Iosief, Alessandro Montedori, Diego Serraino, et al. 2018. “Accuracy

of administrative databases in detecting primary breast cancer diagnoses:

a systematic review.” BMJ Open 8:e019264. https://doi.org/10.1136/bm-

jopen-2017-019264.

Assisted morbidity coding 73

Almagro, Mario, Raquel Martínez, Soto Montalvo, and Victor Fresno. 2019.

“A cross-lingual approach to automatic ICD-10 coding of death certifi-

cates by exploring machine translation.” Journal of biomedical informatics

94 (2019): 103207. https://doi.org/10.1016/j.jbi.2019.103207.

Amodio, Emanuele, Fabio Tramuto, Claudio Costantino, et al. 2014. “Diag-

nosis of influenza: only a problem of coding?”. Med Princ Pract 23:568-

73. https://doi.org/10.1159/000364780.

Apache Lucene. n.d. “Welcome to Apache Lucene.” Last accessed November

10, 2024. https://lucene.apache.org/.

Boyle, Joseph S., Antanas Kascenas, Pat Lok, Maria Liakata, and Alison Q.

O’Neil. 2023. “Automated clinical coding using off-the-shelf large lan-

guage models.” Accepted to the NeurIPS 2023 workshop Deep Generative

Models For Health (DGM4H), arXiv preprint arXiv:2310.06552 (2023).

Cardillo, Elena, Claudio Eccher, Anna Perri, Vincenzo Della Mea, and Fran-

cesco Talin. 2018. “A rule-based Support System for the Validation of

Diagnoses coding in the Patient Summary.” In Proceedings of the Interna-

tional Conference on Medical Informatics Europe 2018 (MIE2018), Gothen-

burg, Sweden, April 24-26, 2018.

Cardillo, Elena, Lucilla Frattura, Salvatore Ciambrini, Claudio Eccher, Elia

Nardo, and Carlo Zavaroni. 2019. “Towards the Development of a Web

Support System for Improving Accuracy in Coding Discharge Diagno-

sis.” In Proceedings of the 2019 IEEE Symposium on Computers and Com-

munications (ISCC), Barcelona, Spain, 1147-52. https://doi.org/10.1109/

ISCC47284.2019.8969649.

Chao-Wei, Huang, Shang-Chi Tsai, and Yun-Nung Chen. 2022. “PLM-ICD:

Automatic ICD coding with pre-trained language models.” In Proceedings

of the 4th Clinical Natural Language Processing Workshop, 10–20, Seattle,

WA: Association for Computational Linguistics.

Chiò, Adriano, Giovannino Ciccone, Andrea Calvo, et al. 2002. “Validity

of hospital morbidity records for amyotrophic lateral sclerosis. A pop-

ulation-based study.” J Clin Epidemiol 55(7): 723-27. https://doi.

org/10.1016/s0895-4356(02)00409-2.

CLEF eHealth Lab Series. n.d. “CLEF eHealth 2020 – Task 1: Multilingual

Information Extraction.” Last Accessed November 10, 2024. http://cle-

fehealth.imag.fr/clefehealth.imag.fr/index135c.html?page_id=187%20

%3E.

Dong, Hang, Matúš Falis, William Whiteley, et al. 2022. “Automated clinical

coding: what, why, and where we are?” NPJ Digit. Med. 5(159). https://

doi.org/10.1038/s41746-022-00705-7.

Drools. n.d. Last Accessed November 10, 2024. https://www.drools.org.

74 Elena Cardillo, Lucilla Frattura

Eclipse Foundation. n.d. “About.” Last Accessed November 10, 2024. https://

eclipse-ee4j.github.io/jersey.

European Health Information Portal. 2023. “Hospital Discharge Records da-

tabase.” Last Updated January 10, 2023. https://www.healthinformation-

portal.eu/health-information-sources/hospital-discharge-database-2.

Falis, Matúš, Gema Aryo Pradipta, Dong Hang, et al. 2024. “Can GPT-3.5

generate and code discharge summaries?” Journal of the American Medical

Informatics Association 31(10): 2284-93. https://doi.org/10.1093/jamia/

ocae132.

Farkas, Richárd, and György Szarvas. 2008. “Automatic construction of

rule-based ICD-9-CM coding systems.” BMC Bioinformatics 9 (3): S10.

https://doi.org/10.1186/1471-2105-9-S3-S10.

Friedman, Carol, Lyudmila Shagina, Yves Lussier, and George Hripcsak.

2004. “Automated Encoding of Clinical Documents Based on Natural

Language Processing.” JAMIA 11(11): 392-402. https://doi.org/10.1197/

jamia.M1552.

Israel, Robert A. 1990. “Automation of mortality data coding and processing

in the United States of America.” World Health Stat Q. 43(4): 259-62.

https://pubmed.ncbi.nlm.nih.gov/2293494/.

Italian Ministry of Health. 2000. “Ministerial Decree October 27, 2000, no.

380 - Regolamento recante norme concernenti l’aggiornamento della di-

sciplina del flusso informativo sui dimessi dagli istituti di ricovero pubbli-

ci e private.” Gazzetta Ufficiale, 19 dicembre 2000, n. 295.

Italian Ministry of Health. 2024. Rapporto sull’attività di ricovero ospedaliero.

Dati SDO Anno 2022. https://www.salute.gov.it/portale/documentazio-

ne/p6_2_2_1.jsp?lingua=italiano&id=3441.

Italian Ministry of Labor, Health and Social Affairs. 2008a. “Ministerial De-

cree December 18, 2008. “Aggiornamento dei sistemi di classificazione

adottati per la codifica delle informazioni cliniche contenute nella scheda

di dimissione ospedaliera e per la remunerazione delle prestazioni ospeda-

liere.” Gazzetta Ufficiale, 9 marzo 2009, n. 56.

Italian Ministry of Labor, Health and Social Affairs. 2008b. Classificazione del-

le malattie, dei traumatismi, degli interventi chirurgici e delle procedure dia-

gnostiche e terapeutiche. Versione italiana della ICD-9-CM, 2007. Roma:

Istituto Poligrafico e Zecca dello Stato.

Italian Permanent working table for Digital health in Regions and Autono-

mous Provinces. 2010. Specifiche tecniche per la creazione del “profilo sa-

nitario sintetico” secondo lo standard HL7-CDA rel. 2. Department for the

Digitization of Public Administration and Technological Innovation.

Assisted morbidity coding 75

Istituto Superiore di Sanità. n.d. “Medical Subject Headings 2019.” Last Ac-

cessed November 10, 2024. https://old.iss.it/site/Mesh/.

Kim, Byung-Hak, and Ganapathi Varun. 2021. “Read, attend, and code:

Pushing the limits of medical codes prediction from clinical notes by ma-

chines.” In Machine Learning for Healthcare Conference, 196-208. PMLR.

Miranda-Escalada, Antonio, Aitor Gonzalez-Agir, Jordi Armengol-Estapé,

and Martin Krallinger. 2020. “Overview of Automatic Clinical Coding:

Annotations, Guidelines, and Solutions for non-English Clinical Cases at

CodiEsp Track of CLEF eHealth 2020.” In Working Notes of CLEF 2020

- Conference and Labs of the Evaluation Forum, CEUR-WS 2696.

Moriyama, Iwao Milton, Ruth M. Loy, Alastair Hamish, Tearloch Robb-

Smith, Harry Michael Rosenberg, and Donna L. Hoyert. 2011. History

of the statistical classification of diseases and causes of death, edited and up-

dated by H. M. Rosenberg, D. L. Hoyert. DHHS publication, no. (PHS)

2011-1125.

Nonis, Marino,Luigi Bertinato,Laura Arcangeli, et al. 2018. “The evolution

of DRG system in Italy: the It-DRG project.” European Journal of Public

Health 28, no. 4 (November), cky218.095.https://doi.org/10.1093/eur-

pub/cky218.095.

O’Malley, Kimberly J., Karon F. Cook, Matt D. Price, Kimberly Raiford Wil-

des, John F. Hurdle, and Carol M. Ashton. 2005. “Measuring diagnoses:

ICD code accuracy.” Health services research 40(5p2): 1620-39. https://

doi.org/10.1111/j.1475-6773.2005.00444.x.

Pavillon, Gérard, Lars A. Johansson, D. Glenn, S. Weber, B. Witting, and S.

Notzon. 2007. “Iris: A Language Independent Coding System For Mor-

tality Data.” In WHO – Family of International Classifications Network –

FIC. Annual Meeting. Trieste, Italy, 28 October - 3 November 2007.

Prime Minister’s Decree 12 January 2017. “Definizione e aggiornamento dei

livelli essenziali di assistenza, di cui all’articolo 1, comma 7, del decreto

legislativo 30 dicembre 1992, n. 502.” Gazzetta Ufficiale, 18 marzo 2017,

no. 65, Allegato 7.

Proctor, Mark. 2012. “Drools: A Rule Engine for Complex Event Process-

ing.” In Applications of Graph Transformations with Industrial Relevance.

AGTIVE 2011, edited by A. Schürr, D. Varró, G. Varró. Lecture Notes

in Computer Science 7233. Berlin, Heidelberg: Springer. https://doi.

org/10.1007/978-3-642-34176-2_2.

Quan, Hude, Łukasz Moskal, Alan J. Forster, et al. 2014. “International vari-

ation in the definition of ‘main condition’ in ICD-coded health data.”

Int J Qual Health Care 26(5): 511-15. https://doi.org/10.1093/intqhc/

mzu064. Epub 2014 Jul 2.

76 Elena Cardillo, Lucilla Frattura

Rinaldi, Rita, Luca Vignatelli, Massimo Galeotti, Giuseppe Azzimondi G.,

and Piero De Carolis. 2003. “Accuracy of ICD-9 codes in identifying

ischemic stroke in the General Hospital of Lugo di Romagna (Italy).”

Neurol Sci 24: 65-69. https://doi.org/10.1007/s100720300074.

Rios, Anthony, and Ramakanth Kavuluru. 2018. “EMR Coding with

Semi-Parametric Multi-Head Matching Networks.” In Proceedings of the

conference. Association for Computational Linguistics. North American Chap-

ter Meeting 2018: 2081-91. https://doi.org/10.18653/v1/N18-1189.

Sforza, Vincenzo, Duilio Carusi, Luigi Bertinato, Marino Nonis, and Silvia

Surricchio. 2021 “L’approccio del PROGETTO IT.DRG per la rilevazio-

ne dei costi standard delle prestazioni ospedaliere. Il modello IT:COST.”

Bilancio Comunità Persona, n. 2: 82-116. https://dirittoeconti.it/artico-

lo-rivista/lapproccio-del-progetto-it-drg-per-la-rilevazione-dei-costi-stan-

dard-delle-prestazioni-ospedaliere-il-modello-itcost/.

Silvestri, Stefano, Francesco Gargiulo, Mario Ciampi, and Giuseppe De Pie-

tro. 2020. “Exploit multilingual language model at scale for icd-10 clini-

cal text classification.” In 2020 IEEE Symposium on Computers and Com-

munications (ISCC), Rennes, France, 2020, 1-7. https://doi.org/10.1109/

ISCC50000.2020.9219640.

Soroush, Ali, Benjamin S. Glicksberg, Eyal Zimlichman, et al. 2024. “Large

Language Models Are Poor Medical Coders - Benchmarking of Medi-

cal Code Querying.” NEJM AI 1(5) (April 19, 2024). https://doi.

org/10.1056/AIdbp2300040.

Spolaore, Paolo, Stefano Brocco, Ugo Fedeli, et al. 2005. “Measuring accu-

racy of discharge diagnoses for a region-wide surveillance of hospitalized

strokes.” Stroke 36, no. 5 (May): 1031-34. https://doi.org/10.1161/01.

STR.0000160755.94884.4a.

Sukanya Chongthawonsatid. 2017. “Validity of Principal Diagnoses in Dis-

charge Summaries and ICD-10 Coding assessments based on national

health data of Thailand.” Health Inform Res 23, no. 4 (October): 293-303.

https://doi.org/10.4258/hir.2017.23.4.293.

Sundararajan, Vijaya, Patricia S. Romano, Hude Quan, et al. 2015. “Captur-

ing diagnosis-timing in ICD-coded hospital data: recommendations from

the WHO ICD-11 topic advisory group on quality and safety.” Int J Qual

Health Care 27(4): 328-33. https://doi.org/10.1093/intqhc/mzv037.

Tatham, Andrew J. 2008. “The increasing importance of clinical coding.”

British Journal of Hospital Medicine 69(7): 372-3.

Assisted morbidity coding 77

Wang, Cheng, Chenlong Yao, Pengfei Chen, Jiamin Shi, Zhe Gu, and Zheying

Zhou. 2021. “Artificial Intelligence Algorithm with ICD Coding Technol-

ogy Guided by the Embedded Electronic Medical Record System in Med-

ical Record Information Management.” J Healthc Eng 30;2021:3293457.

https://doi.org/10.1155/2021/3293457.

Williamson, Ashton, David de Hilster, Amnon Meyers, Nina Hubig, and Amy

Apon. 2024. “Low-resource ICD Coding of Hospital Discharge Summa-

ries.” In Proceedings of the 23rd Workshop on Biomedical Language Process-

ing, August 16, 2024, 548-58. Association for Computational Linguistics.

https://aclanthology.org/2024.bionlp-1.45.pdf.

World Health Organization (WHO). 2016. “International Statistical Classi-

fication of Diseases and Related Health Problems 10th Revision. Volume

2.” Geneva: World Health Organization.

World Health Organization (WHO). 2019/2021. International Classification

of Diseases, Eleventh Revision (ICD-11). https://icd.who.int/browse11.

World Health Organization (WHO). 2024. “ICD-11 Coding Tool.” https://

icd.who.int/ct/icd11_mms/en/release.

Zavaroni, Carlo, Antonia Fanzutto, Elia Nardo, Vincenzo Della Mea, and Lu-

cilla Frattura. 2018. “Morbidity coding in ICD-11 (and ICHI): a deci-

sion tree to identify the main condition.” In WHO-FIC Annual Meeting

Booklet. Seoul, 22-27 October 2018. WHO. #307.

Zhang, Zachariah, Liu Jingshu, and Razavian Narges. 2020. “BERT-XML:

Large scale automated ICD coding using BERT pretraining.” In Pro-

ceedings of the 3rd Clinical Natural Language Processing Workshop, 24-

34, Online. Association for Computational Linguistics. https://doi.

org/10.48550/arXiv.2006.03685.

ISBN 979-12-5965-456-4 ISSN 1121-0095

AIDAinformazioni Anno 42 – N. 3-4 – luglio-dicembre 2024

AIDAinformazioni

Rivista semestrale di Scienze dell’Informazione

Anno 42

N. 3-4 – luglio-dicembre 2024

Contributi

A A

Il nuovo regolamento eIDAS e alcune “quisquilie”

archivistiche

F B, M T

Exploration du réseau numérique YouTube

autour de la santé des militaires: quelles sont les

thématiques des discours, les sources d’informations

et les acteurs de la communication?

E C, L F

Assisted morbidity coding: the SISCO.web

use case for identifying the main diagnosis in

Hospital Discharge Records

V F

A humanistic approach to datafication

R P

Testimonianze di un impegno culturale per

l’Università di Salerno. Le carte di Alfonso

Menna

F S, A B,

E G, S M

CompL-it: a Computational Lexicon of Italian

Rubriche

C G

Non solo libri

In copertina

Disegno di Paul Otlet, Collections Mundaneum, centre d’Archives, Mons (Belgique).