|
Journal of Postgraduate Medicine, Vol. 46, No. 3, July-September, 2000, pp. 199-204 E-Medicine Clinical Patient Record Systems Architecture: An Overview Nadkarni PM Centre for Medical Informatics, Yale University School of Medicine, New Haven,
Connecticut, USA Code Number: jp00071 Creation of a general-purpose medical record is one of the more difficult problems in database design. In the USA, most medical institutions have much more electronic information on a patient's financial and insurance history than on the patient's medical record. Financial information, like orthodox accounting information, is far easier to computerize and maintain, because the information is fairly standardized. Clinical information, by contrast, is extremely diverse. Signal and image dataX-Rays, ECGs, requires much storage space, and is more challenging to manage. Mainstream relational database engines developed the ability to handle image data less than a decade ago, and the mainframe-style engines that run many medical database systems have lagged technologically. One well-known system has been written in assembly language for an obsolescent class of mainframes that IBM sells only to hospitals that have elected to purchase this system. CPRSs are designed to review clinical information that has been gathered through a variety of mechanisms, and to capture new information. From the perspective of review, which implies retrieval of captured data, CPRSs can retrieve data in two ways. They can show data on a single patient (specified through a patient ID) or they can be used to identify a set of patients (not known in advance) who happen to match particular demographic, diagnostic or clinical parameters. That is, retrieval can either be patient-centric or parameter-centric. Patient-centric retrieval is important for real time clinical decision support. "Real time" means that the response should be obtained within seconds (or a few minutes at the most), because the availability of current information may mean the difference between life and death. Parameter-centric retrieval, by contrast, involves processing large volumes of data: response time is not particularly critical, however, because the results are used for purposes like long-term planning or for research, as in retrospective studies. In general, on a single machine, it is possible to create a database design that performs either patient-centric retrieval or parameter-centric retrieval, but not both. The challenges are partly logistic and partly architectural. From the logistic viewpoint, in a system meant for real-time patient query, a giant parameter-centric query that processed half the records in the database would not be desirable because it would steal machine cycles from critical patient-centric queries. Many database operations, both business and medical, therefore periodically copy data from a "transaction" (patient-centric) database, which captures primary data, into a parameter-centric "query" database on a separate machine in order to get the best of both worlds. Some commercial patient record systems, such as the 3M Clinical Data Repository (CDR)1 are composed of two subsystems, one that is transaction-oriented and one that is query-oriented. Patient-centric query is considered more critical for day-to-day operation, especially in smaller or non-research-oriented institutions. Many vendors therefore offer parameter-centric query facilities as an additional package separate from their base CPRS offering. We now discuss the architectural challenges, and consider why creating an institution-wide patient database poses significantly greater hurdles than creating one for a single department.
The Protocol-Oriented Workup During a routine check-up, a clinician goes through a standard checklist in terms of history, physical examination and laboratory investigations. When a patient has one or more symptoms suggesting illness, however, a whole series of questions are asked, and investigations performed (by a specialist if necessary), which would not be asked/performed if the patient did not have these symptoms. These are based on the suspected (or apparent) diagnosis/-es. Proformas (protocols) have been devised that simplify the patient's workup for a general examination as well as many disease categories. The clinical parameters recorded in a given protocol have been worked out by experience over years or decades, though the types of questions asked, and the order in which they are asked, varies with the institution (or vendor package, if data capture is electronically assisted). The level of detail is often left to individual discretion: clinicians with a research interest in a particular condition will record more detail for that condition than clinicians who do not. A certain minimum set of facts must be gathered for a given condition, however, irrespective of personal or institutional preferences. The objective of a protocol is to maximize the likelihood of detection and recording of all significant findings in the limited time available. One records both positive findings as well as significant negatives (e.g., no history of alcoholism in a patient with cirrhosis). New protocols are continually evolving for emergent disease complexes such as AIDS. While protocols are typically printed out (both for the benefit of possibly inexperienced residents, and to form part of the permanent paper record), experienced clinicians often have them committed to memory. However, the difference between an average clinician and a superb one is that the latter knows when to depart from the protocol: if departure never occurred, new syndromes or disease complexes would never be discovered. In any case, the protocol is the starting point when we consider how to store information in a CPRS.
Representing Clinical Data in a Database CPRSs continue to be an area of active research. This paper, however, focuses on the mechanism by which data is stored and retrieved, rather than the ancillary functions provided by the system, such as implementation of clinical guidelines,2 problem-oriented data capture3,4 or therapy support.5 The obvious approach for storing clinical data is to record each type of finding in a separate column in a table. In the simplest example of this, the so-called "flat-file" design, there is only a single value per parameter for a given patient encounter. Systems that capture standardised data related to a particular specialty (e.g., an obstetric examination, or a colonoscopy) often do this. This approach is simple for non-computer-experts to understand, and also easiest to analyse by statistics programs (which typically require flat files as input). A system that incorporates problem-specific clinical guidelines is easiest to implement with flat files, as the software engineering for data management is relatively minimal. In certain cases, an entire class of related parameters is placed in a group of columns in a separate table, with multiple sets of values. For example, laboratory information systems, which support labs that perform hundreds of kinds of tests, do not use one column for every test that is offered. Instead, for a given patient at a given instant in time, they store pairs of values consisting of a lab test ID and the value of the result for that test. Similarly for pharmacy orders, the values consist of a drug/medication ID, the preparation strength, the route, the frequency of administration, and so on. When one is likely to encounter repeated sets of values, one must generally use a more sophisticated approach to managing data, such as a relational database management system (RDBMS). Simple spreadsheet programs, by contrast, can manage flat files, though RDBMSs are also more than adequate for that purpose. The one-column-per-parameter approach, unfortunately, does not scale up when considering an institutional database that must manage data across dozens of departments, each with numerous protocols. (By contrast, the groups-of-columns approach scales well, as we shall discuss later.) The reasons for this are discussed below. One obvious problem is the sheer number of tables that must be managed. A given patient may, over time, have any combination of ailments that span specialities: cross-departmental referrals are common even for inpatient admission episodes. In most Western European countries where national-level medical records on patients go back over several decades, using such a database to answer the question, "tell me everything that has happened to this patient in forward/reverse chronological order" involves searching hundreds of protocol-specific tables, even though most patients may not have had more than a few ailments. Some clinical parameters (e.g., serum enzymes and electrolytes) are relevant to multiple specialities, and, with the one-protocol-per-table approach, they tend to be recorded redundantly in multiple tables. This violates a cardinal rule of database design: a single type of fact should be stored in a single place. If the same fact is stored in multiple places, cross-protocol analysis becomes needlessly difficult because all tables where that fact is recorded must be first tracked down. The number of tables keeps growing as new protocols are devised for emergent conditions, and the table structures must be altered if a protocol is modified in the light of medical advances. In a practical application, it is not enough merely to modify or add a table: one must alter the user interface to the tables- that is, the data-entry/browsing screens that present the protocol data. While some system maintenance is always necessary, endless redesign to keep pace with medical advances is tedious and undesirable. A simple alternative to creating hundreds of tables suggests itself. One might attempt to combine all facts applicable to a patient into a single row. Unfortunately, across all medical specialities, the number of possible types of facts runs into the hundreds of thousands. Today's database engines permit a maximum of 256 to 1024 columns per table, and one would require hundreds of tables to allow for every possible type of fact. Further, medical data is time-stamped, i.e., the start time (and, in some cases, the end time) of patient events is important to record for the purposes of both diagnosis and management. Several facts about a patient may have a common time-stamp, e.g., serum chemistry or haematology panels, where several tests are done at a time by automated equipment, all results being stamped with the time when the patient's blood was drawn. Even if databases did allow a potentially infinite number of columns, there would be considerable wastage of disk space, because the vast majority of columns would be inapplicable (null) for a single patient event. (Even null values use up a modest amount of space per null fact.) Some columns would be inapplicable to particular types of patients-e.g., gyn/obs facts would not apply to males.
The Entity-Attribute-Value Approach: Principles and History The challenges to representing institutional patient data arise from the fact
that clinical data is both highly heterogeneous as well as sparse.
The design solution that deals with these problems is called the entity-attribute-value
(EAV) model. In this design, the parameters (attribute is
a synonym of parameter) are treated as data recorded in an attribute definitions
table, so that addition of new types of facts does not require database
restructuring by addition of columns. Instead, more rows are added to
this table. The patient data table (the EAV table) records an entity
(a combination of the patient ID, clinical event, and one or more date/time
stamps recording when the events recorded actually occurred), the attribute/parameter,
and the associated value of that attribute. Each row of such a table
stores a single fact about a patient at a particular instant in time.
For example, a patient's laboratory value may be stored as: ( Attribute-value pairs themselves are used in non-medical areas to manage extremely
heterogeneous data, e.g., in Web "cookies" (text files written by
a Web server to a user's local machine when the site is being browsed), and
the Microsoft Windows registries. The first major use of EAV for clinical data
was in the pioneering HELP system built at LDS Hospital in Utah starting from
the late 70s.6-8 HELP originally stored all data - characters, numbers
and dates_ as ASCII text in a pre-relational database (ASCII, for American Standard
Code for Information Interchange, is the code used by computer hardware almost
universally to represent characters. The range of 256 characters is adequate
to represent the character set of most European languages, but not ideographic
languages such as Mandarin Chinese.) The modern version of HELP, as well as
the 3M CDR, which is a commercialisation of HELP, uses a relational engine.
A team at Columbia University was the first to enhance EAV design to use relational
database technology. The Columbia-Presbyterian CDR,9,10 also separated
numbers from text in separate columns. The advantage of storing numeric data
as numbers instead of ASCII is that one can create useful indexes on these numbers.
(Indexes are a feature of database technology that allow fast search for particular
values in a table, e.g., laboratory parameters within or beyond a particular
range.). When numbers are stored as ASCII text, an index on such data is useless:
the text "12.5" is greater than "11000", because it comes
later in alphabetical order.) Some EAV databases therefore segregate data by
data type. That is, there are separate EAV tables for short text, long text
(e.g., discharge summaries), numbers, dates, and binary data (signal and image
data). For every parameter, the system records its data type so that one knows
where it is stored. ACT/DB,11,12 a system for management of clinical
trials data (which shares many features with CDRs) created at Yale University
by a team led by this author, uses this approach.
From the conceptual viewpoint (i.e., ignoring data type issues), one may therefore
think of a single giant EAV table for patient data, containing one row per fact
for a patient at a particular date and time. To answer the question "tell
me everything that has happened to patient X", one simply gathers all rows
for this patient ID (this is a fast operation because the patient ID column
is indexed), sorts them by the date/time column, and then presents this information
after "joining" to the Attribute definitions table. The last operation
ensures that attributes are presented to the user in ordinary language - e.g.,
"haemoglobin," instead of as cryptic numerical IDs.
One should mention that EAV database design has been employed primarily in
medical databases because of the sheer heterogeneity of patient data. One hardly
ever encounters it in "business" databases, though these will often
use a restricted form of EAV termed "row modelling." Examples of row
modelling are the tables of laboratory test result and pharmacy orders, discussed
earlier.
Note also that most production "EAV" databases will always contain
components that are designed conventionally. EAV representation is suitable
only for data that is sparse and highly variable. Certain kinds of data, such
as patient demographics (name, sex, birth date, address, etc.) is standardized
and recorded on all patients, and therefore there is no advantage in storing
it in EAV form.
Physical vs. Conceptual Schema: User Interface Issues
EAV is primarily a means of simplifying the physical schema of a database,
to be used when simplification is beneficial. However, the users conceptualise
the data as being segregated into protocol-specific tables and columns. Further,
external programs used for graphical presentation or data analysis always expect
to receive data as one column per attribute. The conceptual schema of
a database reflects the users' perception of the data. Because it implicitly
captures a significant part of the semantics of the domain being modelled, the
conceptual schema is domain-specific. A user-friendly EAV system completely
conceals its EAV nature from its end-users: its interface confirms to the conceptual
schema and creates the illusion of conventional data organisation. From the
software perspective, this implies on-the-fly transformation of EAV data into
conventional structure for presentation in forms, reports or data extracts that
are passed to an analytic program. Conversely, changes to data by end-users
through forms must be translated back into EAV form before they are saved.
To achieve this sleight-of-hand, an EAV system records the conceptual schema
through metadata - "dictionary" tables whose contents describe
the rest of the system. While metadata is important for any database, it is
critical for an EAV system, which can seldom function without it. ACT/DB, for
example, uses metadata such as the grouping of parameters into forms, their
presentation to the user in a particular order, and validation checks on each
parameter during data entry to automatically generate web-based data entry.
The metadata architecture and the various data entry features that are supported
through automatic generation are described elsewhere.13
Limitations of EAV
EAV is not a panacea. The simplicity and compactness of EAV representation
is offset by a potential performance penalty compared to the equivalent conventional
design. For example, the simple AND, OR and NOT operations on conventional data
must be translated into the significantly less efficient set operations of Intersection,
Union and Difference respectively. For queries that process potentially large
amounts of data across thousands of patients, the impact may be felt in terms
of increased time taken to process queries. A quantitative benchmarking study
performed by the Yale group with microbiology data modelled both conventionally
and in EAV form indicated that parameter-centric queries on EAV data ran anywhere
from 2-12 times as slow as queries on equivalent conventional data.14
Patient-centric queries, on the other hand, run at the same speed or even faster
with EAV schemas, if the data is highly heterogeneous. We have discussed the
reason for the latter.
A more practical problem with parameter-centric query is that the standard
user-friendly tools (such as Microsoft Access's Visual Query-by-Example) that
are used to query conventional data do not help very much for EAV data, because
the physical and conceptual schemas are completely different. Complicating the
issue further is that some tables in a production database are conventionally
designed. Special query interfaces need to be built for such purposes. The general
approach is to use metadata that knows whether a particular attribute has been
stored conventionally or in EAV form: a program consults this metadata, and
generates the appropriate query code in response to a user's query. A query
interface built with this approach for the ACT/DB system12; this
is currently being ported to the Web.
Departing from the Protocol: the Use of Free Text
and Encoding
So far, we have discussed how EAV systems can create the illusion of conventional
data organization through the use of protocol-specific forms. However, the problem
of how to record information that is not in a protocol_e.g., a clinician's impressions_has
not been addressed. One way to tackle this is to create a "general-purpose"
form that allows the data entry person to pick attributes (by keyword search,
etc.) from the thousands of attributes within the system, and then supply the
values for each. (Because the user must directly add attribute-value pairs,
this form reveals the EAV nature of the system.) In practice, however, this
process, which would take several seconds to half a minute to locate an individual
attribute, would be far too tedious for use by a clinician.
Therefore, clinical patient record systems also allow the storage of "free
text" - narrative in the doctor's own words. Such text, which is of arbitrary
size, may be entered in various ways. In the past, the clinician had to compose
a note comprising such text in its entirety. Today, however, "template"
programs can often provide structured data entry for particular domains (such
as chest X-ray interpretations). These programs will generate narrative text,
including boilerplate for findings that were normal, and can greatly reduce
the clinician's workload. Many of these programs use speech recognition software,
thereby improving throughput even further.
Once the narrative has been recorded, it is desirable to encode the facts captured
in the narrative in terms of the attributes defined within the system. (Among
these attributes may be concepts derived from controlled vocabularies
such as SNOMED, used by Pathologists, or ICD-9, used for disease classification
by epidemiologists as well as for billing records.) The advantage of encoding
is that subsequent analysis of the data becomes much simpler, because one can
use a single code to record the multiple synonymous forms of a concept
as encountered in narrative, e.g., hepatic/liver, kidney/renal, vomiting/emesis
and so on. In many medical institutions, there are non-medical personnel who
are trained to scan narrative dictated by a clinician, and identify concepts
from one or more controlled vocabularies by looking up keywords. This process
is extremely human intensive, and there is ongoing informatics research focused
on automating part of the process. Currently, it appears that a computer program
cannot replace the human component entirely. This is because certain terms can
match more than one concept. For example, "anaesthesia" refers to
a procedure ancillary to surgery, or to a clinical finding of loss of sensation.
Disambiguation requires some degree of domain knowledge as well as knowledge
of the context where the phrase was encountered. The processing of narrative
text is a computer-science speciality in its own right, and a preceding article15
has discussed it in depth.
Using the CPRS for Automated Decision Making: Medical
Logic Modules
Medical knowledge-based consultation programs ("expert systems")
have always been an active area of medical informatics research, and a few of
these, e.g., QMR16,17 have attained production-level status. A drawback
of many of these programs is that they are designed to be stand-alone. While
useful for assisting diagnosis or management, they have the drawback that information
that may already be in the patient's electronic record must be re-entered through
a dialog between the program and the clinician. In the context of a hospital,
it is desirable to implement embedded knowledge-based systems that can
act on patient data as it is being recorded or generated, rather than after
the fact (when it is often too late). Such a program might, for example, detect
potentially dangerous drug interactions based on a particular patient's prescription
that had just been recorded in the pharmacy component of the CPRS. Alternatively,
a program might send an alert (by pager) to a clinician if a particular patient's
monitored clinical parameters deteriorated severely.
The units of program code that operate on incoming patient data in real-time
are called medical logic modules (MLMs), because they are used to express medical
decision logic. While one could theoretically use any programming language (combined
with a database access language) to express this logic, portability is an important
issue: if you have spent much effort creating an MLM, you would like to share
it with others. Ideally, others would not have to rewrite your MLM to run on
their system, but could install and use it directly. Standardization is therefore
desirable. In 1994, several CPRS researchers proposed a standard MLM language
called the Arden syntax.18-20 Arden resembles BASIC (it is designed
to be easy to learn), but has several functions that are useful to express medical
logic, such as the concepts of the earliest and the latest patient
events. One must first implement an Arden interpreter or compiler for a particular
CPRS, and then write Arden modules that will be triggered after certain events.
The Arden code is translated into specific database operations on the CPRS that
retrieve the appropriate patient data items, and operations implementing the
logic and decision based on that data. As with any programming language, interpreter
implementation is not a simple task, but it has been done for the Columbia-Presbyterian
and HELP CDRs: two of the informaticians responsible for defining Arden, Profs.
George Hripcsak and T. Allan Pryor, are also lead developers for these respective
systems. To assist Arden implementers, the specification of version 2 of Arden,
which is now a standard supported by HL7, is available on-line.20
Arden-style MLMs, which are essentially "if-then-else" rules, are
not the only way to implement embedded decision logic. In certain situations,
there are sometimes more efficient ways of achieving the desired result. For
example, to detect drug interactions in a pharmacy order, a program can generate
all possible pairs of drugs from the list of prescribed drugs in a particular
pharmacy order, and perform database lookups in a table of known interactions,
where information is typically stored against a pair of drugs. (The table of
interactions is typically obtained from sources such as First Data Bank.) This
is a much more efficient (and more maintainable) solution than sequentially
evaluating a large list of rules embodied in multiple MLMs.
Nonetheless, appropriately designed MLMs can be an important part of the CPRS,
and Arden deserves to become more widespread in commercial CPRSs. Its currently
limited support in such systems is more due to the significant implementation
effort than to any flaw in the concept of MLMs.
Data Interchange Issues
Patient management software in a hospital is typically acquired from more than
one vendor: many vendors specialize in niche markets such as picture archiving
systems or laboratory information systems. The patient record is therefore often
distributed across several components, and it is essential that these components
be able to inter-operate with each other. Also, for various reasons, an institution
may choose to switch vendors, and it is desirable that migration of existing
data to another system be as painless as possible. Data exchange/migration is
facilitated by standardization of data interchange between systems created by
different vendors, as well as the metadata that supports system operation. Significant
progress has been made on the former front. The standard formats used for the
exchange of image data and non-image medical data are DICOM (Digital Imaging
and Communications in Medicine) and HL-7 (Health Level 7) respectively. For
example, all vendors who market digital radiography, CT or MRI devices are supposed
to be able to support DICOM, irrespective of what data format their programs
use internally. HL-7 is a hierarchical format that is based on a language specification
syntax called ASN.1 (ASN=Abstract Syntax Notation), a standard originally created
for exchange of data between libraries. HL-7's specification is quite complex,
and HL-7 is intended for computers rather than humans, to whom it can be quite
cryptic. There is a move to wrap HL-7 within (or replace it with) an equivalent
dialect of the more human-understandable XML (eXtended Markup Language), which
has rapidly gained prominence as a data interchange standard in E-commerce and
other areas. XML also has the advantage that there are a very large number of
third-party XML tools available: for a vendor just entering the medical field,
an interchange standard based on XML would be considerably easier to implement.
Conclusions
CPRSs pose formidable informatics challenges, all of which have not been fully
solved: many solutions devised by researchers are not always successful when
implemented in production systems. An issue for further discussion is security
and confidentiality of patient records. In countries such as the US where health
insurers and employers can arbitrarily reject individuals with particular illnesses
as posing too high a risk to be profitably insured or employed, it is important
that patient information should not fall in the wrong hands. Much also depends
on the code of honour of the individual clinician who is authorised to look
at patient data. In their book, "Freedom at Midnight," authors Larry
Collins and Dominic Lapierre cite the example of Mohammed Ali Jinnah's anonymous
physician (supposedly Rustom Jal Vakil) who had discovered that his patient
was dying of lung cancer. Had Nehru and others come to know this, they might
have prolonged the partition discussions indefinitely. Because Dr. Vakil respected
his patient's confidentiality, however, world history was changed.
Grant Support: NIH Grants R01 LM06843-01 from the US National Library
of Medicine and U01 CA78266-03 from the US National Cancer Institute.
References
This article is also available
in full-text from http://www.jpgmonline.com/ Copyright 2000 - Journal of Postgradate Medicine |
|