Global Journal of Engineering Sciences (GJES)
Some
Thoughts on Reliability of Diagnoses by Human Versus by Machine
Authored by Ildikó
Ziegler
Opinion
Not long ago the Comment by
Tessa S Cook [1] appeared in Digital Health following the informative article Xiaoxuan
Liu & colleagues [2]. The question of whether deep learning is more
reliable/ accurate than human health-care professionals in detecting diseases
from medical imaging was in the focus of the papers.
The Comment [1] added much to
the discussion since the author listed several factors those influence the
accuracy of diagnosis made by A.I. or a human.
The actual discussion speaks
about diagnoses based on medical imaging techniques [1,2]. I believe the
consideration of these factors can be continued by estimating the tendencies of
these factors in the long run and actually, it may bring some inspiring
thoughts, not only in connection with medical imaging, but in general,
regarding the learning capabilities of artificial intelligence.
Theoretically the following
error types could be identified (see e.g. [3,4]):
• Deviation occurring due to
the lack of expert consensus.
• False positive diagnoses
• False negative diagnoses
• Other accidental errors
• Other systematic errors
We may make an estimation
about the tendency of these errors in time one by one based on the statistical
nature of these types of errors (see e.g. [5,6]). Supposing that the capacity
of A.I. to handle data and to conduct repetitive calculations/ decisions is
always kept greater than the capacity requirements of actual tasks, that means
no technical limitation of processing the growing data set occur in time during
the development of A.I.
Moreover, a single person,
theoretically, cannot grow her/ his knowledge in an unlimited manner, partly,
because we, humans, forget things, partly, because other circumstances, e.g.
sickness or death may prevent us from continuing our learning process. However,
the human being as a society also can be considered as a continuous learner
group and as such the achievable total knowledge of human kind seems to be
unlimited.
Based on the assumptions
above, let’s take a look at the mentioned influencing factors one by one:
Systematic errors, others the
mentioned in the earlier groups are the characteristics of the given A.I., and
the way it functions. It cannot change during the learning process, however it
may step wise decrease if it is detected and more or less handled, for instance
by upgrading the software, developing the algorithm, etc. In case of a given A.I.
it is constant during the learning process, but it will decrease during the
development of A.I. technology over time.
Accidental errors (others than
the mentioned in the earlier groups) are characterized by Gaussian distribution
[3-6]. It means that with the elapsed time the standard deviation
characterizing of the learning process of a given A.I. will decrease during the
learning period.
False negative and false
positive results mean that the decision about the diagnosis is not true.
In those cases when results
(diagnoses) are not independent from each other - and e.g. biostatistics or
clinical decisions are this kind – the deviations are characterized by Bayesian
distribution [7,8]. The situation with this type of error is similar to the
previous one: as the number of items in the data set on which the learning is
based increases, the probability of the error decreases. With the elapsed time
the A.I. is still expected to improve its effectiveness in the long run.
The most complex problem is to
estimate the tendency of those diagnoses which medical doctors and specialist
have different opinions without consensus. It has to be taken into account that
a human solves problems in a different way compared to A.I. Deep learning is
based on a super computer programmed with sophisticated algorithms, thus, it
always uses linear deductions based on large dataset and mathematical logic,
while we, humans tend to use our creative, associative thinking as well as
checking the associative ideas based on professional rational. It is mainly
expressed – as T. Cook [1] also pointed out – by striving for a holistic
approach: “medical practitioners begin with the history of the present illness,
the review of systems”.
The use of creative thinking
may shorten the time requirement of problem solving and also make serendipitous
results possible. As long most studies take the approach of evaluating
diagnostic accuracy in isolation without reflecting to clinical practice,
development of the accuracy cannot be expected. Conversely, if diagnoses are
combined with subsequent experience in the underlying data set, and this data
set contains a sufficiently large number of cases, deep learning may provide a
better approach to issues that have previously aroused professional debate.
At this point, A.I. may start
to make more accurate diagnoses than human health-care professionals. This
approach may be true until some rare disease appears for which there is little
experience. In this case, however, the human professionals are also quite
likely to misdiagnose, without owning the basis for comparison. It seems the
theoretical approach led to the opposite conclusion than Liu et al [2], but it
is not the case.
The reason lies in the fact
that our theoretical estimate was made long-term, assuming a large amount of
initial data and ample storage capacity. However, the size of the currently
available dataset is not close to the desired size, for instance, in the
present case [1] it meant 25 studies. Consequently, nor did we achieve the
reliably large amount of data from which artificial intelligence could
statistically infer uncertainty in case of lack of expert consensus.
At the present state of the
art, deep learning functions as a kind of secondary measurement method that is
calibrated to human experience, as “primary method”. As such, a secondary
“measurement method” cannot be more accurate than the primary “measurement
method” for which it was calibrated.
The commentator [1] also
mentions that deep learning acts as a black box: one cannot see the algorithm’s
decision-making mechanism, “it still cannot tell us why the end result is
produced”. It is suspected that the everyday practice of the medical profession
and the science of programming are so far apart from each other that the
intensive team work of good communication, as usual with medical devices, will
result in sufficiently reliable futuristic technologies.
Just as the driver does not
see how the fuel burns in the cylinder head under the piston, but he believes
in the service of car mechanic when his car is deemed suitable, the medical
professional will not have a chance to look into the details of the algorithm.
It is necessary to trust the multi disciplinary development team who validates
the machine learning system they have developed.
It is also interesting to consider Moore’s and Kynder’s laws. Moor’s law [9,10] is an experimental law stating hat the number of transistors on a chip would double every two years / 18 months. According to Kynder’s law [10,11] the density / capability of hard drive storage media would double every 18 months. It will be interesting to see in the long run if the size of the data set used for machine learning – analogously to the afore mentioned laws – will follow an exponential curve as a result of technological advances. I am voting in favor.
To read more about this article https://irispublishers.com/gjes/fulltext/some-thoughts-on-reliability-of-diagnoses-by-human-versus-by-machine.ID.000611.php
Indexing List of Iris Publishers: https://medium.com/@irispublishers/what-is-the-indexing-list-of-iris-publishers-4ace353e4eee
Iris publishers google scholar citations: https://scholar.google.co.in/scholar?hl=en&as_sdt=0%2C5&q=irispublishers&btnG=

Comments
Post a Comment