In the past few years, there has been abundant research in using machine learning to generate high quality radiology reports using the large MIMIC-CXR chest x-ray dataset. However, there has been little work focused on evaluating the quality of generated reports from a clinical perspective, where accuracy is the most important factor. Current evaluation metrics evaluate reports in one dimension. This work proposes the use of multiple dimensions (factual correctness, comprehensiveness, style, and overall quality) to better capture evaluation preferences of a clinical text generating model where preferences can differ based on the use case. This work also presents a dataset of radiologist rating annotations for generated and reference chest x-ray radiology reports. Lastly, it also creates an improved metric for the readability dimension by adding context awareness of frequent and acceptable medical terminology.
M.Eng.