In Context: Extracting Relevance from Unstructured Medical Data


By J. Gregory Massey, DVM, MSA

Recently I sat patiently in an examination room while my physician typed notes into a computer terminal. After a few moments, he paused and asked, “You know what electronic medical records are good at?” I smiled politely. “Federal compliance and billing,” he said. I didn’t have to wait long for the follow-up I knew was coming. “You know what they’re bad at? Caring for patients.”

My physician was aware that I have a medical background and work in the healthcare IT industry with a focus on electronic medical records (EMR). So I wasn’t surprised when he voiced his opinion. I also wasn’t surprised by its sentiment—anyone who practices medicine understands the value of a practitioner’s notes. Whether they record a patient’s history, describe his or her progress through the course of a medical problem, or report microscopic findings from a biopsy specimen, this narrative text contains clues that allow healthcare professionals to detect problems, find answers, and practice the art of medicine. Statements from caregivers or patients captured in the record are often as valuable as the clinical pathology reports produced by laboratory tests.

Even so, the challenge isn’t in capturing this data. I was thrilled when my veterinary medical practice moved to using electronic records. I can type much faster than I write, and there were no more complaints about my handwriting. But the goal of using electronic data to analyze patient health outcomes and provide automated clinical decision support means the nonstandard, sometimes cryptic comments in these free-form records are overlooked because of their lack of structure. They are gems that remain buried for the lack of tools to mine them effectively. Fortunately, solutions have emerged that allow data professionals and subject matter experts to combine their knowledge, recognize facts and key concepts contained in free text found in medical records, and extract the information for use in analytical models and structured reports.

The goal: Automate text abstraction

Important information exists in the form of unstructured textual health records; the goal is to capture it in a structured format for consumption by advanced analytics and reporting tools. This is traditionally accomplished through manual abstraction (Lin, Jiao, Biskupiak, & McAdam-Marx, 2013), but using text analytics to automatically capture textual information has the potential to dramatically improve efficiency. Increased efficiency is an important goal because reporting delays associated with manual abstraction have led to erroneous conclusions regarding incidence of disease (Midthune, Fay, Clegg, & Feuer, 2005).

Text analytics can be segmented into seven practice areas (Miner et al., 2012):

  • Search and information retrieval
  • Document clustering
  • Document classification
  • Web mining
  • Information extraction (IE)
  • Natural language processing (NLP)
  • Concept extraction (CE)

Applying these techniques to EMRs involves a combination of IE (e.g., extracting relevant facts and relationships), NLP (e.g., tagging parts of speech), and CE (e.g., grouping words and phrases into semantically similar groups).

Extracting data is the first step toward rendering free-text information useful for reporting and analysis. The next step involves standardizing the findings to make the process more efficient. Synonyms abound in medicine—for example, the words “kidney,” “renal,” and “nephric” all refer to the same organ—and it’s unwieldy to require analysts to include each term in a query. Linking synonymous terms to a single standard is key to maximizing the utility of text analytics.

Uses: Integrate text analytics and data management

Transforming unstructured data into a readily accessible format enables many different uses for the information. Defining those use cases is critical to identifying appropriate text analytics tools. Examples of possible reports include nearly real-time monitoring of patient admissions for recruitment in clinical trials, tracking emerging healthcare-associated infections, or efficiently communicating key results to physicians and patients. Predictive analytics could be used to test symptomology, events described in the case history, and findings contained in the examination or progress notes to determine if they contain significantly predictive variables that identify incident cancer (or other disease) patients. 

These are not simply ideas. Text analytics methods have already been tested to varying degrees in healthcare settings. For example, computer programs have been used to automatically assess whether tumors are stable, progressing, or regressing based on reports from imaging studies (Cheng, Zheng, Savova, & Erickson, 2010). Another group evaluated using NLP to review electronic records of emergency department patients as a potential tool for identifying if similar symptoms were appearing in multiple patients (Gerbier et al., 2011). A different study found using NLP to examine EMRs proved better at detecting postoperative complications than patient safety indicators based on discharge coding (Murff et al., 2011).

Healthcare’s uses for text analytics are clearly numerous, but each requires a team-based approach for implementation. The end user (physician, epidemiologist, administrator, research scientist, etc.) must work with the text analytics professional to clearly define goals for the output and understand how to integrate text analytics with the overall data management system.

The challenge: Disambiguate, don’t obfuscate

Humans are not robots. Even in highly technical, specialized professions like medicine, individuality is reflected in the findings and opinions captured in EMR. Practitioners add further complexity by relying on abbreviations for the sake of efficiency. For example, as of June 30, 2015, the website had collected more than 45,000 medical abbreviations. This number provides a glimpse into the challenge of interpreting whether “AS” refers to ankylosing spondylitis, aortic stenosis, or atherosclerosis.

Solving this riddle relies on word sense disambiguation (WSD). A less technical example of WSD is the homonym “bank”—depending upon its linguistic context, the word might refer to land alongside a river or a financial institution. Distinguishing a word’s meaning using contextual clues seems relatively simple to the human brain, but it is more challenging for a computer.

Beyond dealing with homonyms, homographs, and homophones, WSD faces an additional test: It must determine whether text mentioning a medical condition is describing a positive or negative diagnosis; whether it’s listing one of several differential diagnoses; or whether it’s simply describing characteristics of the condition and not listing a diagnosis at all. Consequently, it is essential that automated processing of medical text is able to recognize the context in which the target term appears. 

The solution: Combine analytical approaches to get superior results

Computer scientists have developed different approaches to address the WSD and text analytics challenges. There are various computational (Berster, Goodwin, & Cohen, 2012) and rules-based (Massey, Myneni, Mattocks, & Brinsfield, 2014) methods available. As expected, each has advantages and disadvantages. In general, computational methods take much longer to process documents, but require less manual refinement. Rules-based models are faster and afford greater customization, but may not perform as well when applied to new document types.

Just because the methods differ doesn’t mean they can’t be used together. Ultimately, choices depend on the user’s resources, requirements, and goals. Like medicine, text analytics is a blend of science and art—and choices must be made to optimize performance. To select the best methods, it’s important to consider a solution that facilitates collaboration between text analysts and subject matter experts. The resulting analyses should be easily integrated into existing analytical environments, and the solution should be well-supported and provide consistent longitudinal results.



Greg Massey is a senior industry consultant in Health Care and Life Sciences for SAS in Cary, North Carolina. He may be contacted at



Berster, B. T., Goodwin, J. C., & Cohen, T. (2012). Hyperdimensional computing approach to word sense disambiguation. AMIA Annual Symposium Proceedings 2012, 1129–1138.

Cheng, L. T., Zheng, J., Savova, G. K., & Erickson, B. J. (2010). Discerning tumor status from unstructured MRI reports—completeness of information in existing reports and utility of automated natural language processing. Journal of Digital Imaging, 23(2), 119-132. doi:

Gerbier, S., Yarovaya, O., Gicquel, Q., Millet, A.-L., Smaldore, V., Pagliaroli, V., … Metzger, M. H. (2011). Evaluation of natural language processing from emergency department computerized medical records for intra-hospital syndromic surveillance. BMC Medical Informatics and Decision Making, 11(50). doi:

Lin, J., Jiao, T., Biskupiak, J. E., & McAdam-Marx, C. (2013). Application of electronic medical record data for health outcomes research: A review of recent literature. Expert Review of Pharmacoeconomics Outcomes Research, 13(2), 191-200. doi:

Massey, J. G., Myneni, R., Mattocks, M. A., & Brinsfield, E. C. (2014). Extracting key concepts from unstructured medical reports using SAS® Text Analytics and SAS® Visual Analytics. Proceedings of SAS Global Forum 2014, Paper 165-2014. Retrieved from

Midthune, D. N., Fay, M. P., Clegg, L. X., & Feuer, E. J. (2005). Modeling reporting delays and reporting corrections in cancer registry data. Journal of the American Statistical Association, 100(469), 61-70. Retrieved from

Miner, G. D., Dursun, D., Elder, J., Fast, A., Hill, T., & Nisbet, R. A. (2012). Practical text mining and statistical analysis for non-structured text data applications. Waltham, MA: Academic Press.

Murff, H. J., FitzHenry, F., Matheny, M. E., Gentry, N., Kotter, K. L., Crimin, K., … Speroff, T. (2011). Automated identification of postoperative complications within an electronic medical record using natural language processing. JAMA, 306(8), 848-855. doi:10.1001/jama.2011.1204.