Mark A. Carter: Business

Data

The flat file recall data is presented as a spreadsheet, which makes initial data evaluation exceedingly easy. NHTSA provides a Recall.txt file to outline data definitions and fields. The raw file contains a total of 110,032 records spanning vehicle recalls from 1966 to today. This significant size of the data set required the paring down for the sake of this example. I used time as the selective factor including data from January 1, 2015 to May 10, 2015. This provided a manageable yet balanced sample set for analysis. The resulting data set formed 1,536 records. The data is structured and validated by NHTSA, therefore normalization was not required. The initial review of the data set provides some immediate insight as to potential visualizations.

See Dataset...

Basic Statistics

There is a great deal of information that can be used to form statistical analysis using field content only. The processes used are familiar to any business intelligence analyst. The first step in the process is to evaluate what is in the data itself. Each field contains normative structure within the data. For example, if we look at the influenced_by field we see the three terms used to describe the entity responsible for influencing the recall: the manufacturer, the Office of Vehicle Safety Compliance, or the Office of Defect Investigations. Each has contextual implications. The measurement processes allow information verification and errors. A basic dashboard defines the nature of the data before text analysis.

Text-Mining

The question we seek to answer is not found in statistics. The available information is not enough to show what is contained in the NHTSA dataset. Understanding the nature of the data requires that we identify key information to grasp what is behind the statistics. Using text analysis, we can place context to the numbers. In essence, qualitative augments quantitative data. Following a defined process, we can evaluate the data to determine what is necessary to the analyze the text. The nature of the data will determine which processes and actions are required to perform reliable, predictable, and maintainable text-mining capability. The technical need is to see data as it is not how we want to see it.

Ontology and Taxonomy

As we have seen, while there is a significant amount of information that is extractable using standard text mining processes, much is of little use. In this study we are seeking to answer a fundemental question, "Which recall notices could affect life or limb?" As always, there are several different ways to approach the problem.

Ontology and Taxonomy...

Analysis

Through the processes of text-mining we have generated a great deal of information that can be used to quantitatively evaluate the data developed from the narrative text. But there are other elements that can be added to provide additional context and meaning. The next step in our process is analysis of these elements.

Visualization

We have come to the point where we must visualize the results of analysis. Visualizations must always be formed to support the objective of the research effort. We must also determine if the data supports the visualization. For example, the lack geospatially relevent data, will result in no map rendering.