Anomaly Detection and Accuracy Measurement for Categorical Data
AbstractThe Department of Defense (DoD) recently initiated an effort to compile all inter-service maintenance data for equipment and infrastructure, requiring the consolidation of maintenance records from over 40 different data sources. This research evaluates and improves the accuracy of this maintenance data warehouse by means of value modeling and statistical methods for anomaly detection. The first step in this work included the categorization of error-identifying metadata, which was then consolidated into a weighted scoring model. The most novel aspect of the work involved error identification processes using conditional probability combinations and likelihood measures. This analysis showed promising results, successfully identifying numerous invalid maintenance description labels through the use of conditional probability tests. This process has potential to both reduce the amount of manual labor necessary to clean the DoD maintenance data records and provide better fidelity on DoD maintenance activities.
Dunham, M. H. (2003). Data mining introductory and advanced topics. Upper Saddle River, NJ: Prentice Hall/Pearson Education.
Barlow, H. B. (1989). Unsupervised Learning. In Neural Computation, volume 1, page 295-311.
Bayes, T., (1763). An Essay towards Solving a Problem in the Doctrine of Chances. In Philosophical Transactions. Vol. 53, page 370-418.
Bhaskaran, R., Palaniswamy, N., Rengaswamy, N. S., & Jayachandran, M. (2005). A review of differing approaches used to estimate the cost of corrosion (and their relevance in the development of modern corrosion prevention and control strategies). Anti - Corrosion Methods and Materials, 52(1), 29-41(13). Retrieved from https://search.proquest.com/docview/218922069?accountid=15138.
Das, K., & Schneider, J. (2007). Detecting anomalous records in categorical datasets. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 220-229). ACM.
Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), 137-144.
Hanley, J.A., & McNeil, B. J. (1982). “The Meaning and Use of the Area Under a Receiver Operating Characteristic (ROC) Curve.” Radiology, 143(1), 29-36.
Meyer, D., Hornik, K., Zeileis, A. (2006). The Strucplot Framework: Visualizing Multi-way Contingency tables with vcd. Retrieved February 22, 2018, from https://www.jstatsoft.org/.
Office of the Secretary of Defense. (2014). Operating and Support Cost-Estimating Guide. Cost Assessment and Program Evaluation. Retrieved September 25, 2017, from https://www.cape.osd.mil/files/os_guide_v9_march_2014.pdf.
Office of the Under Secretary of Defense (Comptroller). (2018). Operation and Maintenance Overview: Fiscal Year 2019 Budget Estimates. Retrieved October 27, 2018, fromhttps://comptroller.defense.gov/Portals/45/Documents/defbudget/fy2019/fy2019_OM_Overview.pdf.
Office of the Under Secretary of Defense for Acquisition, Technology, and Logistics. (2010). Prevention and Mitigation of Corrosion on DoD Military Equipment and Infrastructure. Retrieved September 25, 2017, from http://www.esd.whs.mil/Portals/54/Documents/DD/issuances/dodi/500067p.pdf.
Shen, D., Ruvini, J. D., & Sarwar, B. (2012). Large-scale item categorization for e-commerce. Retrieved September 25, 2017, from https://www.researchgate.net/publication/262270957_Large-scale_item_categorization_for_e-commerce.
Yung, Chung. (2015). Mining Massive Web Log Data of an Official Tourism Web Site as a Step towards Big Data Analysis in Tourism. Retrieved September 25, 2017, from http://dl.acm.org/citation.cfm?id=2818906&CFID=985971970&CFTOKEN=42446460.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
The copyediting stage is intended to improve the flow, clarity, grammar, wording, and formatting of the article. It represents the last chance for the author to make any substantial changes to the text because the next stage is restricted to typos and formatting corrections. The file to be copyedited is in Word or .rtf format and therefore can easily be edited as a word processing document. The set of instructions displayed here proposes two approaches to copyediting. One is based on Microsoft Word's Track Changes feature and requires that the copy editor, editor, and author have access to this program. A second system, which is software independent, has been borrowed, with permission, from the Harvard Educational Review. The journal editor is in a position to modify these instructions, so suggestions can be made to improve the process for this journal.