Anomaly Detection and Accuracy Measurement for Categorical Data

  • Kameron Grubaugh
  • Zachary Zimmerman
  • Nicholas McAfee
  • Emily McGowan
  • Paul Evangelista United States Military Academy

Abstract

The Department of Defense (DoD) recently initiated an effort to compile all inter-service maintenance data for equipment and infrastructure, requiring the consolidation of maintenance records from over 40 different data sources.  This research evaluates and improves the accuracy of this maintenance data warehouse by means of value modeling and statistical methods for anomaly detection. The first step in this work included the categorization of error-identifying metadata, which was then consolidated into a weighted scoring model. The most novel aspect of the work involved error identification processes using conditional probability combinations and likelihood measures. This analysis showed promising results, successfully identifying numerous invalid maintenance description labels through the use of conditional probability tests. This process has potential to both reduce the amount of manual labor necessary to clean the DoD maintenance data records and provide better fidelity on DoD maintenance activities.

Author Biography

Paul Evangelista, United States Military Academy
Director, Engineering Management ProgramDepartment of Systems Engineering,United States Military AcademyMahan Hall, Bldg 752, Room 420West Point, NY 10996, USA

References

Dunham, M. H. (2003). Data mining introductory and advanced topics. Upper Saddle River, NJ: Prentice Hall/Pearson Education.

Barlow, H. B. (1989). Unsupervised Learning. In Neural Computation, volume 1, page 295-311.

Bayes, T., (1763). An Essay towards Solving a Problem in the Doctrine of Chances. In Philosophical Transactions. Vol. 53, page 370-418.

Bhaskaran, R., Palaniswamy, N., Rengaswamy, N. S., & Jayachandran, M. (2005). A review of differing approaches used to estimate the cost of corrosion (and their relevance in the development of modern corrosion prevention and control strategies). Anti - Corrosion Methods and Materials, 52(1), 29-41(13). Retrieved from https://search.proquest.com/docview/218922069?accountid=15138.

Das, K., & Schneider, J. (2007). Detecting anomalous records in categorical datasets. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 220-229). ACM.

Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), 137-144.

Hanley, J.A., & McNeil, B. J. (1982). “The Meaning and Use of the Area Under a Receiver Operating Characteristic (ROC) Curve.” Radiology, 143(1), 29-36.

Meyer, D., Hornik, K., Zeileis, A. (2006). The Strucplot Framework: Visualizing Multi-way Contingency tables with vcd. Retrieved February 22, 2018, from https://www.jstatsoft.org/.

Office of the Secretary of Defense. (2014). Operating and Support Cost-Estimating Guide. Cost Assessment and Program Evaluation. Retrieved September 25, 2017, from https://www.cape.osd.mil/files/os_guide_v9_march_2014.pdf.

Office of the Under Secretary of Defense (Comptroller). (2018). Operation and Maintenance Overview: Fiscal Year 2019 Budget Estimates. Retrieved October 27, 2018, fromhttps://comptroller.defense.gov/Portals/45/Documents/defbudget/fy2019/fy2019_OM_Overview.pdf.

Office of the Under Secretary of Defense for Acquisition, Technology, and Logistics. (2010). Prevention and Mitigation of Corrosion on DoD Military Equipment and Infrastructure. Retrieved September 25, 2017, from http://www.esd.whs.mil/Portals/54/Documents/DD/issuances/dodi/500067p.pdf.

Shen, D., Ruvini, J. D., & Sarwar, B. (2012). Large-scale item categorization for e-commerce. Retrieved September 25, 2017, from https://www.researchgate.net/publication/262270957_Large-scale_item_categorization_for_e-commerce.

Yung, Chung. (2015). Mining Massive Web Log Data of an Official Tourism Web Site as a Step towards Big Data Analysis in Tourism. Retrieved September 25, 2017, from http://dl.acm.org/citation.cfm?id=2818906&CFID=985971970&CFTOKEN=42446460.

Published
2019-03-07
How to Cite
Grubaugh, K., Zimmerman, Z., McAfee, N., McGowan, E., & Evangelista, P. (2019). Anomaly Detection and Accuracy Measurement for Categorical Data. Industrial and Systems Engineering Review, 6(2), 88-94. Retrieved from http://watsonojs.binghamton.edu/index.php/iser/article/view/98