Scalable Machine Learning Pipelines for Big Telemetry Data in Semiconductor Manufacturing
Keywords:
Telemetry Data, Semiconductor Manufacturing, Fault Detection, Classification Methodology, High-Dimensional Data, Imbalanced Data, SECOM Dataset, SMOTE, Proactive Maintenance, Production AnalysisAbstract
The semiconductor industry faces increasing challenges in maintaining high yields and reducing costs as manufacturing processes become more complex. A new and effective tool for optimising processes is big data analytics, enabling manufacturers to extract valuable insights from vast amounts of production data and make data-driven decisions. This study proposes a comprehensive machine learning (ML) pipeline tailored for analyzing telemetry data using the SECOM dataset from the UCI repository. The methodology includes data cleaning, missing value imputation, feature scaling via Min-Max normalization, dimensionality reduction, and Synthetic Minority Oversampling Technique (SMOTE) to handle class imbalance. A Decision Tree Classifier (DTC) is utilized to classify good and defective products, achieving an accuracy of 88% in addition to excellent results in terms of recall, F1-score, ROC-AUC, and accuracy. Based on a comparison, the offered DTC model performs much better than popular traditional and deep learning techniques and can be trusted for spotting and addressing faults in real life.
