Distribution consistency-based missing value imputation algorithm for large-scale data sets (2024)

Table of Contents
Abstract References FAQs

Abstract

Objective

As a significant research branch in the field of data mining, missing value imputation (MVI) aims to provide high-quality data support for the training of machine learning algorithms.However, MVI results for large-scale data sets are not ideal in terms of restoring data distribution and improving data prognosis accuracy.To improve the performance of the existing MVI algorithms, we propose a distribution consistency-based MVI (DC-MVI) algorithm that attempts to restore the original data structure by imputing the missing values for large-scale data sets.

Methods

First, the DC-MVI algorithm developed an objective function to determine the optimal imputation values based on the principle of probability distribution consistency.Second, the data set is preprocessed by random initialization of missing values and normalization, and a feasible missing value update rule is derived to obtain the imputation values with the closest variance and the greatest consistency with the complete original values.Next, in a distributed environment, the large-scale data set is divided into multiple groups of random sample partition (RSP) data blocks with the same distribution as the entire data set by taking into account the statistical properties of the large-scale data set.Finally, the DC-MVI algorithm is trained in parallel to obtain the imputation value corresponding to the missing value of the large-scale data set and preserve distribution consistency with the non-missing values.The rationality experiments verify the convergence of the objective function and the contribution of DC-MVI to distribution consistency.In addition, the effectiveness experiments assess the performance of DC-MVI and eight other MVI algorithms (mean, KNN, MICE, RF, EM, SOFT, GAIN, and MIDA) through the following three indicators: distribution consistency, time complexity, and classification accuracy.

Results

The experimental results on seven selected large-scale data sets showed that: 1) The objective function of the DC-MVI method was effective, and the missing value update rule was feasible, allowing the imputation values to remain stable throughout the adjustment process; 2) the DC-MVI algorithm obtained the smallest maximum mean discrepancy and Jensen-Shannon divergence on all data sets, showing that the proposed method had a more consistent probability distribution with the complete original values under the given significance level; 3) the running time of the DC-MVI algorithm tended to be stable in the time comparison experiment, whereas the running time of other state-of-the-art MVI methods increased linearly with data volume; 4) the DC-MVI approach could produce imputation values that were more consistent with the original data set compared to existing methods, which was beneficial for subsequent data mining analysis.

Conclusions

Considering the peculiarities and limitations of missing large-scale data, this paper incorporates RSP into the imputation algorithm and derives the update rules of imputation values to restore the data distribution and further confirm the effectiveness and practical performance of DC-MVI in the large-scale data set imputation, such as preserving distribution consistency and increasing imputation quality.The method proposes in this paper achieves the desired result and represents a viable solution to the problem of large-scale data imputation.

References

[1]

SCHEET P, STEPHENS M. A fast and flexible statistical model for large-scale population genotype data: Applications to inferring missing genotypes and haplotypic phase[J]. The American Journal of Human Genetics, 2006, 78(4): 629-644.

[2]

JEREZ J M, MOLINA I, GARCÍA-LAENCINA P J, et al. Missing data imputation using statistical and machine learning methods in a real breast cancer problem[J]. Artificial Intelligence in Medicine, 2010, 50(2): 105-115.

[3]

YUAN H W, XU G M, YAO Z J, et al. Imputation of missing data in time series for air pollutants using long short-term memory recurrent neural networks[C]// Proceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers. Singapore: ACM, 2018: 1293-1300.

[4]

VAN BUUREN S, GROOTHUIS-OUDSHOORN K. MICE: Multivariate imputation by chained equations in R[J]. Journal of Statistical Software, 2011, 45(3): 1-67.

[5]

NGUYEN C D, CARLIN J B, LEE K J. Model checking in multiple imputation: An overview and case study[J]. Emerging Themes in Epidemiology, 2017, 14(1): 1-12.

[6]

ZHANG Z H. Missing data imputation: Focusing on single imputation[J]. Annals of Translational Medicine, 2016, 4(1): 9.

[7]

KHOSRAVI P, LIANG Y T, CHOI Y J, et al. What to expect of classifiers? Reasoning about logistic regression with missing features[C]// Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19). Macao, China: IJCAI, 2019: 2716-2724.

[8]

SEFIDIAN A M, DANESHPOUR N. Missing value imputation using a novel grey based fuzzy c-means, mutual information based feature selection, and regression model[J]. Expert Systems with Applications, 2019, 115: 68-94.

[9]

ZHU X F, ZHANG S C, JIN Z, et al. Missing value estimation for mixed-attribute data sets[J]. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(1): 110-121.

[10]

LIU X W, ZHU X Z, LI M M, et al. Multiple kernel k-means with incomplete kernels[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(5): 1191-1204.

[11]

CHEN X Y, YANG J M, SUN L J. A nonconvex low-rank tensor completion model for spatiotemporal traffic data imputation[J]. Transportation Research Part C: Emerging Technologies, 2020, 117: 102673.

[12]

ZHU Y M, LI Z, ZHU H Z, et al. A compressive sensing approach to urban traffic estimation with probe vehicles[J]. IEEE Transactions on Mobile Computing, 2013, 12(11): 2289-2302.

[13]

YU J R, STETTLER M E J, ANGELOUDIS P, et al. Urban network-wide traffic speed estimation with massive ride-sourcing GPS traces[J]. Transportation Research Part C: Emerging Technologies, 2020, 112: 136-152.

[14]

HUANG J L, KEUNG J W, SARRO F, et al. Cross-validation based K nearest neighbor imputation for software quality datasets: An empirical study[J]. Journal of Systems and Software, 2017, 132: 226-252.

[15]

AL-HELALI B, CHEN Q, XUE B, et al. A hybrid GP-KNN imputation for symbolic regression with missing values[C]// Australasian Joint Conference on Artificial Intelligence. Cham: Springer, 2018: 345-357.

[16]

CHOUDHURY S J, PAL N R. Imputation of missing data with neural networks for classification[J]. Knowledge-Based Systems, 2019, 182: 104838.

[17]

YOON J, JORDON J, VAN DER SCHAAR M. GAIN: Missing data imputation using generative adversarial nets[C]// Proceeding of the 35th International Conference on Machine Learning. Stockholm, Sweden: PMLR, 2018: 5689-5698.

[18]

MA Q, LEE W C, FU T Y, et al. MIDIA: Exploring denoising autoencoders for missing data imputation[J]. Data Mining and Knowledge Discovery, 2020, 34(6): 1859-1897.

[19]

LIN W C, TSAI C F, ZHONG J R. Deep learning for missing value imputation of continuous data and the effect of data discretization[J]. Knowledge-Based Systems, 2022, 239: 108079.

[20]

HE K M, CHEN X L, XIE S N, et al. Masked autoencoders are scalable vision learners[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE, 2022: 16000-16009.

[21]

BADSHA M B, LI R, LIU B X, et al. Imputation of single-cell gene expression with an autoencoder neural network[J]. Quantitative Biology, 2020, 8(1): 78-94.

[22]

BANSAL P, DESHPANDE P, SARAWAGI S. Missing value imputation on multidimensional time series[J]. Proceedings of the VLDB Endowment, 2021, 14(11): 2533-2545.

[23]

MENG X F, CI X. Big data management: Concept, techniques and challenge[J]. Journal of Computer Research and Development, 2013, 50(1): 146-169. (in Chinese)

[24]

CHEN M, MAO S, ZHANG Y, et al. Big data storage[M]. Cham: Springer, 2014: 33-49.

[25]

SHVACHKO K, KUANG H R, RADIA S, et al. The hadoop distributed file system[C]// 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). Incline Village, USA: IEEE, 2010: 1-10.

[26]

DEAN J, GHEMAWAT S. MapReduce: Simplified data processing on large clusters[C]// Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation. San Francisco, USA: ACM, 2004: 137-150.

[27]

HUANG Z X, HE Y L, WEI C H, et al. Random sample partition data model and related technologies for big data analysis[J]. Journal of Data Acquisition and Processing, 2019, 34(3): 373-385. (in Chinese)

[28]

STRIKE K, EL-EMAM K, MADHAVJI N. Software cost estimation with incomplete data[J]. IEEE Transactions on Software Engineering, 2001, 27(10): 890-908.

[29]

ZHANG S C, QIN Z, LING C X, et al. "Missing is useful": Missing values in cost-sensitive decision trees[J]. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(12): 1689-1693.

[30]

LITTLE R J A, RUBIN D B. Statistical analysis with missing data[M]. Hoboken: John Wiley & Sons, 2019.

[31]

MORRIS T P, WHITE I R, ROYSTON P. Tuning multiple imputation by predictive mean matching and local residual draws[J]. BMC Medical Research Methodology, 2014, 14(1): 75.

[32]

LI D W, ZHANG H Q, LI T R, et al. Hybrid missing value imputation algorithms using fuzzy C-means and vaguely quantified rough set[J]. IEEE Transactions on Fuzzy Systems, 2022, 30(5): 1396-1408.

[33]

REZVAN P H, LEE K J, SIMPSON J A. The rise of multiple imputation: A review of the reporting and implementation of the method in medical research[J]. BMC Medical Research Methodology, 2015, 15(1): 30.

[34]

HUGHES R A, HERON J, STERNE J A C, et al. Accounting for missing data in statistical analyses: Multiple imputation is not always the answer[J]. International Journal of Epidemiology, 2019, 48(4): 1294-1304.

[35]

ZHANG S C. Nearest neighbor selection for iteratively KNN imputation[J]. Journal of Systems and Software, 2012, 85(11): 2541-2552.

[36]

KEERIN P, KURUTACH W, BOONGOEN T. Cluster-based KNN missing value imputation for DNA microarray data[C]// 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC). Korea: IEEE, 2012: 445-450.

[37]

VINCENT P, LAROCHELLE H, BENGIO Y, et al. Extracting and composing robust features with denoising autoencoders[C]// Proceedings of the 25th International Conference on Machine Learning. Helsinki, Finland: ACM, 2008: 1096-1103.

[38]

GONDARA L, WANG K. MIDA: Multiple imputation using denoising autoencoders[C]// Proceeding of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining. Melbourne, Australia: Springer, 2018: 260-272.

[39]

SOVILJ D, EIROLA E, MICHE Y, et al. Extreme learning machine for missing data using multiple imputations[J]. Neurocomputing, 2016, 174: 220-231.

[40]

LU C B, MEI Y. An imputation method for missing data based on an extreme learning machine auto-encoder[J]. IEEE Access, 2018, 6: 52930-52935.

[41]

GARCÍA J C F, KALENATIC D, BELLO C A L. Missing data imputation in multivariate data by evolutionary algorithms[J]. Computers in Human Behavior, 2011, 27(5): 1468-1474.

[42]

ZHANG Y, ZHOU B H, CAI X R, et al. Missing value imputation in multivariate time series with end-to-end generative adversarial networks[J]. Information Sciences, 2021, 551: 67-82.

[43]

DEMPSTER A P, LAIRD N M, RUBIN D B. Maximum likelihood from incomplete data via the EM algorithm[J]. Journal of the Royal Statistical Society: Series B (Methodological), 1977, 39(1): 1-22.

[44]

RAHMAN M G, ISLAM M Z. Missing value imputation using a fuzzy clustering-based EM approach[J]. Knowledge and Information Systems, 2016, 46(2): 389-422.

[45]

GOLD M S, BENTLER P M. Treatments of missing data: A Monte Carlo comparison of RBHDI, iterative stochastic regression imputation, and expectation-maximization[J]. Structural Equation Modeling: A Multidisciplinary Journal, 2000, 7(3): 319-355.

[46]

KURUCZ M, BENCZU'R A A, CSALOGÁNY K. Methods for large scale SVD with missing values[C]// Proceedings of KDD Cup and Workshop. San Jose, USA: ACM, 2007, 12: 31-38.

[47]

MAZUMDER R, HASTIE T, TIBSHIRANI R. Spectral regularization algorithms for learning large incomplete matrices[J]. The Journal of Machine Learning Research, 2010, 11: 2287-2322.

[48]

ZHU X S, WANG J Y, SUN B, et al. An efficient ensemble method for missing value imputation in microarray gene expression data[J]. BMC Bioinformatics, 2021, 22(1): 188.

[49]

PETROZZIELLO A, JORDANOV I, SOMMEREGGER C. Distributed neural networks for missing big data imputation[C]// 2018 International Joint Conference on Neural Networks (IJCNN). Rio de Janeiro, Brazil: IEEE, 2018: 1-8.

[50]

HE Y L, HUANG D F, DAI D X, et al. General bounds for maximum mean discrepancy statistics[J]. Mathematica Applicata, 2021, 34(2): 284-288. (in Chinese)

[51]

GRETTON A, BORGWARDT K M, RASCH M J, et al. A kernel two-sample test[J]. The Journal of Machine Learning Research, 2012, 13(1): 723-773.

[52]

TANG F, ISHWARAN H. Random forest missing data algorithms[J]. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2017, 10(6): 363-377.

[53]

OpenIDEA-Yunan University, Ycimpute[EB/OL]. (2020-08-23)[2022-09-25]. https://github.com/OpenIDEA-YunanUniversity/ycimpute.

[54]

RUBINSTEYN A, FELDMAN S. Fancyimpute: An imputation library for python[EB/OL]. (2021-10-22)[2022-09-25]. https://github.com/iskandr/fancyimpute.

[55]

UC Irvine Machine Leaning Repository. UCI machine learning repository[EB/OL]. (2021-05-02)[2022-09-25]. http://archive.ics.uci.edu/ml.

[56]

TRIGUERO I, GONZÁLEZ S, MOYANO J M, et al. KEEL 3.0: An open source software for multi-stage analysis in data mining[J]. International Journal of Computational Intelligence Systems, 2017, 10(1): 1238-1249.

[57]

WHITE I R, ROYSTON P, WOOD A M. Multiple imputation using chained equations: Issues and guidance for practice[J]. Statistics in Medicine, 2011, 30(4): 377-399.

Distribution consistency-based missing value imputation algorithm for large-scale data sets (2024)

FAQs

Which method is most suitable for datasets with many missing values such as this one? ›

Imputation: Imputing missing values with estimates like the mean, median, or mode for numerical data, or a common category for categorical data can help retain data while filling gaps. More advanced methods like regression, k-nearest neighbors, or predictive modeling can also be used for imputation.

Which algorithm can handle missing data? ›

However, algorithms like K-nearest and Naive Bayes support data with missing values.

Which of the following algorithms can be used to impute missing data for both categorical and continuous features? ›

k-NN algorithm can be used for imputing missing value of both categorical and continuous variables.

How you will be handling missing values in this dataset in decision tree? ›

To handle missing values effectively, decision trees use surrogate splits. These surrogate splits act as backup choices when the primary attribute for a split has missing values. The algorithm identifies the next best attribute that can provide a similar separation as the primary attribute.

How do you handle missing data in large datasets? ›

When dealing with missing data, data scientists can use two primary methods to solve the error: imputation or data removal. The imputation method substitutes reasonable guesses for missing data. It's most useful when the percentage of missing data is low.

How will you handle imputation of data for a large dataset? ›

Impute missing values with Mean/Median:

Columns in the dataset which are having numeric continuous values can be replaced with the mean, median, or mode of remaining values in the column. This method can prevent the loss of data compared to the earlier method.

Which algorithm is not suitable for large data sets? ›

Bubble Sort is the simplest sorting algorithm that works by repeatedly swapping the adjacent elements if they are in the wrong order. This algorithm is not suitable for large data sets as its average and worst-case time complexity is quite high.

What is the best method of imputation for missing values? ›

Perhaps the easiest way to impute is to replace each missing value with the mean of the observed values for that variable.

What are the 4 types of missing data? ›

There are four qualitatively distinct types of missing data. Missing data is either: structurally missing, missing completely at random (MCAR), missing at random, or nonignorable (also known as missing not at random).

How to impute missing values in a dataset? ›

The simplest imputation method is replacing missing values with the mean or median values of the dataset at large, or some similar summary statistic. This has the advantage of being the simplest possible approach, and one that doesn't introduce any undue bias into the dataset.

What are the two types of imputation? ›

There are two types of imputation–single or multiple. Usually when people talk about imputation, they mean single. Single refers to the fact that you come up with a single estimate of the missing value, using one of the seven methods listed above.

How to handle missing not at random data? ›

Techniques for Handling the Missing Data
  1. Listwise or case deletion. ...
  2. Pairwise deletion. ...
  3. Mean substitution. ...
  4. Regression imputation. ...
  5. Last observation carried forward. ...
  6. Maximum likelihood. ...
  7. Expectation-Maximization. ...
  8. Multiple imputation.

How do you check if dataset has missing values? ›

Some common indicators of missing values are blanks, zeros, nulls, NA, NaN, or other placeholders. You can also use code to check the percentage and pattern of missing values in your dataset. For example, in Python, you can use the isnull() and sum() methods to count the number of missing values in each column.

How do you handle missing or corrupted data in your dataset? ›

how do you handle missing or corrupted data in a dataset?
  1. Method 1 is deleting rows or columns. We usually use this method when it comes to empty cells. ...
  2. Method 2 is replacing the missing data with aggregated values. ...
  3. Method 3 is creating an unknown category. ...
  4. Method 4 is predicting missing values.

When can missing values in dataset be ignored? ›

Missing data under 10% for an individual case or observation can generally be ignored, except when the missing data is a MAR or MNAR. The number of complete cases i.e. observation with no missing data must be sufficient for the selected analysis technique if the incomplete cases are not considered.

Which method is used to eliminate the missing values in the dataset? ›

Imputing missing values with mean/median

This method can prevent the loss of data compared to the earlier method. Replacing the above two approximations (mean, median) is a statistical approach to handle the missing values. This approach is popularly used when there are small number of missing values in the data.

How to find missing values in a dataset? ›

Finding Missing Values

Let's identify all locations in the survey data that have null (missing or NaN) data values. We can use the isnull method to do this. The isnull method will compare each cell with a null value. If an element has a null value, it will be assigned a value of True in the output object.

Which method is preferred of the four methods for handling missing data? ›

The simulation results suggested that in general MI+ is likely to be the best method. Patrician [11] pointed out that MI is the best approach and should be considered to handle missing data compared with CCA and MS by an empirical investigation of AIDS care longitudinal data outcomes.

Top Articles
Latest Posts
Article information

Author: Jonah Leffler

Last Updated:

Views: 6645

Rating: 4.4 / 5 (45 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Jonah Leffler

Birthday: 1997-10-27

Address: 8987 Kieth Ports, Luettgenland, CT 54657-9808

Phone: +2611128251586

Job: Mining Supervisor

Hobby: Worldbuilding, Electronics, Amateur radio, Skiing, Cycling, Jogging, Taxidermy

Introduction: My name is Jonah Leffler, I am a determined, faithful, outstanding, inexpensive, cheerful, determined, smiling person who loves writing and wants to share my knowledge and understanding with you.