An Approach towards Record Linkage using Genetic Algorithm along with Hash Algorithm
Keywords:
Cosine similarity, Dataset, genetic algorithm, MD5, SHA-1 and string distance.Abstract
Several systems that depends on the integrity of the data in order to offer high quality services, such as digital libraries and e-commerce brokers, may be affected due to the existence of duplicates in their warehouse. Due to this, more time is required to retrieve high quality data. Here deduplication or record linkage is computed by using hash algorithm i.e., MD5 and SHA-1 algorithm for finding similarity to detect duplicate records and eliminate them using evolutionary i.e., genetic algorithm. This approach removes the duplicate dataset samples in the system.
