Daegeun Kim

Fall 2025
Python Package
Data Wrangling
Data Pipeline Tooling

mergeprep is a Python package designed to make row-wise merging of messy tables reliable and repeatable, built on top of pandas library. It handles common real-world problems that cause merges to break or produce incorrect results, such as inconsistent table layouts, mismatched column names, different value formats, and unclear join keys.

The package provides utilities to prepare tables into merge-ready structures, measure similarity between tables and columns, and automatically suggest likely merge keys based on patterns in the data. It also includes diagnostic tools that explain why a merge succeeded or failed, highlighting issues like low key uniqueness, key mismatches, duplicates, or unexpected row expansion, so you can fix problems quickly and merge with confidence.

*mergeprep was created with cookiecutter and the py-pkgs-cookiecutter template.

Details are available in above github page.

Functions

5 functions in the package:

1. calc_match_rate()

Compares every column pair between two tables and quantifies how much their values overlap to identify likely merge keys.

2. convert_style()

Standardizes values from two columns into a common format so that differently written but equivalent entries can be merged reliably.

3. similarity_mapping()

Analyzes two columns to find and rank similar values, producing a mapping that aligns mismatched labels across tables.

4. merge_with_mapping()

Merges two tables using a shared canonical key derived from optional value mappings, while preserving the original merge context.

5. diagnose_merge()

Summarizes a merge by reporting match rates, row counts, and value conversions to explain why the merge succeeded or failed.