Back to Blog

Top 6 Free ML Data Prep Tools of 2022

Stephen
ai-architecturedata-engineeringpythonweb-development

Data Prep…we all gotta do it

Are you looking for a data cleaning tool to make your next data science project? Below is a list of my top 5 picks for data cleaning tools of 2022 that will help you prep your data for that ML model faster and with less of a headache.

DataPrep

With a few lines of code you can explore your data, clean your data and even connect to databases and APIs.

Features:

  1. 10x faster than Pandas Profiling and report looks great
  2. 140+ cleaning and validating data functions
  3. Built on top of Dask to get the performance bump
  4. Automatically detects and highlights insights from your data (like outliers)
  5. Report generated to summarize changes made to the data (great for audits)

GitHub - sfu-db/dataprep: DataPrep - The easiest way to prepare data in Python DataPrep lets you prepare your data using a single library with a few lines of code. Currently, you can use DataPrep… github.com

Miller

Command-line tool for querying , shaping , and reformatting CSV, TSV and JSON. Need something done quickly? This CLI makes quick work of simple jobs.

Features:

  1. Reduce large datasets with one command
  2. Using streaming interface so it only holds one record in memory for most operations
  3. Basic stats on columns
  4. Quick data querying without python or notebooks

Miller Latest Documentation The big picture: Even well into the 21st century, our world is full of text-formatted data like CSV. Google CSV memes… miller.readthedocs.io

Data Cleaner

Data cleaner is a great beginning tool to just start having a library help you out with your data cleaning. Very easy to use, but you might end up using a more advanced tool.

Features:

  1. Can drop any row with a missing value
  2. Replaces missing values with the mode (for categorical variables) or median (for continuous variables) on a column-by-column basis
  3. Encodes non-numerical variables (e.g., categorical variables with strings) with numerical equivalents
  4. Available through CLI or as a python library

GitHub - rhiever/datacleaner: A Python tool that automatically cleans data sets and readies them… A Python tool that automatically cleans data sets and readies them for analysis. datacleaner works with data in pandas… github.com

BitRook

BitRook is a unique desktop app that is more like a Data Science swiss army knife. It uses ML to analyze and help clean your data — it even generates a python script to automate your cleaning. It helps you analyze your data , and instead of you searching for issues — it raises them to your attention and can tell you if a dataset is predictive. Worth checking out and the free version is robust and there is a free trial for the pro version.

Features:

  1. Generates a python data cleaning script for you
  2. Predictive data detection (Correlation Matrix & Predictive Power Score)
  3. Easily handles large data sets
  4. Column type detection & type standardization
  5. Common data cleaning functions built-in
  6. Unique values, missing values
  7. Outlier handling
  8. PII data detection
  9. Splitting data with a click
  10. Data profiling script generation
  11. A lot more…

BitRook - Clean Data 10x faster using AI Spend less time cleaning data by letting Bitrook's AI engine help you clean your data and generate Python code so you… www.bitrook.com

Great Expectations

Data profiling and data validation in a pipeline is what Great Expectations brings to you. Think unit testing for your data.

Features:

  1. Easy to write assertions for your data validation
  2. Integrates with most major tools (Spark, Snowflake, Postgres, AWS, etc)
  3. Incredible amount of “expectation” assertion functions
  4. Generates data docs automatically
  5. Great documentation!

GitHub - great-expectations/great_expectations: Always know what to expect from your data. Always know what to expect from your data. Great Expectations helps data teams eliminate pipeline debt, through data… github.com

KLib

Klib helps with importing, cleaning, analyzing and preprocessing data and a lot more with a simple couple of lines.

Features:

  1. Missing value plot with a 3 lines of code
  2. Help with data cleaning and data aggregation
  3. Creates amazing correlation plots
  4. Simple numerical data distribution plot
  5. Categorical data plot

GitHub - akanz1/klib: Easy to use Python library of customized functions for cleaning and analyzing… klib is a Python library for importing, cleaning, analyzing and preprocessing data. Explanations on key functionalities… github.com

I hope you found these helpful and they save you time in your data prep. Let me know if I missed any that should be shared! There are so many new tools in data science everyday, but these are some to watch for 2022.