Tips for Data Cleaning
Some tips one should keep in mind while preprocessing data.
Data Cleaning is as simple as it sounds like ‘data cleaning’ but sometimes cleaning can cost you a lot of time if not approached properly. As a data enthusiast who has been a keen learner I always get intrigued about the new findings. Since I consider myself a fresher this blog is a glimpse that reflects the tips , tricks and little concepts I encountered myself of the deep sea of data that grows each day.
Data Cleaning is the earliest step in analytics. Let me explain data cleaning in simpler terms , imagine your friend messages asking you for a football match tomorrow and you agree, now what’s the first step. You would probably look for your kit in the cupboard but you rarely organize your clothing shelf so it takes you a minute finding it. The clothes here were “data” the kit was your “target” the path to help you find your target was “cleaning”.
What if your room is all covered in clothes and cupboard has clothes too , and you find a rat among your pile of clothes that could harm your target, try finding your kit in a minute now.
When dealing with real data from the datasets you clean it meaning you fix it , you find the potential errors that might harm your target and remove it accordingly, you find missing data and take actions accordingly. This is a detailed discussion with many how(s) and why(s). Lets keep this writing simple for now
A good approach
Data cleaning can be hectic for large datasets. Never try to look at all the dataset manually always bound yourself to seeing either first 10 rows or last 10 rows (can be done by using head and tail function), the data in between is a void , use methods to view them code your way through.
Your methods act as search helicopter hovering over a vicinity to find the suspect. They are responsible for reporting you.
Take a look at the types of methods one can use.
- <dataframe name>.head()
- <dataframe name>.tail()
- <dataframe name>.dtypes
- <dataframe name>.describe
- <dataframe name>.value_counts()
- <dataframe name>.columns
Take a look at the outputs of built-in methods. Here the <dataframe name> is df
Refer to official documentation
Always experiment with data you’ll discover some tips that many don’t know. The libraries hold a number of methods nobody knows by heart so referring to official documentations is always a good practice.
Some of the well known python libraries are listed: