Data journalism is heavily reliant on the quality of your data. Cleaning data is an essential step in increasing the quality of data.
So how do you clean data in excel?
In this post I’ll first outline a number of key tools to clean data with so as to provide a knowledge bank, and then I’ll go through a simple example of cleaning a data set to show how it can be done it practice.
Data cleaning is all about attention to detail.
Some of it is really easy to do especially if you know what you’re doing, but other aspects of it are tough and time consuming. Hopefully by the time I’ve finished you’ll understand how to clean your data well and make the most of it.
Why shouldn’t I just do it manually?
Cleaning data via excel functions and tools is both quicker and more accurate, it allows you to avoid the tedium of going over the same spreadsheet for several hours making minor alterations. You will make mistakes.
I would argue for somebody learning to always apply the principles I’m showing here however easy your set of data is. It’ll help you learn and get into good practice.
Here a list of tools we are going to use and some which we are not, like conditional formatting, but that I’ve included so if you need to use them in future you can.
1. Autofilter – quickly find out what’s in a column of data, and show which bits you want – these are gods gift when it comes to cleaning data in excel.
2. Conditional formatting – highlight cells of data based on criteria you specify
3. Data Validation – choose what values are allowed in a cell
4. Defined data ranges – create lists of data that can be used to make comparisons
5. Find and Replace – look for one thing, and replace it with another
6. Paste Special – remove unwanted formatting
7. Pivot tables – summarise your data and see it in completely new ways
8. Text-to-Columns – split up cells of data that have more than one entry in them
The recipe also makes use of forumulae that use some different spreadsheet functions, including. You may not have known these existed, but they can speed up a lot of the most difficult and time-consuming parts of removing errors from a dataset.
For the purposes of this lesson I have assumed that a basic level of knowledge exists and as such I will not being covering basics such as the SUM function. If you need a list of these tools it is linked here.
Now for the list of formulae will shall be using and some others I’ve included for general use:
• =LEFT (=RIGHT)
Now to go and clean out data.
The data we will be using is here . This is from the Halifax Mortgage Index and is for house price by age over the last 15 years. I’ve merged two regions, London and Yorks & Humber, to make it more interesting as well as just messed it up in general.
Ok, as you can see the data set is a bit of a mess. There are random columns, empty cells, and plenty of other issues.
The order you can approach this is largely down to individual preference, I’d advise going with what is logical to you.
A slight diversion
This is not part of the cleaning we are doing but its really useful to know when to use the TRIM and CLEAN functions.
Trailing whitespace and new lines are common enough problems in spreadsheets that have just been imported or opened and there are two specialised functions – clean and trim – that can be used to remove them. Here are the two steps to use them:
1. In your spreadsheet, the dataset should be in ‘Sheet1’. Create a new worksheet for your spreadsheet, called Sheet2.
2. In cell A1 of the new worksheet you have just created enter the following formula:=CLEAN(TRIM(Sheet1.A1)) and press enter. This will take the content of cell A1 from Sheet1, which is your original data, and reproduce it in Sheet2 without any invisible character, new lines or trailing whitespace.
Back on track – Getting the date
Our dates are currently stuck combined with a description of the house types which is redundant. To split them there are a few ways we could do it, I will cover text-to-columns or splitting it via formula. We shall start with the latter and use text-to-columns later.
Please note that I will cover more steps than necessary here, but it’ll make it easier on more complex data sets.
You’ll need to add a few columns now to work in.
To start with using =SEARCH we can find out where the start of the year is in the string(hyperlink). This is useful when we don’t know there are only 4 like we have now.
We will search for 0 as that the first instance of it is our unique identifier.
You’ll see the blank cells come up with #VALUE, don’t worry about this for now. We’ll solve that later.
The next step is to subtract 2 from the number we’ve just found as we went too far deliberately.
Then we use =LEN to tell us the length of the D column cell.
Then subtract your F values from your G values to get how many characters we want.
Finally to get the year use the =RIGHT function and our newly acquired number of characters like below, and you’ll get the relevant year for our date column. Easy.
Now we can repeat the same process until we get our Q values with the year string attached, once you’ve done it the spreadsheet should look something like this.
We can now use text-to-columns.
Firstly select your row of data, copy it, go to paste special, and paste as values. This gives us numbers rather than formula to work with.
Then add a new column to the right of I, next simply select your column I values, go to data and select text to columns.
Then follow the steps below.
We now have all our basic data so its time to clean it up a bit.
First use paste special and values in our other year column, and then delete all columns we have created but the two year columns and the quarter column. Deleting the string of type of house, quarter and year is good for neatness as well.
This should leave you with:
Now what about all those empty rows?
If we create a new column to the right of category and then use =ISBLANK, we now have data we can filter.
If we select the data in the column we can see we have unclick all but true, which would then leave us with:
We can then delete these rows. Removing the filter there are no more blank rows to trouble our data.
Note: There is a more round about way of doing this but it isn’t really worth doing, but here is how the formula should be set out:
=IF(COUNTIF(some_names,D1)>0, TRUE, FALSE)
You will notice that many of the cells in the values columns do not have data in, instead they have “0” in them. This isn’t great when it comes to presenting the data, we want a more informative description that doesn’t distort averages as easily.
To do this we will use the IF function. This asks whether the value in a cell is a specific string, then returns a result depending on whether it is or it isn’t.
The formula that should be entered is:
Now drag it down across all columns. The next job is to copy all and use paste special as values again, to get rid of the values, and delete the old colums with “0” in. It should leave you with this:
If you wanted at this stage to know how many were not available you could use countif.
The formula would be =countif(G1:G1000,”NotAvailable”. This would be useful for assessing how useful data would be for a story in case there is significant blank.
The last step we’ll take here is to create a Q and year row in itself – this will make it easier to graph on a year by year and quarter basis.
To do this the CONCATENATE function is of use. It allows you to merge the cells you want – if as below you want to add more text simply insert it as an additional text. You should then delete all the empty columns.
Now select it all and paste special as values to make sure data is lost.
Your data should then look like this:
The last problem is the if you scroll down you will see some labels are Yorks&Humber while others and Yorks and Humber. This is solved with a simple find and replace to make it consistent, see the example below.
The data is now ready to put into pivot tables however before we look at that I will just diverge to one brief point.
The VLOOKUP function is one of the most useful for cleaning and editing your data out there, but for what we are doing it isn’t relevant. However a way you could use it is if you were mapping the data and had the shape polygons, you could use VLOOKUP to pull the polygon for the relevant location from another sheet to our sheet.
To explain how to use it if you put the VLOOKUP function into a cell and pass it one of the unique identifiers from your database, it will return you one of the pieces of information associated with that unique identifier. In the example above, you would pass VLOOKUP an item code, and it would return to you either the corresponding item’s description, its price, or its availability (its “In stock” quantity). Which of these pieces of information will it pass you back? Well, you get to decide this when you’re creating the formula.
To return to the data we can now put it into pivot tables. Select all your data, go to the data tab, and click pivot table.
You then get this screen.
Pivot tables allow you to play with and explore your cleaned data, for example I have look at the average all price for all properties across the two areas.
I hope this brief guide has been useful in introducing the fundamentals of how to clean data in excel. To get to grips with more complex aspects there are a lot of great guides out there on the internet to look at, or leave a comment and I’ll get back to you.
If there are any suggestions on what to add or what I could do another article around please feel free to get in contact.