More noms!
Photo: Holly Vickery
Comparing data
One of the most common tasks is to compare data. Examples:
- Sales of the same product in different regions.
- Product return rates, for older customers, versus younger customers.
- Math grades for students who took an extra prep course, and those who didn't.
- Number of Twitter followers, for goats who can code, versus those who can't.
To do it, you add a few new variables, some If
s, and a little other stuff. There's no new VBA statements, or new concepts.
Another strange example
Let's say we have some data about cows.
No, just kidding, it's about goats again.
I should hope so!
The data is about goats' social media activity. The fields are:
- Goat name
- Twitter followers
- Tweets in the last 30 days
- Can the goat write code? Yes or no.
- Age (months)
Here they are in a worksheet:
There might be errors in the data, though. So let's start out by validating the data, only processing the data that's valid.
Validation
We'll validate the same way as before, using the flag pattern. Like this.
- Make a Range
- For each row
- Data ok = yes
- Get data for row
- If first field data is bad:
- Data ok = no
- If second field data is bad:
- Data ok = no
- ...
- If last field data is bad:
- Data ok = no
- If data ok = yes:
- Process the row
- Next
Ethan
You know, now that we've seen this a few times, it's not as hard as I thought.
Right! No topic is easy or hard in itself. Driving, caring for goats, programming... It's not that it seems easier, it is easier when you know what you're doing.
That's how you tell you're learning. Think back to what you could do at the beginning of this course, and what you can do now.
Ray
Big change.
Aye. And don't forget. These skills can get you...
OK, back to the example. Goats' social media activity
Here's the VBA for creating a Range
, reading each record, and validating it. Everything but the processing. Please go through it.
- 'Set up the Range object.
- Set rGoatDataTable = Range(Cells(4, 1), Cells(4, 1).End(xlToRight).End(xlDown))
- 'Get the number of rows in the range. We use one row for each record.
- sNumberOfRecords = rGoatDataTable.Rows.Count
- 'Initialize totals.
- sNumberValidRecords = 0
- sNumberInvalidRecords = 0
- 'Compute totals
- For sCurrentRecordNumber = 1 To sNumberOfRecords
- tIsRecordOk = "yes"
- 'Get a record
- tNameIn = rGoatDataTable.Cells(sCurrentRecordNumber, NAME_COLUMN)
- tTwitterFollowersIn = rGoatDataTable.Cells(sCurrentRecordNumber, FOLLOWERS_COLUMN)
- tTweets30DaysIn = rGoatDataTable.Cells(sCurrentRecordNumber, TWEETS_COLUMN)
- tWriteCodeIn = rGoatDataTable.Cells(sCurrentRecordNumber, CAN_CODE_COLUMN)
- tAgeIn = rGoatDataTable.Cells(sCurrentRecordNumber, AGE_COLUMN)
- 'Validate
- 'Check input data.
- 'Check that name is not missing.
- If tNameIn = "" Then
- 'Name is missing.
- tIsRecordOk = "no"
- End If
- 'Check that Twitter followers is a reasonable number.
- If Not IsNumeric(tTwitterFollowersIn) Then
- tIsRecordOk = "no"
- Else
- 'Twitter followers is a number.
- sTwitterFollowers = tTwitterFollowersIn
- If sTwitterFollowers < 0 Then
- tIsRecordOk = "no"
- End If
- If sTwitterFollowers > 55000000 Then
- tIsRecordOk = "no"
- End If
- End If
- 'Check that tweets in last 30 days is a reasonable number.
- If Not IsNumeric(tTweets30DaysIn) Then
- tIsRecordOk = "no"
- Else
- 'Is a number.
- sTweets30Days = tTweets30DaysIn
- If sTweets30Days < 0 Then
- tIsRecordOk = "no"
- End If
- If sTweets30Days > 1000 Then
- tIsRecordOk = "no"
- End If
- End If
- 'Check that coder is either Yes or No.
- tWriteCodeIn = LCase(Trim(tWriteCodeIn))
- If tWriteCodeIn <> "yes" And tWriteCodeIn <> "no" Then
- tIsRecordOk = "no"
- End If
- 'Check that age is a reasonable number.
- If Not IsNumeric(tAgeIn) Then
- tIsRecordOk = "no"
- Else
- 'Is a number.
- sAge = tAgeIn
- If sAge < 0 Then
- tIsRecordOk = "no"
- End If
- If sAge > 500 Then
- tIsRecordOk = "no"
- End If
- End If
- 'Process this record?
- If tIsRecordOk = "yes" Then
- sNumberValidRecords = sNumberValidRecords + 1
- PROCESSING GOES HERE
- Else
- sNumberInvalidRecords = sNumberInvalidRecords + 1
- End If
- Next sCurrentRecordNumber
Notice the range checks in the validation. For example, more than 1,000 tweets a month is unlikely, so the program decides the record is invalid. Tests like these are sometimes called sanity checks.
Ray
I just want to check. All that code is just for validation, right? We haven't gotten to any processing yet.
That's right. It's just validation so far. It's common in business analytics to have a lot of validation code.
Review: averages
Let's start with code for working out the average number of tweets, without the filtering for now. We'd add code like this to the code above:
- 'Initialize totals.
- sNumberValidRecords = 0
- sTotalTweets = 0
- ...
- For sCurrentRecordNumber = 1 To sNumberOfRecords
- Validation here
- If tIsRecordOk = "yes" Then
- 'Processing
- sNumberValidRecords = sNumberValidRecords + 1
- sTotalTweets = sTotalTweets + sTweets30Days
- ...
- End If
- Next sCurrentRecordNumber
- 'Compute average.
- sAvgTweets = sTotalTweets / sNumberValidRecords
- ...
- 'Output
- ...
- Cells(8, 8) = sAvgTweets
Let's work backwards through the code.
Adela, could you do that again?
Adela
Sure.
We want to show the average tweets. So, we have to have it in a variable, so we can put it in a cell. That's in line 19. The variable is sAvgTweets
.
OK, to work out sAvgTweets
, we need a total and a count, like in line 15. The variables for that are sTotalTweets
, and sNumberValidRecords
.
To work out sTotalTweets
, we need to loop over the records (lines 5 and 13), and, if the current record is valid (line 7), add the record's number of tweets to the total (line 10).
Same thing for a count of records.
Great! Thanks.
Filtering
Filtering is about taking one or more subsets of the data. Which subsets? Whatever the business needs.
For example, suppose a consumer product has too many returns. Product returns are expensive, so the company wants to reduce them.
Some marketing people think it's older consumers who are returning the product, because they can't figure out how to use it. If they're right, maybe instructions should be rewritten with older consumers in mind.
That's if they're right. If they're not, the company will waste time and money doing things that don't address the underlying problem, while the problem continues to cost money.
This is a typical business analytics project. Analyze a data set, to make a business decision. Better to know what the data says, than to guess.
OK, so how to find out if the marketing people are right?
Filtering.
Write a program that splits data on returns into two different groups, based on age. Are there more records in the older group than the younger group?
Of course, we're dealing with goats and social media. Cthulhu suspects goats who write code, are more prolific tweeters than goats who don't write code. Is it right? Let's see.
We want to split the totals up. For example, instead of sTotalTweets
, we want two variables:
sTotalTweetsCoders
sTotalTweetsNoncoders
Let's do the same for followers, since we're writing the code anyway. Instead of sTotalFollowers
, we want:
sTotalFollowersCoders
sTotalFollowersNoncoders
Same for age.
sTotalAgeCoders
sTotalAgeNoncoders
We also need to count the number of goats in each group:
sNumberCanCode
sNumberCannotCode
Here's the code that will use those variables.
- For sCurrentRecordNumber = 1 To sNumberOfRecords
- 'Process this record?
- If tIsRecordOk = "yes" Then
- ...
- 'Filter by coding ability.
- If tWriteCodeIn = "yes" Then
- 'Accumulate data for coders.
- sNumberCanCode = sNumberCanCode + 1
- sTotalTweetsCoders = sTotalTweetsCoders + sTweets30Days
- sTotalFollowersCoders = sTotalFollowersCoders + sTwitterFollowers
- sTotalAgeCoders = sTotalAgeCoders + sAge
- Else
- 'Accumulate data for noncoders.
- sNumberCannotCode = sNumberCannotCode + 1
- sTotalTweetsNoncoders = sTotalTweetsNoncoders + sTweets30Days
- sTotalFollowersNoncoders = sTotalFollowersNoncoders + sTwitterFollowers
- sTotalAgeNoncoders = sTotalAgeNoncoders + 1
- End If
- Else
- ...
- End If
- Next sCurrentRecordNumber
- ...
- 'Compute averages for coders.
- sAvgFollowersCoders = sTotalFollowersCoders / sNumberCanCode
- sAvgTweetsCoders = sTotalTweetsCoders / sNumberCanCode
- sAvgAgeCoders = sTotalAgeCoders / sNumberCanCode
- 'Compute averages for noncoders.
- sAvgFollowersNoncoders = sTotalFollowersNoncoders / sNumberCannotCode
- sAvgTweetsNoncoders = sTotalTweetsNoncoders / sNumberCannotCode
- sAvgAgeNoncoders = sTotalAgeNoncoders / sNumberCannotCode
This part is the key:
- If tWriteCodeIn = "yes" Then
- 'Accumulate data for coders.
- ...
- Else
- 'Accumulate data for noncoders.
- ...
- End If
The program uses different sets of total variables, depending on the value of tWriteCodeIn
.
Georgina
Oo! The If statement...
- If tWriteCodeIn = "yes" Then
... only works because the whether-can-code data is normalized, right?
That's right! This line...
- tWriteCodeIn = LCase(Trim(tWriteCodeIn))
... is before the If
. It makes sure that "YES", "yes ", " Yes ", and other weird combinations are all converted to "yes". So the If just tests for that.
Notice the normalization code is mixed with the validation code. That's usual, since it's easier to validate data that's normalized first.
Ethan
I've noticed If
statements have many uses, not just validation.
Aye! Like screws in a car. Screws are used to mount headlights, keep the fuse box in place, lock the seats down, lots of things. Screws do one job: hold parts together. But they do that function in lots of different places.
If
statements, variables, expressions... they're components you put together in different ways to get what you want.
Onward!
OK, you've added filtering to your data analytics toolkit, along with validation, normalization, accumulation (totals and counts), and other things.
Now for another thing people will ask you to do: finding the smallest and largest.