Filtering

More noms!
More noms!
Photo: Holly Vickery

Comparing data

One of the most common tasks is to compare data. Examples:

  • Sales of the same product in different regions.
  • Product return rates, for older customers, versus younger customers.
  • Math grades for students who took an extra prep course, and those who didn't.
  • Number of Twitter followers, for goats who can code, versus those who can't.

To do it, you add a few new variables, some Ifs, and a little other stuff. There's no new VBA statements, or new concepts.

Another strange example

Let's say we have some data about cows.

Cows

No, just kidding, it's about goats again.

Goat

I should hope so!

The data is about goats' social media activity. The fields are:

  • Goat name
  • Twitter followers
  • Tweets in the last 30 days
  • Can the goat write code? Yes or no.
  • Age (months)

Here they are in a worksheet:

Social media data

There might be errors in the data, though. So let's start out by validating the data, only processing the data that's valid.

Validation

We'll validate the same way as before, using the flag pattern. Like this.

  • Make a Range
  • For each row
  •     Data ok = yes
  •     Get data for row
  •     If first field data is bad:
  •         Data ok = no
  •     If second field data is bad:
  •         Data ok = no
  •     ...
  •     If last field data is bad:
  •         Data ok = no
  •     If data ok = yes:
  •         Process the row
  • Next
Ethan
Ethan

You know, now that we've seen this a few times, it's not as hard as I thought.

Right! No topic is easy or hard in itself. Driving, caring for goats, programming... It's not that it seems easier, it is easier when you know what you're doing.

That's how you tell you're learning. Think back to what you could do at the beginning of this course, and what you can do now.

Ray
Ray

Big change.

Aye. And don't forget. These skills can get you...

Benjamins

OK, back to the example. Goats' social media activity

Here's the VBA for creating a Range, reading each record, and validating it. Everything but the processing. Please go through it.

  1.   'Set up the Range object.
  2.   Set rGoatDataTable = Range(Cells(4, 1), Cells(4, 1).End(xlToRight).End(xlDown))
  3.   'Get the number of rows in the range. We use one row for each record.
  4.   sNumberOfRecords = rGoatDataTable.Rows.Count
  5.   'Initialize totals.
  6.   sNumberValidRecords = 0
  7.   sNumberInvalidRecords = 0
  8.   'Compute totals
  9.   For sCurrentRecordNumber = 1 To sNumberOfRecords
  10.       tIsRecordOk = "yes"
  11.       'Get a record
  12.       tNameIn = rGoatDataTable.Cells(sCurrentRecordNumber, NAME_COLUMN)
  13.       tTwitterFollowersIn = rGoatDataTable.Cells(sCurrentRecordNumber, FOLLOWERS_COLUMN)
  14.       tTweets30DaysIn = rGoatDataTable.Cells(sCurrentRecordNumber, TWEETS_COLUMN)
  15.       tWriteCodeIn = rGoatDataTable.Cells(sCurrentRecordNumber, CAN_CODE_COLUMN)
  16.       tAgeIn = rGoatDataTable.Cells(sCurrentRecordNumber, AGE_COLUMN)
  17.       'Validate
  18.       'Check input data.
  19.       'Check that name is not missing.
  20.       If tNameIn = "" Then
  21.           'Name is missing.
  22.           tIsRecordOk = "no"
  23.       End If
  24.       'Check that Twitter followers is a reasonable number.
  25.       If Not IsNumeric(tTwitterFollowersIn) Then
  26.           tIsRecordOk = "no"
  27.       Else
  28.           'Twitter followers is a number.
  29.           sTwitterFollowers = tTwitterFollowersIn
  30.           If sTwitterFollowers < 0 Then
  31.               tIsRecordOk = "no"
  32.           End If
  33.           If sTwitterFollowers > 55000000 Then
  34.               tIsRecordOk = "no"
  35.           End If
  36.       End If
  37.       'Check that tweets in last 30 days is a reasonable number.
  38.       If Not IsNumeric(tTweets30DaysIn) Then
  39.           tIsRecordOk = "no"
  40.       Else
  41.           'Is a number.
  42.           sTweets30Days = tTweets30DaysIn
  43.           If sTweets30Days < 0 Then
  44.               tIsRecordOk = "no"
  45.           End If
  46.           If sTweets30Days > 1000 Then
  47.               tIsRecordOk = "no"
  48.           End If
  49.       End If
  50.       'Check that coder is either Yes or No.
  51.       tWriteCodeIn = LCase(Trim(tWriteCodeIn))
  52.       If tWriteCodeIn <> "yes" And tWriteCodeIn <> "no" Then
  53.           tIsRecordOk = "no"
  54.       End If
  55.       'Check that age is a reasonable number.
  56.       If Not IsNumeric(tAgeIn) Then
  57.           tIsRecordOk = "no"
  58.       Else
  59.           'Is a number.
  60.           sAge = tAgeIn
  61.           If sAge < 0 Then
  62.               tIsRecordOk = "no"
  63.           End If
  64.           If sAge > 500 Then
  65.               tIsRecordOk = "no"
  66.           End If
  67.       End If
  68.  
  69.       'Process this record?
  70.       If tIsRecordOk = "yes" Then
  71.           sNumberValidRecords = sNumberValidRecords + 1
  72.  
  73.           PROCESSING GOES HERE
  74.  
  75.       Else
  76.           sNumberInvalidRecords = sNumberInvalidRecords + 1
  77.       End If
  78.   Next sCurrentRecordNumber

Notice the range checks in the validation. For example, more than 1,000 tweets a month is unlikely, so the program decides the record is invalid. Tests like these are sometimes called sanity checks.

Ray
Ray

I just want to check. All that code is just for validation, right? We haven't gotten to any processing yet.

That's right. It's just validation so far. It's common in business analytics to have a lot of validation code.

Review: averages

Let's start with code for working out the average number of tweets, without the filtering for now. We'd add code like this to the code above:

  1. 'Initialize totals.
  2. sNumberValidRecords = 0
  3. sTotalTweets = 0
  4. ...
  5. For sCurrentRecordNumber = 1 To sNumberOfRecords
  6.     Validation here
  7.     If tIsRecordOk = "yes" Then
  8.         'Processing
  9.         sNumberValidRecords = sNumberValidRecords + 1
  10.         sTotalTweets = sTotalTweets + sTweets30Days
  11.         ...
  12.     End If
  13. Next sCurrentRecordNumber
  14. 'Compute average.
  15. sAvgTweets = sTotalTweets / sNumberValidRecords
  16. ...
  17. 'Output
  18. ...
  19. Cells(8, 8) = sAvgTweets

Let's work backwards through the code.

Adela, could you do that again?

Adela
Adela

Sure.

We want to show the average tweets. So, we have to have it in a variable, so we can put it in a cell. That's in line 19. The variable is sAvgTweets.

OK, to work out sAvgTweets, we need a total and a count, like in line 15. The variables for that are sTotalTweets, and sNumberValidRecords.

To work out sTotalTweets, we need to loop over the records (lines 5 and 13), and, if the current record is valid (line 7), add the record's number of tweets to the total (line 10).

Same thing for a count of records.

Great! Thanks.

Filtering

Filtering is about taking one or more subsets of the data. Which subsets? Whatever the business needs.

For example, suppose a consumer product has too many returns. Product returns are expensive, so the company wants to reduce them.

Some marketing people think it's older consumers who are returning the product, because they can't figure out how to use it. If they're right, maybe instructions should be rewritten with older consumers in mind.

That's if they're right. If they're not, the company will waste time and money doing things that don't address the underlying problem, while the problem continues to cost money.

This is a typical business analytics project. Analyze a data set, to make a business decision. Better to know what the data says, than to guess.

OK, so how to find out if the marketing people are right?

Filtering.

Write a program that splits data on returns into two different groups, based on age. Are there more records in the older group than the younger group?

Of course, we're dealing with goats and social media. Cthulhu suspects goats who write code, are more prolific tweeters than goats who don't write code. Is it right? Let's see.

We want to split the totals up. For example, instead of sTotalTweets, we want two variables:

  • sTotalTweetsCoders
  • sTotalTweetsNoncoders

Let's do the same for followers, since we're writing the code anyway. Instead of sTotalFollowers, we want:

  • sTotalFollowersCoders
  • sTotalFollowersNoncoders

Same for age.

  • sTotalAgeCoders
  • sTotalAgeNoncoders

We also need to count the number of goats in each group:

  • sNumberCanCode
  • sNumberCannotCode

Here's the code that will use those variables.

  1. For sCurrentRecordNumber = 1 To sNumberOfRecords
  2.     'Process this record?
  3.     If tIsRecordOk = "yes" Then
  4.         ...
  5.         'Filter by coding ability.
  6.         If tWriteCodeIn = "yes" Then
  7.             'Accumulate data for coders.
  8.             sNumberCanCode = sNumberCanCode + 1
  9.             sTotalTweetsCoders = sTotalTweetsCoders + sTweets30Days
  10.             sTotalFollowersCoders = sTotalFollowersCoders + sTwitterFollowers
  11.             sTotalAgeCoders = sTotalAgeCoders + sAge
  12.         Else
  13.             'Accumulate data for noncoders.
  14.             sNumberCannotCode = sNumberCannotCode + 1
  15.             sTotalTweetsNoncoders = sTotalTweetsNoncoders + sTweets30Days
  16.             sTotalFollowersNoncoders = sTotalFollowersNoncoders + sTwitterFollowers
  17.             sTotalAgeNoncoders = sTotalAgeNoncoders + 1
  18.         End If
  19.     Else
  20.         ...
  21.     End If
  22. Next sCurrentRecordNumber
  23. ...
  24. 'Compute averages for coders.
  25. sAvgFollowersCoders = sTotalFollowersCoders / sNumberCanCode
  26. sAvgTweetsCoders = sTotalTweetsCoders / sNumberCanCode
  27. sAvgAgeCoders = sTotalAgeCoders / sNumberCanCode
  28. 'Compute averages for noncoders.
  29. sAvgFollowersNoncoders = sTotalFollowersNoncoders / sNumberCannotCode
  30. sAvgTweetsNoncoders = sTotalTweetsNoncoders / sNumberCannotCode
  31. sAvgAgeNoncoders = sTotalAgeNoncoders / sNumberCannotCode

This part is the key:

  • If tWriteCodeIn = "yes" Then
  •     'Accumulate data for coders.
  •     ...
  • Else
  •     'Accumulate data for noncoders.
  •     ...
  • End If

The program uses different sets of total variables, depending on the value of tWriteCodeIn.

Georgina
Georgina

Oo! The If statement...

  • If tWriteCodeIn = "yes" Then

... only works because the whether-can-code data is normalized, right?

That's right! This line...

  • tWriteCodeIn = LCase(Trim(tWriteCodeIn))

... is before the If. It makes sure that "YES", "yes ", " Yes ", and other weird combinations are all converted to "yes". So the If just tests for that.

Notice the normalization code is mixed with the validation code. That's usual, since it's easier to validate data that's normalized first.

Ethan
Ethan

I've noticed If statements have many uses, not just validation.

Aye! Like screws in a car. Screws are used to mount headlights, keep the fuse box in place, lock the seats down, lots of things. Screws do one job: hold parts together. But they do that function in lots of different places.

If statements, variables, expressions... they're components you put together in different ways to get what you want.

Onward!

OK, you've added filtering to your data analytics toolkit, along with validation, normalization, accumulation (totals and counts), and other things.

Now for another thing people will ask you to do: finding the smallest and largest.