Case-Sensitive Removal of Duplicates in Excel

By default, Microsoft Excel uses case-insensitive comparison of strings. This is useful in most cases. For instance, when you’re looking up a value in a column and the value you’re trying to find is written in sentence case but all values in the column are written in uppercase, Excel will still find that value without requiring you to

write a complex formula.

However, there are cases when that’s not what you want. Assume, for example, that you were given a large dataset of shipments from terminals to stations and need to calculate some totals for each terminal group. The source of the data is some strange legacy database, where terminal names are written in many different ways. What complicates your task further is that there are no terminal IDs in the dataset. Here’s what you find in the Terminal column of table Data:

Terminal
Simblica
LANTIDOTE
LAROQUE
SIGPLANT
MISPLACE
VISICA
LAROQUE
limericia
QUARWICK
simblica
Visica
Sigplant
VISICA
VISICA
SIGPLANT
LANTIDOTE
QUARWICK
visica
Lantidote
Limericia
Lantidote
sigplant
Lantidote
sigplant
SIGPLANT
Sigplant
MISPLACE
quarwick
Sigplant
visica
Simblica
Visica
laroque
SIGPLANT
Misplace
quarwick
Quarwick
misplace
SIMBLICA
SIGPLANT
simblica
misplace
misplace
SIGPLANT
laroque
simblica
misplace
lantidote
Lantidote
LIMERICIA
Visica
visica
SIGPLANT
QUARWICK
QUARWICK
LIMERICIA
visica
lantidote
Sigplant
Visica
Misplace
LANTIDOTE
limericia
visica
LAROQUE
misplace
limericia
MISPLACE
Simblica
Limericia
sigplant
LANTIDOTE
Simblica
Visica
Simblica
Limericia
misplace
lantidote
MISPLACE
Simblica
visica
Sigplant
Limericia
misplace
Lantidote
Visica
Laroque
sigplant
VISICA
SIGPLANT
LANTIDOTE
lantidote
limericia
misplace
laroque
sigplant
Sigplant
QUARWICK
misplace
Visica

You’re required to create a correspondence matrix between all the ways these terminals are written in the database and terminal groups.

Well, that’s easy, you think, just copy-paste this column and use the Remove Duplicates functionality of Excel. Here’s what you get as a result:

case-sensitive-removal-of-duplicates-in-excel-removedupl

What happened? Well, apparently, the Remove Duplicates functionality is based on Excel’s standard string comparison, which is… correct! — case-insensitive. This isn’t what you need to get as a result. Because there’s no option called Case-Sensitive Remove Duplicates, you’ll have to create a workaround manually.

If you read Excel’s help, you’ll quickly find out that it provides the EXACT function for case-sensitive comparison of strings, and we’ll be using it.

Firstly, copy-paste the whole Terminal column from Data table to a new table named Results. Then create a second column called CopyNo with the following formula:

{=COUNT(IF(EXACT([Terminal], [@Terminal]) * (ROW([Terminal]) < ROW([@Terminal])) = 1, 1, “”))}

Note that this is an array formula applying advanced on-the-fly filtering. The formula looks in the Terminal column of the current table (Results) in all rows above the current row and counts the number of cases when the terminal name written exactly as in the current row was encountered. It is then easy to filter out duplicates: go to the column’s filter and keep a tick only near zero.

We can also as easily create unique identifiers for all non-duplicate entries. We’ll do it in a new UID column with the following formula:

=IF([@CopyNo]= 0, IF(ROW() = ROW(Results[#Headers]) + 1, 1, MAX(INDIRECT(ADDRESS(ROW(Results[#Headers]) + 1, COLUMN()) & “:” & ADDRESS(ROW() – 1, COLUMN()))) + 1), 0)

The logic is as follows. If this is a duplicate, the value in the CopyNo column will be non-zero and, hence, UID will be set to zero. If this is a new terminal, we’re trying to find the maximum UID between the first row below the table’s header and the row exactly above the current row and increment this UID by one. There is a special case to be treated, however: when the current row is the first row below the table’s header. In this case, we’re setting UID to one.

The portion

INDIRECT(ADDRESS(ROW(Results[#Headers]) + 1, COLUMN()) & “:” & ADDRESS(ROW() – 1, COLUMN())))

simply gives the correct address of the range in the given column between the row exactly below the header (with row number ROW(Results[#Headers]) + 1) and the row exactly above the current row (with row number ROW() – 1). This notation is necessary to avoid referencing the current column of the table ([UID]), which will result in a circular reference.

Here’s the subset of results before filtering:

case-sensitive-removal-of-duplicates-in-excel-results-unfiltered

And here’s after the filter is set to keep only rows with CopyNo = 0:

case-sensitive-removal-of-duplicates-in-excel-results-filtered

It remains to add another column and assign terminals to groups. If necessary, duplicates can easily be deleted from this table.

Advertisements

One thought on “Case-Sensitive Removal of Duplicates in Excel

  1. I must thank you for the efforts you’ve put in writing this blog.
    I am hoping to check out the same high-grade
    blog posts by you in the future as well. In fact, your creative writing abilities
    has encouraged me to get my very own site now 😉

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s