Case-Sensitive Removal of Duplicates in Excel

By default, Microsoft Excel uses case-insensitive comparison of strings. This is useful in most cases. For instance, when you’re looking up a value in a column and the value you’re trying to find is written in sentence case but all values in the column are written in uppercase, Excel will still find that value without requiring you to

write a complex formula.

However, there are cases when that’s not what you want. Assume, for example, that you were given a large dataset of shipments from terminals to stations and need to calculate some totals for each terminal group. The source of the data is some strange legacy database, where terminal names are written in many different ways. What complicates your task further is that there are no terminal IDs in the dataset. Here’s what you find in the Terminal column of table Data:

Terminal

Simblica

LANTIDOTE

LAROQUE

SIGPLANT

MISPLACE

VISICA

LAROQUE

limericia

QUARWICK

simblica

Visica

Sigplant

VISICA

SIGPLANT

LANTIDOTE

QUARWICK

visica

Lantidote

Limericia

Lantidote

sigplant

Lantidote

sigplant

SIGPLANT

Sigplant

MISPLACE

quarwick

Sigplant

visica

Simblica

Visica

laroque

SIGPLANT

Misplace

quarwick

Quarwick

misplace

SIMBLICA

SIGPLANT

simblica

misplace

SIGPLANT

laroque

simblica

misplace

lantidote

Lantidote

LIMERICIA

Visica

visica

SIGPLANT

QUARWICK

LIMERICIA

visica

lantidote

Sigplant

Visica

Misplace

LANTIDOTE

limericia

visica

LAROQUE

misplace

limericia

MISPLACE

Simblica

Limericia

sigplant

LANTIDOTE

Simblica

Visica

Simblica

Limericia

misplace

lantidote

MISPLACE

Simblica

visica

Sigplant

Limericia

misplace

Lantidote

Visica

Laroque

sigplant

VISICA

SIGPLANT

LANTIDOTE

lantidote

limericia

misplace

laroque

sigplant

Sigplant

QUARWICK

misplace

Visica

You’re required to create a correspondence matrix between all the ways these terminals are written in the database and terminal groups.

Well, that’s easy, you think, just copy-paste this column and use the Remove Duplicates functionality of Excel. Here’s what you get as a result:

case-sensitive-removal-of-duplicates-in-excel-removedupl

What happened? Well, apparently, the Remove Duplicates functionality is based on Excel’s standard string comparison, which is… correct! — case-insensitive. This isn’t what you need to get as a result. Because there’s no option called Case-Sensitive Remove Duplicates, you’ll have to create a workaround manually.

If you read Excel’s help, you’ll quickly find out that it provides the EXACT function for case-sensitive comparison of strings, and we’ll be using it.

Firstly, copy-paste the whole Terminal column from Data table to a new table named Results. Then create a second column called CopyNo with the following formula:

{=COUNT(IF(EXACT([Terminal], [@Terminal]) * (ROW([Terminal]) < ROW([@Terminal])) = 1, 1, “”))}

Note that this is an array formula applying advanced on-the-fly filtering. The formula looks in the Terminal column of the current table (Results) in all rows above the current row and counts the number of cases when the terminal name written exactly as in the current row was encountered. It is then easy to filter out duplicates: go to the column’s filter and keep a tick only near zero.

We can also as easily create unique identifiers for all non-duplicate entries. We’ll do it in a new UID column with the following formula:

=IF([@CopyNo]= 0, IF(ROW() = ROW(Results[#Headers]) + 1, 1, MAX(INDIRECT(ADDRESS(ROW(Results[#Headers]) + 1, COLUMN()) & “:” & ADDRESS(ROW() – 1, COLUMN()))) + 1), 0)

The logic is as follows. If this is a duplicate, the value in the CopyNo column will be non-zero and, hence, UID will be set to zero. If this is a new terminal, we’re trying to find the maximum UID between the first row below the table’s header and the row exactly above the current row and increment this UID by one. There is a special case to be treated, however: when the current row is the first row below the table’s header. In this case, we’re setting UID to one.

The portion

INDIRECT(ADDRESS(ROW(Results[#Headers]) + 1, COLUMN()) & “:” & ADDRESS(ROW() – 1, COLUMN())))

simply gives the correct address of the range in the given column between the row exactly below the header (with row number ROW(Results[#Headers]) + 1) and the row exactly above the current row (with row number ROW() – 1). This notation is necessary to avoid referencing the current column of the table ([UID]), which will result in a circular reference.

Here’s the subset of results before filtering:

case-sensitive-removal-of-duplicates-in-excel-results-unfiltered

And here’s after the filter is set to keep only rows with CopyNo = 0:

case-sensitive-removal-of-duplicates-in-excel-results-filtered

It remains to add another column and assign terminals to groups. If necessary, duplicates can easily be deleted from this table.

One thought on “Case-Sensitive Removal of Duplicates in Excel”

ig says:

August 2, 2017 at 03:48

I must thank you for the efforts you’ve put in writing this blog.
I am hoping to check out the same high-grade
blog posts by you in the future as well. In fact, your creative writing abilities
has encouraged me to get my very own site now 😉

LikeLike

Non-Trivial Analytic Problems

Non-trivial analytic problems. Short and efficient solutions. Experienced in business environment.