Skip to content

The curious case of missing hyphens

Posted on:September 27, 2023 at 12:13 PM

The curious case of hyphens

So today I was working on a simple search functionality for an application. The job was simple. Given an article and a set of company names, look if the article contains any of the names. If it does, then report a match, otherwise move to the next article. Pretty simple, right?

There I encountered an article containing the following sentence

XXXXXXX Corp. (“XXXXXX”; TSX‐V: XXX) today reported initial
results  ...... (rest of the article)

Names are masked to protect IP (anyways they are irrelevant here).

The interesting thing was, TSX‐V: XXX was there in our wordlist, and still I wasn’t getting any matches. And it wasn’t that I was using any fancy method of searching. I was simply using Python’s

string.find('TSX-V: XXX')

method.

After several brute force attempts, I found out that the hyphen present in news article , is different from the normal hyphen - present on our keyboard! (Now that I have typed these two hyphens independently, it is evident from their widths that both of them are different!)

The normal hyphen - Unicode representation is U+002D, while the Unicode representation of is U+2010.

Similarly in python interpreter,

ord('-') # Normal hyphen

gives 45, while

ord('‐') # The one in the JNM article

gives 8208.

Well, half an hour wasted because news agencies apparently can’t use the common ASCII symbols! Anyways, hope it helps someone else :)