Calculating Word Counts

A couple of days ago, Jason -- a.k.a. -- asked for a word-count feature in 10Cv5. This would be displayed on the Archives page and, potentially, on the dedicated page for a given post. This is a concept that I've played with a couple of times in the past and intentionally left out of v5 primarily because of the complexity involved in getting it right, which is rather important. The largest stumbling block that I've generally had when developing word-counting systems involves working with languages that do not use the alphabet. English and many other languages make counting systems easy by conveniently separating words with spaces. Common East Asian languages such as Japanese and Korean, however, do not do this. One word runs right into the next one and it's up to the reader to parse and make sense of the written text. So, if a person wanted to create a complete work-counting solution, what would it look like?

As it stands, 10Cv5 already has a unique word counting system in place that is used by the search mechanism. When a post of any type is published, the text is split apart by spaces and the distinct words are stored in a lookup table. This means that if someone were to type in "bright yellow" into the search field on this site, the API would know to split the words "bright" and "yellow", search the lookup table to get a list of posts, then pull back the first X results sorted based on their relevancy score, which is generally done by looking for an exact string match, then whether a post has all the search terms, then posts that contain some of the search terms. It's nowhere near Google-level efficiency, but it works rather well. And, because every post has its individual words stored in a table, it's a piece of cake to say how many distinct words a post has.

But this isn't a word count, nor does it work for East Asian languages.

One way to solve for the language problem would be to have dictionaries to compare words against. This would make it possible to more accurately determine how many words are in a post so long as the spelling is correct and there is a minimum of jargon or slang. This would be inefficient, but it would work.

Another option would be to just focus on the alphabet-based languages and split on the white space, counting words that contain at least one character. But this would raise some other questions, such as "what is a word?". Does […] count as a word? Is an emoji a word? Is a series of strung-together emoji a word? How about numbers? Punctuation? Would it make sense to limit the concept of a "word" to text entities consisting of the letters A to Z in both cases plus the digits 0 through 9? This would certainly simplify things, but it would also be an incomplete solution.

Rather than a word count, I wonder if it would make more sense to go with a Medium-style "This post will take X minutes to read" informational line, as this can generally be universal across languages. The average person reads an A4 page of text at roughly the same speed regardless the language, so counting the number of characters would make a time-to-read calculation relatively easy.

The idea of word counts, time-to-read informationals, and other metrics are certainly appealing. I wonder if these are things that readers generally look for, though.