In lodash there is the words method that can be used to quickly preform lexical analysis tokenization of a string. In other words the lodash words method is used to split a string into an array of substrings where each substring is a single word from the given source string.
In some cases this could be easily done with the split method, but it is not always so cut and dry. There are text samples that might contain certain characters that are to be cut out or included in the process for example. So that being said there is a need for some kind of Tokenizer method that is better suited for the task of creating an array of words from a text sample.
If The full lodash version of lodash is part of the stack of the project that you are working on there is the _.words method that can be used to quickly get an array of words from a text sample in a string. The default pattern that is used should work okay in most situations, but if for some reason it does not it is possible to override that pattern by passing a pattern to use as the second argument when using the method.
In some situations I might want to do some processing for the text before hand, or use a custom pattern that can be given as the second argument to lodash words. Say that I have some text that has camel case words in it, or in other words it has some words that start out lower case but then have an upper case letter in it. The default pattern that is used in lodash words will break that kind of word into two or more words, which might not be the result that I want.
So then the solution is to make all the text lowercase before I pass it to lodash words, or use a custom pattern that will not split up words like that.
The lodash words method is then one user space option for spiting a string of words into an array of sub strings where each sub string is a single word in the given source string. This method might work okay when it comes to English sentiences, but even then I might still want to use some other option in certain situations.
If you enjoyed this post you might want to check out my main post on lodash in general, or one of my many other posts on the topic of lodash. However there is also a whole lot to be aware of when it comes to what there is to work with in terms of libraries outside that of lodash, such as this one library called natural.js. There is also the markdown parser project known as marked that comes to mind when thinking about this sort of thing, as there is not just creating an array for tokens for enough but also for languages and forms of markup such as mark down.