NLP - Text tagging tools
I have a set of keywords/phrases i would like to find in user's input, e.g. cities. My set contains about 100 000 items. The most important issue is that i want to recognize it even if user will make a type. For example: The best city i've ever visited is Barceloma, however New Yuork is also pretty cool, because of amazing views. expected result should be Barcelona and New York I have a basic algorithm, unfortunately it lasts very long. It's no possible to do it concurrently by splitting text by whitespaces, because i would like to recognize multi-words names, e.g. New York, San Francisco. Is there any tool which may help me? I thought nltk is one of the libraries, but i'm not able to use it correctly to do it. Is it possible to use elasticsearch or neural network to achieve it? Thanks for help
I think using neural-networks might not be so effective except if you have access to a really fast computer. It will be hard to train 100000 words to one neural network. So heres how I would do it: For every word in your text: For every search_word in your keywords: var diff = calculateDifference(word, search_word); if(diff < treshold): log('Found a word!') So basically, instead of checking if a word matches a word your searching for (==), your going to calculate the difference between the word and the word your seraching for. An example of this distance might be: word = Text search_word = Test // Every letter that is different gets added to difference, divided by word length difference = 1 /4 = 0.25 word = Key search_word = Bees // K = B, y = e, whitespace = s difference = (1 + 1 + 1) / 4 = 0.75 However, doing it this way might cause unwanted matches; maybe the user did not mistype Text, and actually wanted to type Test. If you want to do this context related, you need to switch to neural networks (LSTM's), but that is pretty advanced. As you brought up multi-word searches, here is how you could do it: For every word in your text: For every search_word in your keywords: var diff = calculateDifference(word, search_word); if(diff < treshold): log('Found a word!') For every search_phrase in your keyphrases: var level = search_phrase.level; var diff = calculateDifference(word, search_phrase[level]); if(diff < treshold): search_phrase.level++; if(level == search_phrase.wordcount): log('Found a phrase!') search_phrase.level = 1; else: search_phrase.level = 1; So basically, you keep an array of phrases as well (óne word phrases are also possible, so you kan keep them in one array). Each phrase has a word count (e.g. New York, wordcount = 2. New York, level = 1 > New New York, level = 2 > York So when you haven't find the first word of your phrase yet, that phrase's level = 1. You will keep looking for the level = 1 word, aka the first word. If you find the first word, you increase the level by 1. Then you keep looking for the word in the phrase with level = 2. If the next word in the text is not that search word, you reset the level to 1. If it IS that word, you increase the level, however, if level == wordCount you have found the word. So you also reset the level then. I hope you understand...
ElasticSerach cluster performance
Nxlog unable to send eventlog after certain time
Sort elasticsearch search hits by document count
Elastic search date range max, min date
Elastic search river mongodb _meta returning action not found error
Seeing many open Elasticsearch connections even after using singleton pattern
What would be a good approach for sending logs from multiple servers a centralized logging server?
does elasticsearch support queries of queries?
Data modelling with elastic search
match or term query on a long property for exact match?
Updating filtered documents in elasticsearch
Testing ElasticSearch custom analyzers
timestamp issue in elasticsearch
Elasticsearch NEST client singleton usage
Elasticsearch: suggest users based on likes
Set every property type to not_analyzed for custom object