elasticsearch


Retrieve docs that contains only allowed tags (exactly equals)


For each search request I have allowed tags list. For example,
["search", "open_source", "freeware", "linux"]
And I want to retrieve documents with all tags in this list. I want to retrieve:
{
"tags": ["search", "freeware"]
}
and exclude
{
"tags": ["search", "windows"]
}
because list doesn't contain windows tag.
There is an example for equals exactly in Elasticsearch documentation:
https://www.elastic.co/guide/en/elasticsearch/guide/current/_finding_multiple_exact_values.html
Firstly, we include a field that maintains the number of tags:
{ "tags" : ["search"], "tag_count" : 1 }
{ "tags" : ["search", "open_source"], "tag_count" : 2 }
Secondly, we retrieve with needed tag_count
GET /my_index/my_type/_search
{
"query": {
"filtered" : {
"filter" : {
"bool" : {
"must" : [
{ "term" : { "tags" : "search" } },
{ "term" : { "tags" : "open_source" } },
{ "term" : { "tag_count" : 2 } }
]
}
}
}
}
}
The problem is I don't know tag_count.
Also I have tried to write query with script_field tags_count, write each allowed tag in terms query and set minimal_should_match to tags_count, but I can't set script variable in minimal_should_match.
What can I investigate?
So I admit this is not a great solution, but maybe it will inspire other better solutions?
Given portions of the records you are searching look like what you have in your post with the tag_count fields:
"tags" : ["search"],
"tag_count" : 1
or
"tags" : ["search", "open_source"],
"tag_count" : 2
And you have a query like:
["search", "open_source", "freeware"]
Then you might programmatically generate a query like:
{
"query" : {
"bool" : {
"should" : [
{
"bool" : {
"should" : [
{ "term" : { "tags" : "search" } },
{ "term" : { "tags" : "open_source" } },
{ "term" : { "tags" : "freeware" } },
{ "term" : { "tag_count" : 1 } },
],
"minimum_should_match" : 2
}
},
{
"bool" : {
"should" : [
{ "term" : { "tags" : "search" } },
{ "term" : { "tags" : "open_source" } },
{ "term" : { "tags" : "freeware" } },
{ "term" : { "tag_count" : 2 } },
],
"minimum_should_match" : 3
}
},
{
"bool" : {
"should" : [
{ "term" : { "tags" : "search" } },
{ "term" : { "tags" : "open_source" } },
{ "term" : { "tags" : "freeware" } },
{ "term" : { "tag_count" : 3 } },
],
"minimum_should_match" : 4
}
}
],
"minimum_should_match" : 1
}
}
}
The number of nested bool queries will match the number query of query tags (not great for a number of reasons - but with smaller queries / smaller indices, can perhaps get away with this?). Basically each clause will handle every possible case of tag_count and minimum_should_match will be tag_count + 1 (so match tag_count and appropriate number of tags - tag_count amount).
If index size is medium size and tags cardinality is rather low I would just use terms aggregation to get distinct tags and create must and must not filters to filter out docs which contain tags you don't "allow". There are many ways to cache the list of all tags to an in-memory database like Redis, here are a few that came to my mind:
Have a time-to-live of a few minutes or hours, re-generate the list if cache has expired
Have a background process refreshing the list at regular intervals
Update the list when new docs are inserted, then doc deletions should be handled as well
A more performant and 100% accurate method could look like this:
Query all documents which have the requested tags but exclude docs with known other tags (as with the first solution)
Go through the list of returned docs
If a doc contains a tag which is not "allowed", it means it wasn't in known tags cache and thus must be added there, exclude this doc from the result set
Tags at Redis could have a TTL of for example one day or one week, this way old tags are automatically pruned and you get simpler ES queries
This way you don't need a backup process to maintain the list of tags or use the possibly heavy terms aggregation as it hits all docs, and get always the correct result set and fairly performant queries.
This wouldn't work if subsequent aggregations are used as ES might return false documents which are pruned on the client side. However this could be detected by adding a terms aggregation as well and confirm that it doesn't have unexpected tags. If it does those need to be added to the tag cache, added to the must_not filter and query has to be re-executed. This isn't ideal if new tags are being created frequently.
Why not to use the bool with windows being added to must not clause. I hope that's what you are looking out for.
#Sergey Shuvalov, another way to get away with this without using scripting is by creating another field whose value contains all the sorted tags separated by a comma (e.g., or you can choose whatever separator suits you).
So for instance, if you have a document like this:
{
"tags": ["search", "open_source", "freeware", "linux"]
}
You'd create another field alltags which contains the same tags, but sorted in lexicographical order and separated by commas, like this:
{
"tags": ["search", "open_source", "freeware", "linux"]
"alltags": "freeware,linux,open_source,search"
}
That new alltags field would be not_analyzed and thus have the following mapping:
{
"mappings": {
"doc": {
"properties": {
"tags": {
"type": "string"
},
"alltags": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
Then you can issue a simple term query like the one below, you just need to make sure that the tags are also sorted and you'll get your matching documents.
{
"query": {
"term": {
"alltags": "freeware,linux,open_source,search"
}
}
}
If you have a long list of tags, you might also decide to produce an MD5 or SHA1 out of the sorted list of tags and only store that value in the alltags field and use the same value during the search. The bottom line is that you need to produce some kind of "signature" for your tag list and know that that signature will always be the same given the same set of tags. The limit is the sky!
As I mentioned early I combine two nice answers. And this is what I have:
"query" : {
"bool":{
"should":[
{"term":{"tag_count":1}},
{
"bool":{
"should":[
{"term":{"tags":"search"}},
{"term":{"tags":"open_source"}},
{"term":{"tags":"freeware"}}
],
"filter":{"term":{"tag_count":2}},
"minimum_should_match":2
}
},
{
"bool":{
"should":[
{"term":{"tags":"search"}},
{"term":{"tags":"open_source"}},
{"term":{"tags":"freeware"}}
],
"filter":{"term":{"tag_count":3}},
"minimum_should_match":3
}
},
{
"script": {
"script": "tags.containsAll(doc['tags'].values)",
"params": {"tags":["search", "open_source", "freeware"]}
}
}
],
"filter":{ "terms" : {"tags" :["search", "open_source", "freeware"]}}
}
}
script condition works with nontrivial cases, and other conditions is here to consider simple cases.

Related Links

Does ElasticSearch store a duplicate copy of each record?
Dynamic Index with SpringData ElasticSearch
how can I get ElasticSearch cluster configuration
How to leverage logstash to index data but not generating extra fields from logstash
Understanding multi-fields to analyze the text once per language
Elasticsearch: get for a substring in the value of a document field?
Elasticsearch completion suggester context for nested fields
Elasticsearch: type of fields in mapping is different from type in query result
It appears that we have not received any data for this cluster?
Removing From ElasticSearch by type last 7 day
Elasticsearch: accuracy on a filter aggregation
Wildcard query over _all field on Elasticsearch
Showing unmatched word in elasticsearch results
Elasticsearch: querying on both nested object properties and parent properties
elasticsearch first setup - create index
Using ElasticSearch as alternative data store with applications updating both the DB and ES(with the help of Kafka). Is this a good idea?

Categories

HOME
openshift
macos-sierra
plesk
vagrant
schema.org
selenium-builder
add-on
plaid
docker-cloud
supercollider
phpmqtt
tfs2010
sharepoint-online
gtk
yeoman-generator-angular
global-variables
rest-assured
gmp
nose
backtracking
eclipse-emf
hdmi
sparkle
materialize
envoy
john-the-ripper
express-session
busboy
assertions
android-alarms
reset
icepdf
fiware-wirecloud
yoast
virtual-memory
ol3-google-maps
brightscript
webvtt
popen
dynamics-crm-4
maatwebsite-excel
oid
ogg
bigdecimal
okio
data-management
zynq
carmen
coovachilli
between
facebook-ios-sdk
mathnet
wikimedia-commons
miniconda
arbre
self
android-studio-import
apple
supercomputers
ngcordova
modern.ie
mobilefirst-server
karma-coverage
componentart
ampersand
erlog
wicked-gem
gflags
innerhtml
sunos
microsoft-expression-web
flurry-analytics
ildasm
dmake
inputbox
treeline
senchatouch-2.4
xulrunner
frontbase
mousemotionlistener
revmob
lru
tfs-sdk
linkbutton
xalan
spark-view-engine
clearinterval
nsfont
dropshadow
asio
net-use
getopts
cewolf
mongrel2
reliability
principles
cleartype
fxruby
iphone-sdk-3.1.3
isapi-redirect

Resources

Encrypt Message



code
soft
python
ios
c
html
jquery
cloud
mobile