elasticsearch


How to index a pdf file using Elasticsearch ingest-attachment plugin?


I have to implement a full-text based search in a pdf document using Elasticsearch ingest plugin. I'm getting an empty hit array when I'm trying to search the word someword in the pdf document.
//Code for creating pipeline
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data",
"indexed_chars" : -1
}
}
]
}
//Code for creating the index
PUT my_index/my_type/my_id?pipeline=attachment
{
"filename" : "C:\\Users\\myname\\Desktop\\bh1.pdf",
"title" : "Quick",
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
//Code for searching the word in pdf
GET /my_index/my_type/_search
{
"query": {
"match": {
"data" : {
"query" : "someword"
}
}
}
When you index your document with the second command by passing the Base64 encoded content, the document then looks like this:
{
"filename": "C:\\Users\\myname\\Desktop\\bh1.pdf",
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"attachment": {
"content_type": "application/rtf",
"language": "ro",
"content": "Lorem ipsum dolor sit amet",
"content_length": 28
},
"title": "Quick"
}
So your query needs to look into the attachment.content field and not the data one (which only serves the purpose of sending the raw content during indexing)
Modify your query to this and it will work:
POST /my_index/my_type/_search
{
"query": {
"match": {
"attachment.content": { <---- change this
"query": "lorem"
}
}
}
}
PS: Use POST instead of GET when sending a payload

Related Links

Join two separate nodes into one cluster?
Elasticsearch - combining query_string and bool query in filter
Issue while querying on a field that store a file path on ElasticSearch
elasticsearch/logstash and logstash-contrib: Couldn't find any plugin named 'x'
What is the difference between Lucene and Elasticsearch
Cannot create Phoenix JDBC river in Elasticsearch
Disabling field analyzing by default in elastic search
time difference in elasticsearch took and that calculated manually
elasticsearch comparison between fields
Find all ID where ID are not in my blacklist
How to order results by custom value and then by _score
Kibana homepage shows blank after deleting all indices
elasticsearch: update in a nested object (HTTP)
Using Phoenix to help to integrate elastic-search and Hbase. When use sqlline.py,to create table, bad happens
Shards and replicas elastic search
Elasticsearch Completion Suggester - Sort suggestions

Categories

HOME
visual-studio
lambda
paypal-ipn
bots
writefile
operating-system
code-formatting
openlayers-3
celery
visualforce
elastic-load-balancer
clickable-image
lc3
yeoman-generator-angular
fasm
alljoyn
csrf
spring-amqp
multi-step
complexity-theory
mailgun
global-variables
temperature
pinterest
pickle
checkout
scheduled-tasks
raml
wsf
akka-persistence
android-permissions
channel
unordered-multimap
spreadsheetgear
python-imageio
docx4j
tightvnc
transform
dreamfactory
numerics
elastica
system-on-chip
httphandler
vesta
code-push
aws-kinesis-firehose
kendo-editor
kendo-treeview
hmmlearn
maatwebsite-excel
bpms
nitrousio
gawk
annotatorjs
gapi
haraka
tablespace
mathjs
system.web.optimization
visual-studio-monaco
always-on-top
coovachilli
android-4.2-jelly-bean
fragmentstatepageradapter
sharepoint-apps
wif4.5
ui4j
textpattern
smart-tv
gluon-desktop
knife
modern.ie
bootstrap-wysiwyg
kotlin-android-extensions
camus
mod-auth-openidc
squeezebox
air-native-extension
boinc
microformats
sunstudio
sly-scroller
examine
cyrillic
jquery-dialog
datacontracts
canonicalization
digiflow
jdownloader
odac
getimagesize
main-method
objectinstantiation

Resources

Encrypt Message