Configuration "Wikitext Search"

From dataspects::Wiki
C0927114931
Jump to navigation Jump to search




ConfigurationWikitextSearch0.png

We are interested in 2 things: the MediaWiki page name and the full wikitext.

Content to MongoDB

They are fed to MongoDB as follows:

  • MediaWiki page name → namespace and pagename
  • full wikitext → full.wikitext

Indexed to Elasticsearch

Subsequently these fields are indexed as follows:

  • namespace and pagenameHasEntityName: "#{pageMongodoc['namespace']}#{pageMongodoc['pagename']}"
  • full.wikitextHasEntityContentTEXT: pageMongodoc['full']["wikitext"]

Index Settings and Mappings

The index uses the following settings and mappings:

Settings Mappings
 {
   "analysis": {
     "analyzer": {
       "camel_case": {
         "type": "pattern",
         "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
       },
       "my_ngram_tokenizer_analyzer": {
         "type": "custom",
         "tokenizer": "my_ngram_tokenizer",
         "filter": "lowercase"
       }
     },
     "tokenizer": {
       "my_ngram_tokenizer": {
         "type": "ngram",
         "min_gram": 3,
         "max_gram": 3
       }
     }
   }
 }
 {
   "properties": {
     "HasEntityName": {
       "type": "text",
       "analyzer": "camel_case"
     },
     "HasEntityContentTEXT": {
       "type": "text",
       "fielddata": true,
       "analyzer": "my_ngram_tokenizer_analyzer",
       "term_vector": "with_positions_offsets_payloads",
       "copy_to": [
         "completion"
       ]
     }
   }
 }

Query

The query is:

{
  "query": {
    "bool": {
      "should": [
        {
          "query_string": {
            "default_field": "HasEntityName",
            "query": "{{query.params.queryString}}",
            "boost": "10"
          }
        },
        {
          "query_string": {
            "default_field": "HasEntityContentTEXT",
            "query": "{{query.params.queryString}}"
          }
        }
      ]
    }
  }
}