✒️ Elasticsearch Certified Engineer notes

8.1.2022 155-minute read

✍ Personal notes and exercises creates to pass the Elastic Certified Engineer certification

🏅 Certification successfully passed

⚠ The following are my personal notes, so I assume no responsibility or liability for any errors or omissions in the content.
📧 Found an error or have a question? write to me or leave a comment

🗂 Index

🗂 Index
🗺️ Course summary
🎓 Topics notes
👨‍🏭 How to
🐳 Deepenings
💊 Pills
🤝 Advices
📔 Dictionary
🙏 Resources

🗺️ Course summary

Official summary Link - Certification FAQ - Stack subscriptions

An Elastic Certified Engineer can deploy a cluster, write precise queries and complex aggregations, create optimized mappings with custom analyzers, manage shard allocation as ingest increases, troubleshoot node issues, and more. - blog

Summary
- ⚠️ Warnings: arguments saw on exams but not addressed by this article
  - Runtime fields - doc
  - Enrich processor - doc
- Data Management
  - Define an index that satisfies a given set of requirements
  - Use the Data Visualizer to upload a text file into Elasticsearch *
  - Define and use an index template for a given pattern that satisfies a given set of requirements
  - Define and use a dynamic template that satisfies a given set of requirements
  - Define an Index Lifecycle Management policy for a time-series index *
  - Define an index template that creates a new data stream *
- Searching Data
  - Write and execute a search query for terms and/or phrases in one or more fields of an index
  - Write and execute a search query that is a Boolean combination of multiple queries and filters
  - Write an asynchronous search *
  - Write and execute metric and bucket aggregations
  - Write and execute aggregations that contain sub-aggregations
  - Write and execute a query that searches across multiple clusters
- Developing Search Applications
  - Highlight the search terms in the response of a query
  - Sort the results of a query by a given set of requirements
  - Implement pagination of the results of a search query
  - Define and use index aliases
  - Define and use a search template
- Data Processing
  - Define a mapping that satisfies a given set of requirements
  - Define and use a custom analyzer that satisfies a given set of requirements
  - Define and use multi-fields with different data types and/or analyzers
  - Use the Reindex API and Update By Query API to reindex and/or update documents
  - Define and use an ingest pipeline that satisfies a given set of requirements, including the use of Painless to modify documents
  - Configure an index so that it properly maintains the relationships of nested arrays of objects
- Cluster Management
  - Diagnose shard issues and repair a cluster’s health
  - Backup and restore a cluster and/or specific indices
  - Configure a snapshot to be searchable
  - Configure a cluster for cross-cluster search
  - Implement cross-cluster replication *
  - Define role-based access control using Elasticsearch Security

🎓 Topics notes

📝 - We will explore each exam topics reported on the exam page

🧰 - Exam specs:
∙ Exam ES version: 7.13
∙ Elasticsearch 7.13 guide
∙ Kibana 7.13 guide

Legend:
⭐ - Relevant topic
💡 - Tips
🦂 - Tricky point

Aware:
🔗 - Original links to sources are often provided
🖱️ - The code sometimes contain invalid inline
comments included for study purposes

Examples:
🤖 - There are a lot of code examples, the suggestion
is to try everything on your machine
💻 - All the examples are executed on a notebook
with 16GB RAM, i5 CPU, into Docker containers

🔷 Data Management

Questions

🔹 Define an index that satisfies a given set of requirements

See 🔹 Define a mapping that satisfies a given set of requirements chapter

🔹 Use the Data Visualizer to upload a text file into Elasticsearch

Data visualizer
- [video] How to import data on Data visualizer
  Elasticsearch Data Visualizer for Files
- Section to upload logs data - doc
  Kibana → Analytics → Machine Learning → Data Visualizer
- Basically, you can upload CSV, TSV, JSON, Logs, data directly from Kibana UI
  - Want to try? Here CSV with Italian cities info
- Index Pattern
  - After the import, you could check to Create index pattern box
    - Index patterns tell Kibana which Elasticsearch indices you want to explore (for dashboard purposes) - doc
  - How create an index pattern?
    Stack Management → Index Patterns → Create index pattern - doc

🔹 Define and use an index template for a given pattern that satisfies a given set of requirements

🔗 Official docs

An index template is a way to tell Elasticsearch how to configure an index when it is created.
Could be composed of Component templates: reusable building blocks that configure mappings, settings, and aliases
- composable template: new (ES v.7.8) index template, it replaces the legacy templates - link
💡 Index created with explicit settings and also matches an index template: the settings from the create index request take precedence
Changes to index templates do not affect existing indices, including the existing backing indices of a data stream.

Create a template

API docs

🖱️ Code example

🦂 Note that Kibana suggestions don’t show template field, although is required before mappings and settings fields

GET _cat/indices?v
          
# ---
# Index template
# ---
          
# Create some composable templates
PUT _component_template/component_template1
{
  "template": {
    "mappings": {
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "name": {
          "type": "keyword"
        },
        "bio":{
          "type": "text",
          "analyzer": "simple"
        }
      }
    }
  }
}
          
PUT _component_template/runtime_component_template
{
  "template": {
    "mappings": {
      "runtime": { 
        "day_of_week": {
          "type": "keyword",
          "script": {
            "source": "emit(doc['@timestamp'].value.dayOfWeekEnum.getDisplayName(TextStyle.FULL, Locale.ROOT))"
          }
        }
      }
    }
  }
}
          
GET _component_template/runtime_component_template
          
# Create the template
PUT _index_template/template_1
{
  "index_patterns": ["te*", "bar*"],
  "template": {
    "settings": {
      "number_of_shards": 1
    }, 
    /* The following mapping overwrite potential template specs*/
    "mappings": {
      "_meta": {
        "description": "Generated using `template_1` "
      },
      "_source": {
        "enabled": false
      },
      "properties": {
        "host_name": {
          "type": "keyword"
        },
        "created_at": {
          "type": "date",
          "format": "EEE MMM dd HH:mm:ss Z yyyy"
        }
      }
    },
    "aliases": {
      "mydata": { }
    }
  },
  "priority": 500,
  "composed_of": ["component_template1", "runtime_component_template"], 
  "version": 1,
  "_meta": {
    "description": "Testing templating system"
  }
}
          
# Create an index that uses the template
PUT bar_index_1
GET bar_index_1

🔹 Define and use a dynamic template that satisfies a given set of requirements

🔗 Official docs

Greater control of how Elasticsearch maps your data beyond the default dynamic field mapping rules.
💡 You can create rules to map new fields (dynamically added - so not explicitly declared in the index original mapping) to desired types

🖱️ Code example

# ─────────────────────────────────────────────
# Basic example with dynamic_templates:
# Map all fields that start with *ip** to IP type
# ─────────────────────────────────────────────
      
# Create the index
PUT my-index-000001/
{
  "mappings": {
    "dynamic": "true",
    "dynamic_templates": [
      {
        "strings_as_ip": {
          "match_mapping_type": "string",
          "match": "ip*",
          "runtime": {
            "type": "ip"
          }
        }
      }
    ]
  }
}
      
# One field
PUT my-index-000001/_doc/1
{
  "ip_host":"0.0.0.0",
  "host_ip":"0.0.0.0"
}
      
GET my-index-000001/_search
{
  "query": {
    "term": {
      "host_ip": "0.0.0.0/16"
    }
  }
}
# > 0 hits found
      
GET my-index-000001/_search
{
  "query": {
    "term": {
      "ip_host": "0.0.0.0/16"
    }
  }
}
# > 1 hit found
      
DELETE my-index-000001

🖱️ Code example

# ─────────────────────────────────────────────
# Dynamic templates example:
# create a *full_name* field with desired format.
# 
# Relevant fields involved: 
#   - Patch match/unmatch
#   - copy_to
# ─────────────────────────────────────────────
      
PUT my-index-000001
{
  "mappings": {
    "dynamic_templates": [
      {
        "full_name": {
          "path_match": "name.*",
          "path_unmatch": "*.middle",
          "mapping": {
            "type": "keyword",
            "copy_to": "full_name"
          }
        }
      }
    ]
  }
}
      
PUT my-index-000001/_doc/1
{
  "name": {
    "first":  "John",
    "middle": "Winston",
    "last":   "Lennon"
  }
}
      
GET my-index-000001/_search
{
  "query": {
    "match": {
      "full_name": {
        "query": "John Winston",
        "operator" : "and"
      }
    }
  }
}
# > 0 hits
      
GET my-index-000001/_search
{
  "query": {
    "match": {
      "full_name": {
        "query": "John Lennon",
        "operator" : "and"
      }
    }
  }
}
# > 1 hits
      
DELETE my-index-000001

🔹 Define an Index Lifecycle Management (ILM) policy (ILP) for a time-series index

⭐ ILM

🔗 Official docs

💡 Automatically manage indices according to your performance, resiliency, and retention requirements.
🦂 Don’t mix ILM with backup system: for the backup process there is a dedicated topic named SLM: Snapshot Lifecycle Management
Actions you could trigger:
- Rollover:
  create a new index when the current reaches some limits
- Shrink:
  reduce the number of primary shards
  - Possible with shrink api
    - 💡 Tip: Why shrink and index? → To reduce overhead - link
- Force merge:
  reduce the number of segments in the index’s shards
  - Possible with force merge api
- Freeze:
  freeze and index - possible with freeze api
- Delete:
  delete the index
Index lifecycle temperatures
- Lifecycle phases - with data behavior description
  - Hot: insert and queries
  - Warm: queries
  - Cold: queries infrequently
  - Frozen: queries rarely
  - Delete: no longer used
- ILM moves indices through the lifecycle according to their age
- Actions available for each phase: list
- Lifecycle phases are useful if we move the data on less expensive HW,
  so we move data to different nodes belonging to different Data tiers

Create an ILM

Using kibana: Stack Management > Index Lifecycle Policies - example

API - docs

🦂 Hot phase and read-only action:
You can set read-only under the hot phase in the policy creation, without context this doesn’t make sense.
- The read-only is referred to the indices archived after a rollover, as described on the GitHub issue
- Moreover: to enable read-only into the API call, the rollover action must be present - doc
🦂 What happens if we set min_age > 0ms in the hot phase?
- Official answers aren’t found, but on Kibana Edit policy section you couldn’t set the min_age parameter, so we can assume this parameter will be ignored for the hot phase
Get an index lifecycle status using ilm explain API - docs
- GET <target>/_ilm/explain

🖱️ Code example

# ─────────────────────────────────────────────
# ILM
# > Example of almost all ILM settings available
# ─────────────────────────────────────────────
                  
PUT _ilm/policy/my_policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",               # Assumption: will be ignored (see notes)
        "actions": {
          "readonly": {},               # Referred to rollover indexes
          "shrink": {
            "number_of_shards": 1       # Reduce the number of primary shards
          },
          "rollover": {                 # Move the insertions to new indexes
            "max_primary_shard_size": "50gb", 
            "max_age": "30d",
            "max_docs": 1000
          },
          "forcemerge": {               # Merge the lucene segments
            "max_num_segments": 1,
            "index_codec": "best_compression"
          },
          "set_priority": {             # Set index recovery priority
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "7d",                # Min age to enter in the phase
        "actions": {
          "set_priority": {
            "priority": 50
          },
          "shrink": {
            "number_of_shards": 1
          },
          "forcemerge": {
            "max_num_segments": 1
          },
          "allocate": {
            "number_of_replicas": 3
          },
          "readonly": {}
        }
      },
      "cold": {
        "min_age": "14d",
        "actions": {
          "set_priority": {
            "priority": 0
          }
        }
      },
      "delete": {
        "min_age": "365d",
        "actions": {
          "delete": {
            "delete_searchable_snapshot": true
          }
        }
      }
    }
  }
}
                  
GET _ilm/policy/my_policy
                  
# Create an index use the policy
PUT my-index-3
{
  "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "index.lifecycle.name": "my_policy"
    }
}
                  
# Check 
GET my-index-3/_ilm/explain

Time series & Data streams

🔗 Time series docs

🦂 Some documentation is under the How To section,
maybe not available during the exam

🔗 Data streams docs
- Time series:
  - “A series of data points indexed (or listed or graphed) in time order” - wiki
- Data streams
  
  ES structure to manage time-series data
  - 💡 “Data Streams is just an improved API and a better user-experience for using the Rollover API for partitioning data into more indexes” - web
    - “A data stream lets you store append-only time series data across multiple indices while giving you a single named resource for requests.” - doc
  - Data streams components:
    - ILM policy with rollover definition
      - Each time the rollover process run, a new backing index is created
        
        Search queries are forwarded to all the backing indices
        
        When you index a new document, only the last backing index is used
        
        You cannot add new documents to backing indices different than the latest, even by sending requests directly to the index shard.
    - Index template
      - "data_stream": { }, - mandatory parameter, specify the index created is a “data stream”
      - "@timestamp" - mandatory parameter, used to time order the data
      - You cannot delete a template used by a data stream
      - See the chapter Define an index template that creates a new data stream for the Kibana code
ILM & Time series

ES offers features to help you store, manage, and search time series data,
such as logs and metrics - doc
- 💡To manage time-series data ES offer different technologies that you should use together:
  1. [optional] Data tiers: create multiple nodes with different HW specs
    (fast HW for hot data, slow and cheap HW for cold data)
  2. [optional] Create a snapshot repository: store the data on distributed file storages (e.g. Google Cloud Storage) for backup purposes
  3. Create a ILP: define backing indexes lifecycle, from the creation to the storing on cold tiers and potential deletion
  4. Create an index template: the time series fields mapping must contain @timestamp
  5. Create the index template and use it
- 🔗 More on ES documentation

🔹 Define an index template that creates a new data stream

What is a data stream? → see the previous chapter
- tl;dr; “Data Streams is just an improved API and a better user-experience for using the Rollover API for partitioning data into more indexes” - web

⭐ Create a data stream

📎 Official docs

Five steps
1. [optional] Create an index lifecycle policy
2. [optional] Create component templates
3. Create an index template
4. Create the data stream
5. [optional] Secure the data stream

Basic data stream creation (only steps 3 & 4)

📎 Data stream creation API

💡 Data stream index must include @timestamp field - doc
💡 For the data stream index naming there is an official naming scheme

🖱️ Code example

🦂 The following example is good to get the hang with API but not so useful in a real-world scenario: without ILM the data stream index is not so different from a normal index

# ─────────────────────────────────────────────
# Basic data stream creation: 
# define an index template that  
# creates a new data stream
# ─────────────────────────────────────────────
              
# ---
# Create an index template
# ---
              
# Create data stream template
PUT _index_template/my-stream-template
{
  "index_patterns": [
    "my-logs-backend-*"
  ],
  "data_stream": {},
  "template": {
    "mappings": {
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "message": {
          "type": "text"
        },
        "relevance": {
          "type": "long"
        }
      }
    }
  }
}
# Note:
# - data stream must include `@timestamp` field
# - template must include `data_stream` field
              
# ---
# Create the data stream
# ---
              
PUT _data_stream/my-logs-backend-test
GET _data_stream/my-logs-backend-test
# > 200
# Note: 
# - "index_name" is not a human-friendly string, and start with a point
# - "timestamp_field" is automatically linked to "@timestamp"
# - "template" used is the previous "my-stream-template"
              
GET _cat/indices
# > 200
# Note: the previous "index_name" is present
              
# ---
# Load & search some data
# ---
              
POST my-logs-backend-test/_doc
{
  "@timestamp": "2020-01-01T00:00:00",
  "message": "my first message",
  "relevance": 1
}
              
POST my-logs-backend-test/_doc
{
  "@timestamp": "2020-01-02T00:00:00",
  "message": "bla bla bla bla - my second message",
  "relevance": 2
}
              
POST my-logs-backend-test/_doc
{
  "@timestamp": "2020-01-02T00:00:01",
  "message": "low level message: relevance 3",
  "relevance": 3
}
              
GET my-logs-backend-test/_search
{
  "query": {
    "match": {
      "message": "message"
    }
  }
}
# > all docs returned
              
GET my-logs-backend-test/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "relevance": 3
          }
        }
      ]
    }
  }
}
# > 1 hit found

Use a data stream

🔗 Official doc

🖱️ Code example

- Section 1: exploring ILM operations

# Cluster to use: `04_snapshots-locals`
          
# -------------------------------------
# Create ILM: exploring ILM operations
# -------------------------------------
          
DELETE _ilm/policy/my-hwc-policy
          
PUT _ilm/policy/my-hwc-policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_docs": 1
          },
          "set_priority": {
            "priority": 100
          },
          "forcemerge": {
            "max_num_segments": 1
          }
        },
        "min_age": "0ms"
      },
      "warm": {
        "min_age": "10s",
        "actions": {
          "set_priority": {
            "priority": 50
          },
          "allocate": {
            "number_of_replicas": 0
          }
        }
      },
      "cold": {
        "min_age": "1m",
        "actions": {
          "set_priority": {
            "priority": 0
          }
        }
      }
    }
  }
}
# > 200
# Note: rollover after 1 document, 
#       warm after 10s, cold after 1m
          
PUT _cluster/settings
{
  "transient": {
    "indices.lifecycle.poll_interval": "3s" 
  }
}
# > 200
# Note: https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-with-existing-indices.html#ilm-existing-indices-reindex
          
# ---
# Test the ILM
# ---
          
DELETE test-index-01
PUT test-index-01
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "index.lifecycle.name": "my-hwc-policy"
  }
}
# Note: no alias provided
          
GET _cat/shards/test*?v
# > node es01
          
GET _cat/indices/test*?v
# > test-index-01, count 0
          
PUT test-index-01/_doc/01
{"foo":"bar"}
          
GET test-index-01/_ilm/explain
# > stack_trace: java.lang.IllegalArgumentException: setting [index.lifecycle.rollover_alias] for index [test-index-01] is empty or not defined
          
# [!] To use rollover, we need to set the alias!
DELETE test-index-01
PUT test-index-01
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "index.lifecycle.name": "my-hwc-policy",
    "index.lifecycle.rollover_alias": "test-index-01-alias"
  }
}
# > 200
          
POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "test-index-01",
        "alias": "test-index-01-alias",
        "is_write_index": true
      }
    }
  ]
}
# [!] Alias must exist to have rolling system
          
PUT test-index-01/_doc/01
{"foo":"bar"}
PUT test-index-01/_doc/02
{"foo":"bar"}
          
GET test-index-01/_count
# > 2
          
GET _cat/shards/test*?v
# > Index: test-index-000002 - Node: es01
# > Index: test-index-01 - Node: es01
          
# Wait for 10s ...
          
GET _cat/shards/test*?v
# > Index: test-index-000002 - Node: es01
# > Index: test-index-01 - Node: es01
# [!] Index not moved because no template was defined,
#     so the `test-index-000002`hasn't the IL
          
# ---
# Tip: se alias for enable writing
# ---
POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "test-index-01",
        "alias": "test-index-01-alias",
        "is_write_index": true
      }
    }
  ]
}
          
PUT test-index-01-alias/_doc/03
{"foo":"bar"}
# > 200
          
PUT _cluster/settings
{
  "transient": {
    "indices.lifecycle.poll_interval": null
  }
}
# > 200
# Note: https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-with-existing-indices.html#ilm-existing-indices-reindex

🖱️ Code example

- Section 2: ILM data-stream like

# Cluster to use: `04_snapshots-locals`
          
# -------------------------------------
# Create index with data-stream like behaviour
# -------------------------------------
          
DELETE _ilm/policy/my-hwc-policy
          
PUT _ilm/policy/my-hwc-policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_docs": 1
          },
          "set_priority": {
            "priority": 100
          },
          "forcemerge": {
            "max_num_segments": 1
          }
        },
        "min_age": "0ms"
      },
      "warm": {
        "min_age": "10s",
        "actions": {
          "set_priority": {
            "priority": 50
          },
          "allocate": {
            "number_of_replicas": 0
          }
        }
      },
      "cold": {
        "min_age": "1m",
        "actions": {
          "set_priority": {
            "priority": 0
          }
        }
      }
    }
  }
}
# > 200
# Note: rollover after 1 document, 
#       warm after 10s, cold after 1m
          
PUT _cluster/settings
{
  "transient": {
    "indices.lifecycle.poll_interval": "3s" 
  }
}
# > 200
# Note: https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-with-existing-indices.html#ilm-existing-indices-reindex
          
# ---
# Create an index that rolls over after 1 document,
# move the old index to warm and then to the cold,
# and meanwhile an alias is updated accordingly
# ---
          
DELETE _index_template/test-index-template
PUT _index_template/test-index-template
{
  "index_patterns": ["test-index-*"], 
  "template": {
    "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "index.lifecycle.name": "my-hwc-policy",
      "index.lifecycle.rollover_alias": "test-index-alias"
    }
  }
}
          
DELETE test-index-000001
PUT test-index-000001
# [!] Warning: name MUST end with `-000001`, or 
#     the ILM process will broke
          
POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "test-index-000001",
        "alias": "test-index-alias",
        "is_write_index": true
      }
    }
  ]
}
          
GET _cat/shards/test-index*?v
# > index: test-index; node: es01
          
PUT test-index-000001/_doc/01
{"foo":"bar"}
PUT test-index-alias/_doc/02
{"foo":"bar"}
          
GET test-index-000001/_count
# > 2
          
GET _cat/shards/*test*?v
# > index: test-index-000001; node: es01
# > index: test-index-000002; node: es01
          
GET test-index-000001/_ilm/explain
# Note: use to check if errors occur
          
# Wait for 10s...
          
GET _cat/shards/*test*?v
# > index: test-index-000001; node: es02
# > index: test-index-000002; node: es01
# Note: the first index is moved to es02 (warm)
#       a new index (...0002) is created, and
#       the alias is updated, with ...0002 index as writing index
          
GET _alias/test-index-alias
# > test-index-000002; "is_write_index" : true
# > test-index-000001; "is_write_index" : false
          
# wait 60s...
          
GET _cat/shards/*test*?v
# > index: test-index-000001; node: es03
# > index: test-index-000002; node: es01
          
# ---
# And so on...
# ---
          
PUT test-index-alias/_doc/03
{"foo":"bar"}
          
GET test-index-alias/_count
          
GET _cat/shards/*test*?v
# > index: test-index-000001; node: es03
# > index: test-index-000002; node: es01
# > index: test-index-000003; node: es01

🖱️ Code example

- Section 3: use data stream

# ---
# Create ILM
# ---
          
PUT _ilm/policy/hwc-policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_docs": 1
          },
          "set_priority": {
            "priority": 100
          }
        },
        "min_age": "0ms"
      },
      "warm": {
        "min_age": "10s",
        "actions": {
          "set_priority": {
            "priority": 50
          },
          "allocate": {
            "number_of_replicas": 0
          }
        }
      },
      "cold": {
        "min_age": "60s",
        "actions": {
          "set_priority": {
            "priority": 0
          }
        }
      }
    }
  }
}
# > 200
# Note: rollover after 1 doc, warm after 10s, cold after 60s
          
# ---
# Create data stream
# ---
          
PUT _index_template/data-stream-template
{
  "index_patterns": ["test-data-stream*"], 
  "data_stream": { },
  "template": {
    "settings": {
      "index.lifecycle.name": "hwc-policy",
      "number_of_replicas": 0
    },
    "mappings": {
      "properties": {
        "@timestamp": {
          "type": "date"
        }
      }
    }
  }
}
# [!] one `date` field is mandatory(@timestamp)
# Note: we don't need to define alias even if we are using 
#       the rollover functionality (like without data_stream)
          
PUT _data_stream/my-new-data-stream
# > 400; no matching index template found
          
DELETE _data_stream/test-data-stream-01
PUT _data_stream/test-data-stream-01
# >200
# Note: index name must coincide with one index_patterns
#       of a template with data_stream enabled
          
GET _data_stream
# > "name" : "test-data-stream-01"
# > "index_name" : ".ds-test-data-stream-01-2021.12.05-000001",
          
GET _cat/indices/*test*?v
# > .ds-test-data-stream-01-2021.12.05-000001
          
GET _alias
# >  ".ds-test-data-stream-01-2021.12.05-000001" : {
# >    "aliases" : { }
# >  },
# Note: alias not yet created
          
GET _cat/shards/*test*?v
# > index: .ds-test-data-stream-01-2021.12.05-000001; node: 01
          
# ---
# Ingest some data
# ---
PUT test-data-stream-01/_bulk
{ "create":{ } }
{ "@timestamp": "2099-05-06T16:21:15.000Z", "message": "192.0.2.42 - - [06/May/2099:16:21:15 +0000] \"GET /images/bg.jpg HTTP/1.0\" 200 24736" }
{ "create":{ } }
{ "@timestamp": "2099-05-06T16:25:42.000Z", "message": "192.0.2.255 - - [06/May/2099:16:25:42 +0000] \"GET /favicon.ico HTTP/1.0\" 200 3638" }
          
GET test-data-stream-01/_count
# > count: 2
          
# Wait for 10s...
          
GET _cat/indices/*test*?v
# > ...001
# > ...002
# Note: new index ...002 created
          
GET _cat/shards/*test*?v
# > ...001; node: es02
# > ...002; node: es01
          
# Wait for 60s...
          
GET _cat/shards/*test*?v
# > ...01; node: es03
# > ...02; node: es01
          
# ---
# Explore aliases
# ---
          
GET _alias
# > .ds-test-data-stream-01-2021.12.05-000001; no alias
# > .ds-test-data-stream-01-2021.12.05-000002; no alias
          
GET _data_stream
# > indices" : [ ...01, ...02
          
GET _cat/indices
# Note: data stream not shown!
          
PUT .ds-test-data-stream-01-2021.12.05-000001/_doc/99
{
  "foo": "bar",
  "@timestamp": "2099-05-06T16:21:15.000Z"
}
# > 400

🔷 Searching Data

Questions

🔹 Write and execute a search query for terms and/or phrases in one or more fields of an index

Elasticsearch is queried through a specific language named DSL (Domain Specific Language), we will combine multiple sections and elements of this language to retrieve the data with the desired characteristics - more on the doc
Use Full text queries to search analyzed text fields - doc
- Some interesting queries from the list:
  - match - the standard search mode
  - term - search for an exact term
  - 💡 match_phrase - search for phrases, use when the order of the words is important
    - Use slop parameter to set the maximum number of intervening unmatched positions

🖱️ Code example

💡 On the following code block there is a question in this form:
Search all results that must satisfy X clause, and is a nice-to-have if satisfy Y clause.
How can we solve this? - using filter + “should” with a normal match

Highlight extracted from the next code block:

*# Search all docs where the comment must contain the word "film",
# and is a "nice-to-have" if the "phrase" field contains the "life" word*
GET test-index-01/_search
{
  "query": {
    "bool": {
      "filter": {
        "term": {
          "comment": "film"
        }
      },
      "should": [
        {
          "match": {
            "phrase": "life"
          }
        }
      ]
    }
  }
}

# ─────────────────────────────────────────────
# Search of text and keywords types
# with multiple "full text queries" and
# multiple index fields
# ─────────────────────────────────────────────
      
DELETE test-index-01
PUT test-index-01
{
  "mappings": {
    "properties": {
      "phrase": {
        "type": "text"
      },
      "book": {
        "type": "keyword"
      },
      "author": {
        "type": "keyword"
      },
      "comment": {
        "type": "text"
      },
      "review_date": {
        "type": "date",
        "format": "yyyy/MM/dd HH:mm:ss||HH:mm:ss yyyy/MM/dd"
      }
    }
  }
}
      
PUT test-index-01/_doc/1
{
  "phrase": "It was a bright cold day in April, and the clocks were striking thirteen",
  "book": "1984",
  "author": "George-Orwell",
  "comment": "A book everyone should read - recommended",
  "review_date": "2021/05/25 12:10:30"
}
      
PUT test-index-01/_doc/2
{
  "phrase": "Mr. Jones, of the Manor Farm, had locked the hen-houses for the night, but was too drunk to remember to shut the pop-holes",
  "book": "Animal Farm",
  "author": "George-Orwell",
  "comment": "A great classic of literature - recommended",
  "review_date": "2021/02/02 16:10:30"
}
       
PUT test-index-01/_doc/3
{
  "phrase": "Review the software license agreements for currently shipping Apple products",
  "book": "Software License Agreements",
  "author": "Apple",
  "comment": "Important but boring EULA informations - not recommended",
  "review_date": "2021/10/25 12:10:30"
}
      
PUT test-index-01/_doc/4
{
  "phrase": "Tyler gets me a job as a waiter, after that Tyler's pushing a gun in my mouth and saying, the first step to eternal life is you have to die.",
  "book": "Fight Club",
  "author": "Chuck Palahniuk",
  "comment": "The book behind the grat film - recommended",
  "review_date": "2019/10/25 12:10:30"
}
      
PUT test-index-01/_doc/5
{
  "phrase": "noise noise noise",
  "book": "Test book 1",
  "author": "Jhon Doe",
  "comment": "The book behind the grat film - not yet recommended",
  "review_date": "2021/01/10 09:10:30"
}
      
PUT test-index-01/_doc/6
{
  "phrase": "noise noise noise",
  "book": "Test book 2",
  "author": "Jhon Doe",
  "comment": "Not a book, not a film - recommended",
  "review_date": "10:10:30 2021/10/25"
}
      
PUT test-index-01/_doc/7
{
  "phrase": "Mr. Jones, of the Manor Farm, had locked the hen-houses for the night, but was too drunk to remember to shut the pop-holes",
  "book": "Animal Farm - not recommended",
  "author": "George-Orwell",
  "comment": "Test test test test - not recommended",
  "review_date": "2019/12/31 00:01:30"
}
      
PUT test-index-01/_doc/8
{
  "phrase": "It's my life",
  "book": "foo",
  "author": "John Doe",
  "comment": "Only a boook - recommended"
}
      
# ---
# Search recommended film
# 
# > different attempts reported
#   with discussion on each behavior
# ---
      
GET test-index-01/_search
{
  "query": {
    "bool": {
      "must_not": [
        {
          "term": {
            "comment": {
              "value": "not recommended"
            }
          }
        }
      ]
    }
  }
}
# > wrong - all docs returned
# Note: you sould not use term query on text fields,
# as suggested on the official documentation.
      
GET test-index-01/_search
{
  "query": {
    "bool": {
      "must_not": [
        {
          "match": {
            "comment": "not recommended"
          }
        }
      ]
    }
  }
}
# > wrong - no docs returned
# Note: the query is asking for all docs
# that doesn't have words "not" AND "recommended".
# Because all docs have the word "recommended" no
# results are found. Same output of `"comment": "recommended"`
      
GET test-index-01/_search
{
  "query": {
    "bool": {
      "must_not": [
        {
          "match": {
            "comment": "not"
          }
        }
      ]
    }
  }
}
# > correct - but not resilient
# Note: searching all comments doesn't have the
# word `not` work for this small example but isn't
# a reliable solution, the `not` term can easily be
# used on the comment before the final verdict.
# [!] Like the test document "6" that is wrongly not retrieved.
      
GET test-index-01/_search
{
  "query": {
    "bool": {
      "must_not": [
        {
          "bool": {
            "must": [
              {
                "match": {
                  "comment": "not"
                }
              },
              {
                "match": {
                  "comment": "recommended"
                }
              }
            ]
          }
        }
      ]
    }
  }
}
# > correct - but not resilient
# Note: same problem of previous example
      
GET test-index-01/_search
{
  "query": {
    "query_string": {
      "default_field": "comment",
      "query": "NOT not recommended"
    }
  }
}
# > correct - but not resilient
# Note: same problem of previous example.
# This query is similar of the last one,
# but write in more concise format and
# return a _score for each hit
      
GET test-index-01/_search
{
  "query": {
    "bool": {
      "must_not": [
        {
          "match_phrase": {
            "comment": "not recommended"
          }
        }
      ]
    }
  }
}
# > correct - can be improved
# Note: with match_phrase we are asking for
# documents that have both words "not recommended"
# one word after the other word (ordering of the words is important).
# [!] The test document "5" is returned with a 
# text "not yet recommended", this behaviour may or 
# may not be desired depending on the use case.
      
GET test-index-01/_search
{
  "query": {
    "bool": {
      "must_not": [
        {
          "match_phrase": {
            "comment": "not recommended"
          }
        },
        {
          "match_phrase": {
            "comment": "not yet recommended"
          }
        }
      ]
    }
  }
}
# > correct
# Note: exclude all text phrases 
# that are not the simple "recommended"
# Note: another approach could be to use `slope`
      
# ---
# Recommended books written by Orwell
# ---
      
GET test-index-01/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "author": "George-Orwell"
          }
        },
        {
          "bool": {
            "must_not": [
              {
                "match_phrase": {
                  "comment": "not recommended"
                }
              },
              {
                "match_phrase": {
                  "comment": "not yet recommended"
                }
              }
            ]
          }
        }
      ]
    }
  }
}
# > correct - can be improved
# Note: because we aren't required how good
# "George-Orwell" match with the author field,
# `match` isn't the best API to use 
      
GET test-index-01/_search
{
  "query": {
    "bool": {
      "must_not": [
        {
          "match_phrase": {
            "comment": "not recommended"
          }
        },
        {
          "match_phrase": {
            "comment": "not yet recommended"
          }
        }
      ],
      "filter": [
        {
          "term": {
            "author": "George-Orwell"
          }
        }
      ]
    }
  }
}
# > correct
# Note: with `filter` API we could 
# speed up the search using caching
      
# ---
# Books with comments must talk about "film",
# possibly with phrase spoke about `life`
# ---
      
GET test-index-01/_search
{
 "query": {
   "bool": {
     "should": [
       {
         "match": {
           "comment": "film"
         }
       },
       {
         "match": {
           "phrase": "life"
         }
       }
     ]
   }
 } 
}
# > wrong - doc 8 should not be present
# Note: the "must spoke about film" restriction
# is not represented in this query
      
GET test-index-01/_search
{
  "query": {
    "bool": {
      "filter": {
        "term": {
          "comment": "film"
        }
      },
      "should": [
        {
          "match": {
            "phrase": "life"
          }
        }
      ]
    }
  }
}
# > correct
# Note: for the conjunction between two assertions:
# "X must be true" and "Y is nice-to-have"
# we could put X into a `filter` and use the 
# standard `match` to evaluate the Y
      
# ---
# Documents with comments written in 2021
# ---
      
GET test-index-01/_search
{
  "query": {
    "range": {
      "review_date": {
        "gte": "2021/01/01 00:00:00",
        "lte": "2022/01/01 00:00:00"
      }
    }
  }
}
# > correct
# Note: all docs have the same _score
      
# ---
# All books where the author
# have a name that contains 
# the character "o"
# ---
      
GET test-index-01/_search
{
  "query": {
    "wildcard": {
      "author": {
        "value": "*o*",
        "case_insensitive":true
      }
    }
  }
}
      
# ---
# All docs are written by an author
# with a name similar to `Jeorge-Orbell`
# ---
      
GET test-index-01/_search
{
  "query": {
    "fuzzy": {
      "author": {
        "value": "Jeorge-Orbell",
        "fuzziness": 2
      }
    }
  }
}
# > All the books that are written by "George-Orwell"

🔹 Write and execute a search query that is a Boolean combination of multiple queries and filters

See the previous question “Write and execute a search query for terms and/or phrases in one or more fields of an index”
On search API, bool statement usages:
- must → query must be satisfied and track the score
- filter→ like must but without the score
- should → match not required but if verified score increased
- must_not → if match discard doc

🔹 Write an asynchronous search *

“Asynchronous search makes long-running queries feasible and reliable” - blog
“The async search API let you asynchronously execute a search request, monitor its progress, and retrieve partial results as they become available. - doc
- Create a standard query and make it async to receive a token used to monitor the query evolution and gather data as it executes

🖱️ Code example

You can also specify how long the async search needs to be available through the keep_alive parameter - doc
💡 Async search does not support scroll - doc

# ─────────────────────────────────────────────
# Make an async search query
# ─────────────────────────────────────────────
      
# Add "Sample eCommerce orders" sample
# data directly from Kibana:
# https://www.elastic.co/guide/en/kibana/7.13/get-started.html#gs-get-data-into-kibana
      
# ---
# Check data existence
# ---
      
GET _cat/indices?v
# > "kibana_sample_data_ecommerce"
      
PUT _cluster/settings
{
  "transient": {
    "search.max_buckets": 2290000
  }
}
# Note: increase max_buckets to allow 
# a heavy query to be performed.
      
# [!] Warning: the following is a heavy query
GET kibana_sample_data_ecommerce/_search
{
  "query": {
    "range": {
      "order_date": {
        "gte": "now-1d"
      }
    }
  },
  "aggs": {
    "time_buckets": {
      "date_histogram": {
        "field": "order_date",
        "fixed_interval": "1s",
        "extended_bounds": {
          "min": "now-1d"
        },
        "min_doc_count": 0
      }
    }
  },
  "size": 0
}
# > [wait ~1m] - results returned
      
# ---
# Try Async search
# ---
      
POST kibana_sample_data_ecommerce/_async_search?size=0
{
  "query": {
    "range": {
      "order_date": {
        "gte": "now-1d"
      }
    }
  },
  "aggs": {
    "time_buckets": {
      "date_histogram": {
        "field": "order_date",
        "fixed_interval": "1s",
        "extended_bounds": {
          "min": "now-1d"
        },
        "min_doc_count": 0
      }
    }
  },
  "size": 0
}
# > "is_running": true - no hits 
# Note: copy the "id" value (we will refer as $ID)
      
GET _async_search/status/FmxTakt4Z0dDU0MyaG9TUC1GNVhqamcbVHhDQV9EcENTV21EOHNtWWt0b3hIdzo0ODg2
# > wait for "is_running": false
      
GET _async_search/FmxTakt4Z0dDU0MyaG9TUC1GNVhqamcbVHhDQV9EcENTV21EOHNtWWt0b3hIdzo0ODg2
# > the query results

🔹 Write and execute metric and bucket aggregations

“Aggregation summarizes your data as metrics, statistics, or other analytics” - doc

📎 High-Level Concepts - doc
📎 Official doc

⭐ Aggregation is a powerful resource offered by Elasticsearch: it consists of the ability to aggregate (bucketization) the data and calculate metrics on those buckets.
- With some powerful characteristics:
  - Efficiency: the aggregation use internal structures for fast calculation, and leverage the ES cluster scaling system
  - Near real time: just as a document is indexed, it will be counted into the aggregation
  - Powerful: aggregation structure allows a query nested system to allow the user to aggregate and measure any sort of data, moreover aggregation could be used in conjunction with the usual search system (the query field)

Bucket aggregations

“Bucket aggregations that group documents into buckets, also called bins, based on field values, ranges, or other criteria.” - doc

🖱️ Code example

# ─────────────────────────────────────────────
# Bucket aggregations
# ─────────────────────────────────────────────
          
# ---
# Data creation
# ---
# Add "Sample eCommerce orders" data directly from kibana,
# Follow this guide: https://www.elastic.co/guide/en/kibana/7.13/get-started.html#gs-get-data-into-kibana
          
GET _cat/indices?v
# > `kibana_sample_data_ecommerce` 
          
GET kibana_sample_data_ecommerce/_search?size=1
# > get and idea of the doc structure
          
# ---
# Bucket aggregation
# ---
          
GET kibana_sample_data_ecommerce
# > check "category" field: is indexed both as text and keyword
          
# How many "category" exist?
GET kibana_sample_data_ecommerce/_search
{
  "size": 0,
  "aggs": {
    "category_census": {
      "terms": {
        "field": "category.keyword"
      }
    }
  }
}
# > There are 2024 "Men's Clothing", 1136 "Women's Shoes"...
# Note:
# - "size": 0 because we don't need to 
#     use the "search" system, this spec speed-up the process
# - we use `terms` although only one field is used 
# - `category.keyword` required because is a "Multi-Fields" field [1]
          
# What are the manufacturers for each category?
GET kibana_sample_data_ecommerce/_search
{
  "size": 0,
  "aggs": {
    "sale_categories": {
      "terms": {
        "field": "category.keyword"
      },
      "aggs": {
        "category_manufacturers": {
          "terms": {
            "field": "manufacturer.keyword"
          }
        }
      }
    }
  }
}
# > e.g. "Elitelligence" is the most important manufacturer of category "Men's Clothing" ...
          
# In which categories the top 3 manufacturers sell products?
GET kibana_sample_data_ecommerce/_search
{
  "size": 0,
  "aggs": {
    "manufacturers": {
      "terms": {
        "field": "manufacturer.keyword",
        "size": 3
      },
      "aggs": {
        "categories": {
          "terms": {
            "field": "category.keyword"
          }
        }
      }
    }
  }
}
# Warning: description of the `aggs.manufacturers.terms.size` parameter
# (in other words the `size` parameter on an `aggs` field)
# is not found on the official API documentation. So it "should be"
# the top 3 manufacturers, the count order is not guaranteed
          
# ---
# Resources
# ---
# [1] https://www.elastic.co/guide/en/elasticsearch/reference/7.13/mapping-types.html#types-multi-fields

Metric aggregations

Calculate metric on data searched and/or grouped into buckets

“Calculate metrics, such as a sum or average, from field values.” - doc

🖱️ Code example

# ─────────────────────────────────────────────
# Metrics aggregations
# ─────────────────────────────────────────────
          
# ---
# Data creation
# ---
# Add "Sample eCommerce orders" data directly from kibana,
# Follow this guide: https://www.elastic.co/guide/en/kibana/7.13/get-started.html#gs-get-data-into-kibana
          
GET _cat/indices?v
# > `kibana_sample_data_ecommerce` 
          
GET kibana_sample_data_ecommerce/_search?size=1
# > get and idea of the doc structure
          
# ---
# Metrics aggregation
# ---
          
# AVG of products "price"
GET kibana_sample_data_ecommerce/_search
{
  "size": 0,
  "aggs": {
    "avg_price": {
      "avg": {
        "field": "products.price"
      }
    }
  }
}
# > average products price: 34.78€
          
# Most recent order
GET kibana_sample_data_ecommerce/_search
{
  "size": 0, 
  "aggs": {
    "max_order_date": {
      "max": {
        "field": "order_date"
      }
    }
  }
}
# > "2021-11-13T23:45:36.000Z"
          
# Older order
GET kibana_sample_data_ecommerce/_search
{
  "size": 0,
  "aggs": {
    "min_order_date": {
      "min": {
        "field": "order_date"
      }
    }
  }
}
# > "2021-10-14T00:04:19.000Z"

Bucket and metrics together

We can combine both the functionalities to calculate simultaneously metrics for all buckets
🦂 Inside the agg field, on the query request, we can use a Bucket or Metric predicate indiscriminately in the same place. It will be the predicate meaning that differentiates a bucketization from a stats calculation.
- e.g. terms predicate will create sub-groups (buckets) while avg will calculate the bucket average value

🖱️ Code example

# ─────────────────────────────────────────────
# Bucket & Metrics aggregations
#
# Note:
# - We will mix `query`, `bucket` and `aggs` terms
# ─────────────────────────────────────────────
          
# ---
# Data creation
# ---
# Add "Sample eCommerce orders" data directly from kibana,
# Follow this guide: https://www.elastic.co/guide/en/kibana/7.13/get-started.html#gs-get-data-into-kibana
          
GET _cat/indices?v
# > `kibana_sample_data_ecommerce` 
          
GET kibana_sample_data_ecommerce/_search?size=1
# > get and idea of the doc structure
          
# ---
# Bucket & Metrics aggregation
# ---
          
# AVG price per category
GET kibana_sample_data_ecommerce/_search?size=0
{
  "aggs": {
    "avg_categories_price": {
      "terms": {
        "field": "category.keyword"
      },
      "aggs": {
        "avg_price": {
          "avg": {
            "field": "products.price"
          }
        }
      }
    }
  }
}
# > "Men's Clothing" avg price: 33.44
# > "Women's Clothing" avg price: 32.91 
# > [...]
          
# AVG price of "Men's Clothing"
GET kibana_sample_data_ecommerce/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "match": {
            "category.keyword": "Men's Clothing"
          }
        }
      ]
    }
  },
  "aggs": {
    "category_mens_clothing_avg_price": {
      "avg": {
        "field": "products.base_price"
      }
    }
  }
}
# > "value": 33.44
# Note: same result as before, but without the overhead
# of calculate all categories AVG
          
# Number of products bought per day
POST kibana_sample_data_ecommerce/_search?size=0
{
  "aggs": {
    "daily_orders": {
      "date_histogram": {
        "field": "order_date",
        "calendar_interval": "day"
      },
      "aggs": {
        "products_counter": {
          "value_count": {
            "field": "products._id.keyword"
          }
        }
      }
    }
  }
}
# > 2021-10-21 - 318 products bought
# > 2021-10-22 - 334 products bought
          
# Number of products bought per day - another solution
GET kibana_sample_data_ecommerce/_search?size=0
{
  "aggs": {
    "LEVEL_1": {
      "date_histogram": {
        "field": "order_date",
        "interval": "day"
      },
      "aggs": {
        "LEVEL_2": {
          "cardinality": {
            "field": "products._id.keyword"
          }
        }
      }
    }
  }
}
          
# Date with the max products bought
GET kibana_sample_data_ecommerce/_search?size=0
{
  "aggs": {
    "daily_orders": {
      "date_histogram": {
        "field": "order_date",
        "calendar_interval": "day"
      },
      "aggs": {
        "products_counter": {
          "value_count": {
            "field": "products._id.keyword"
          }
        }
      }
    },
    "date_max_products_bought": {
      "max_bucket": {
        "buckets_path": "daily_orders.products_counter"
      }
    }
  }
}
# > "2021-10-29T00:00:00.000Z" with 368 products boughts
# Note: "max_bucket" term use "daily_orders.products_counter"
#       that are both user-defined

🔹 Write and execute aggregations that contain sub-aggregations

With sub-aggregations, we can go deeper in the analysis of the data. We can create buckets with some criteria inside other buckets.
- E.g. we can create buckets, one for each day and aggregate inside each day with some other policy

🖱️ Code example

# ─────────────────────────────────────────────
# Aggregations with sub-aggregations
# ─────────────────────────────────────────────
      
# Add "Sample eCommerce orders" sample
# data directly from Kibana:
# https://www.elastic.co/guide/en/kibana/7.13/get-started.html#gs-get-data-into-kibana
      
# ---
# Check data existence
# ---
      
GET _cat/indices?v
# > "kibana_sample_data_ecommerce"
      
GET kibana_sample_data_ecommerce/_search?size=1
      
# ---
# manufacturers inside an each category
# ---
      
GET kibana_sample_data_ecommerce/_search
{
  "size": 0,
  "aggs": {
    "category_family": {
      "terms": {
        "field": "category.keyword"
      },
      "aggs": {
        "manufacturer_family": {
          "terms": {
            "field": "manufacturer.keyword"
          }
        }
      }
    }
  }
}
# > "Men's Clothing" category have 1242 products boughts from "Elitelligence" manufacturer
# Note: the "manufacturer_family" is a sub-aggregation
      
# Check the above affirmation is true
GET kibana_sample_data_ecommerce/_count
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "category.keyword": "Men's Clothing"
          }
        },
        {
          "match": {
            "manufacturer.keyword": "Elitelligence"
          }
        }
      ]
    }
  }
}
# > 1242
# Note: the last query comment is true

🔹 Write and execute a query that searches across multiple clusters

🔗 Official doc

You could connect ES clusters to allow a search query to be performed across all their instances
- 🦂 Not all API are allowed, here is the complete list.
  - e.g. you cannot get a document from a remote cluster by _doc id:
```
GET local-index/_doc/01 # ← allowed
GET remote-cluster:remote-index/_doc/01 # ← not allowed
```
Basically use the following format if you want to query a remote cluster:
GET <remote-cluster-name> : <remote index name>/<API>

💡 You could check the remote cluster connection using the _remote API

GET /_remote/info
# > "<cluster name>"
# > "connected" : true,
# > "num_nodes_connected" : 1,

🖱️ Code example

The cluster used for the following example is multicluster-configured

# ─────────────────────────────────────────────
# Multiple clsuters search
#
# Architecture:
#     Two clusters (`cluster1`,`cluster2`), 
#     with `cluster1` configured to be
#     connected to `cluster2`
# Note:
#     - To run the experiment use the cluster "multicluster-configured":
#     https://github.com/pistocop/elastic-certified-engineer/tree/master/dockerfiles
#     - Pay attention to the comments: some kibana
#     code should be run on a different host
# ─────────────────────────────────────────────
# ─────────────────────────────────────────────
      
# ---
# Kibana code for `cluster2`
# Tip: open `cluster2` kibana at localhost:5602
# and paste the following code
# ---
      
GET /
# > `cluster2`
      
GET /_remote/info
# > no connections
      
# Create some data
PUT c2-index/_doc/01
{
  "msg": "Hello world form cluster 2!"
}
      
GET cluster1:c1-index/_search
{
  "query": {
    "match_all": {}
  }
}
# > index not found
      
# ---
# [!] Kibana code for `cluster1`
# Tip: open `cluster1` kibana at localhost:5601
# and paste the following code
# ---
      
GET /
# > `cluster1`
      
GET /_remote/info
# `cluster2` connected
# Note: "num_nodes_connected" should be at least 1
      
GET cluster2:c2-index/_search
{
  "query": {
    "match_all": {}
  }
}
# > "msg" : "Hello world form cluster 2!"
      
GET /c1-index,cluster2:c2-index/_search
{
  "query": {
    "match_all": {}
  }
}
# > "Hello world form cluster 1!"
# > "Hello world form cluster 2!"
      
PUT c1-index/_doc/01
{
  "msg": "Hello world form cluster 1!"
}
      
# ---
# Kibana code for `cluster2`
# Tip: open `cluster2` kibana at localhost:5602
# and paste the following code
# ---
      
PUT _cluster/settings
{
  "persistent": {
    "cluster": {
      "remote": {
        "cluster1": {
          "mode": "sniff",
          "seeds": [
            "c1n1:9300"
          ],
          "transport.ping_schedule": "30s"
        }
      }
    }
  }
}
# 200
      
GET cluster1:c1-index/_search
{
  "query": {
    "match_all": {}
  }
}
# > "msg" : "Hello world form cluster 1!"

🔷 Developing Search Applications

Questions

🔹 Highlight the search terms in the response of a query

🔗 Official doc
🔗 Official highlighting examples - doc

“enable you to get highlighted snippets from one or more fields in your search results so you can show users where the query matches are” - doc
💡 ES internals thoughts:
At indexing time for searching purposes the text is parsed, tokenized and the tokens are used to build the search inverted index. In this system, the requirements to “highlight” a piece of the original text aren’t met: e.g. we should store/calculate also the tokens original position.
ES has multiple solutions for resolving this issue, explained in the chapter Offset Strategy.

🖱️ Code example

The cluster used for the following example is single-node

# ─────────────────────────────────────────────
# Highlighting
# ─────────────────────────────────────────────
      
# ---
# Data creation
# ---
# Add "Sample eCommerce orders" data directly from kibana,
# Follow this guide: https://www.elastic.co/guide/en/kibana/7.13/get-started.html#gs-get-data-into-kibana
      
# ---
# Basic highlight test
# ---
      
PUT test-index-01
{
  "mappings": {
    "properties": {
      "msg": {
        "type": "text"
      }
    },
    "_source": {
      "enabled": false
    }
  }
}
      
PUT test-index-02
{
  "mappings": {
    "properties": {
      "msg": {
        "type": "text"
      }
    },
    "_source": {
      "enabled": true
    }
  }
}
      
PUT test-index-01/_doc/01
{
  "msg": "To be, or not to be, that is the question"
}
      
PUT test-index-02/_doc/01
{
  "msg": "To be, or not to be, that is the question"
}
      
GET test-index-01/_doc/01
# > "found" : true
# Note: the body is not returned, because it wasn't stored
      
GET test-index-02/_doc/01
# > "found" : true + "_source" with body
      
GET test-index-02/_search
{
  "query": {
    "match": {
      "msg": "to be"
    }
  },
  "highlight": {
    "fields": {
      "msg": {}
    }
  }
}
# > <em>To</em> <em>be</em>
      
GET test-index-01/_search
{
  "query": {
    "match": {
      "msg": "to be"
    }
  },
  "highlight": {
    "fields": {
      "msg": {}
    }
  }
}
# > hits with score, no highlight
# Note: if the source text is not stored,
#       you cannot get the highlight
      
# ---
# Index settings and highlights behaviour
# Experiment 1
# ---
      
PUT test-index-03
{
  "mappings": {
    "_source": {
      "enabled": false
    },
    "properties": {
      "msg":{
        "type": "text",
        "term_vector": "with_positions_offsets"
      }
    }
  }
}
      
PUT test-index-03/_doc/01
{
  "msg": "To be, or not to be, that is the question"
}
      
GET test-index-03/_search
{
  "query": {
    "match": {
      "msg": "to be"
    }
  },
  "highlight": {
    "fields": {
      "msg": {
        "type": "unified"
      }
    }
  }
}
# > no highlights
# Note: even with "term_vector": "with_positions_offsets",
#       if the source is not stored, the highlight couldn't
#       work. As described on the official documentation:
#       https://www.elastic.co/guide/en/elasticsearch/reference/7.13/mapping-source-field.html
      
# ---
# Index settings and highlights behaviour
# Experiment 2
# ---
      
GET test-index-02
# > _source enabled and "msg" type "text"
      
GET test-index-02/_search
{
  "query": {
    "match": {
      "msg": "to be"
    }
  },
  "highlight": {
    "fields": {
      "msg": {
        "type": "fvh"
      }
    }
  }
}
# > error
# Note: you should index the termvector
#       if you want to use the `fvh` highlighter
      
PUT test-index-04
{
  "mappings": {
    "_source": {
      "enabled": true
    },
    "properties": {
      "msg": {
        "type": "text",
        "term_vector": "with_positions_offsets"
      }
    }
  }
}
      
PUT test-index-04/_doc/01
{
  "msg": "To be, or not to be, that is the question"
}
      
GET test-index-04/_search
{
  "query": {
    "match": {
      "msg": "to be"
    }
  },
  "highlight": {
    "fields": {
      "msg": {
        "type": "fvh"
      }
    }
  }
}
# > <em>To</em> <em>be</em>
# Note: this is the faster highlight mode available,
#       but to enable the `fvh` highlighter you need an
#       index with "term_vector": "with_positions_offsets",
#       and this parameter will double the size of the field:
#       https://www.elastic.co/guide/en/elasticsearch/reference/7.13/term-vector.html
      
# ---
# Highlights behaviour
# ---
      
PUT test-index-05
{
  "mappings": {
    "_source": {
      "enabled": true
    },
    "properties": {
      "msg": {
        "type": "text"
      }
    }
  }
}
      
PUT test-index-05/_doc/01
{
  "msg": "To be, or not to be, that is the question"
}
      
GET test-index-05/_search
{
  "query": {
    "match": {
      "msg": "to be"
    }
  },
  "highlight": {
    "fields": {
      "msg": {
        "type": "unified"
      }
    },
    "boundary_scanner": "word"
  }
}
# > "<em>To</em>",
      
# ---
# Highlight and bool
# ---
      
PUT test-index-06
{
  "mappings": {
    "_source": {
      "enabled": true
    },
    "properties": {
      "phrase": {
        "type": "text"
      },
      "comment": {
        "type": "text"
      }
    }
  }
}
      
PUT test-index-06/_doc/01
{
  "phrase": "To be, or not to be, that is the question",
  "comment": "he is questioning the value of life..."
}
      
PUT test-index-06/_doc/02
{
  "phrase": "The greatest glory in living lies not in never falling, but in rising every time we fall",
  "comment": "he was speaking about the power of persistence..."
}
      
PUT test-index-06/_doc/03
{
  "phrase": "I’ve learned that life is one crushing defeat after another until you just wish Flanders was dead.",
  "comment": "he is obese, immature, outspoken, aggressive, balding, lazy, ignorant,..."
}
      
GET test-index-06/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "phrase": "Flanders"
          }
        },
        {
          "match": {
            "comment": "he is"
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "phrase": {
        "type": "plain"
      }
    }
  }
}
# > <em>Flanders</em>
# Note: only the `phrase` field is highlighted
      
GET test-index-06/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "phrase": "Flanders"
          }
        },
        {
          "match": {
            "comment": "immature ignorant"
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "phrase": {
        "type": "plain"
      },
      "comment": {
        "type": "plain"
      }
    }
  }
}
# > both fields highlighted
      
GET test-index-06/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "phrase": "Flanders"
          }
        },
        {
          "match": {
            "comment": "he is"
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "comment": {
        "highlight_query": {
          "match": {
            "comment": "ignorant"
          }
        }
      }
    }
  }
}
# > <em>ignorant</em>
# Note: the query search on some fields with some queries,
#       but we highlight work on something different.

🔹 Sort the results of a query by a given set of requirements

🔗 Official doc

“Allows you to add one or more sorts on specific fields.” - doc
Usually, the _score field is used to order the documents, but also other fields could be involved id ordering
🦂 If you use the sort field to order the results, the max_score value will be lost, use the “track_scores”: true if you want it

🖱️ Code example

The cluster used for the following example is single-node

# ─────────────────────────────────────────────
# Sorting the search results
# ─────────────────────────────────────────────
      
# Add "Sample eCommerce orders" sample
# data directly from Kibana:
# https://www.elastic.co/guide/en/kibana/7.13/get-started.html#gs-get-data-into-kibana
      
GET _cat/indices?v
# > "kibana_sample_data_ecommerce"
      
GET kibana_sample_data_ecommerce
GET kibana_sample_data_ecommerce/_search?size=1
      
GET kibana_sample_data_ecommerce/_search
{
  "_source": ["products.product_name"], 
  "query": {
    "match": {
      "products.product_name": "basic dark"
    }
  }
}
# > "max_score" : 5.0528083,
# > "Basic T-shirt - Dark Salmon"
      
GET kibana_sample_data_ecommerce/_search
{
  "_source": ["products.product_name"], 
  "query": {
    "match": {
      "products.product_name": "basic dark"
    }
  },
  "sort": [
    {
      "_score": {
        "order": "desc"
      }
    }
  ]
}
# > "max_score" : 5.0528083,
# > "Basic T-shirt - Dark Salmon"
# Note: same results as before because ES
#       order by _score by default
      
GET kibana_sample_data_ecommerce/_search
{
  "_source": ["products.product_name"], 
  "query": {
    "match": {
      "products.product_name": "basic dark"
    }
  },
  "sort": [
    {
      "_score": {
        "order": "asc"
      }
    }
  ]
}
# > "max_score" : null,
# > "Cocktail dress / Party dress - peacoat" | "_score" : 0.81010413
# Note: we have reversed the score order, so now ES
#       don't know the max_score value and the first hit
#       is the less relevant respect to the query
      
GET kibana_sample_data_ecommerce/_search
{
  "_source": [
    "products.product_name",
    "order_id"
  ],
  "query": {
    "match": {
      "products.product_name": "basic dark"
    }
  },
  "sort": [
    {
      "order_id": {
        "order": "asc"
      }
    }
  ]
}
# > "max_score" : null,
# > "order_id" : 550375,
# > "product_name" : "Basic T-shirt - Medium Slate Blue"
# Note: the results has lost the _score for the same
#       cause of the last query. Here we are ordering on
#       `order_id` value: this means that the first hit
#       is the document that match (maybe with the lowest
#       score but this doesn't matter) the query AND have
#       the biggest `order_id` value
      
GET kibana_sample_data_ecommerce/_search
{
  "_source": [
    "products.product_name",
    "order_id"
  ],
  "track_scores": true,
  "query": {
    "match": {
      "products.product_name": "basic dark"
    }
  },
  "sort": [
    {
      "order_id": {
        "order": "asc"
      }
    }
  ]
}
# > "max_score" : 5.0528083,
# Note: we can specify that we want the max score also
      
# ---
# Order an aggregation
# ---
      
# Number o products bought per day, in asc order,
# of the "Elitelligence" manufacturer
GET kibana_sample_data_ecommerce/_search?size=0
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "manufacturer.keyword": "Elitelligence"
          }
        }
      ]
    }
  },
  "aggs": {
    "daily_bucket": {
      "date_histogram": {
        "field": "order_date",
        "interval": "day"
      },
      "aggs": {
        "n_products": {
          "value_count": {
            "field": "products._id.keyword"
          }
        },
        "n_products_sort": {
          "bucket_sort": {
            "sort": [
              {
                "n_products": {
                  "order": "asc"
                }
              }
            ]
          }
        }
      }
    }
  }
}
# > "key_as_string" : "2021-11-07T00:00:00.000Z",
# > "n_products" : { "value" : 70 }
# Note: for the bucket ordering we haven't use
#       the "sort term", for buckets we need
#       buckets pipeline aggregation fields.
#       In this case the `Bucket sort` pipeline
#       was used: https://www.elastic.co/guide/en/elasticsearch/reference/7.13/search-aggregations-pipeline-bucket-sort-aggregation.html

🔹 Implement pagination of the results of a search query

🔗 Official doc

A query could span up to a lot of docs, so usually, we get the results not all together but one page after another.
There are two main ways to paginate the documents:
- using from and size fields: recommended if the total hits to paginate are < 10.000
- using search_after field: recommended if the total hits to paginate are > 10.000
Both the pagination approaches require that during one-page request and the following request the index doesn’t change.
- To overcome this problem, Elasticsearch has a feature named Point In Time (PIT) - 🔗 doc
  - 💡 Basically generate a token that represents the status of the cluster, and then pass this token during the pagination.

🖱️ Code example

🦂 Note the different pagination using search_after: without PIT we don’t receive and don’t use the _shard_doc value (see Dictionary paragraph and the following code for more explanation)

# ─────────────────────────────────────────────
# Queries pagination
# ─────────────────────────────────────────────
      
# ---
# Data creation
# ---
# Add "Sample eCommerce orders" data directly from kibana,
# Follow this guide: https://www.elastic.co/guide/en/kibana/7.13/get-started.html#gs-get-data-into-kibana
      
# ---
# Use the basic "from" / "size" duo
# ---
      
GET kibana_sample_data_ecommerce/_search?size=1
      
GET kibana_sample_data_ecommerce/_count
{
  "query": {
    "match": {
      "manufacturer": "Elitelligence"
    }
  }
}
# > 1370 hits
      
GET kibana_sample_data_ecommerce/_search
{
  "_source": ["order_id"], 
  "from": 0,
  "size": 10,
  "query": {
    "match": {
      "manufacturer": "Elitelligence"
    }
  },
  "sort": [
    {
      "order_id": {
        "order": "desc"
      }
    }
  ]
}
# > last `order_id` returned is '723049'
      
GET kibana_sample_data_ecommerce/_search
{
  "_source": ["order_id"], 
  "from": 9,
  "size": 20,
  "query": {
    "match": {
      "manufacturer": "Elitelligence"
    }
  },
  "sort": [
    {
      "order_id": {
        "order": "desc"
      }
    }
  ]
}
# > first `order_id` returned is '723049'
      
POST kibana_sample_data_ecommerce/_doc
{
  "manufacturer": "Elitelligence",
  "order_id" : 723050
}
      
GET kibana_sample_data_ecommerce/_search
{
  "_source": ["order_id"], 
  "from": 9,
  "size": 20,
  "query": {
    "match": {
      "manufacturer": "Elitelligence"
    }
  },
  "sort": [
    {
      "order_id": {
        "order": "desc"
      }
    }
  ]
}
# > first `order_id` returned is '723050'
# Note: same query, different result ('723050'!= '723049'),
#       because meanwhile a new document was indexed.
#       This behavior highlight how this approach
#       redoes the search, without persistence,
#       each time we ask for a new page.
      
# ---
# Use PIT: Point In Time with "from" / "size" duo
# ---
      
GET kibana_sample_data_ecommerce/_search
{
  "_source": [
    "order_id",
    "type"
  ],
  "from": 0,
  "size": 10,
  "query": {
    "match": {
      "geoip.region_name": "Cairo Governorate"
    }
  },
  "sort": [
    {
      "order_id": {
        "order": "desc"
      }
    }
  ]
}
# > hits.total: 491
# > first document returned: "order_id" : 723213
      
POST kibana_sample_data_ecommerce/_pit?keep_alive=60m
# > "id" : "85ez..."
      
POST kibana_sample_data_ecommerce/_doc
{
  "order_id": 730000,
  "geoip":{
    "region_name": "Cairo Governorate"
  }
}
      
GET kibana_sample_data_ecommerce/_search
{
  "_source": [
    "order_id",
    "type"
  ],
  "from": 0,
  "size": 10,
  "query": {
    "match": {
      "geoip.region_name": "Cairo Governorate"
    }
  },
  "sort": [
    {
      "order_id": {
        "order": "desc"
      }
    }
  ]
}
# > hits.total: 492
# > first document returned: "order_id" : 730000
# Note: same problem as before, an indexing
#       is occurred between two queries
      
POST _search
{
  "_source": [
    "order_id",
    "type"
  ],
  "from": 0,
  "size": 10,
  "query": {
    "match": {
      "geoip.region_name": "Cairo Governorate"
    }
  },
  "pit": {
    "id": "85ezAwEca2liYW5hX3NhbXBsZV9kYXRhX2Vjb21tZXJjZRZHSGJGY19SNlJ6MkhyMTZTemNCWm5BABZ0MlBFd2hNQVRybUZ5SzZRcEU4WTF3AAAAAAAAAAWLFkVxTndfdC1VUXU2cEJTVVVkRDJzcmcAARZHSGJGY19SNlJ6MkhyMTZTemNCWm5BAAA"
  },
  "sort": [
    {
      "order_id": {
        "order": "desc"
      }
    }
  ]
}
# > hits.total: 491
# > first document returned: "order_id" : 723213
# Note: like before the indexing of the document.
#       With PIT we could search and paginate docs
#       without inconsistency.
# Note: you need to replace the "pit.id" value
# Note: with the use of PIT, we had received an 
#       additional parameter under the "sort" array:
#       this prameter's field is named `_shard_doc`
      
# ---
# Use `search_after` for paginate
#
# Tip: recommended approach to paginate > 10.000 hits
# ---
      
GET kibana_sample_data_ecommerce/_search
{
  "_source": [
    "order_id",
    "type"
  ],
  "from": 0,
  "size": 10,
  "query": {
    "match": {
      "geoip.region_name": "Cairo Governorate"
    }
  },
  "sort": [
    {
      "order_id": {
        "order": "desc"
      }
    }
  ]
}
# > last document "order_id" : 722406
      
GET kibana_sample_data_ecommerce/_search
{
  "_source": [
    "order_id",
    "type"
  ],
  "from": -1,
  "size": 10,
  "query": {
    "match": {
      "geoip.region_name": "Cairo Governorate"
    }
  },
  "sort": [
    {
      "order_id": {
        "order": "desc"
      }
    }
  ],
  "search_after": [
    722406
  ]
}
# > first document "order_id" : 722373
# Note: `from` value is -1 because is not relevant:
#       we are asking for 10 documents after the
#       722406 `order_id` document    
      
GET kibana_sample_data_ecommerce/_search
{
  "_source": [
    "order_id",
    "type"
  ],
  "from": 10,
  "size": 11,
  "query": {
    "match": {
      "geoip.region_name": "Cairo Governorate"
    }
  },
  "sort": [
    {
      "order_id": {
        "order": "desc"
      }
    }
  ]
}
# > first document "order_id" : 722373
# Note: same result as before, because both
#       methods have the same goal: paginate
      
# ---
# Recommended way to paginate >10 000 hits:
# queries with both `search_after` and `PIT` fields
# ---
      
POST kibana_sample_data_ecommerce/_pit?keep_alive=60m
# > "id" : "85ezA..."
      
GET _search
{
  "_source": [
    "order_id",
    "type"
  ],
  "from": -1,
  "size": 10,
  "query": {
    "match": {
      "geoip.region_name": "Cairo Governorate"
    }
  },
  "sort": [
    {
      "order_id": {
        "order": "desc"
      }
    }
  ],
  "pit": {
    "id": "85ezAwEca2liYW5hX3NhbXBsZV9kYXRhX2Vjb21tZXJjZRZHSGJGY19SNlJ6MkhyMTZTemNCWm5BABZ0MlBFd2hNQVRybUZ5SzZRcEU4WTF3AAAAAAAAAAqbFkVxTndfdC1VUXU2cEJTVVVkRDJzcmcAARZHSGJGY19SNlJ6MkhyMTZTemNCWm5BAAA="
  }
}
# > last document "order_id" : 722406
# > last document "sort": ["722406", 4634]
# Note: we don't specify the index in the GET API
# Note: the 4634 is the `_shard_doc` value and need
#       to be used on the next request
      
GET _search
{
  "_source": [
    "order_id",
    "type"
  ],
  "from": -1,
  "size": 10,
  "query": {
    "match": {
      "geoip.region_name": "Cairo Governorate"
    }
  },
  "sort": [
    {
      "order_id": {
        "order": "desc"
      }
    }
  ],
  "pit": {
    "id": "85ezAwEca2liYW5hX3NhbXBsZV9kYXRhX2Vjb21tZXJjZRZHSGJGY19SNlJ6MkhyMTZTemNCWm5BABZ0MlBFd2hNQVRybUZ5SzZRcEU4WTF3AAAAAAAAAAqbFkVxTndfdC1VUXU2cEJTVVVkRDJzcmcAARZHSGJGY19SNlJ6MkhyMTZTemNCWm5BAAA="
  },
  "search_after": [
    "722406",
    4634
  ]
}
# > first document "order_id" : 722373
# Note: we need to use both the "order_id" 
#       pagination value and the `_shard_doc` value
# Note: you need to replace the pit.id
      
# --------------------------------
# Other examples, based on Kibana 
# provided example indices
# --------------------------------
      
GET kibana_sample_data_flights/_search
{
  "_source": [
    "FlightNum"
  ],
  "from": 0,
  "size": 3,
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "FlightNum": {
        "order": "asc"
      }
    }
  ]
}
# Second to last: 009NIGR
# Last flight: 00CGX81
      
GET kibana_sample_data_flights/_search
{
  "_source": [
    "FlightNum"
  ],
  "from": 0,
  "size": 3,
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "FlightNum": {
        "order": "asc"
      }
    }
  ],
  "search_after": [
    "009NIGR"
    ],
  "track_total_hits": false
}
# 1st: 00CGX81
# Note: due we have used the "second to last" of previous
# query, now the 1st element is the same as the last query

🔹 Define and use index aliases

🔗 Official doc

“An index alias is a secondary name for one or more indices.” - doc
You could use aliases for multiple purposes, e.g.:
- To use always the same index in a hot-warm-cold architecture with data stream
  - See the Test hot-warm-cold architecture chapter for more info
- To join and/or filter some indexes and expose the results as an independent index - doc
- To make ES changes transparently to the user
- 🔗 More on so and doc

🖱️ Code example

# ─────────────────────────────────────────────
# Indexes alias
# ─────────────────────────────────────────────
      
# ---
# Data creation
# ---
# Add "Sample eCommerce orders" data directly from kibana,
# Follow this guide: https://www.elastic.co/guide/en/kibana/7.13/get-started.html#gs-get-data-into-kibana
      
# ---
# Alias basics
# ---
      
GET kibana_sample_data_ecommerce/_search
{
  "_source": [
    "customer_full_name"
  ],
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "order_id": 584677
          }
        }
      ]
    }
  }
}
# > hits.total.value = 1
# > "customer_full_name" : "Eddie Underwood"
      
POST /_aliases
{
  "actions": [
    {
      "add": {
        "index": "kibana_sample_data_ecommerce",
        "alias": "my-index-001"
      }
    }
  ]
}
# > 200
      
GET my-index-001/_search
{
  "_source": [
    "customer_full_name"
  ],
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "order_id": 584677
          }
        }
      ]
    }
  }
}
# > hits.total.value = 1
# > "customer_full_name" : "Eddie Underwood"
# Note: we have used the alias as index name
      
GET _cat/aliases?v
# > my-index-001 | kibana_sample_data_ecommerce
      
# ---
# Alias and multiple indexes
# ---
      
GET kibana_sample_data_ecommerce/_search
{
  "_source": [
    "customer_full_name",
    "order_id",
    "type"
  ],
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "customer_full_name.keyword": "Eddie Underwood"
          }
        }
      ]
    }
  }
}
# > "customer_full_name" : "Eddie Underwood",
# > "type" : "order",
# > "order_id" : 584677
      
PUT customers-additional-info
{
  "mappings": {
    "properties": {
      "customer_full_name": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "customer_favorite_colour": {
        "type": "keyword"
      }
    }
  }
}
# > 200
      
PUT customers-additional-info/_doc/01
{
  "customer_full_name": "Eddie Underwood",
  "customer_favorite_colour": "red"
}
# > 200
      
POST /_aliases
{
  "actions": [
    {
      "add": {
        "indices": [
          "kibana_sample_data_ecommerce",
          "customers-additional-info"
        ],
        "alias": "customers-info"
      }
    }
  ]
}
# > "acknowledged" : true
      
GET customers-info/_search
{
  "_source": [
    "customer_full_name",
    "customer_gender",
    "customer_favorite_colour"
  ],
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "customer_full_name.keyword": "Eddie Underwood"
          }
        }
      ]
    }
  }
}
# > "customer_full_name" : "Eddie Underwood",
# > "customer_favorite_colour" : "red"
# > "customer_gender" : "MALE"
# Note: the informations retrieved are from both the indices
      
# ---
# Alias as index filter
# ---
      
POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "kibana_sample_data_ecommerce",
        "alias": "men-clothing",
        "filter": {
          "bool": {
            "filter": [
              {
                "term": {
                  "category.keyword": "Men's Clothing"
                }
              }
            ]
          }
        }
      }
    }
  ]
}
# > "acknowledged" : true
      
GET men-clothing/_search?size=3
# > Only docs with "category" : [ "Men's Clothing" ]

🔹 Define and use a search template

🔗 Official doc

“A search template is a stored search you can run with different variables.” - doc
- Useful for many things:
  - To not expose the ES query syntax externally: the final API will be more user-friendly and straightforward to use (you need only to fill the id of the query and the runtime parameters to use
  - If an app makes the query, we can change the query structure without changing the app’s code
To parametrize the query use the Mustache variables

🖱️ Code example

# ─────────────────────────────────────────────
# Indexes templates
# ─────────────────────────────────────────────
      
# ---
# Data creation
# ---
# Add "Sample eCommerce orders" data directly from kibana,
# Follow this guide: https://www.elastic.co/guide/en/kibana/7.13/get-started.html#gs-get-data-into-kibana
      
# ---
# Create and use index template
# ---
      
PUT _scripts/ten-products-by-category-template
{
  "script": {
    "lang": "mustache",
    "source": {
      "query": {
        "match": {
          "category": "{{product_category}}"
        }
      },
      "size": 10
    },
    "params": {
      "product_category": "The category name to search"
    }
  }
}
# > "acknowledged" : true
# Note: API to call is `_scripts` and we define
#       the query structure using moustache language
#       for the parameters placeholders
      
POST _render/template
{
  "id": "ten-products-by-category-template",
  "params": {
    "product_category": "category to search"
  }
}
# > "template_output" ...
# Note: with _render we can see the final 
#       query body produced from the template
      
GET kibana_sample_data_ecommerce/_search/template
{
  "id": "ten-products-by-category-template",
  "params": {
    "product_category": "Men's Clothing"
  }
}
# > "category" : [ "Men's Clothing", ...
# Note: the query format is simpler than ES DSL query
# Note: we have used a template against a 
#       specific index, but the template
#       could be used with other indexes also
      
PUT test-index-001/_doc/01
{
  "category": "my-test-category"
}
# > 200
      
GET test-index-001/_search/template
{
  "id": "ten-products-by-category-template",
  "params": {
    "product_category": "test"
  }
}
# > "category" : "my-test-category"
# Note: same template, different index
      
# ---
# Search templates with default values
# ---
GET kibana_sample_data_ecommerce/_search

🔷 Data Processing

Questions

🔹 Define a mapping that satisfies a given set of requirements

🔗 Official doc

Mapping is the process of defining how a document and the fields it contains are stored, analyzed and indexed

Mapping process

The mapping is specified using the homonym field inside the create index API

⭐ Inside the mapping term we can use two families of fields: Metadata Fields and Mapping Parameters

Metadata fields - doc
- Fields linked to the document but not directly created by them,
  e.g. _id is the univocal internal (inside ES world) document id
  - 🦂 Some Kibana suggestions are deprecated, like the "_all": {"enabled": true} metadata field, that is deprecated and unsupported
- 💡 All the metadata fields start with the underscore “_” and they are “internal” fields, same meaning that private class attributes in Python
Mapping parameters - doc
- Fields that specify how to manage the document fields,
  e.g. properties field is used to declare the document fields with the relative properties (like the type, define an analyzer to use etc.)

🖱️ Code example

# ─────────────────────────────────────────────
# Basic Index mapping
# ─────────────────────────────────────────────
              
PUT my-index-01
{
  "mappings": {
    "_source": {
      "enabled": true
    },
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "simple"
      },
      "country": {
        "type": "keyword",
        "store": false
      }
    }
  }
}
# > 200
# Note: `_source` is a metadata field, if enabled
#       the original document data is stored.
#       Inside `properties` we will define how and
#       which fields store

For the fields type declaration we can use two approaches: the Dynamic mapping and the Explicit mapping

fields type declaration: declare what a document field will contain (e.g. text, object, dates, text, keyword)

Explicit mapping - doc
- Specify the fields information (name, type, analyzer etc.) at index creation time
- 💡 You can’t change the mapping or field type of an existing field.
  (although there are some exceptions - info)
  - 🦂 If you want to change the index mapping, you need to reindex the data
Dynamic mapping - doc
- Using the dynamic field, we can declare how new/not already declared fields are managed
  - Dynamic field allows the following values:
    
    🔗 list copied from docs
    - true - New fields are added to the mapping (default)
    - runtime - New fields are added to the mapping as runtime fields.
      These fields are not indexed and are loaded from _source at query time.
    - false - New fields are ignored and only stored on _source
    - strict - If new fields are detected, an exception is thrown
  - 💡 We can also use the dynamic templates to declare how to define certain fields matching some naming criteria, see the dedicated chapter for more
- Elasticsearch automatically assign a type to new fields found on a new document using those rules
- 💡 You can use both the mappings approaches at the same time: explicitly declare fields already know and delegate the type management for new fields to Elasticsearch using the parameter dynamic

🖱️ Code example

# ─────────────────────────────────────────────
# Explicit / Dynamic Mapping
# ─────────────────────────────────────────────
              
# ---
# Explicit mapping
# ---
              
PUT /my-index-01
{
  "mappings": {
    "properties": {
      "age": {
        "type": "integer"
      },
      "email": {
        "type": "keyword"
      },
      "name": {
        "type": "text"
      }
    }
  }
}
# > "acknowledged" : true
              
GET _cat/shards/my-index-01?v
# > 2 entries
# Note: by default, elasticsearch create an index
#       with 1 primary shard and 1 replica
              
PUT my-index-01/_doc/01
{
  "name" : "tyler",
  "age" : 33,
  "email" : "tyler@hotmail.com",
  "employee-id" : 1
}
# > 200
# Note: "employee-id" wasn't declared in the mapping,
#       this field is dynamically mapped
              
GET my-index-01/_doc/01
# > "employee-id" : 1
              
PUT /my-index-01/_mapping
{
  "properties": {
    "employee-id": {
      "type": "long",
      "index": false
    }
  }
}
# > 400; "conflicts with existing mapper"
# Note: we cannot update a field already exist,
#       indifferently if it is dynamic or explicit 
              
PUT /my-index-01/_mapping
{
  "properties": {
    "born-city": {
      "type": "text",
      "analyzer": "simple"
    }
  }
}
# > 200
# Note: this update work because no document
#       indexed have "born-city" field, neither
#       the mapping properties
              
PUT /my-index-01/_doc/02
{
  "name": "magda",
  "age": 45,
  "email": "magda@hotmail.com",
  "employee-id": 2,
  "born-city": "New York"
}
# > 200
              
GET my-index-01/_search
{
  "query": {
    "match": {
      "born-city": "new"
    }
  }
}
# > "name" : "magda",
# Note: this search is possible because the 
#       `born-city` name use the `simple` analyzer
              
# ---
# Dynamic mapping
# ---
              
DELETE my-index-02
PUT my-index-02
{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "user": {
        "properties": {
          "name": {
            "type": "text"
          },
          "social_networks": {
            "dynamic": true,
            "properties": {}
          }
        }
      }
    }
  }
}
# > 200
# Note: we have provided the field "dynamic" : "strict",
#       so no new fields are allowed on this index
              
PUT my-index-02/_doc/1
{
  "user": {
    "name": "tyler" 
  }
}
# > 200
              
PUT my-index-02/_doc/2
{
  "user": {
    "name": "tyler"
  },
  "provider": "AWS"
}
# > 400; "strict_dynamic_mapping_exception"
              
PUT my-index-02/_doc/2
{
  "user": {
    "name": "tyler",
    "social_networks": {
      "fb": "tyler-official"
    }
  }
}
# > 200
# Note: here the new field is correctly indexed,
#       altought not declared at mapping time,
#       because of the socia_networks.dynamic: true
              
PUT my-index-03
{
  "mappings": {
    "dynamic": "false",
    "properties": {
      "user": {
        "properties": {
          "name": {
            "type": "text"
          },
          "social_networks": {
            "dynamic": true,
            "properties": {}
          }
        }
      }
    }
  }
}
# 200
# Note: we have provided the field "dynamic" : "false",
#       new filelds will be ignored
              
PUT my-index-03/_doc/01
{
  "user": {
    "name": "giorgio",
    "social_networks": {
      "fb": "giorgione"
    }
  }
}
# > 200
              
PUT my-index-03/_doc/02
{
  "user": {
    "name": "maccio",
    "social_networks": {
      "fb": "The Real Maccio"
    },
    "provider": "AWS"
  }
}
# > 200
              
GET my-index-03/_search
{
  "query": {
    "match": {
      "user.name": "maccio"
    }
  }
}
# > 200; hits.total: 1
# Note: the document retrieved has "provider" : "AWS"
              
GET my-index-03/_search
{
  "query": {
    "match": {
      "provider": "AWS"
    }
  }
}
# > 0 hits found 
# Note: `provider` is stored in the `_source` field
#       but isn't indexed for search because we had
#       defined "dynamic": "false" at mapping time

Field data types

🔗 Official doc
- Each field has a field data type, although was user-defined (explicit mapping) or inferred by Elasticsearch (Dynamic mapping)
- Interesting data types:
  
  🔗 complete list
  - keyword: used for structured content (e.g. ID, mail, tags).
    - There are three more keyword types, see Keyword type family
      - 💡 One of the keyword types is the wildcard: “for unstructured machine-generated content” - doc
        
        “users wanting to search machine-generated logs should find this new field type a more natural fit than existing options.” - blog
        
        Usage example: for extract info from logs data - doc
  - alias: defines an alternate name for a field in the index
  - object: A JSON object.
  - join: special field that creates parent/child relation
  - range: continuous range of values between an upper and lower bound
  - aggregate_metric_double: Pre-aggregated metric values.
Runtime Fields

🔗 Official doc
- “A runtime field is a field that is evaluated at query time.” - doc
- “At its core, the most important benefit of runtime fields is the ability to add fields to documents after you’ve ingested them” - doc

🖱️ Code example

🦂 We cannot multi-field a field of type object or nested

PUT my-index-000004
{
  "mappings": {
    "properties": {
      "my-field": {
        "type": "object",
        "fields": {
          "raw": { 
            "type":  "keyword"
          }
        }
      }
    }
  }
}
# > 400; Failed to parse mapping [_doc]: Mapping definition for [my-field] has unsupported parameters
          
PUT my-index-000004
{
  "mappings": {
    "properties": {
      "my-field": {
        "type": "text",
        "fields": {
          "raw": { 
            "type":  "keyword"
          }
        }
      }
    }
  }
}
# > 200

# ─────────────────────────────────────────────
# Index mapping wrap up
# ─────────────────────────────────────────────
      
# ---
# Try different mapping parameters & types
# 
# All list here:
# https://www.elastic.co/guide/en/elasticsearch/reference/7.13/mapping-params.html
# ---
      
PUT test-index-01
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "copy_to": "full_name",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "surname": {
        "type": "text",
        "copy_to": "full_name",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "full_name": {
        "type": "keyword"
      }
    }
  }
}
# > 200
# Note: properties `name` and `surname` are
#       indexed both as `text` and `keyword` types
      
PUT test-index-01/_doc/01
{
  "name":"bat",
  "surname": "man"
}
      
GET test-index-01/_doc/01
# > 200; `full_name` not present
      
GET test-index-01/_search
{
  "query": {
    "match": {
      "full_name": "man"
    }
  }
}
# > 200; document found: `full_name` is only queryable
      
PUT test-index-02
{
  "mappings": {
    "properties": {
      "full_name": {
        "type": "text",
        "analyzer": "simple",
        "term_vector": "with_positions",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "hobbies": {
        "type": "nested",
        "properties": {
          "name": {
            "type": "keyword"
          },
          "outdoor": {
            "type": "boolean"
          }
        }
      },
      "personal_info":{
        "type": "flattened"
      }
    }
  }
}
# > 200
# Note: we use some interesting data types and fields,
#       like `term_vector`, `nested` and `flattened` type
      
PUT test-index-02/_doc/01
{
  "full_name": "Angry Bird",
  "hobbies": [
    {
      "name": "fly",
      "outdoor": true
    },
    {
      "name": "Mobile gaming",
      "outdoor": "IDK"
    }
  ]
}
# > 400; Failed to parse value [IDK]
# Note: we had defined `outdoor` as
#       boolean field
      
PUT test-index-02/_doc/01
{
  "full_name": "Angry Bird",
  "hobbies": [
    {
      "name": "fly",
      "outdoor": true
    },
    {
      "name": "Mobile gaming",
      "outdoor": false
    }
  ],
  "personal_info":{
    "born_on":"20101105",
    "android_user": true,
    "labels":[
      "green",
      "red"
    ]
  }
}
# > 200
      
GET test-index-02/_doc/01
# > 200
      
GET test-index-02/_doc/01/_termvectors
# > 200
# Note: with `termvectors` enabled we could
#       explore how the analyzer had tokenized
#       the text
      
# ---
# Explicit & Dynamic mapping
# ---
      
PUT test-index-03
{
  "mappings": {
    "properties": {
      "product_id": {
        "type": "keyword"
      }
    }
  }
}
# > 200
# Note: explicit mapping
      
PUT test-index-04
{
  "mappings": {
    "dynamic": "true"
  }
}
# > 200
# Note: index that could ingest new fields,
#       and automatically detect field type
# Warning: `dynamic` field is not suggested
#         by Kibana Webapp
      
PUT test-index-04/_doc/01
{
  "product_id": "kiqhfi2iu3hf"
}
# > 200
      
GET test-index-04
# > "product_id" : "type" : "text" | "type" : "keyword"
      
PUT test-index-05
{
  "mappings": {
    "dynamic": "runtime"
  }
}
# > 200
      
PUT test-index-05/_doc/01
{
  "product_id": "kiqhfi2iu3hf"
}
# > 200
      
GET test-index-05
# > "product_id" : "type" : "keyword"
# Note: product_id isn't indexed as `text` field
#       like before in "test-index-04"
      
# ---
# Dynamic templates
# ---
      
# We can also define how to index some fields 
# without explicitly specifying the field's name,
# but instead use some matching conditions
# 
# More on the dedicated chapter and the official doc:
# https://www.elastic.co/guide/en/elasticsearch/reference/7.13/dynamic-templates.html
      
PUT test-index-06
{
  "mappings": {
    "dynamic": "true",
    "dynamic_templates": [
      {
        "customer_info_as_keywords": {
          "match_mapping_type": "string",
          "match": "customer_*",
          "mapping": {
            "type": "keyword"
          }
        }
      }
    ]
  }
}
# > 200
# Note: all fields start wit `customer_` prefix
#       will be indexed as `keyword` type.
#       Instead, others fields will be indexed as
#       bot `text` and `keyword` type because of
#       `dynamic`:`true` parameter
# Warning: Kibana webapp doesn't suggest
#         "dynamic_templates" as available field
      
PUT test-index-06/_doc/01
{
  "customer_name": "giorgio",
  "customer_gender": "male",
  "product_comment": "Everything about the physical device, i feel like is pretty well made."
}
# > 200
      
GET test-index-06
# > 200
# Note: fields indexed as expected
#   customer_gender -> keyword
#   customer_name   -> keyword
#   product_comment -> text and keyword

🔹 Define and use a custom analyzer that satisfies a given set of requirements

Analyzer
- Analyzers are instruments used in text fields and provide different ways to analyze and search the text.
  - Through the analyzers, ES can return all relevant results, rather than just exact matches.
- The analyzer is composed of multiple components:
  
  🔗 Original doc
  - An analyzer may have zero or more character filters, which are applied in order.
    - e.g. mapping character filter, that replaces a sequence of characters with another sequence following a provided map
  - An analyzer must have exactly one tokenizer.
    - e.g. character group tokenizer, that split the text in tokens whenever it encounters a character which is in a provided set
  - An analyzer may have zero or more token filters, which are applied in order.
    - e.g. stop token filter, used to remove the stop words from the text before the insertion of the inverted index - doc
- 🦂 Note that the analyzer modifies the text only to enhance the search process, your original document text will not be changed when retrieved and displayed.
  - This could lead to some mismatches, e.g. as explained in the documentation, the highlight process will be invalided if the analyzer process change the length of the original text
- Since the analyzer could alter the text, it should be used for both the documents text and the query - link
  - When an analyzer is used to parse the new text that will be stored in an index, is called index analyzer, while an analyzer used to parse the query at search time is named search analyzer
  - In most cases, the same analyzer should be used at index and search time.
    However, sometimes they could be different,
    e.g. here is a good example of the use of different analyzers at indexing and query time: link
- Create an analyzer at search time
  - You can create a new analyzer at search time, but be aware: that analyzer will be used only on the search query text*,* not on the documents indexed
    - 💡 This makes sense because at indexing time the analyzers parse the document and build all the internal structures used then to fast search through the documents. With a “runtime” analyzer those internal structures cannot be created/updated “on the fly” only for one query.

🖱️ Code example

# ─────────────────────────────────────────────
# Create a custom analyzer that
# define all the three components:
# 1. character filter [0+]
# 2. tokenizer [1]
# 3. token filters [0+]
#
# Legend: [X] = how many items of that 
#               type we can define
# ─────────────────────────────────────────────
      
# ---
# 1. Character filter
# ---
      
# Replace digits and emoji with text
GET /_analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": [
        "0 => zero",
        "1 => one",
        "2 => two",
        "3 => three",
        "4 => FOUR",
        "5 => five",
        "6 => six",
        "7 => seven",
        "8 => eight",
        "9 => nine"
      ]
    }
  ],
  "text": "I have 2 bike and 4 laptop :)"
}
      
GET /_analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": [
        ":) => _happy_"
      ]
    }
  ],
  "text": "I have 2 bike and 4 laptop :)"
}
      
# ---
# 2. Tokenizer
# ---
      
# By default, `lowercase` tokenizer remove digits from text
# but we replace digits with text using the character filter
# before the tokenization process
      
POST _analyze
{
  "tokenizer": "lowercase",
  "text": "I have 2 bike and 4 laptop :)"
}
      
# ---
# 3. Token filters
# ---
      
# We will use filters to replace `two bike` with `bikes`
GET /_analyze
{
  "tokenizer" : "whitespace",
  "filter" : [
    {
      "type": "common_grams",
      "common_words": ["two", "bike"]
    }
  ],
  "text" : "I have two bike and four laptop :)"
}
      
GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [
    {
      "type": "pattern_replace",
      "pattern": "(two_bike)",
      "replacement": "bikes"
    }
  ],
  "text": "I have two_bike and four laptop :)"
}
      
# ---
# Create the index
# ---
      
DELETE my-index-0001
PUT my-index-0001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "char_filter": [
            "my_mapping_numbers_to_text",
            "my_mapping_emoji"
          ],
          "tokenizer": "lowercase",
          "filter": [
            "my_ngram_filter",
            "my_plural_bike_filter"
          ]
        }
      },
      "char_filter": {
        "my_mapping_numbers_to_text": {
          "type": "mapping",
          "mappings": [
            "0 => zero",
            "1 => one",
            "2 => two",
            "3 => three",
            "4 => four",
            "5 => five",
            "6 => six",
            "7 => seven",
            "8 => eight",
            "9 => nine"
          ]
        },
        "my_mapping_emoji": {
          "type": "mapping",
          "mappings": [
            ":) => _happy_"
          ]
        }
      },
      "filter": {
        "my_ngram_filter": {
          "type": "common_grams",
          "common_words": [
            "two",
            "bike"
          ]
        },
        "my_plural_bike_filter": {
          "type": "pattern_replace",
          "pattern": "(two_bike)",
          "replacement": "bikes"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_field": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}
      
PUT my-index-0001/_doc/1
{
  "my_field": "I have 2 bike and 4 laptop :)"
}
      
GET my-index-0001/_doc/1
# > Original text returned
# [!] Remember: analyzer apply changes
# only for search purposes and don't change
# the original document text. To analyze the 
# terms used for search use `termvectors`
      
GET my-index-0001/_termvectors/1?fields=my_field&field_statistics=false
# > "two", "four", "_happy_", "bikes" tokens found
#
# [!] Tokenizer 'lowercase' remove the numbers from the text,
# why we have "four" and "two" on the termvectors? Because 
# before the tokenizer run we map all digits to words 
# using the characters filter 'my_mapping_numbers_to_text'
#
# [!] "bikes" token is present because the token filters
# are applied in order, and the "my_ngram_filter" build
# the token "two_bike" before "my_plural_bike_filter" 
# apply the conversion from "two_bike" to "bikes".
# And in fact, the token "two_bike" doesn't reported
# on the termvectors list

🔹 Define and use multi-fields with different data types and/or analyzers

🔗 Official doc

A way to “index the same field in different ways for different purposes” - doc
- Different ways include using the different analyzer and different field’s type
🦂 In the documentations we could find the multi-fields specs under the following path of the official index page:
Mapping → Mapping parameters → fields

🖱️ Code example

# ─────────────────────────────────────────────
# Multi-fields examples
# ─────────────────────────────────────────────
      
# ---
# Basic usage
# ---
DELETE test-index-01
PUT test-index-01
{
  "mappings": {
    "properties": {
      "movie_title": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "commentary": {
        "type": "text"
      }
    }
  }
}
# > 200
      
PUT test-index-01/_doc/01
{
  "movie_title": "american history x",
  "commentary": "American History X is a 1998 American crime drama film directed by Tony Kaye and written by David McKenna."
}
# > 200
      
GET test-index-01/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "movie_title.keyword": "american"
          }
        }
      ]
    }
  }
}
# > 0 hit
      
GET test-index-01/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "movie_title": "american"
          }
        }
      ]
    }
  }
}
# > 1 hit
      
# ---
# Define multiple types
# ---
      
PUT test-index-02
{
  "mappings": {
    "properties": {
      "movie_title": {
        "type": "text",
        "analyzer": "english", 
        "fields": {
          "sayt": {
            "type": "search_as_you_type"
          },
          "keyword": {
            "type": "keyword"
          }
        }
      }
    }
  }
}
# > 200
# Note: "sayt" as acronym of "Search As You Type"
# Note: we had defined two "different ways" to index
#       the same document field
      
PUT test-index-02/_doc/01?refresh
{
  "movie_title": "The Lord of the Rings: The Return of the King"
}
# > 200
      
GET test-index-02/_search
{
  "query": {
   "prefix": {
     "movie_title": {
       "value": "of the"
     }
   }
  }
}
# > 0 hits
      
GET test-index-02/_search
{
  "query": {
   "prefix": {
     "movie_title.sayt": {
       "value": "of the"
     }
   }
  }
}
# > 1 hit
# Note: same field but with `text` and `english` anlyzers
#       we cannot use the stopwords
      
# ---
# Define multiple analyzers
# ---
DELETE test-index-03
PUT test-index-03
{
  "settings": {
    "analysis": {
      "analyzer": {
        "agnostic_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "agnostic_filter"
          ]
        }
      },
      "filter": {
        "agnostic_filter": {
          "type": "pattern_replace",
          "pattern": "(christianity)",
          "replacement": "<religion>"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "user_id": {
        "type": "keyword"
      },
      "user_opinion": {
        "type": "text",
        "analyzer": "english",
        "term_vector": "with_positions_offsets_payloads",
        "store": true,
        "fields": {
          "agnostic": {
            "type": "text",
            "analyzer": "agnostic_analyzer",
            "search_analyzer": "agnostic_analyzer",
            "term_vector": "with_positions_offsets_payloads",
            "store": true
          }
        }
      }
    }
  }
}
# > 200
      
PUT test-index-03/_doc/01
{
  "user_id": "A001",
  "user_opinion": "I have a long family tradition around christianity and their celebrations"
}
# > 200
      
GET test-index-03/_search
{
  "query": {
    "match": {
      "user_opinion": "christianity tradition"
    }
  }
}
# > 0.575
      
GET test-index-03/_search
{
  "query": {
    "match": {
      "user_opinion": "buddhist tradition"
    }
  }
}
# > 0.28 score
      
GET test-index-03/_termvectors/01
# > "<religion>" is present with  "term_freq" : 1
      
GET test-index-03/_search
{
  "query": {
    "match": {
      "user_opinion.agnostic": "<religion> tradition"
    }
  }
}
# > 0.28 score
# Note: should be higher than 0.28,
#   TODO follow the ticket:
#   https://discuss.elastic.co/t/custom-analyzer-with-token-replacement/289236
      
GET test-index-03/_search
{
  "query": {
    "match": {
      "user_opinion.agnostic": "tradition"
    }
  }
}
# > 0.28 score, like with <religion> tag -

🔹 Use the Reindex API and Update By Query API to reindex and/or update documents

🦂 In the documentations we could find those **specs under the following path of the official index page
REST APIs → Document APIs → [Update by query | Reindex]

Reindex

🔗 Reindex API official doc

“Copies documents from a source to a destination.” - doc
Basically, use a from index as source of documents (it must have _source enabled indeed) to index the data into a destination index
The reindexing process is useful for many applications, thanks also to their proprieties like
- Reindex from multiple sources
- Reindex only data that match a specific query
- Reindex with a max cap of documents
- … more examples in the code block
You could also reindex data from a remote cluster - doc

🖱️ Code example

# ─────────────────────────────────────────────
# Reindex API
# ─────────────────────────────────────────────
# Add "Sample eCommerce orders" sample
# data directly from Kibana:
# https://www.elastic.co/guide/en/kibana/7.13/get-started.html#gs-get-data-into-kibana
          
GET _cat/indices/kibana*?v
# > "kibana_sample_data_ecommerce"
          
# ---
# Change index settings and reindex
# ---
          
GET kibana_sample_data_ecommerce
# > "number_of_shards" : "1"
          
PUT test-index-01
{
  "settings": {
    "number_of_shards": 3
  },
  "mappings": {
    "properties": {
      "category": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      }
    }
  }
}
# > 200
# Note: original index mapping properties skipped
#       for space and readability but in real world
#       scenario all the mapping properties should
#       be reported
          
POST _reindex
{
  "source": {
    "index": "kibana_sample_data_ecommerce"
  },
  "dest": {
    "index": "test-index-01"
  }
}
# > 200; "total" : 4675,
          
GET _cat/shards?v
# > kibana_sample_data_ecommerce | one primary shard
# > test-index-01 | three rows for primary shards
          
# ---
# Alias + reindex for transparent 
# index structure changes
# ---
          
PUT test-index-02
{
  "aliases": {
    "movies-info": {}
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text"
      }
    }
  }
}
# > 200
          
PUT movies-info/_doc/01
{
  "title": "Star Wars"
}
# > 200
          
GET movies-info/_search
{
  "query": {
    "match": {
      "title": "Star"
    }
  }
}
# > "title" : "Star Wars"
          
# [!] Now we want the `search_as_you_type`
#     field type under the `title` field.
#     One of the ways to get this functionality
#     on the already indexed documents also
#     is with the reindex process.
          
PUT test-index-03
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "fields": {
          "sayt": {
            "type": "search_as_you_type"
          }
        }
      }
    }
  }
}
# > 200
# Note: create the index with the new requirements
          
PUT test-index-02/_settings
{
  "settings": {
    "index.blocks.write": true
  }
}
# > 200
# Note: block the insertion of new documents
#       on the "source" index
          
PUT movies-info/_doc/02
{
  "title": "Fight Club"
}
# > 403; index [test-index-02] blocked
# Note: alias movies-info refer to test-index-02
          
POST _reindex
{
  "source": {
    "index": "test-index-02"
  },
  "dest": {
    "index": "test-index-03"
  }
}
# > 200
          
POST /_aliases
{
  "actions": [
    {
      "add": {
        "index": "test-index-03",
        "alias": "movies-info"
      }
    }
  ]
}
# > 200
# Note: now behind `movies-info` we have two indexes
          
GET movies-info/_search
{
  "query": {
    "match": {
      "title": "Star"
    }
  }
}
# > "title" : "Star Wars"
# > "title" : "Star Wars"
# Note: One hit from `test-iondex-02`, one
#       from `test-index-03`
          
DELETE test-index-02
# > 200
# Note: remove old data
          
GET movies-info/_search
{
  "query": {
    "multi_match": {
      "query": "star",
      "type": "bool_prefix",
      "fields": [
        "title.sayt",
        "title.sayt._2gram",
        "title.sayt._3gram"
      ]
    }
  }
}
# > "title" : "Star Wars"
# Note: query possible only with the
#       mapping of `test-index-03`

Update by query

🔗 Update by query official doc

“Updates documents that match the specified query” - doc
Useful to apply some changes sequentially to a big number of documents that satisfy a query
🦂 During the update query the documents could change (and accordingly the _version filed also), we can use the conflicts field to specify how to resolve this event - doc

🖱️ Code example

# ─────────────────────────────────────────────
# Update by query API
# ─────────────────────────────────────────────
# Add "Sample eCommerce orders" sample
# data directly from Kibana:
# https://www.elastic.co/guide/en/kibana/7.13/get-started.html#gs-get-data-into-kibana
          
GET _cat/indices/kibana*?v
# > "kibana_sample_data_ecommerce"
          
# ---
# Basic usage
# ---
          
PUT test-index-01/_doc/01
{
  "movie_name": "once upon a time in hollywood",
  "director": "Quentin Tarantino"
}
# > 200
          
GET test-index-01/_doc/01
# > _version: 1
          
POST test-index-01/_update_by_query
# > 200
          
GET test-index-01/_doc/01
# > _version: 2
# Note: the `_update_by_query` take the document in
#       `_source` and use it to re-index the data 
#       on the index. This process increases the _version
          
# ---
# Use update by query to change
# the documents fields contents
# ---
          
PUT test-index-01/_doc/02
{
  "movie_name": "Paz!",
  "director": "Renato De Maria"
}
# > 200
          
GET test-index-01/_search
{
  "query": {
    "match": {
      "movie_name": "once upon"
    }
  }
}
# > 1 hit; "movie_name" : "once upon a time in hollywood",
          
POST test-index-01/_update_by_query
{
  "conflicts": "proceed",
  "query": {
    "match": {
      "director": "quentin"
    }
  },
  "script": {
    "source": "ctx._source.movie_name='obfuscated'",
    "lang": "painless"
  }
}
# > 200
# Note: we change only the movie_name
#       of docs with "director": "quentin"
          
GET test-index-01/_doc/01
# > "movie_name" : "obfuscated"
          
GET test-index-01/_doc/02
# > ovie_name" : "Paz!"
          
GET test-index-01/_search
{
  "query": {
    "match": {
      "movie_name": "once upon"
    }
  }
}
# > no hits
# Note: the change has involved also
#       the structure used to search
          
# ---
# Special attributes
# ---
          
GET kibana_sample_data_flights/_search?version=true
{
  "query": {
    "wildcard": {
      "Dest": "Sydney Kingsford *"
    }
  }
}
# > _version: 1
          
POST kibana_sample_data_flights/_update_by_query?conflicts=proceed
{
  "query": {
    "wildcard": {
      "Dest": "Sydney Kingsford *"
    }
  },
  "script": {
    "source": """
    long version = ctx['_version'];
    ctx["_source"]["dangerous"] = true;
    ctx["_version"] = version;
    """,
    "lang": "painless"
  }
}
          
GET kibana_sample_data_flights/_search?version=true
{
  "query": {
    "wildcard": {
      "Dest": "Sydney Kingsford *"
    }
  }
}
# > "dangerous" : true,
# > _version: 2
# Note: version is read only and cannot be managed

🔹 Define and use an ingest pipeline that satisfies a given set of requirements, including the use of Painless to modify documents

🔗 Official doc

Ingest pipeline

“perform common transformations on your data before indexing” - doc
With an ingest pipeline we could parse the input document and change their structure and content (differently than the analyzer component, that change and parse the document only for internal purposes).
An ingest pipeline is composed by one or more processors, that are the “working unit” that apply some specific changes to the document
- The processors list is here but could be retrieved by API or with plugins
- Each processor is configurable, for the max flexibility there is the Script processor that runs a stored script
💡 A good approach could be to create the ingest pipeline from Kibana GUI (Stack Management → Ingest Node Pipelines) and then use the Show request button to get the equivalent Kibana code

🖱️ Code example

You can also create an ingest pipeline from Kibana GUI
At least one note should have the ingest role

# ─────────────────────────────────────────────
# Ingest pipeline
# ─────────────────────────────────────────────
          
# ---
# Cluster
# ---
# The cluster must have at least one `ingest` role,
# or an "illegal_state_exception" exception will be returned
          
# ---
# Pipeline basics
# ---
          
PUT _ingest/pipeline/test-pipeline-01
{
  "description": "Basic pipeline example: test 'split' and 'rename' processors",
  "processors": [
    {
      "split": {
        "field": "folder_path",
        "separator": "/"
      },
      "rename": {
        "field": "folder_path",
        "target_field": "folder_path_parsed"
      },
      "set": {
        "field": "parsed",
        "value": true
      }
    }
  ],
  "version": 1
}
# > 200
          
POST _ingest/pipeline/test-pipeline-01/_simulate
{
  "docs": [
    {
      "_source": {
        "folder_path": "/foo/bar/folder/file.txt"
      }
    },
    {
      "_source": {
        "folder_path": "file.txt"
      }
    }
  ]
}
# > "folder_path_parsed" : "", "foo", "bar", "folder", "file.txt"
# > "parsed" : true
# Note: use the `_simulate` endpoint to be sure
#       the pipeline perform as desired
          
PUT test-index-01/
{
  "settings": {
    "default_pipeline": "test-pipeline-01"
  }
}
# > 200
# Note: the `default_pipeline` field is not suggested by Kibana
          
PUT test-index-01/_doc/01
{
  "folder_path": "/foo/bar/folder/file.txt"
}
# > 200
          
GET test-index-01/_doc/01
# > "folder_path_parsed" : "", "foo", "bar", "folder", "file.txt"
          
# ---
# Modify already indexed documents
# ---
          
PUT test-index-02/_doc/01
{
  "user_id": 123,
  "nikname": "Dr1ppy"
}
# > 200
          
PUT _ingest/pipeline/test-pipeline-02
{
  "description": "Enrich forum data pipeline",
  "processors": [
    {
      "set": {
        "field": "forum",
        "value": "warrock"
      }
    }
  ],
  "version": 1
}
# > 200
          
POST test-index-02/_update_by_query?pipeline=test-pipeline-02
# > 200
          
GET test-index-02/_doc/01
# > "forum" : "warrock"
# Note: we have updated one filed of **all** the documents
#       inside the index. What about to update only "some" documents?
          
POST _bulk
{"index":{"_index":"test-index-02","_id":"02"}}
{"user_id":234,"nikname":"BadKarma"}
{"index":{"_index":"test-index-02","_id":"03"}}
{"user_id":234,"nikname":"DankGamer","forum":"steam"}
# > 200
# Note: the last entry have "forum":"steam",
#       how we can set "forum" only for documents that
#       doesn't have it?
          
# [Solution 1]: use the `query` inside update_by_query
POST test-index-02/_update_by_query?pipeline=test-pipeline-02
{
  "query": {
    "bool": {
      "must_not": [
        {
          "exists": {
            "field": "forum"
          }
        }
      ]
    }
  }
}
# > 200
# Note: To find documents that are missing an indexed value for a field see
#       https://www.elastic.co/guide/en/elasticsearch/reference/7.13/query-dsl-exists-query.html#find-docs-null-values
          
GET test-index-02/_doc/02
# > "forum" : "warrock"
          
GET test-index-02/_doc/03
# > "forum" : "steam"
# Note: the document that already had the
#       "forum" field was not modified, as desired

Painless language

🔗 Official doc
🔗 Official guide

“With scripting, you can evaluate custom expressions in Elasticsearch” - doc
- There are some scripts languages: painless, expression, mustache, java, see the doc to understand which use
“Painless is a performant, secure scripting language designed specifically for Elasticsearch” - doc
We can use painless - and scripts in other languages - for a wide range of reasons:
- “You can write a script to do almost anything, and sometimes, that’s the trouble” - doc
💡 Store the script when possible: the script compiler process is heavy. For the same reason don’t hardcode parameters inside the script.
🦂 Scripts are incredibly useful, but can’t use Elasticsearch’s index structures or related optimizations - doc
💡 Painless takeaways
- Access to document fields using doc['field_name']
  - To use the field content we need to specify what we want:
    - doc['goals'].length ← count list lenght
    - doc['name.keyword'].value ← access to the keyword content
- Define and declare variables, e.g. int total=0
- Access to document _source using ctx._source
- Use params.<parameter_name> to parametrize a script
- Painless debug is based on use Debug.explain utility that throws an exception and print useful information like the type of an object
- Use emit to return calculated values inside runtime_mapping - doc
- 🦂 When use doc and ctx._source?
  “Depending on where a script is used” - stack overflow
  - Depends on the context and each context has values that are available as local variables, here for the ingest context **- doc

🖱️ Code example

Official guides and docs

🔗 Painless available keywords list
🔗 Painless syntax - doc
Use a Painless script in an update by query operation to add, modify, or delete fields within each of a set of documents collected as the result of query - doc
🦂 For integers it looks like that .value is not required:
total += doc['grades'][i] and not ~~total += doc['grades'][i].value~~

# ─────────────────────────────────────────────
# Painless language
# ─────────────────────────────────────────────
          
# ---
# Add some data
# ---
          
PUT test-index-01/_doc/01
{
  "name": "John",
  "grades": [
    9.4,
    8.0,
    3.0
  ]
}
# > 200
          
PUT test-index-01/_doc/02
{
  "name": "Bob",
  "grades": [
    10.0,
    7.0,
    8.5,
    9.0
  ]
}
# > 200
          
PUT test-index-01/_doc/03
{
  "name": "Zen",
  "grades": [
    4.4,
    5.0
  ]
}
# > 200
          
# ---
# Use Painless for search
# ---
          
GET test-index-01/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "script": {
            "script": "doc['grades'].length > 3"
          }
        }
      ]
    }
  }
}
# > _id: 02
# Note: with painless we could return only
#       documents with more than 3 grades.
#       This approach is inefficient, let's
#       use script parameters
          
GET test-index-01/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "script": {
            "script": {
              "source": "doc[params.field_name].length > params.min_cardinality",
              "lang": "painless",
              "params": {
                "field_name": "grades",
                "min_cardinality": 3
              }
            }
          }
        }
      ]
    }
  }
}
# > _id: 02
# Note: same results as before, but with params
# Note: the structure is "must -> script -> script"
          
PUT _scripts/list-cardinality-filter
{
  "script": {
    "lang": "painless",
    "source": """
       doc[params.field_name].length > params.min_cardinality
    """
  }
}
# > 200
# Note: the params are automatically
#       inferred by the script content
          
GET test-index-01/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "script": {
            "script": {
              "id": "list-cardinality-filter",
              "params": {
                "field_name": "grades",
                "min_cardinality": 3
              }
            }
          }
        }
      ]
    }
  }
}
# > _id: 02
# Note: we can also use stored scripts, same results as 
#       last two queries
# Note: pay attention to the query nested structure
          
# ---
# Update document using painless
# ---
          
GET test-index-01/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "script": {
            "script": {
              "lang": "painless",
              "source": """
                        int total = 0;
                        for (int i = 0; i < doc['grades'].length; i++) {
                          total += doc['grades'][i];
                        }
                        float avg = total / doc['grades'].length;
                        avg > 7.0
                        """
            }
          }
        }
      ]
    }
  }
}
# > _id : 02
# Note: only documents with an average 
#       grade of 7.0 are returned
          
POST test-index-01/_update_by_query
{
  "script": {
    "source": """
              int total = 0;
              for (int i = 0; i < ctx._source[params.field_name].length; i++) {
                total += ctx._source.grades[i];
              }
                        
              float avg = total / ctx._source.grades.length;
                        
              if (avg > params.threshold){
                  ctx._source.elegible = true;
              } else {
                  ctx._source.elegible = false;
              }
              """,
    "params": {
      "field_name": "grades",
      "threshold": 7
    },
    "lang": "painless"
  }
}
# > 200
# Note: we are updating the documents _source:
#       set elegible=true if the grades AVG is > 7.0
# Note: use params and ctx in the form `_source[params.field_name]`
# Note: we have switched from `doc` to `ctx._source` because 
#       we are in `update_by_query` API
          
GET test-index-01/_search?size=10
# > _id : 01 -> elegible : false
# > _id : 02 -> elegible : true
# > _id : 03 -> elegible : false

⭐ Ingest Pipeline & Painless

# ─────────────────────────────────────────────
# Ingest pipeline & Painless
# ─────────────────────────────────────────────
      
# ---
# Basic ingest pipeline
# ---
PUT _ingest/pipeline/test-ingest-01
{
  "description": "Lowercase the csv row and extract the fields",
  "version": 1,
  "processors": [
    {
      "lowercase": {
        "field": "csv_data"
      }
    },
    {
      "csv": {
        "field": "csv_data",
        "target_fields": [
          "nickname",
          "city",
          "degree",
          "role"
        ],
        "separator": ";",
        "trim": true,
        "empty_value": "None",
        "tag": "my_csv_processor"
      }
    }
  ]
}
# > 200
# Note: created from Kibana GUI and then
#       pasted using "Show request"
      
PUT test-index-01/_doc/01?pipeline=test-ingest-01
{
  "csv_data": "pistocop; Bologna; CS; Data Engineer;"
}
# > 200
      
GET test-index-01/_doc/01
# > "role" : "data engineer"
# > "city" : "bologna",
# > "nickname" : "pistocop",
# > "degree" : "cs"
      
# ---
# Insert Painless script
# ---
PUT _ingest/pipeline/test-ingest-02
{
  "description": """Lowercase the csv row, extract the fields and create "cities_short" field""",
  "version": 1,
  "processors": [
    {
      "lowercase": {
        "field": "csv_data"
      }
    },
    {
      "csv": {
        "field": "csv_data",
        "target_fields": [
          "nickname",
          "city",
          "degree",
          "role"
        ],
        "separator": ";",
        "trim": true,
        "empty_value": "None",
        "tag": "my_csv_processor"
      }
    },
    {
      "script": {
        "lang": "painless",
        "source": """
                  Map cities = new HashMap();
                  cities.put('bologna','bo');
                  cities.put('roma','rm');
                  cities.put('milano','mi');
                        
                  String city_shorted = cities.get(ctx[params.city_field]);
                  ctx[params.city_shorted_field] = city_shorted;
                  """,
        "params": {
          "city_field": "city",
          "city_shorted_field": "city_shorted"
        }
      }
    }
  ]
}
# > 200
# Note: use the `script` processor
#       to run Painless code
# Note: Painless is Java-like, do not forget to
#       declare variables type
# Note: we are using on painless a field
#       that is just created from the pipeline:
#       "city", so is important that the script
#       processor is executed after the csv processor
# Warning: we don't use `ctx._source` 
      
PUT test-index-02/_doc/01?pipeline=test-ingest-02
{
  "csv_data": "pistocop; Bologna; CS; Data Engineer;"
}
# > 200
      
GET test-index-02/_doc/01
# > ... same as before
# > "city_shorted" : "bo"
      
PUT test-index-02/_doc/02?pipeline=test-ingest-02
{
  "csv_data": "magneto; MiLaNo; History; Teacher;"
}
GET test-index-02/_doc/02
# > ...
# > "city_shorted" : "mi",
      
# ---
# Create a dispacher
# using pipeline + stored script
# ---
      
PUT _scripts/my-script
{
  "script":{
    "lang": "painless",
    "source": """
    String checkString = ctx[params['fieldToCheck']];
    if (checkString == params['checkValue']){
      ctx["_index"] = params['destinationIndex'];
    }
    """
  }
}
      
PUT _ingest/pipeline/my-dispacher-pipeline
{
  "processors": [
    {
      "script": {
        "id": "my-script",
        "params": {
          "fieldToCheck": "dispacher-type",
          "checkValue": "storic",
          "destinationIndex": "storic-index"
        }
      }
    }
  ]
}
# Note: params setted at pipeline level
      
POST _ingest/pipeline/my-dispacher-pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "my-keyword-field": "FOO",
        "dispacher-type": "storic"
      }
    },
    {
      "_source": {
        "my-keyword-field": "BAR"
      }
    }
  ]
}
# > "_index" : "storic-index"
      
PUT storic-index
      
PUT my-index-01
{
  "settings": {
    "number_of_shards": 1,
    "default_pipeline": "my-dispacher-pipeline"
  }
}
      
PUT my-index-01/_doc/01
{
  "my-keyword-field": "FOO",
  "dispacher-type": "storic"
}
PUT my-index-01/_doc/02
{
  "my-keyword-field": "FOO",
  "dispacher-type": "non-storic"
}
      
GET storic-index/_search
# > _id" : "01"
      
GET my-index-01/_search
# > "_id" : "02",

🔹 Configure an index so that it properly maintains the relationships of nested arrays of objects

🔗 Official doc

“allows arrays of objects to be indexed in a way that they can be queried independently of each other.” - doc
In ES and NoSQL world there isn’t a true relationship between data: each document is independent and tasks like SQL join are not expected.
Nevertheless, the real-world data has relations and in ES this aspect could be mapped using specific (nested) fields, 💡 but be aware: we are “merely” storing all related data altogether.
🦂 Nested vs Object vs Arrays:
- An arrays like ["text1", "text2"] should be indexed as “string” and not “object”
  - In other words, declare as usual the field (e.g. text) and then index as an array.
    Is important all elements will have the same type format.

To catch the relationship between the information we could create/use three different fields:

Object arrays - doc

The default type when subfields are dynamically found, stores all documents grouping the keys with associated a list of values

# e.g.
> Document:
{
    "my_subfield":[
        {
            "key1":"val11"
            "key2":"val12"
        },
        {
            "key1":"val21"
            "key2":"val22"
        },
        {
            "key3":"val3"
        },
    ]
}
              
# Will be mapped as:
key1 : [val11, val21]
key2 : [val12, val22]
key3 : [val3]

Nested arrays - doc

Treat each subfield as independent (stored as hided document under the hood)

# e.g.
> Document:
{
    "my_subfield":[
        {
            "key1":"val11"
            "key2":"val12"
        },
        {
            "key1":"val21"
            "key2":"val22"
        },
        {
            "key3":"val3"
        },
    ]
}
              
# Will be mapped as:
> Hided1
{
    "key1":"val11"
    "key2":"val12"
}
              
> Hided2
{
    "key1":"val21"
    "key2":"val22"
}
              
> Hided3
{
    "key3":"val3"
}

Flattened - doc

Store all the keys values in a list of type keyword

# e.g.
> Document:
{
    "my_subfield":[
        {
            "key1":"val11"
            "key2":"val12"
        },
        {
            "key1":"val21"
            "key2":"val22"
        },
        {
            "key3":"val3"
        },
    ]
}
              
# Will be mapped as:
["val11","val12","val21","val22","val3"]

🖱️ Code example

# ---
# First test with flattened
# ---
DELETE test04
PUT test04
{
  "mappings": {
    "properties": {
      "f-flat":{
        "type": "flattened"
      }
    }
  }
}
              
PUT test04/_doc/02
{
  "f-flat": [
    {
      "field1": {
        "sub1": "sky",
        "sub2": "earth"
      }
    },
    {
      "field1": {
        "sub1": "sky",
        "sub2": "earth"
      }
    }
  ]
}
              
GET test04
GET test04/_search
{
  "query": {
    "match": {
      "f-flat.field1": "sky"
    }
  }
}
# > 0 hits
              
GET test04/_search
{
  "query": {
    "match": {
      "f-flat": "sky"
    }
  }
}
# > 1 hit

🖱️ Code example

🦂 In order to search using a nested field, the search body must include nested with path parameters

# Example
GET my-index-01/_search
{
    "query": {
        "nested": {
            "path": "my_nested_collection",
            "query": {...}
        }
    }
}

🦂 To get highlights from nested sub-fields use the field inner_hits at the same level as nested params

# ─────────────────────────────────────────────
# Mapping relationships: objects, nested, flattened
# ─────────────────────────────────────────────
      
# ---
# Basic object mapping
# ---
      
PUT test-index-01/_doc/01
{
  "user_id": 1,
  "user_stats": {
    "last_access": "20211117T101500",
    "device": "smartphone",
    "ip_country": "italy"
  },
  "user_friends": [
    {
      "name": "markus",
      "nationality": "canadian"
    },
    {
      "name": "alice",
      "nationality": "belgian"
    },
    {
      "name": "stephen",
      "deleted": true
    }
  ]
}
# > 200
# Note: we are using Object types in both 
#       "user_stats" and "user_friends" (type dynamically inferred)
#       They will be flattened like
#       "user_stats.device = smartphone"
#       "user_friends.name = ["markus", "alice", "stephen"]
      
GET test-index-01/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "user_friends.name.keyword": "markus"
          }
        },
        {
          "term": {
            "user_friends.nationality.keyword": "belgian"
          }
        }
      ]
    }
  }
}
# > "user_id" : 1
# Warning: if the desired query was "return users with
#          at least one belgian friend named markus"
#          this result is wrong. This is because we haven't
#          used nested field.
#          We will resolve this issue in the next block.
      
# ---
# Basic nested mapping
# ---
      
PUT test-index-02/
{
  "mappings": {
    "properties": {
      "user_friends": {
        "type": "nested",
        "properties": {
          "name": {
            "type": "keyword"
          },
          "nationality": {
            "type": "keyword"
          },
          "deleted": {
            "type": "boolean"
          }
        }
      }
    }
  }
}
# > 200
# Note: "nested" type used, each entries on the
#       field will be treated as individual.
# Note: not all properties was mapped,
#       the others will be inferred dynamically
      
PUT test-index-02/_doc/01
{
  "user_id": 1,
  "user_stats": {
    "last_access": "20211117T101500",
    "device": "smartphone",
    "ip_country": "italy"
  },
  "user_friends": [
    {
      "name": "markus",
      "nationality": "canadian"
    },
    {
      "name": "alice",
      "nationality": "belgian"
    },
    {
      "name": "stephen",
      "deleted": true
    }
  ]
}
# > 200
# Note: same document as 1st block,
#       but different index
      
GET test-index-02/_search
{
  "query": {
    "nested": {
      "path": "user_friends",
      "query": {
        "bool": {
          "filter": [
            {
              "term": {
                "user_friends.name": "markus"
              }
            },
            {
              "term": {
                "user_friends.nationality": "belgian"
              }
            }
          ]
        }
      }
    }
  }
}
# > 0 hits
# Note: same query as before, but now we don't
#       get any results because each entry in 
#       `user_friends` is managed as independent
#       document and there isn't friends with
#       name `markus` and nationality `belgian`
# Note: In order to search using a nested field, 
#       the search body must include nested with 
#       path parameters
      
GET test-index-02/_search
{
  "query": {
    "nested": {
      "path": "user_friends",
      "query": {
        "bool": {
          "filter": [
            {
              "term": {
                "user_friends.name": "markus"
              }
            },
            {
              "term": {
                "user_friends.nationality": "canadian"
              }
            }
          ]
        }
      }
    }
  }
}
# > "user_id" : 1
# Note: correct match, markus is canadian and
#       belong to user_id 1 friends list
      
# ---
# Flattened field
# ---
      
PUT test-index-03/
{
  "mappings": {
    "properties": {
      "user_friends": {
        "type": "flattened"
      }
    }
  }
}
# > 200
# Note: with `flattened` the entire object 
#       is mapped as a single field.
      
PUT test-index-03/_doc/01
{
  "user_id": 1,
  "user_stats": {
    "last_access": "20211117T101500",
    "device": "smartphone",
    "ip_country": "italy"
  },
  "user_friends": [
    {
      "name": "markus",
      "nationality": "canadian"
    },
    {
      "name": "alice",
      "nationality": "belgian"
    },
    {
      "name": "stephen",
      "deleted": true
    }
  ]
}
# > 200
# Note: same document as 1st block,
#       but different index
      
GET test-index-03/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "user_friends": "markus"
          }
        },
        {
          "term": {
            "user_friends": "belgian"
          }
        }
      ]
    }
  }
}
# > "_id" : "01"
# Note: we had to change the query structure
#       brecause now ".<subfield>.keyword" is not
#       longer supported: all the keys values are
#       stored as keyword.
      
# ---
# Object vs Flattened
#
# What is the difference?
# -> in object we aggregate subfield 
#   values based on keys
# -> in flattened we only store keys
#     as keywords family
# ---
      
GET test-index-01
# > user_friends - type not specified: is Object
      
GET test-index-03
# > "user_friends" : "type" : "flattened"
      
PUT test-index-01/_doc/02
{
  "user_friends": [
    {
      "name": "mike"
    },
    {
      "name": "robert"
    }
  ]
}
# > 200
      
PUT test-index-03/_doc/02
{
  "user_friends": [
    {
      "name": "mike"
    },
    {
      "name": "robert"
    }
  ]
}
# > 200
      
GET test-index-01/_search
{
  "query": {
    "bool": {
      "minimum_should_match": 2,
      "should": [
        {
          "match_phrase": {
            "user_friends.name.keyword": "mike"
          }
        },
        {
          "match_phrase": {
            "user_friends.name.keyword": "robert"
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "user_friends.name.keyword": {}
    }
  }
}
# > "_id" : "02" + highlight
# Note: with `minimum_should_match` we are sure
#       that both the queries has a match.
#       A better way is to put the queries in "and",
#       this format is only for study purposes.
      
GET test-index-03/_search
{
  "query": {
    "bool": {
      "minimum_should_match": 2,
      "should": [
        {
          "match_phrase": {
            "user_friends": "mike"
          }
        },
        {
          "match_phrase": {
            "user_friends": "robert"
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "user_friends": {}
    }
  }
}
# > "_id" : "02" - but without highlight
# Note: with `flattened` we cannot get highlights
      
# ---
# Nested highlight
# ---
      
GET test-index-02
# > "type" : "nested"
      
PUT test-index-02/_doc/02
{
  "user_friends": [
    {
      "name": "mike",
      "age": 22
    },
    {
      "name": "robert",
      "age": 30
    }
  ]
}
# > 200
      
GET test-index-02/_search
{
  "query": {
    "nested": {
      "path": "user_friends",
      "query": {
        "bool": {
          "minimum_should_match": 2,
          "should": [
            {
              "match_phrase": {
                "user_friends.name": "mike"
              }
            },
            {
              "match_phrase": {
                "user_friends.name": "robert"
              }
            }
          ]
        }
      }
    }
  }
}
# > 0 hits
# Note: same query as last block, but here 
#       no hits are returned because each 
#       subfield is managed individually
      
GET test-index-02/_search
{
  "query": {
    "nested": {
      "path": "user_friends",
      "query": {
        "bool": {
          "minimum_should_match": 2,
          "should": [
            {
              "match_phrase": {
                "user_friends.name": "mike"
              }
            },
            {
              "match_phrase": {
                "user_friends.age": 22
              }
            }
          ]
        }
      },
      "inner_hits": {
        "highlight": {
          "fields": {
            "user_friends.name": {}
          }
        }
      }
    }
  }
}
# > "_id" : "02" + highlights
# Note: for highlighting we require to use
#       a special field named `inner_hits`,
#       placed at the **same level as `nested`** field

🔷 Cluster Management

Questions

🔹 Diagnose shard issues and repair a cluster’s health

🔗 cluster health API
🔗 Fix common cluster issues doc

Repair corrupted shard

We will corrupt a shard to simulate a Hardware issue, then explore the ES behavior, recover the corrupted shard using CLI utilities.

Note: in real word you should use, if possible, a backup system to recover the index shard - the following example approach may lose index data

🖱️ Code example

⚠️ We will broke ES files, so be sure to run on a containerized environment developed only for the exercise
💡 We will use the elasticsearch-shard CLI program
💡 The cluster used for the next code block is 07_autorun_disabled

# ─────────────────────────────────────────────
# Shard issues repair
# ─────────────────────────────────────────────
          
# ---
# Init 
# ---
          
# 1. Start the cluster, `bash rerun` if
#     you are using https://github.com/pistocop/elastic-certified-engineer/tree/master/dockerfiles/07_autorun_disabled
# 2. The cluster will start one master node: es01
          
GET _cat/nodes?v
# > master: * ; name: es01
          
GET _cluster/health?human
# > "status" : "green"
          
# ---
# Start es02
# ---
          
# 1. Open new WSL/CLI
# 2. Run `$ docker exec -it es02 /bin/bash`
# 3. Run `$ su - elasticsearch bin/elasticsearch &`
# 4. Close the shell
# Now we have started a new node on the second node
          
GET _cat/nodes?v
# > master: * ; name: es01
# > master: - ; name: es02
          
# ---
# Data creation
# ---
# Add "Sample eCommerce orders" data directly from kibana,
# Follow this guide: https://www.elastic.co/guide/en/kibana/7.13/get-started.html#gs-get-data-into-kibana
          
PUT /kibana_sample_data_ecommerce/_settings
{
  "index": {
    "number_of_replicas": 0,
    "auto_expand_replicas": false
  }
}
# > 200
          
GET _cat/shards/kib*?v
# > prirep : p ; node: es02
# Note: if the primary shard isn't on es02, 
#       restart the cluster and the tutorial
          
GET _cat/indices/kib*?v
# > 4675
          
# ---
# Invalidate the index
# ---
          
# 1. Go into es02: `$ docker exec -it es02 /bin/bash`
# 2. Find where `kibana_sample_data_ecommerce` are:
#     - Go into `/usr/share/elasticsearch/data/nodes/0/indices` folder
#     - Search for a folder ~4.1M using `du -h`
#     - Go into the folder, e.g. `./G0u2hp4aSb2YUb_ukHaSNA/0/index`
#     - Open the first file and "mess" with the code, e.g. `vi _0.cfs` 
#       - [!] Tricky point: write some data, save the file and check
#           with the following query if the index is broken. 
#           Mess with the data until the next query don't return:
#           `corrupt_index_exception`
          
GET kibana_sample_data_ecommerce/_search
{
  "query": {
    "match": {
      "manufacturer": "Oceanavigations"
    }
  }
}
# > `corrupt_index_exception`
          
# ---
# Remove corrupted shard
# ---
          
# 1. Go into es02: `$ docker exec -it es02 /bin/bash`
# 2. Stop the ES instance:
#     - Read the program ID using `$ps -aux`
#     - Kill the program using `kill <pid>`
# 3. Run the recovery program:
#   `$ bin/elasticsearch-shard remove-corrupted-data --index kibana_sample_data_ecommerce --shard-id 0`
# 4. Answer yes to all questions
# 5. [!] Copy the last block of code, after the note: 
#     "You should run the following command to allocate this shard:"
#     printed on CLI by the recovery program
# 6. Paste the code on kibana, it should looks like the next block
# 7. Re-run ES on es02 node: `$ su - elasticsearch bin/elasticsearch &`
# 8. RUn the code you have pasted, with accept_data_loss set to true
          
POST /_cluster/reroute
{
  "commands" : [
    {
      "allocate_stale_primary" : {
        "index" : "kibana_sample_data_ecommerce",
        "shard" : 0,
        "node" : "uG690rhBQ9GJTfDGqe9BIg",
        "accept_data_loss" : true
      }
    }
  ]
}
# > 200
          
GET kibana_sample_data_ecommerce/_search
{
  "query": {
    "match": {
      "manufacturer": "Oceanavigations"
    }
  }
}
# > 200
# Note: the query now is working!
          
GET kibana_sample_data_ecommerce/_count
# > 4597
# Note: Originally we had 4675 documents, now 4597,
#       because the recovery process could lost some
#       data as advertised by the CLI program

Red or yellow cluster status

🔗 official doc

We will simulate an HW crash whit a node shutdown

# ─────────────────────────────────────────────
# Repair cluster health
# ─────────────────────────────────────────────
      
# ---
# Start the cluster
# ---
      
# 1. Run the cluster named `08_autorun-disabled-3nodes`
#   $ bash rerun
# 2. Run ES on es02:
#   $ docker exec -u elasticsearch es02 /usr/share/elasticsearch/bin/elasticsearch
#   Tip: you could escape from the command (`ctrl + c`) without problems:
#       the ES instance will continue to run
# 3. Wait...
      
GET _cat/nodes?v
# > name: es01; master: *; node.role: m
# > name: es02; master: -
      
PUT test-index-01
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  }
}
# > 200
      
GET _cat/shards/test*?v
# > shard:0; prurep: p; node: es02
# Note: we have the primary shard of the index
#       stored inside node es02
      
GET _cluster/health
# > status: green
      
# ---
# Go to yellow state
# ---
      
PUT test-index-01/_settings
{
  "index" : {
    "number_of_replicas" : 1
  }
}
# > 200
      
GET _cluster/health
# > status: yellow
# > "unassigned_shards" : 1
# Note: ES should allocate a replca shard,
#       but no nodes are available. 
#       `es01` is tecnically an available index
#       but doesn't have role `data`
      
GET _cluster/allocation/explain
{
  "index": "test-index-01",
  "shard": 0,
  "primary": true,
  "current_node": "es02"
}
# > 200
      
# ---
# Go to greed state: start new instance
# ---
      
# 1. Start ES instance inside es03:
#   $ docker exec -u elasticsearch es03 /usr/share/elasticsearch/bin/elasticsearch
# 2. Wait...
      
GET _cat/nodes?v
# > ...same as before
# > name: es03
      
GET _cat/shards/test*?v
# > prirep:r; node: es03
# Note: the replica shard was created
#       and placed on node es03
      
GET _cluster/allocation/explain
{
  "index": "test-index-01",
  "shard": 0,
  "primary": false,
  "current_node": "es03"
}
# > 200
      
GET _cluster/health
# > "status" : "green"
      
# ---
# VM fault simulation
# ---
      
# > What happen if we kill
#   the node with the primary shard?
      
GET _cat/shards/test*?
# > primary shard on es02
      
# 1. Connect to the node
#   $ docker exec -u root -it es02 /bin/bash
# 2. Find the ES prigram PID and kill it
#   $ ps -aux
#   $ kill 11
      
GET _cat/nodes?v
# > node es02 disappeared
      
GET _cat/shards/test*?v
# > node: es03; prirep: p
# Note: the replica shard allocated to es03 now,
#       after the es02 kill, it is converted to primary
      
GET _cluster/health
# > "status" : "yellow"
      
GET _cluster/allocation/explain
{
  "index": "test-index-01",
  "shard": 0,
  "primary": true,
  "current_node": "es03"
}
# > 200
      
# ---
# VM recovery
# ---
      
# > What happen if the node
#   come back in function?
      
# 1. Start ES on es02 node
#   $ docker exec -u elasticsearch es02 /usr/share/elasticsearch/bin/elasticsearch
# 2. Wait...
      
GET _cat/nodes?v
# > name: es02
      
GET _cat/shards/test*?v
# > prirep: r; node: es02
      
GET _cluster/health
# > "status" : "green",

🔹 Backup and restore a cluster and/or specific indices

🔗 Official doc

🔗 For more info see the chapter of this guide under Deepenings → Index management → Backup/restore snapshots chapter
💡 Takeaways
- use snapshot to store on disk ES resources, i.e. indexes and settings
- we will create a family of snapshots inside a resource named repository
- the path where store snapshot files is defined inside the repository and must be declared on each node setting (elasticsearch.yml - see doc)
- we could schedule the snapshots lifecycle (when making a snapshot, when deleting etc.) using Snapshot Lifecycle Management (SLM) - doc

🖱️ Code example

Almost all the functionalities and tasks related to the snapshot ecosystem could be done with Kibana UI too other than the following code block
The cluster to use for the next is 04_snapshots-locals

# ─────────────────────────────────────────────
# Backup and restore a cluster and/or specific indices
# ─────────────────────────────────────────────
      
# Run a cluster with a repo path registered:
# https://github.com/pistocop/elastic-certified-engineer/tree/master/dockerfiles/04_snapshots-locals
      
# ---
# Register the repository
# ---
      
PUT /_snapshot/my-repository
{
  "type": "fs",
  "settings": {
    "location": "/mnt/bkp"
  }
}
# > 200
# Note: "location" value must coincide with 
#       informations stored on settings inside
#       the elasticsearch.yml file of each node
      
# ---
# Create a snapshot
# ---
      
PUT test-index-01/_doc/01
{
  "name": "donald",
  "surname": "duck"
}
# > 200
      
PUT test-index-02/_doc/01
{
  "song": "song2"
}
# > 200
      
PUT _snapshot/my-repository/my-first-snapshot
{
  "indices": "test-index-01,test-index-02",
  "ignore_unavailable": true,
  "include_global_state": false,
  "metadata": {
    "taken_by": "es exercises",
    "taken_because": "test the backup system"
  }
}
# > "state" : "SUCCESS"
# Note: we are creating a snapshot named `my-first-snapshot`,
#       it will include `test-index-01` and `test-index-02`
# Warning: don't put spaces on "indices" field, no error will
#       be raised and the second index will not be included
      
GET _cat/snapshots/my-repository?v
# > id: my-first-snapshot
# > failed_shards: 0
      
# ---
# Recovery from a snapshot
# ---
      
PUT test-index-01/_doc/02
{
  "name": "donald",
  "surname": "Knuth"
}
# > 200
      
DELETE test-index-01/_doc/01
      
GET test-index-01/_search
# > 1 hit, donald Knuth
      
POST test-index-01/_close
# > 200
# Note: A closed index is blocked for read/write operations,
#       we need to close an index before restore it
      
POST /_snapshot/my-repository/my-first-snapshot/_restore
{
  "indices": "test-index-01",
  "ignore_unavailable": true,
  "include_global_state": false,              
  "include_aliases": false
}
# > 200
      
GET test-index-01/_search
# > 2 hits, both Knuth and duck
# Note: the index has recovered the 
#       deleted document.
      
# ---
# Recover a changed document
# ---
      
PUT test-index-01/_doc/01
{
  "name":"salvo",
  "surname": "errori"
}
# > 200
      
POST test-index-01/_close
# > 200
      
POST /_snapshot/my-repository/my-first-snapshot/_restore
{
  "indices": "test-index-01",
  "ignore_unavailable": true,
  "include_global_state": false,
  "include_aliases": false
}
# > 200
      
GET test-index-01/_search
# > 2 hits, both Knuth and duck
# Note: the changed document is overwritted by the snapshot recovery
      
PUT /_slm/policy/nightly-snapshots
{
  "schedule": "0 30 1 * * ?",
  "name": "<nightly-snap-{now/d}>",
  "repository": "my-repository",
  "config": {
    "indices": [
      "*"
    ]
  },
  "retention": {
    "expire_after": "30d",
    "min_count": 5,
    "max_count": 50
  }
}
# > 200
# Note: the policy will create a snapshot of all indexes
#       daily at 1:30AM UTC, then clean snapshot if they are
#       more than 50 or are older than 1 month.
#       The above rules doesn't apply if the snapshots created
#       are less than 5.
      
# ---
# Restore on different index
# ---
      
POST /_snapshot/my-repository/my-first-snapshot/_restore
{
  "indices": "*",
  "ignore_unavailable": true,
  "include_global_state": false,
  "rename_pattern": "index_*",
  "rename_replacement": "restored_index_$1",
  "include_aliases": false
}
# > 500
# Note: "index_out_of_bounds_exception", this error is dued
#       the fact we cannot use "index_*" as parameter
      
POST /_snapshot/my-repository/my-first-snapshot/_restore
{
  "indices": "*",
  "ignore_unavailable": true,
  "include_global_state": false,
  "rename_pattern": "test-index-(.+)",
  "rename_replacement": "restored-$0",
  "include_aliases": false
}
# > 200
# Note: the restored indexes will have 
#       the naming form of restored-<original index name>
#       because we had used $0 as variable
      
GET _cat/indices/restored*?v
# > restored-test-index-01
# > restored-test-index-02
      
GET restored-test-index-01/_search
# > both donald duck and knuth
      
# ---
# Store and restore the cluster
# ---
      
POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "test-index-01",
        "alias": "users-census"
      }
    }
  ]
}
# > 200
      
PUT _snapshot/my-repository/my-cluster-snapshot
{
  "indices": "*",
  "ignore_unavailable": true,
  "include_global_state": true,
  "metadata": {
    "taken_by": "es exercises",
    "taken_because": "first cluster complete backup"
  }
}
# > 200
# Note: we have setted "*" to say "all indices" and
#       "include_global_state": true to store 
      
GET _snapshot/my-repository/my-cluster-snapshot
# > indices: .kibana_task_manager...
# Note: the indices stored are more than the defined by us,
#       this is because system indices are included in the backup

🔹 Configure a snapshot to be searchable

🔗 Official doc - api

“use snapshots to search infrequently accessed and read-only data” - doc
With searchable snapshots we could search through data stored on a repository without loading all the indexes - at the cost of slower speed we save nodes HW capabilities
- We will see different searchable snapshot usages in the code block because searchable snapshots is a versatile functionality
  - e.g. we will create indices that after X second will change to searchable snapshot,
    - how to use it on a Hot-Warm-Cold architecture,
    - how mount a snapshot already done as a searchable index,
    - how to integrate a searchable snapshot inside a data stream
Some Q&A about the searchable snapshots:
- Can we use searchable snapshots without templates?
  - Yes, just attach the searchable snapshot functionality to the ILM
- Can we set the searchable snapshot functionality on a hot index?
  - Yes, but you must use the rollover functionality
- Should we make the snapshot before creating the searchable snapshot?
  - Yes and no: if the searchable snapshot functionality is created inside an ILM, the snapshot will be automatically created and mounted to be searched.
    If you already have a snapshot, you could mount it and be searched
- Can we create ILM without rollover?
  - Yes
- 💡 Can we create ILM with rollover and no index template?
  - Yes, but you must specify the index alias using the parameter index.lifecycle.rollover_alias at index creation: no rollover system will be activated if ES cannot know how to update the alias name

🖱️ Code example

🦂 if you create a new Index Lifecycle Policies from Kibana UI you will not enable the Searchable snapshot option: this isn’t related to some index settings you must follow but instead is a license-related problem.
You must enable the functionality activating the license, got to:
Stack Management → License management → Start a 30-day trial
(or use the Kibana code as described in the next code block)
🦂 Often the ILM system isn’t really responsive, especially if the timing between the phases is in the order of seconds. This delay is caused by the ILM checking system, described here.
- To increase the ILM checking ratio use and set the following cluster parameter - doc
```
PUT _cluster/settings
{
  "persistent": {
    "indices.lifecycle.poll_interval": "5s" # <-- default is 
  }
}
```
💡 min_age parameter between phases calculation - blog
- If the rollover is used, min_age is calculated off the rollover date
- Otherwise, min_age is calculated off the original index’s creation date.

🖱️ Section 1: explore ILM and searchable snapshot

🦂 In one example, ILM during the Searchable snapshot phase change the name of the index and create an alias point to the “original” name. The new index is restored-<original-name> and is a new index with the snapshot mounted.

# ─────────────────────────────────────────────
# Configure a snapshot to be searchable
#
# Section 1: explore ILM and searchable snapshot
# ─────────────────────────────────────────────
          
# Cluster requirements:
#   - nodes with Hot & Cold tiers
#   - path registered for snapshots
# Cluster to use:
# `04_snapshots-locals`
# https://github.com/pistocop/elastic-certified-engineer/tree/master/dockerfiles/04_snapshots-locals
          
GET _cat/nodes?v
# > es03 node.role: cm
# Note: the es03 node has the cold role
          
# ---
# Cluster init
# ---
          
PUT /_snapshot/my-repository
{
  "type": "fs",
  "settings": {
    "location": "/mnt/bkp/"
  }
}
# > 200
          
PUT _ilm/policy/my_policy
{
  "policy": {
    "phases": {
      "cold": {
        "actions": {
          "searchable_snapshot": {
            "snapshot_repository": "my-repository"
          }
        }
      }
    }
  }
}
# > 400
# > "current license is non-compliant for [searchable-snapshots]"
# Note: the basic license doesn't allow searchable-snapshots functionality
          
GET _license
# > "type" : "basic"
          
POST /_license/start_trial?acknowledge=true
# > "trial_was_started" : true
# Note: now functionalities like searchable-snapshots
#       are unblocked
          
PUT _ilm/policy/my_policy
{
  "policy": {
    "phases": {
      "cold": {
        "actions": {
          "searchable_snapshot": {
            "snapshot_repository": "my-repository"
          }
        }
      }
    }
  }
}
# > 200
# Note: now we can use searchable snapshot functionality
          
DELETE _ilm/policy/my_policy
# > 200
          
PUT _cluster/settings
{
  "persistent": {
    "indices.lifecycle.poll_interval": "5s"
  }
}
# > 200
# Note: increase the pool checking interval
#       because we will test ILM policies with 
#       time between phases in the order of seconds
          
# ---
# Basic ILP
# ---
          
PUT _ilm/policy/test-policy-1
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "10s",
        "actions": {
          "set_priority": {
            "priority": 50
          }
        }
      },
      "cold": {
        "min_age": "60s",
        "actions": {
          "set_priority": {
            "priority": 0
          }
        }
      }
    }
  }
}
# > 200
# Note: move the index to warm after 10s
#       and to cold after 60s
# Tip: code generated from Kibana webapp
#     under `Index Lifecycle Policies`
          
PUT test-index-01
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "index.lifecycle.name": "test-policy-1" 
  }
}
# > 200
# Note: is important set replicas to 0,
#       with only 3 nodes (hot - warm - cold)
#       the replica shard cannot be instantiated
          
GET _cat/shards/test*?v
# > node: es01
          
PUT test-index-01/_doc/01
{
  "msg": "payload"
}
# > 200
          
# Wait 10s...
          
GET _cat/shards/test*?v
# > node: es02
# Note: now is in warm node es02
          
# Wait 60s...
          
GET _cat/shards/test*?v
# > node: es03
# Note: the shard is finally moved to the cold node es03
          
# ---
# ILP with a searchable snapshot
# ---
          
PUT _ilm/policy/test-policy-2
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "10s",
        "actions": {
          "set_priority": {
            "priority": 50
          }
        }
      },
      "cold": {
        "min_age": "60s",
        "actions": {
          "set_priority": {
            "priority": 0
          },
          "searchable_snapshot": {
            "snapshot_repository": "my-repository"
          }
        }
      }
    }
  }
}
# > 200
# Note: same policy as before but with 
#       snapshot_repository in the cold phase
          
PUT test-index-02
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "index.lifecycle.name": "test-policy-2"
  }
}
# > 200
          
PUT test-index-02/_doc/1
{
  "msg": "payload"
}
# > 200
          
GET _cat/shards/test-index-02?v
# > es01
          
# Wait 10s...
          
GET _cat/shards/test-index-02?v
# > es02
          
# Wait 60s... (maybe >> 60s)
          
GET _cat/shards/test-index-02?v
# > index: restored-test-index-02
# > node: es03 (memo: es03 is the cold node)
# Note: the index name is changed! Under the hood
#       the ILM system did some things, let's explore...
          
# From CLI we can visit the `04_snapshots-locals/backup` folder,
# inside we can find some files: they are the test-index-02
# searchable snapshot
          
GET test-index-02/_ilm/explain
# > index" : "restored-test-index-02"
# > "phase" : "cold"
          
GET _cat/aliases/test*?v
# > alias: test-index-02
# > index: restored-test-index-02
# Note: the ILM has created an alias with the
#       index name and a redirection to the restored index
          
GET test-index-02/_search
# > "_id" : "1"
# Note: we can use the index for search
          
PUT test-index-02/_doc/2
{
  "msg": "2nd payload"
}
PUT restored-test-index-02/_doc/2
{
  "msg": "2nd payload"
}
# > 403 - cluster_block_exception
# Nore: we cannot store new data on an index
#       that have the snapshot stored on a file-system
          
GET /_searchable_snapshots/stats
# > restored-test-index-02 | "num_files" : 1
          
# ---
# ILP with searchable snapshot on hot phase
# ---
          
PUT _ilm/policy/test-policy-3
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "set_priority": {
            "priority": 100
          },
          "searchable_snapshot": {
            "snapshot_repository": "my-repository"
          }
        }
      }
    }
  }
}
# > 400 - the [searchable_snapshot] action(s) could not be used in the [hot] phase without an accompanying [rollover] action
# Note: we cannot create a searchable snapshot in the hot
#       phase without the rollover functionality
          
PUT _ilm/policy/test-policy-3
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_age": "10s"
          },
          "set_priority": {
            "priority": 100
          },
          "searchable_snapshot": {
            "snapshot_repository": "my-repository",
            "force_merge_index" : true
          }
        },
        "min_age": "0ms"
      }
    }
  }
}
# > 200
# Note: searchable snapshot and rollover
# Note: "force_merge_index" : true is a best practice, see
#       https://www.elastic.co/guide/en/elasticsearch/reference/7.13/ilm-searchable-snapshot.html#ilm-searchable-snapshot-options
          
PUT test-index-03
{
  "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "index.lifecycle.name": "test-policy-3"
    }
}
# > 200
          
GET _cat/shards/test-index-03?v
# > node: es01
          
PUT test-index-03/_doc/01
{
  "msg": "payload"
}
# > 200
          
GET _cat/indices/test-index-03*?v
          
GET _cat/shards/test-index-03?v
# > node: es01
# Note: the rollover cannot be done
#       because no index alias is found
          
POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "test-index-03",
        "alias": "test-index-03-alias"
      }
    }
  ]
}
# > 200
          
# > setting [index.lifecycle.rollover_alias] for index [test-index-03] is empty or not defined
# Note: try the API multiple times to get the error
# Note: the rollover cannot be completed
#       because we haven't set the rollover alias name
          
PUT test-index-03/_settings
{
  "index.lifecycle.name": "test-policy-3",
  "index.lifecycle.rollover_alias": "test-index-03-alias"
}
# > 200
# Note: we need to provide the index alias to update after the rollout,
#       api body structure from
#       https://www.elastic.co/guide/en/elasticsearch/reference/7.13/getting-started-index-lifecycle-management.html#ilm-gs-alias-apply-policy
          
GET _cat/indices/test*?v
# > test-index-000004 | docs.count: 0
# > restored-test-index-03 | docs.count: 1
# Note: the `test-index-000004` is the index created
#       after the rollover
# Note: the `restored-test-index-03` is the "original" index
#       after the rollover process, stored as a searchable index
          
GET /_searchable_snapshots/stats
# > restored-test-index-03 | "num_files" : 1
          
GET test-index-03/_search
# > "_id" : "01"
          
GET test-index-03-alias/_search
# > 0 hit
# Note: why zero hits?
#       -> because the alias now point to
#         the index created by the rollover process
          
GET _cat/aliases/test*?v
# > alias: test-index-03-alias | index: test-index-000004
# > alias: test-index-03 | restored-test-index-03
# Note: like before a new alias is created that point
#       to the new index with searchable snapshot
          
PUT test-index-03/_doc/02
{
  "msg": "2nd payload"
}
# > cluster_block_exception
# Note: cannot insert data on a snapshot
          
GET _cat/snapshots/my-repository?v
# > 2 entries: the test-index-02 and test-index-03 searchable snapshots

🖱️ Section 2: real-world usages

# ─────────────────────────────────────────────
# Configure a snapshot to be searchable
#
# Section 2: real-world usages
# ─────────────────────────────────────────────
          
# Cluster requirements:
#   - nodes with Hot & Cold tiers
#   - path registered for snapshots
# Cluster to use:
# `04_snapshots-locals` - https://github.com/pistocop/elastic-certified-engineer/tree/master/dockerfiles/04_snapshots-locals
          
GET _cat/nodes?v
# > es03 node.role: cm
# Note: the es03 node have a cold role
          
# ---
# Cluster init
# ---
          
POST /_license/start_trial?acknowledge=true
          
PUT _cluster/settings
{
  "persistent": {
    "indices.lifecycle.poll_interval": "5s"
  }
}
          
PUT /_snapshot/my-repository
{
  "type": "fs",
  "settings": {
    "location": "/mnt/bkp/"
  }
}
          
# ---
# Make indices in existing snapshot searchable
# ---
          
PUT test-index-01/_doc/01
{
  "msg": "payload"
}
          
PUT /_snapshot/my-repository/test-index-01-snapshot?wait_for_completion=true
{
  "indices": "test-index-01",
  "include_global_state": false
}
# > "state" : "SUCCESS"
          
GET _cat/snapshots/my-repository?v
# > successful_shards: 1
          
POST /_snapshot/my-repository/test-index-01-snapshot/_mount?wait_for_completion=true
{
  "index": "test-index-01",
  "renamed_index": "test-index-01-snapshot",
  "index_settings": {
    "index.number_of_replicas": 0
  }
}
# > "successful" : 1
# Note: we have just mounted a snapshot as a new
#       index named `test-index-01-snapshot`, it
#       is a searchable snapshot
          
GET _cat/indices/test*?v
# > index: test-index-01 | health yellow | docs.count 1
# > index: test-index-01-snapshot | health green | docs.count 1
# Note: test-index-01 is yellow because it would instantiate
#       a replica shard but we cannot do it (no other indices with hot role).
#       Instead test-index-01-snapshot is green because has replica set to 0
          
PUT test-index-01-snapshot/_doc/02
{
  "msg": "2nd payload"
}
# > cluster_block_exception
# Note: cannot insert data on a searchable snapshot
          
PUT test-index-01/_doc/02
{
  "msg": "2nd payload"
}
# > 200
# Note: the "normal" index continue to work as usual
          
GET test-index-01-snapshot/_doc/02
# > found: false
# Note: how can align the two indexes?
          
PUT /_snapshot/my-repository/test-index-01-snapshot02?wait_for_completion=true
{
  "indices": "test-index-01",
  "include_global_state": false
}
          
DELETE test-index-01-snapshot
          
POST /_snapshot/my-repository/test-index-01-snapshot02/_mount?wait_for_completion=true
{
  "index": "test-index-01",
  "renamed_index": "test-index-01-snapshot",
  "index_settings": {
    "index.number_of_replicas": 0
  }
}
          
GET test-index-01-snapshot/_doc/02
# > "_id" : "02"
          
# Apply best practices: 
# > To mount an index from a snapshot that contains multiple indices, 
# we recommend creating a clone of the snapshot that contains only the 
# index you want to search, and mounting the clone.
# https://www.elastic.co/guide/en/elasticsearch/reference/7.13/searchable-snapshots.html#using-searchable-snapshots
          
PUT /_snapshot/my-repository/test-index-01-snapshot02/_clone/test-index-01-snapshot02-searchable
{
  "indices": "test-index-01"
}
# > 200
          
DELETE test-index-01-snapshot
          
POST /_snapshot/my-repository/test-index-01-snapshot02-searchable/_mount?wait_for_completion=true
{
  "index": "test-index-01",
  "renamed_index": "test-index-01-snapshot",
  "index_settings": {
    "index.number_of_replicas": 0
  }
}
          
GET test-index-01-snapshot/_doc/02
# > "_id" : "02"
          
# Info: the above process could be automatizated 
#       using the ILM functionalities + aliases: let's do it
          
# ---
# Searchable snapshot and Hot-Warm-Cold ILM
# ---
          
PUT _ilm/policy/test-policy-02
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_docs": 1
          },
          "set_priority": {
            "priority": 100
          },
          "shrink": {
            "number_of_shards": 1
          },
          "forcemerge": {
            "max_num_segments": 1
          }
        },
        "min_age": "0ms"
      },
      "warm": {
        "min_age": "0d",
        "actions": {
          "set_priority": {
            "priority": 50
          }
        }
      },
      "cold": {
        "min_age": "60s",
        "actions": {
          "set_priority": {
            "priority": 0
          },
          "searchable_snapshot": {
            "snapshot_repository": "my-repository"
          }
        }
      }
    }
  }
}
# > 200
# Note: create new index after 1 document indexed,
#       move the old index to warm immediately,
#       then wait 1m and move to cold node
#       and make the index a searchable snapshot
          
PUT test-index-02-000001
{
  "aliases": {
    "test-index-02": {}
  },
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "index.lifecycle.name": "test-policy-02",
    "index.lifecycle.rollover_alias": "test-index-02" 
  }
}
# > 200
# Warning: without `index.lifecycle.rollover_alias`
#         the rollover will not start
          
GET _cat/aliases/test*?v
# > alias: test-index-02 | index: test-index-02-000001 
          
PUT test-index-02/_doc/01
{
  "msg": "payload"
}
# > 200
          
GET _cat/indices/test-index-02*?v
# > index: test-index-02-000001 | health green
# > index: test-index-02-000002 | health yellow
# Note: the rollover have created the new index `test-index-02-000002`,
#       but without a template, the new index will have a replica setings
#       set to 1 and no nodes with hot role for replica shards are available
          
GET _cat/shards/test-index-02*?v
# > test-index-02-000001 | node: es02
          
GET _cat/shards/test-index-02*?v
# > restored-test-index-02-000001 | node: es03
# Note: the searchable index is set with the `restored...` index
          
GET _cat/aliases/test*?v
# > alias: test-index-02-00001 | index=restored-test-index-02-000001
          
PUT test-index-02/_doc/02
{
  "msg": "2nd payload"
}
          
GET _cat/shards/test-index-02*?v
# > restored-test-index-02-000001: the searchable snapshot
# > test-index-02-000002: the index created from rollover
# Note: no new indices are created when the new document is indexed.
#       This is because `test-index-02-00002` created from the rollover
#       process doesn't have the policy attached (no template was used)
          
# ---
# Searchable snapshot and data stream
# ---
          
PUT _ilm/policy/test-policy-03
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_docs": 1
          },
          "set_priority": {
            "priority": 100
          },
          "shrink": {
            "number_of_shards": 1
          },
          "forcemerge": {
            "max_num_segments": 1
          }
        },
        "min_age": "0ms"
      },
      "warm": {
        "min_age": "0d",
        "actions": {
          "set_priority": {
            "priority": 50
          }
        }
      },
      "cold": {
        "min_age": "60s",
        "actions": {
          "set_priority": {
            "priority": 0
          },
          "searchable_snapshot": {
            "snapshot_repository": "my-repository"
          }
        }
      }
    }
  }
}
# > 200
          
PUT _index_template/my-index-template
{
  "index_patterns": [
    "test-index-03*"
  ],
  "data_stream": {},
  "template": {
    "mappings": {
      "properties": {
        "@timestamp": {
          "type": "date",
          "format": "date_optional_time||epoch_millis"
        }
      }
    },
    "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "index.lifecycle.name": "test-policy-03"
    }
  },
  "priority": 500
}
# > 200
# Note: under settings we don't need to specify the
#       "index.lifecycle.rollover_alias" parameter,
#       will be the data_stream to manage this parameter
          
PUT _data_stream/test-index-03
GET _data_stream/test-index-03
# > 200
          
POST test-index-03/_doc?refresh=true
{
  "@timestamp": "2020-01-01T00:00:00",
  "msg": "payload"
}
# Note: differently from normal indices,
#       data streams want a POST API to index
#       new data and NOT specify the index ID
          
GET _cat/shards/*03*?v
# > index: xxxx-000001 | node: es02
# > index: xxxx-000002 | node: es01
# Note: the ILM policy have created the new index and moved the old
          
# Wait 60s...
          
GET _cat/shards/*03*?v
# > index: restored-xxxx-000001 | node: es03
# > index: xxxx-000002 | node: es01
# Note: the ILM policy have created the searchable snapshot
          
POST test-index-03/_doc?refresh=true
{
  "@timestamp": "2021-01-01T00:00:00",
  "msg": "2nd payload"
}
          
GET _cat/shards/*03*?v
# > index: restored-xxxx-000001 | node: es03
# > index: xxxx-000002 | node: es02
# > index: xxxx-000003 | node: es01
          
# Wait 60s...
          
GET _cat/shards/*03*?v
# > index: restored-xxxx-000001 | node: es03
# > index: restored-xxxx-000002 | node: es03
# > index: xxxx-000001 | node: es01
# Note: the data stream continue to apply the policy of
#       rollover when a new index is uploaded, move the old
#       index to warm node and after 1m move to cold node
#       and create a searchable snapshot.

🔹 Configure a cluster for cross-cluster search (remote cluster)

🔗 Official doc

“You can connect a local cluster to other Elasticsearch clusters, known as remote clusters.” - doc
To get a cross-cluster functionality you must configure a connection to the remote cluster; following you’re able to search across all configured clusters
- Not only simple searches:
  - here the list of available APIs that could be used on remote clusters
  - we will see also how to sync data between clusters using cross-cluster replication (next exam question)
How to configure a remote cluster

Steps to connect cluster2 as a remote cluster on cluster1
- Run cluster 1 with (at least) one node with the remote_cluster_client role - info1 info2
- Be sure the cluster 2 nodes can connect could be reached
  - e.g. if you are on docker-composer: open a shell on a node in cluster 1 and use curl to test the connection
- Connect the remote cluster
  
  There are two ways to create the connection
  - Hot mode
    - Open Kibana and connect the remote cluster using the dedicated API
      - 🦂 Warning: you must specify in the API the remote cluster host and port, pay attention that the port to use isn’t the 9200 but instead the transport port (default 9300) - doc
  - Cold mode
    - Editing the elasticsearch.yml settings file of the remote_cluster_client node - doc
  There are also two connection architectures
  - Sniff mode (default)
    - (remote) cluster state is retrieved from one of the seed nodes and up to three gateway nodes are selected as part of remote cluster requests
    - 🦂 Dedicated master nodes (on the remote cluster) are never selected as gateway nodes - we will test this setting on the
      🖱️ Code example
      block
  - Proxy mode
    - a cluster is created using a name and a single proxy address
    - The proxy is required to route those connections to the remote cluster.
    - The proxy mode is not the default connection mode and must be configured
- Search on the remote cluster using cluster2:<idx name> as index name, you could also search in multiple remote clusters and local index - doc

🖱️ Code example

# ─────────────────────────────────────────────
# Configure a cluster for cross-cluster search
# ─────────────────────────────────────────────
      
# Cluster requirements:
#   - 3 clusters
#   - 3 networks
#   - 1 node with some specs:
#      - registered on all 3 the networks
#      - with the role `remote_cluster_client`
# Cluster to use:
# `10_cross-cluster` - https://github.com/pistocop/elastic-certified-engineer/tree/master/dockerfiles/10_cross-cluster
       
      
# ---
# Connect cluster2 as remote cluster of cluster1
#
# > Run the following code on Cluster 1, 
#   using kibana at localhost:5601
# ---
      
GET _cat/nodes?v
# > name: es01 | role: dmr
# Note: the node *must* have the `r` role, it represent `remote_cluster_client` role
      
# Optional: check the clusters2 connection
#   - From CLI enter in es01 and query cluster2:
#       - $ docker exec -u elasticsearch -it es01 /bin/bash
#       - $ curl es02:9200
#       - > ..."cluster_name" : "cluster2"...
      
PUT _cluster/settings
{
  "persistent": {
    "cluster": {
      "remote": {
        "cluster2": {
          "seeds": [
            "es02:9300"
          ]
        }
      }
    }
  }
}
# > "acknowledged" : true
# Note: the port 9300 is used
      
GET _remote/info
# > "connected" : true
# Note: if receive `node [es01] does not have the [remote_cluster_client] role`
#       you shuld add to the master node the `remote_cluster_client` role
      
# ---
# Insert data on cluster2
#
# > Run the following code on Cluster 2, 
#   using kibana at localhost:5602
# ---
      
GET _cat/nodes?v
# > name: es02 | role: dm
      
GET _remote/info
# > 400 | "node [es02] does not have the [remote_cluster_client] role
# Note: the `remote_cluster_client` isn't required on the remote cluster
      
PUT idx-cluster2/_doc/01
{
  "msg" : "Hello from `cluster2`!"
}
# > 200
      
# ---
# Query data from cluster 2
#
# > Run the following code on Cluster 1, 
#   using kibana at localhost:5601
# ---
      
GET cluster2:idx-cluster2/_search
{
  "query": {
    "match_all": {}
  }
}
# > "msg" : "Hello from `cluster2`!"
      
# ---
# Run cluster 3 nodes
# ---
      
# Run the master and data ES nodes of cluster 3
#   - Connect to both the nodes and run the ES program,
#     - $ docker exec -u elasticsearch -it es03d /bin/bash
#     - $ bin/elasticsearch &
#     - $ exit
#     - $ docker exec -u elasticsearch -it es03m /bin/bash
#     - $ bin/elasticsearch &
#     - $ exit
#   - wait ~1m
      
# > Run the following code on Cluster 3,
#   using kibana at localhost:5603
      
GET _cat/nodes?v
# > name: es03m | role: m
# > name: es03d | role: d
      
PUT idx-cluster3/_doc/01
{
  "msg" : "Hello from `cluster3`!"
}
# > 200
      
# ---
# Connect cluster3 as remote cluster of cluster1
#
# > Run the following code on Cluster 1, 
#   using kibana at localhost:5601
# ---
      
GET _cat/nodes?v
# > name: es01 | role: dmr
      
PUT _cluster/settings
{
  "persistent": {
    "cluster": {
      "remote": {
        "cluster3m": {
          "seeds": [
            "es03m:9300"
          ],
          "transport.ping_schedule": "30s"
        },
        "cluster3d": {
          "seeds": [
            "es03d:9300"
          ],
          "transport.ping_schedule": "30s"
        }
      }
    }
  }
}
# > "acknowledged" : true
# Note: we try to connect both at the 
#       "only master" node and the "data" node
      
GET _remote/info
# > cluster3d.num_nodes_connected: 1
# > cluster3m.num_nodes_connected: 1
      
GET cluster3m:idx-cluster3/_search
{
  "query": {
    "match_all": {}
  }
}
# > "msg" : "Hello from `cluster3`!"
      
GET cluster3d:idx-cluster3/_search
{
  "query": {
    "match_all": {}
  }
}
# > "msg" : "Hello from `cluster3`!"

🔹 Implement cross-cluster replication *

🔗 Official doc

“With cross-cluster replication (CCR) you can replicate indices across clusters” - doc
Benefits:
- In case of disaster, you have a hot backup - doc
- Distribute search copies near users geolocation for cut network latency - doc
- Implement different architectures to implement project-required functionalities like disaster recovery resilience, increase data availability etc.
CCR is is a xpack functionality and require the license is activated
CCR work in an active-passive model:
- “You index to a leader index, and the data is replicated to one or more read-only follower indices” - doc
- When the leader index indexes new data the follower’s indices pull changes from the leader index
  - You can also chain replica: attach a follower index to another follower indices
Replication mechanism - doc
- Elasticsearch achieves replication at the shard level, so the follower index will have the same number of shards as its leader index.
  - As a matter of fact, you cannot change the shard number on the create follower index API
- The follower index shard updates shard information, and immediately sends another read request to the leader index shard
- If the following index read request fails:
  - If the read fails for an error that could auto-recovery (e.g. network issue), the follower index entry on a retry loop
  - For errors cannot auto-recovery, follower index pause the read requests until you resume it
    - Tip: we will test both cases under
      🖱️ Code example
      block
- Cross-cluster replication works by replaying the history of individual write operations that were performed on the shards of the leader index.
  - This could work only if the leader index has activated the history retention - doc
How setup cross-cluster replication

🔗 Official tutorial
- Setup both the clusters
  - There are cluster global settings parameters (elasticsearch.yml) to set different CCR aspects (e.g. chunk size requested)
  - A license that includes cross-cluster replication must be activated on both clusters.
- Setup leader cluster
  - The leader indices must have the soft-deletion feature activated - API
- Setup follower cluster
  - 🦂 In the cluster that will have follower indices all nodes with the master node role must also have the remote_cluster_client role - doc
  - Create follower indices using the designed api
🔗 Resources
- Webinair link
- Tutorial ccr-getting-started
- Set up cross-cluster replication
- Manage cross-cluster replication
- Manage auto-follow patterns
- Upgrading clusters
- Remote clusters

🖱️ Code example

🦂 You must enable the Cross Cluster functionality activating the license, got to:
Stack Management → License management → Start a 30-day trial
(or use the Kibana code as described in the next code block)

🖱️ Section 1: create a follower index

# ─────────────────────────────────────────────
# Configure a cluster for cross-cluster search
#
# Section 1: create a follower index
# ─────────────────────────────────────────────
          
# Cluster to use for the test: `10_cross-cluster`
# https://github.com/pistocop/elastic-certified-engineer/tree/master/dockerfiles/10_cross-cluster
          
# ---
# Connect es02 as remote cluster
#
# > Run the following code on Cluster 1, 
#   using kibana at localhost:5601
# ---
          
GET _cat/nodes?v
# > name: es01 | role: dmr
          
POST _license/start_trial?acknowledge=true
GET _license
# > "status" : "active",
          
PUT _cluster/settings
{
  "persistent": {
    "cluster": {
      "remote": {
        "cluster2": {
          "seeds": [
            "es02:9300"
          ]
        }
      }
    }
  }
}
# > "acknowledged" : true,
          
GET _remote/info
# > "num_nodes_connected" : 1
          
# ---
# Create indices on cluster2
#
# > Run the following code on Cluster 2,
#   using kibana at localhost:5602
# ---
          
GET _cat/nodes?v
# > name: es02 | role: dm
          
POST _license/start_trial?acknowledge=true
GET _license
# > "status" : "active"
          
PUT idx-cluster2
{
  "settings": {
    "index.soft_deletes.enabled": true
  }
}
# > 200
          
PUT idx-cluster2-nosoft
{
  "settings": {
    "index.soft_deletes.enabled": false
  }
}
# > 200
          
PUT idx-cluster2/_doc/01
{
  "msg" : "Hello from `cluster2`!"
}
# > 200
          
PUT idx-cluster2-nosoft/_doc/01
{
  "msg" : "Hello from `cluster2`!"
}
# > 200
          
# ---
# Create follower index on cluster1
#
# > Run the following code on Cluster 1, 
#   using kibana at localhost:5601
# ---
          
GET _cat/nodes?v
# > name: es01 | role: dmr
          
PUT follower-idx-cluster2/_ccr/follow?wait_for_active_shards=1
{
  "remote_cluster": "cluster2",
  "leader_index": "idx-cluster2"
}
# > "follow_index_shards_acked" : true
          
GET follower-idx-cluster2/_search?size=10
# > "msg" : "Hello from `cluster2`!"
          
PUT follower-idx-cluster2-nosoft/_ccr/follow?wait_for_active_shards=1
{
  "remote_cluster" : "cluster2",
  "leader_index" : "idx-cluster2-nosoft"
}
# > 400 | leader index [idx-cluster2-nosoft] does not have soft deletes enabled
# Note: indices without soft-delete parameter enabled cannot be
#       used for cross-cluster replications
          
GET follower-idx-cluster2/_ccr/info
# > remote_cluster" : "cluster2"
# > "status" : "active"
          
GET follower-idx-cluster2/_ccr/stats
# > "remote_cluster" : "cluster2"
          
PUT follower-idx-cluster2/_doc/99
{
  "foo": "bar"
}
# > 403 | status_exception
          
# ---
# Add more data on cluster2
#
# > Run the following code on Cluster 2,
#   using kibana at localhost:5602
# ---
          
PUT idx-cluster2/_doc/02
{
  "msg" : "2nd msg"
}
# > 200
          
# ---
# Check automatically updated data
#
# > Run the following code on Cluster 1,
#   using kibana at localhost:5601
# ---
          
GET follower-idx-cluster2/_search?size=10
# > "msg" : "2nd msg"

🖱️ Section 2: simulate different outages

# ─────────────────────────────────────────────
# Configure a cluster for cross-cluster search
#
# Section 2: simulate different outages
# ─────────────────────────────────────────────
          
# Cluster to use for the test: `10_cross-cluster`
# https://github.com/pistocop/elastic-certified-engineer/tree/master/dockerfiles/10_cross-cluster
          
# ---
# Run cluster 3 nodes
# ---
          
# Run the master and data ES nodes of cluster 3
#   - Connect to both the nodes and run the ES program,
#     - $ docker exec -u elasticsearch -it es03d /bin/bash
#     - $ bin/elasticsearch &
#     - $ exit
#     - $ docker exec -u elasticsearch -it es03m /bin/bash
#     - $ bin/elasticsearch &
#     - $ exit
#   - wait ~1m
          
# > Run the following code on Cluster 3,
#   using kibana at localhost:5603
          
GET _cat/nodes?v
# > name: es03m | role: m
# > name: es03d | role: d
          
PUT idx-cluster3/_doc/01
{
  "msg" : "Hello from `cluster3`!"
}
# > 200
          
POST _license/start_trial?acknowledge=true
GET _license
# > "status" : "active"
          
# ---
# Create follower index on cluster1
#
# > Run the following code on Cluster 1, 
#   using kibana at localhost:5601
# ---
          
POST _license/start_trial?acknowledge=true
GET _license
# > "status" : "active"
          
PUT _cluster/settings
{
  "persistent": {
    "cluster": {
      "remote": {
        "cluster3": {
          "seeds": [
            "es03m:9300"
          ]
        }
      }
    }
  }
}
# > "acknowledged" : true
          
GET _remote/info
# > cluster3 | "connected" : true
          
PUT /follower-idx-cluster3/_ccr/follow
{
  "remote_cluster": "cluster3",
  "leader_index": "idx-cluster3",
  "max_read_request_operation_count": 5120,
  "max_outstanding_read_requests": 12,
  "max_read_request_size": "32mb",
  "max_write_request_operation_count": 5120,
  "max_write_request_size": "9223372036854775807b",
  "max_outstanding_write_requests": 9,
  "max_write_buffer_count": 2147483647,
  "max_write_buffer_size": "512mb",
  "max_retry_delay": "500ms",
  "read_poll_timeout": "1m"
}
# > "follow_index_created" : true
# Note: the above configuration could be 
#       created from Kibana GUI under 
#       Stack Management -> Cross-Cluster Replication
          
GET follower-idx-cluster3/_search?size=10
# > "msg" : "Hello from `cluster3`!"
          
# ---
# Simulate connection interruption
# ---
          
# From CLI:
#     - $ docker network disconnect 10_cross-cluster_cluster03net es01
#     - $ docker exec -u elasticsearch -it es01 /bin/bash
#     - $ curl es03m:9200
#       > curl: (6) Could not resolve host: es03m
#     - $ exit
          
# ---
# Add data on cluster3 index
#
# > Run the following code on Cluster 3,
#   using kibana at localhost:5603
# ---
          
PUT idx-cluster3/_doc/02
{
  "msg" : "2nd payload"
}
# > 200
          
PUT idx-cluster3/_doc/03
{
  "msg" : "3th payload"
}
# > 200
          
# ---
# Check index isn't updated
#
# > Run the following code on Cluster 1,
#   using kibana at localhost:5601
# ---
          
GET follower-idx-cluster3/_search?size=10
# > total.value: 1
# Note: new data from idx-cluster3 aren't fetched
          
# ---
# Reestablish connection
# ---
          
# From CLI:
#     - $ docker network connect 10_cross-cluster_cluster03net es01
#     - $ docker exec -u elasticsearch -it es01 /bin/bash
#     - $ curl es03m:9200
#       > "cluster_name" : "cluster3"
#     - $ exit
          
# ---
# Check index is automatically updated
#
# > Run the following code on Cluster 1,
#   using kibana at localhost:5601
# ---
          
GET follower-idx-cluster3/_search?size=10
# > total.value: 3
# Note: new data fetched from idx-cluster3 
          
# ---
# Simulate outage
# 
# Tip: ES automatically handle reconnection to the remote cluster
#     if the problem is at network level (like before), but suspend the
#     reconnection if the problem is from different nature
# ---
          
# From CLI:
#     - $ docker exec -u elasticsearch -it es03d /bin/bash
#     - $ ps -aux
#       # copy the PID of ES process
#     - $ kill -s SIGKILL <ES PID>
#     - $ exit
#     - wait ~1m
          
# ---
# Check "follower index" connection error
#
# > Run the following code on Cluster 1,
#   using kibana at localhost:5601
# ---
          
GET follower-idx-cluster3/_ccr/info
# > "status" : "active"
# Note: the index is active, but let's check the stats
          
GET follower-idx-cluster3/_ccr/stats
# > java.lang.IllegalStateException: Unable to open any connections to remote cluster [cluster3]
# Note: the follower index cannot connect to the remote index,
#       because we have shut down the data node on cluster3,
#       essential for the cluster functioning
          
# ---
# Recovery from the outage
# ---
          
# From CLI:
#     - $ docker exec -u elasticsearch -it es03d /bin/bash
#     - $ bin/elasticsearch &
#     - $ exit
#     
          
# ---
# Add some data on cluster3
#
# > Run the following code on Cluster 3,
#   using kibana at localhost:5603
# ---
          
GET _cat/nodes?v
# > name: es03m | role: m
# > name: es03d | role: d
# Note: after the restart, wait ~1m if the es03d isn't displayed
          
PUT idx-cluster3/_doc/01
{
  "msg" : "Hello from `cluster3`! ---updated---"
}
# 200
          
# ---
# Check if index is automatically recovered (no)
#
# > Run the following code on Cluster 1,
#   using kibana at localhost:5601
# ---
          
GET follower-idx-cluster3/_search?size=10
# > total.value: 3
# > "msg" : "Hello from `cluster3`!"
# Note: the msg of document 01 isn't updated
          
GET follower-idx-cluster3/_ccr/stats
# > java.lang.IllegalStateException: Unable to open any connections to remote cluster [cluster3]
# Note: ES hasn't recovered the connection although we had
#       restarted the service on es03d.
#       We need to restart the following process
          
POST follower-idx-cluster3/_ccr/pause_follow
# > ack: true
# Note: we need to both pause & resume the index
          
POST follower-idx-cluster3/_ccr/resume_follow 
# > ack: true
          
GET follower-idx-cluster3/_search?size=10
# > "msg" : "Hello from `cluster3`! ---updated---"
# Note: the index is again up to date with cluster3 data

🔹 Define role-based access control (RBAC) using Elasticsearch Security

🔗 Official doc

“The Elastic Stack security features add authorization, which is the process of determining whether the user behind an incoming request is allowed to execute the request.” - doc
Security is based on two different processes:
- User authentication:
  the process of identify a specific user (username + password) - doc
  - Basic security features (like RBAC and basic logging system) are included in ES basic license, for more advanced features buy the license or enable the 30 days trial
  - Must be enabled on all nodes, under elasticsearch.yml using the xpack.security.enabled: true setting
    - For a complete cluster setup see the Minimal security guide
  - There are some special built-in users that serve for specific purposes and are not intended for general use **(e.g. underlying Kibana connection) - doc
    - The elastic built-in user can be used to set all of the built-in user passwords (superuser)
    - The kibana_system built-in user is used by Kibana to connect and communicate with Elasticsearch.
    - These built-in users are stored in a special .security index that is a full-fledged index: “If your .security index is deleted or restored from a snapshot, however, any changes you have applied are lost” - doc
      - What happen if we lost the admin credentials? how we could continue to use the cluster? - a solution could be to recreate a superuser account
      - The CLI program /bin/elasticsearch-setup-passwords provided could be used to setup the built-in passwords - doc
        
        Warning: you cannot run the elasticsearch-setup-passwords command a second time.
  - “standard” users (e.g. people that will work with the ES infrastructure) management use the realms to manage the login process - doc
    - Realms are basically the who and how the user credentials are checked, some realms are: - doc
      - native, An internal realm where users are stored in a dedicated Elasticsearch index
      - kerberos, authenticates a user using Kerberos authentication
- User authorization:
  the process of checking if a user could access a specific resource (e.g. cluster settings) - doc
  - We could create users with specific roles that specify the permissions they are allowed to perform
  - “assigning privileges to roles and assigning roles to users or groups” - doc
  - Glossary
    - Secured Resource = what will be protected, could be “indices, aliases, documents, fields, users, and the Elasticsearch cluster itself”
    - Privilege = what the user could do with the resource
    - Permissions = set of privileges; available privileges list
    - Role = permissions + a name to identify the set
    - User = authenticated user
    - Group = set of users
  - The users, roles and the mapping between the two groups could be managed:
    - Using configurations files - doc
      - Some files used by ES security:
        
        ES_PATH_CONF/roles.yml
        
        ES_PATH_CONF/elasticsearch-users
        
        ES_PATH_CONF/role_mapping.yml
    - Using the Kibana GUI - doc
    - Using the ES API - doc

🖱️ Code example

🦂 ES password will be generated when all ES instances up & running

# ─────────────────────────────────────────────
# Configure a RBAC access
# ─────────────────────────────────────────────
      
# Cluster to use: 12_basic-security
# https://github.com/pistocop/elastic-certified-engineer/tree/master/dockerfiles
      
# ---
# Generate the built-in credentials
# ---
      
# Open new CLI:
# $ docker exec -u elasticsearch -it es01 /bin/bash
# $ bin/elasticsearch
      
# Open new CLI:
# $ bin/elasticsearch-setup-passwords auto
# > store the printed psw
# $ exit
      
# ---
# Connect Kibana to ES
# ---
      
# Open new CLI:
# $ docker exec -u kibana -it kibana /bin/bash
# $ ./bin/kibana-keystore create
# $ ./bin/kibana-keystore add elasticsearch.password
# [ Paste the psw of the user: kibana_system
# $ curl es01:9200
# > Error: security_exception
# $ curl --user kibana_system:<PASSWORD> es01:9200
# > "cluster_name" : "es-docker-cluster"
# $ bin/kibana
      
# ---
# Connect to Kibana
# ---
      
# Open Kibana at http://localhost:5601/
# User Usr and psw of user `elastic`
      
GET .security-7/_count
# > 55
      
GET .security-7/_search
{
  "_source": [
    "type",
    "password"
  ]
}
# > _id: reserved-user-kibana_system
# Note: psw are hashed
      
# Create indices for future tests
PUT test-index-01/_doc/01
{
  "foo": "bar"
}
# > 200
      
PUT test-index-protected
# > 200
      
# ---
# Create Users and Roles
# ---
      
# Two possible approaches:
# - API: https://www.elastic.co/guide/en/elasticsearch/reference/7.13/security-api-put-user.html
# - GUI: management -> security -> create user
      
# From GUI, create:
# - New User `test-user` with role `editor`
      
# ---
# Test the user `test-user` roles
# ---
      
# Open a new Incognito page on the browser
# Open Kibana at http://localhost:5601/
# User credentials of the just created `test-user`
      
GET _cat/indices
# > Error, "type" : "security_exception"
      
PUT test-index-02
# > Error, "type" : "security_exception"
      
GET test-index-01/_search
# > 200; "foo" : "bar"
      
# ---
# Create new role
# ---
      
# From whe 1st Kibana page (user `elasticsearch`)
# Create new role from GUI: management -> security -> Roles
# New role info:
#    - Name: `protected-index-writer`
#    - Indices: add `test-index-protected` with privileges `write`
#    Under Kibana section (bottom page):
#        - Add Kibana privilege -> all spaces -> All privileges -> Create
      
# Create new user from GUI: management -> security -> users
# New User info:
#    - Name: `protected-writer`
#    - Role: `protected-index-writer`
      
# Open a new Incognito page on the browser
# Open Kibana at http://localhost:5601/
# User credentials of the just created `protected-writer`
      
PUT test-index-protected/_doc/01
{
  "foo": "bar"
}
# > 200
      
GET test-index-protected/_search
# > 403; security_exception

👨‍🏭 How to

Guides to setting up ES and running experiments

Run ES locally: docker setup

🔗 Docker based: official guide
🔗 Docker compose based: official guide

ES docker setup

Single instance

# Run es node on `elastic` network
docker network create elastic
docker pull docker.elastic.co/elasticsearch/elasticsearch:7.13.0
docker run --name es01-test --net elastic -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.13.0
  
# Run kibana service on `elastic` network
docker pull docker.elastic.co/kibana/kibana:7.13.0
docker run --name kib01-test --net elastic -p 5601:5601 -e "ELASTICSEARCH_HOSTS=http://es01-test:9200" docker.elastic.co/kibana/kibana:7.13.0
  
# Stop everything
docker stop es01-test
docker stop kib01-test

Multiple instances

Use docker compose with those configurations:
- https://github.com/pistocop/elastic-certified-engineer
- 🔗 Original documentation - doc

Docker containers troubleshooting

Use case: you have mess with elasticsearch.yml file and now the container doesn’t start.

Steps:

# Create a new image from the container
$ docker commit $CONTAINER_NAME user/test_image
      
# Create a new container from the image
$ docker run -ti --entrypoint=bash user/test_image
      
# Explore the image and find the problem.
# E.g. an error on the file `/usr/share/elasticsearch/config/elasticsearch.yml`
      
# Copy the file from the container
$ docker cp $CONTAINER_NAME:/usr/share/elasticsearch/config/elasticsearch.yml .
      
# Apply the fix changes on the file
$ vi ./elasticsearch.yml
      
# Replace the config file of the container
$ docker cp ./elasticsearch.yml $CONTAINER_NAME:/usr/share/elasticsearch/config/

Test hot-warm-cold architecture

Process based on docker containers

Start an ES cluster with 3 nodes, each of which with a different role
- Tip: use the hot-warm-cold architecture from elastic-certified-engineer repo

Kibana code

🦂 The parameter min_age indicate a value to pass between phases, but, actually, you will wait more time before the shards are moved

# Check the cluster status
GET _cluster/health
GET _cat/nodes?v
# > you should have 3 nodes with [mw, hms, cm] roles
      
# Create the policy
PUT _ilm/policy/hwc-policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_age": "30d",
            "max_primary_shard_size": "50gb",
            "max_docs": 5
          },
          "set_priority": {
            "priority": 100
          },
          "readonly": {}
        },
        "min_age": "0ms"
      },
      "warm": {
        "min_age": "5m",
        "actions": {
          "set_priority": {
            "priority": 50
          }
        }
      },
      "cold": {
        "min_age": "15m",
        "actions": {
          "set_priority": {
            "priority": 0
          }
        }
      }
    }
  }
}
      
GET _ilm/policy
      
# Create the indexes template
PUT _template/my-index-template
{
  "index_patterns": [
    "my-index-*"
  ],
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "index.lifecycle.name": "hwc-policy",
    "index.lifecycle.rollover_alias": "my-index"
  },
  "mappings": {
    "properties": {
      "foo": {
        "type": "keyword"
      }
    }
  }
}
      
# Create the index
PUT my-index-01
{
  "aliases": {
    "my-index": {
      "is_write_index": true
    }
  }
}
      
# Check alias creation
GET _cat/aliases
      
# Check index ILM
GET my-index-01/_ilm/explain?human
# > "phase": "hot"
# > "policy" : "hwc-policy"
      
# Check shard allocation
GET _cat/shards/my-index*?v
# > index `my-index-01` primary shard on node `es01` (hot node)
      
# Fill the index
PUT my-index/_doc/1
{
  "foo":"bar"
}
PUT my-index/_doc/2
{
  "foo":"bar"
}
PUT my-index/_doc/3
{
  "foo":"bar"
}
PUT my-index/_doc/4
{
  "foo":"bar"
}
PUT my-index/_doc/5
{
  "foo":"bar"
}
PUT my-index/_doc/6
{
  "foo":"bar"
}
      
# Wait 5 minutes...
      
GET _cat/indices/my-*?v
# > New index: my-index-000002
      
PUT my-index-01/_doc/7
{
  "foo":"bar"
}
# > Error: policy set "old" indexes to `Read only`
      
PUT my-index/_doc/7
{
  "foo":"bar"
}
# > 200: the alias point to new the new index
      
GET _cat/shards/my-index*?v
# > `my-index-01` is on `es01` node
      
# Wait 20/30 minutes...
# [hot -> warm]
      
GET my-index-01/_ilm/explain
# > "phase":"warm"
      
GET _cat/shards/my-index*?v
# > index `my-index-01` primary shard on node `es02` (warm node)
      
# Wait 20/30 minutes...
# [warm -> cold]
      
GET my-index-01/_ilm/explain
# > "phase":"cold"
      
GET _cat/shards/my-index*?v
# > index `my-index-01` primary shard on node `es03` (cold node)

Configure a multicluster architecture

Process based on docker containers

Create two clusters and two networks with only one node (c1n1) that is on both the networks, then use the code to connect cluster1 to cluster2 and query one of its indices
- 🔗 Cluster creation & configuration docker files: GitHub

🖱️ Code tutorial

Another cluster could be used: 10_cross-cluster

# ─────────────────────────────────────────────
# Connect a remote cluster
#
# Note:
#     - To run the experiment cluster architecture:
#     https://github.com/pistocop/elastic-certified-engineer/tree/master/dockerfiles/05_multicluster
#     - Pay attention to the comments: some kibana
#     code should be run on a different host
# ─────────────────────────────────────────────
      
# ---
# Kibana code for `cluster2`
# Tip: open `cluster2` kibana at localhost:5602
# and paste the following code
# ---
      
GET /
# > `cluster2`
      
GET _cat/nodes
# > 1 node
      
# Check remote cluster
GET /_remote/info
# > no results
      
# Create some data
PUT c2-index/_doc/01
{
  "msg": "Hello world form cluster 2!"
}
      
GET c2-index/_doc/01
# > 200
      
# ---
# [!] Kibana code for `cluster1`
# Tip: open `cluster1` kibana at localhost:5601
# and paste the following code
# ---
      
GET /
# > `cluster1`
      
GET /_cat/nodes?v
# > 2 nodes
      
# Check remote cluster
GET /_remote/info
# > no results
      
# Connect to `cluster2`
PUT _cluster/settings
{
  "persistent": {
    "cluster": {
      "remote": {
        "cluster2": {
          "mode": "sniff",
          "seeds": [
            "c2n1:9300"
          ],
          "transport.ping_schedule": "30s"
        }
      }
    }
  }
}
      
# Check remote cluster
GET /_remote/info
# > `cluster2` found
# Note: "num_nodes_connected" : 1,
#   if a wrong port is specified on the
#   seeds list (e.g. 9200) this number is zero
      
GET c2-index/_doc/01
# > Error: index not found
      
GET cluster2:c2-index/_search
{
  "query": {
    "match_all": {}
  }
}
# > "msg" : "Hello world form cluster 2!"
      
GET cluster2:c2-index/_doc/01
# > error
# Note: not all the API are allowed to be
# done on remote cluster
      
PUT c1-index/_doc/01
{
  "msg": "Hello world form cluster 1!"
}
      
# ---
# Kibana code for `cluster2`
# Tip: open `cluster2` kibana at localhost:5602
# and paste the following code
# ---
      
GET cluster1:c1-index/_search
{
  "query": {
    "match_all": {}
  }
}
# > Error
# Note: a connection is not bidirectional,
# you should also open a connection 
# from `cluster2` to `cluster1`

🐳 Deepenings

More in-depth topics useful for a more comprehensive learning

Cluster infrastructure

Cluster formation

How to set configurations when creating a new cluster

🔗 Official doc
🔗 Good Stack-Overflow recap - link
- Some glossary before start:
  - bootstrapping = the first time a cluster is started is an event called
  - master = a node with the master role, it will take part in the voting system and is (one of them) responsible to manage the cluster (along with the other master nodes) - doc
  - voting system = cluster-level decisions (like deciding which shards to allocate to which nodes) are taken by master nodes, but because ES is distributed some master nodes could be unreachable (connection error, node fault etc.). So to avoid that two sub-groups working independently for a connection error (split brain) there is a voting system with a quorum to take the decisions - doc
- Cluster (in)formation
  - At bootstrapping, nodes don’t know how many of them are present in the cluster, nor how many and which are the master nodes, moreover both those parameters could change during the time with add/remove nodes.
  - ES have a system to “automagically” create the cluster, balance the voting system, permit the nodes resizing, but some information must be provided in order to allow those functionalities to work well
- Information to provide
  - At bootstrapping, each node with master role should have set the cluster.initial_master_node parameter with the list of all of the other master nodes
    - This parameter should be removed after the bootstrapping
    - This information, after the first start, will be stored (with other cluster information) inside the data folder of each node
  - Nodes without master role should instead have discovery.seed_hosts parameter set. This parameter contains a list of hosts to call when the node start, in order to “ask for taking part of (join) the cluster”. Those hosts do not necessarily coincide with the master nodes but is a good idea if they do because we should provide resilient and stable nodes.
  - 🦂 We say master nodes to indicate nodes with ES instance with a master role, but after the cluster bootstrapping the master in a cluster is only one, elected after a voting system
  - 🦂 Note that both cluster.initial_master_node and discovery.seed_hosts parameters are required for each master eligible node at bootstrap time. This makes sense because the first parameter is used only one time and should be removed after bootstrapping, so the latter is essential for the node functioning

Index management

Removal of mapping types

ES has decided to remove the concept of *mapping types* from Elasticsearch.
- “In an Elasticsearch index, fields that have the same name in different mapping types are backed by the same Lucene field internally” - link
- Alternatives to types
  - Have an index per document type
  - Custom type field - link
    - implement your own custom type field which will work in a similar way to the old _type

Change Static Index modules (reindex)

change a mapping *static* parameter and use *reindex/aliases* to update the indices

Index modules - doc
- Basically all the information linked to the index (e.g. shards, replicas, analyzers…), some are static and cannot be changed without reindex the data, others are dynamic (e.g. replicas) and could be changed using the _mapping index endpoint

PUT test02
{
  "mappings": {
    "properties": {
      "text-field":{
        "index_options": "docs",
        "type": "text"
      }
    }
  }
}
  
PUT test02/_doc/01
{
  "text-field": "hello i'm a computer and this is a test"
}
  
# ---
# We want different index_options: "offset"
# this parameter cannot be "hot changed"
# ---
PUT test02/_mapping
{
  "properties": {
    "text-field": {
      "index_options": "offsets",
      "type": "text"
    }
  }
}
# > 400; Mapper for [text-field] conflicts with existing mapper
  
PUT test03
{
  "mappings": {
    "properties": {
      "text-field": {
        "index_options": "offsets",
        "type": "text"
      }
    }
  }
}
  
POST _reindex
{
  "source": {
    "index": "test02"
  },
  "dest": {
    "index": "test03"
  }
}
  
GET test03/_search
GET test03
# Check everything is fine
  
# ---
# Two solutions:
# 1. Delete test02 and use an alias to redirect index02 to index03
# 2. Delete test02 and reindex test03 to test02
# ---
  
DELETE test02
# > 200
  
POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "test03",
        "alias": "test02"
      }
    }
  ]
}
# > 200
  
GET test02/_search
{
  "query": {
    "match": {
      "text-field": "hellooooo test"
    }
  },
  "highlight": {
    "fields": {
      "text-field": {}
    }
  }
}
# > "hello i'm a computer and this is a <em>test</em>"

Search

Access the analyzers tokens

Define custom analyzers through templates and inspect their tokens

The following code cover various topics: composable templates, custom analyzers, custom tokenizers, termvectors, subfields, that is suggested to be already familiar with

# ─────────────────────────────────────────────
# Intermediate example:
# Create a template that defines custom analyzers
# and inspect their behaviour
# ─────────────────────────────────────────────
  
# ---
# Create the template
# ---
  
# Tip: test analyzer's behaviour before define it:
GET _analyze
{
  "tokenizer": {
    "type": "char_group",
    "tokenize_on_chars": [
      ","
    ]
  },
  "text": [
    "To be, or not to be, that is the question"
  ]
}
  
# Template components
PUT _component_template/whitespace_analyzer_template
{
  "template": {
    "settings": {
      "analysis": {
        "analyzer": {
          "my_whitespace_analyzer": {
            "type": "custom",
            "tokenizer": "whitespace"
          }
        }
      }
    },
    "mappings": {
      "properties": {
        "text_whitespace_field": {
          "type": "text",
          "analyzer": "my_whitespace_analyzer",
          "fields": {
            "length": {
              "type": "token_count",
              "analyzer": "my_whitespace_analyzer"
            }
          }
        }
      }
    }
  }
}
  
PUT _component_template/ngram_analyzer_template
{
  "template": {
    "settings": {
      "analysis": {
        "tokenizer": {
          "my_ngram_tokenizer": {
            "type": "ngram",
            "min_gram": 3,
            "max_gram": 3
          }
        },
        "analyzer": {
          "my_ngram_analyzer": {
            "type":"custom",
            "tokenizer": "my_ngram_tokenizer"
          }
        }
      }
    },
    "mappings": {
      "properties": {
        "text_ngram_field": {
          "type": "text",
          "analyzer": "my_ngram_analyzer",
          "fields": {
            "length": {
              "type": "token_count",
              "analyzer": "my_ngram_analyzer"
            }
          }
        }
      }
    }
  }
}
  
PUT _component_template/char_group_analyzer_template
{
  "template": {
    "settings": {
      "analysis": {
        "tokenizer": {
          "my_chargroup_tokenizer": {
            "type": "char_group",
            "tokenize_on_chars": [
              ","
            ]
          }
        },
        "analyzer": {
          "my_char_group_analyzer": {
            "type": "custom",
            "tokenizer": "my_chargroup_tokenizer"
          }
        }
      }
    },
    "mappings": {
      "properties": {
        "text_chargroup_field": {
          "type": "text",
          "analyzer": "my_char_group_analyzer",
          "fields": {
            "length": {
              "type": "token_count",
              "analyzer": "my_char_group_analyzer"
            }
          }
        }
      }
    }
  }
}
  
PUT _component_template/pattern_analyzer_template
{
  "template": {
    "settings": {
      "analysis": {
        "tokenizer": {
          "my_pattern_tokenizer": {
            "type": "pattern",
            "pattern": "to be"
          }
        },
        "analyzer": {
          "my_pattern_analyzer": {
            "type": "custom",
            "tokenizer": "my_pattern_tokenizer",
            "filter": [
              "lowercase"
            ]
          }
        }
      }
    },
    "mappings": {
      "properties": {
        "text_pattern_field": {
          "type": "text",
          "analyzer": "my_pattern_analyzer",
          "fields": {
            "length": {
              "type": "token_count",
              "analyzer": "my_pattern_analyzer"
            }
          }
        }
      }
    }
  }
}
  
PUT _component_template/pattern_analyzer_enhanced_template
{
  "template": {
    "settings": {
      "analysis": {
        "tokenizer": {
          "my_pattern_enhanced_tokenizer": {
            "type": "pattern",
            "pattern": "[Tt]o be"
          }
        },
        "analyzer": {
          "my_pattern_enhanced_analyzer": {
            "type": "custom",
            "tokenizer": "my_pattern_enhanced_tokenizer",
            "filter": [
              "lowercase"
            ]
          }
        }
      }
    },
    "mappings": {
      "properties": {
        "text_pattern_enhanced_field": {
          "type": "text",
          "analyzer": "my_pattern_enhanced_analyzer",
          "fields": {
            "length": {
              "type": "token_count",
              "analyzer": "my_pattern_enhanced_analyzer"
            }
          }
        }
      }
    }
  }
}
  
# Create the template
POST _index_template/analyzer_family_template
{
  "index_patterns": ["test_*"],
  "composed_of": [
    "whitespace_analyzer_template",
    "ngram_analyzer_template",
    "char_group_analyzer_template",
    "pattern_analyzer_template",
    "pattern_analyzer_enhanced_template"
  ]
}
  
# ---
# Index creation & insertion
# ---
  
DELETE test_index
PUT test_index
{
  "mappings": {
    "properties": {
      "text_standard_field": {
        "type": "text",
        "analyzer": "standard", 
        "fields": {
          "length": {
            "type": "token_count",
            "analyzer": "standard"
          }
        }
      },
      "text_simple_field": {
        "type": "text",
        "analyzer": "simple",
        "fields": {
          "length": {
            "type": "token_count",
            "analyzer": "simple"
          }
        }
      },
      "text_stop_field": {
        "type": "text",
        "analyzer": "stop",
        "fields": {
          "length": {
            "type": "token_count",
            "analyzer": "stop",
            "enable_position_increments": "false"
          }
        }
      },
      "text_keyword_field": {
        "type": "text",
        "analyzer": "keyword",
        "fields": {
          "length": {
            "type": "token_count",
            "analyzer": "keyword"
          }
        }
      },
      "keyword_field": {
        "type": "keyword",
        "fields": {
          "length": {
            "type": "token_count",
            "analyzer": "keyword"
          }
        }
      }
    }
  }
}
# [!] Note the "enable_position_increments": "false",
#     here why: https://github.com/elastic/elasticsearch/issues/39276#issuecomment-466278696
  
GET test_index
  
PUT test_index/_doc/1
{
  "text_standard_field": "To be, or not to be, that is the question",
  "text_chargroup_field": "To be, or not to be, that is the question",
  "text_ngram_field": "To be, or not to be, that is the question",
  "text_whitespace_field": "To be, or not to be, that is the question",
  "text_pattern_field": "To be, or not to be, that is the question",
  "text_pattern_enhanced_field": "To be, or not to be, that is the question",
  "text_simple_field": "To be, or not to be, that is the question",
  "text_stop_field": "To be, or not to be, that is the question",
  "text_keyword_field": "To be, or not to be, that is the question",
  "keyword_field": "To be, or not to be, that is the question"
}
  
# ---
# Inspect the analyzer's behaviour
# ---
  
GET test_index/_search
{
  "_source": [
    ""
  ],
  "fields": [
    "*.length"
  ],
  "query": {
    "term": {
      "_id": 1
    }
  }
}
# > stop_field.length = 1 because only "question" isn't a stopword
#
# > pattern_field.length = 2 because we split on "to be" text.
#   Tip: note that the first section of the sentence "To be" is not
#        used for the split but is reported on the results text. 
#        This occour because the tokenizer run before the `lowercase` filter
#
# > text_pattern_enhanced_field.length = 2 because we split on "[tT]o be" text.
#   Tip: use the next termvectors API to compare this resault with `pattern_field`
#
# > text_whitespace_field.length,
#   text_standard_field.length,
#   text_simple_field.length
#   = 10 because the sentence is composed by 10 words
#
# > text_ngram_field.length = 39 because the string is 41 characters and
#   we have 39 positions for a sliding window of size 3
# 
# > text_keyword_field.length,
#   keyword_field.length
#   = 1 because a keywork token is created with all field text
# 
# > text_chargroup_field.length = 3 because we will create one token
#   for each comma, and the sentence contain three commas
  
# Inspect the analyzers tokens
GET test_index/_termvectors/1?fields=text_stop_field&field_statistics=false
GET test_index/_termvectors/1?fields=text_pattern_field&field_statistics=false
GET test_index/_termvectors/1?fields=text_pattern_enhanced_field&field_statistics=false
GET test_index/_termvectors/1?fields=text_whitespace_field&field_statistics=false
GET test_index/_termvectors/1?fields=text_standard_field&field_statistics=false
GET test_index/_termvectors/1?fields=text_simple_field&field_statistics=false
GET test_index/_termvectors/1?fields=text_ngram_field&field_statistics=false
GET test_index/_termvectors/1?fields=text_keyword_field&field_statistics=false
GET test_index/_termvectors/1?fields=keyword_field&field_statistics=false
GET test_index/_termvectors/1?fields=text_chargroup_field&field_statistics=false
  
# Tip: test an index analyzer on the fly
GET test_index/_analyze
{
  "analyzer": "my_ngram_analyzer",
  "text": ["Text not indexed"]
}

Backup

Backup/restore snapshots

The supported way to back up a cluster is by taking a snapshot

🔗 docs

For a complete cluster backup you should:

Back up the data:

Based on *snapshot* API, you can backup a cluster including all its data streams and indices
Elasticsearch takes snapshots incrementally
Snapshot repository
- The snapshot could be stored on different repositories, like GCS or S3.
  Here the list of available repositories.
- API to create a snapshot repository:
```
PUT /_snapshot/my_repository
{
  "type": "fs", # Types: [fs, source, url]
  "settings": {
    "location": "my_backup_location", # only if "fs" type - folder path e.g. /mnt/my-fs/
        "url": "url_root_filesystem",     # only if "url" type - URL location of the root of the shared filesystem
        "compress": true ,                # metadata (e.g. mappings) compressed
        "max_number_of_snapshots": 500    # Maximum number of snapshots the repository can contain
  }
}
```
  - For a complete example see the Shared file system repository official guide
  - 🦂 Be aware: although the distributed file system where make the backup must be mounted on each note at the same path, you need anyway register the path on the elasticsearch.yml file and make a rolling restart. - see the official guide
    - If the path isn’t registered on elasticsearch.yml file, an error like this will be returned:
      "[my_backup] location [/this/path/doesnt/exist] doesn’t match any of the locations specified by path.repo because this setting is empty”

Create a snapshot

API to create a snapshot:

# Default: includes all data streams and open indices in the cluster
PUT /_snapshot/my_repository/my_snapshot
              
# Query parameters
PUT /_snapshot/my_repository/snapshot_2?wait_for_completion=true # request returns a response when the snapshot is complete
{
  "indices": "index_1,index_2",
  "ignore_unavailable": true,      # ignores missing or closed data streams and indices
  "include_global_state": false,   # store global state also (Index templates, ILM, ...)
  "metadata": {                    # arbitrary metadata
    "taken_by": "user123",
    "taken_because": "backup before upgrading"
  },
    "partial": true,                 # do not fail if one or more indices included in the snapshot do not have all primary shards available
}

A snapshot could also be searched

Use the SLM (Snapshot lifecycle management) to automatically take and manage snapshots

🔗 Tutorial: Automate backups with SLM

API to create a SLM

# Create the policy
PUT /_slm/policy/nightly-snapshots
{
  "schedule": "0 30 1 * * ?", # cron syntax
  "name": "<nightly-snap-{now/d}>", 
  "repository": "my_repository", 
  "config": {              # same info of snapshot API
    "indices": ["*"] 
  },
  "retention": {
    "expire_after": "30d", # period after which a snapshot is considered expired
    "min_count": 5, 
    "max_count": 50        # Maximum number of snapshots to retain - should not exceed 200
  }
}
          
# Test the policy
POST /_slm/policy/nightly-snapshots/_execute # trigger the policy
GET /_slm/policy/nightly-snapshots?human     # get info about the execution

Restore the data

🔗 official docs

API to restore a snapshot

🦂 index_out_of_bounds_exception error
```
"type" : "index_out_of_bounds_exception",
"reason" : "index_out_of_bounds_exception: No group 1"
```
- If the above error is raised, probably you have messed with the parameters rename_pattern and rename_replacement , try to change those settings (i.e. remove the * usage)

# [opt] Close local index that exist
POST /index_1/_close
          
# Restore a snapshot
POST /_snapshot/my_repository/snapshot_2/_restore?wait_for_completion=true # the request returns a response when the restore operation completes
{
  "indices": "index_1,index_2",              # which index restore
  "ignore_unavailable": true,
  "include_global_state": false,
  "rename_pattern": "index_(.+)",            # index match this pattern...
  "rename_replacement": "restored_index_$1", # ... will be renamed with this pattern
  "include_aliases": false                   # do not restore aliases from snapshot
}
          
# [opt] Open local index
POST /index_1/_open

💡 You are not obligated to restore everything from the snapshot:

“You can select specific data streams or indices to restore.”
🦂 “Existing indices can only be restored if they are closed and have the same number of shards as the indices in the snapshot."

🖱️ Code example

# Register snapshot repository
PUT /_snapshot/fs_bkp
{
  "type": "fs",
  "settings": {
    "location": "/mnt/cluster_fs/es_bkp/"
  }
}
      
POST /_snapshot/fs_bkp/_verify
# > Check passed
      
# [opt] Create an index
PUT test_index
      
# Create a cluster snapshot
PUT /_snapshot/fs_bkp/snapshot_001?wait_for_completion=true
{
  "metadata":{
    "taken_by": "My first bkp attempt",
    "taken_because": "Test es bkp functionality, all snapshot defaults maintained"
  }
}
      
GET /_snapshot/fs_bkp/_current?
# > "state" : "IN_PROGRESS"
      
# Waiting...
      
GET /_snapshot/fs_bkp/_current?
# > [<empty>]
      
GET _snapshot/fs_bkp/snapshot_001
# > "state": "SUCCESS"
      
PUT test_index/_doc/bkp_test_01
{
  "foo":"bar"
}
GET test_index/_doc/bkp_test_01
# > 200
      
POST _snapshot/fs_bkp/snapshot_001/_restore
{
  "indices": "test_index",
  "rename_pattern": "test_(.+)",
  "rename_replacement": "restored_$1"
}
      
GET restored_index/_doc/bkp_test_01
# > "found": false
      
GET test_index/_doc/bkp_test_01
# > "found": true
      
# ---
# Make a policy for daily snapshots
# ---
PUT /_slm/policy/daily-snapshots
{
  "schedule": "0 30 22 * * ?", 
  "name": "<daily-snap-{now/d}>", 
  "repository": "fs_bkp", 
  "config": { 
    "ignore_unavailable": false,
    "include_global_state": true,
    "metadata":{
    "taken_by": "Policy named: `daily-snapshots`"
    }
  },
  "retention": { 
    "expire_after": "30d", 
    "min_count": 7, 
    "max_count": 60
  }
}

Security

Unsecured node connect to cluster with minimal security

Setup a minimal ES security system and connect unsecured node

We will see how a new node on the cluster can get access to “secured” data.
- This example describes why “If your cluster has multiple nodes, you must enable minimal security and then configure Transport Layer Security (TLS) between nodes. If your cluster has multiple nodes, you must enable minimal security and then configure Transport Layer Security (TLS) between nodes.” - doc

# ─────────────────────────────────────────────
# Setup a minimal ES security system 
# and connect unsecured node
# ─────────────────────────────────────────────
  
# Cluster to use: 11_blank-minicluster
# https://github.com/pistocop/elastic-certified-engineer/tree/master/dockerfiles/11_blank-minicluster
  
# ---
# Configure ES security
# ---
  
# Open new shell:
# $ docker exec -u elasticsearch -it es01 /bin/bash
# $ echo "xpack.security.enabled: true" >> config/elasticsearch.yml
# $ bin/elasticsearch
  
# Open new shell:
# $ docker exec -u elasticsearch -it es01 /bin/bash
# $ ./bin/elasticsearch-setup-passwords auto
# > store all the psw (we will use kibana & elastic)
# Test the credentials:
# $ curl es01:9200
# > missing authentication credentials
# $ curl --user elastic:oM2vXErEqaxhznsDilB0 -XGET localhost:9200
# > "cluster_name" : "es-docker-cluster"
  
# ---
# Start kibana
# ---
  
# Open new shell:
# $ docker exec -u kibana -it kibana /bin/bash
# $ echo "elasticsearch.username: kibana_system" >> config/kibana.yml
# [1] $ echo "elasticsearch.password: YVU119gR44nO0Qh6A0Zt" >> config/kibana.yml
# $ bin/kibana
  
# Create new index:
# Visit localhost:5601 & use usr:"elastic" psw:"<es psw generated before>"
GET _cat/nodes?v
# > name: es01
  
PUT secret_index/_doc/01
{
  "psw": "secret"
}
# 200
  
# Open new shell:
# $ docker exec -u elasticsearch -it es01 /bin/bash
# $ curl -XGET "http://es01:9200/secret_index/_search"
# > error: security_exception
  
# [1] Note: this is an insecure mode to set the password, use
# instead keystore: https://www.elastic.co/guide/en/elasticsearch/reference/7.13/security-minimal-setup.html#add-built-in-users
  
# ---
# Connect insicure node
# ---
  
# Open new shell:
# $ docker exec -u elasticsearch -it es02 /bin/bash
# $ cat config/elasticsearch.yml
# node.name: es02
# cluster.name: es-docker-cluster
# network.host: 0.0.0.0
# discovery.seed_hosts:
#   - es01
# cluster.initial_master_nodes:
#   - es01
# bootstrap.memory_lock: true
# $ bin/elasticsearch
  
# Open new shell:
# $ docker exec -u elasticsearch -it es02 /bin/bash
# $ curl -XGET "http://es01:9200/secret_index/_search"
# > error: security_exception
# $ curl -XGET "http://es02:9200/secret_index/_search" 
# [!] > 200; "psw": "secret"
  
# Note: new node without psw have read index content

💊 Pills

Bullets for a last-minute review

Hot topics
- Use _index_template instead _template (latter is deprecated)
- Attach multiple analyzers to a field using fields: { "raw":{ type: ...,
  this process is named multi-fields
- 🦂 Pay attention/do not use Kibana suggestions, are often misleading and incorrect.
  Always open the documentation page.
- Painless functions available for strings object: Painless doc → Painless API Reference (contain all API available) → Ingest API (API available during ingestion pipeline) → String (API list)
- Under query -> match function there are a lot of settings,
  e.g. operator=AND to force the search of all words. Use it for an exact match or match_phrase if the order is relevant
```
# Example
GET test01/_search
{
  "query": {
    "match": {
      "message_field": {
        "query": "the old",
        "operator": "OR"
      }
    }
  }
}
```
- On search API, bool statement usages:
  - must → query must be satisfied and track the score
  - filter→ like must, but without the score
  - should → match not required but if verified score increased
  - must_not → if match discard doc
- To query a date use the query.range.<field_name> API field
- wildcard query could be done on both keyword and text fields
- access to object type keys using the . , e.g. products.price
- pipeline/nested aggs should be read top-down: the 1st level aggregation/metric is done before the nested one.
```
# E.g. to calculate products bought daily 
# Before (1_level) aggregate by day number and
# **then** (2_level) calculate the value for each bucket
POST kibana_sample_data_ecommerce/_search?size=0
{
  "aggs": {
    "1_level": {
      "date_histogram": {
        "field": "order_date",
        "calendar_interval": "day"
      },
      "aggs": {
        "2_level": {
          "value_count": {
            "field": "products._id.keyword"
          }
        }
      }
    }
  }
}
```
- Highlight system require offset-strategy to know “where” the match sections are,
  and _source.store:enabled because the source text is used by the highlighter
- Mapping a field has a lot of parameters, don’t forget to use it
  - index_options “controls what information is added to the inverted index” - doc
    - Useful to speed-up highlights - doc
  - ignore_malformed control if raise or not raise an error if one field cannot be parsed
- Pagination
  - There are two main ways to paginate the documents:
    - using from and size fields: recommended if the total hits to paginate are < 10.000
    - using search_after field: recommended if the total hits to paginate are > 10.000
      
      Could be used only if the sort order is provided (memo: keyword fields could be ordered in alphabetical order)
  - Both pagination systems could use PIT: generate a token that represents the status of the cluster and then pass this token during the pagination.
    Usage
    1. generate from index
      POST kibana_sample_data_ecommerce/_pit?keep_alive=60m
    2. pass the received id to the query pit.id:...
- Aliases could map multiple indices behind the name and apply a filter to the data
- Search template: store script with parameters and the query using mustache under PUT _scripts/<script_name> and then use it at query time using GET <index>/_search/template{ "id":<script_name>
- Dynamic mapping
  - Dynamic field mapping = how the index, automatically, manage new fields that weren’t declared (e.g. strict raise an error).
    - 💡 A subfield could overwrite the dynamic parameter, in this way we could for example “restrict” the insertion of only some subfields.
      
      Example:
      
      PUT my-index-02 { "mappings": { "dynamic": "strict", "properties": { "user": { "properties": { "name": { "type": "text" }, "social_networks": { "dynamic": true, "properties": {} } } } } } } # > 200 # Note: we have provided the field "dynamic" : "strict", # so no new fields are allowed on this index PUT my-index-02/_doc/1 { "user": { "name": "tyler" } } # > 200 PUT my-index-02/_doc/2 { "user": { "name": "tyler" }, "otherfield": "foo" } # > 400; mapping set to strict PUT my-index-02/_doc/2 { "user": { "name": "tyler", "surname": "foo" } } # > 400; mapping set to strict PUT my-index-02/_doc/2 { "user": { "name": "tyler", "social_networks": { "facebook": { "nick": "foo" } } } } # > 200 # Note: possible because the latter "dynamic": true, # overwrite the general "dynamic": "strict"
  - Dynamic template = we declare **some matching rules to catch the new fields (e.g. location-*) and how to manage it
- Nested arrays of objects
  - If the field will store unknown fields, we can easily store them as an object.
    Use nested or flattened only for arrays of objects
    DELETE test04 PUT test04 { "mappings": { "properties": { "f-obj":{ "type": "object" }, "f-nested":{ "type": "nested" }, "f-flat":{ "type": "flattened" } } } } PUT test04/_doc/01 { "f-obj": { "field1": "mouse", "field2": "keyboard" }, "f-nested": { "field1": "mouse", "field2": "keyboard" }, "f-flat": { "field1": "mouse", "field2": "keyboard" } } PUT test04/_doc/02 { "f-obj": { "field1": "keyboard", "field2": "mouse" }, "f-nested": { "field1": "keyboard", "field2": "mouse" }, "f-flat": { "field1": "keyboard", "field2": "mouse" } } # --- # Test searches # --- GET test04/_search { "query": { "match": { "f-obj.field1": "mouse" } } } # "_id" : "01", as expected
  - Object vs flattener types*:* object maintain *keys* information, instead of the *flattened* only store an array with all the values of the JSON
- **In custom analyzer: composed by (and applied in order)
  - character filters - preprocess characters
  - 🦂 tokenizer - split in tokens and could do other things (e.g. lowercase, remove punctuation etc.)
  - token filter - manage the tokens: remove (stopword), add (synonyms), lowercase
- _update_by_query take the document in _source and use it to re-index the data on the index. This process increases the _version
- In the _reindex API we could specify a processor - doc
- Ingest pipeline + script
  1. Write & Store script under _script with parameters
  2. Create pipeline component of type _script and set parameters values
  3. Call the pipeline using PUT <index>... ?pipeline=<pipName> or set during index mapping using default_pipeline=<pipName>
  - Example for fast look
    # --- # Create a dispacher # using pipeline + stored script # --- PUT _scripts/my-script { "script":{ "lang": "painless", "source": """ String checkString = ctx[params['fieldToCheck']]; if (checkString == params['checkValue']){ ctx["_index"] = params['destinationIndex']; } """ } } PUT _ingest/pipeline/my-dispacher-pipeline { "processors": [ { "script": { "id": "my-script", "params": { "fieldToCheck": "dispacher-type", "checkValue": "storic", "destinationIndex": "storic-index" } } } ] } # Note: params setted at pipeline level POST _ingest/pipeline/my-dispacher-pipeline/_simulate { "docs": [ { "_source": { "my-keyword-field": "FOO", "dispacher-type": "storic" } }, { "_source": { "my-keyword-field": "BAR" } } ] } # > "_index" : "storic-index" PUT storic-index DELETE my-index-01 PUT my-index-01 { "settings": { "number_of_shards": 1, "default_pipeline": "my-dispacher-pipeline" } } PUT my-index-01/_doc/01 { "my-keyword-field": "FOO", "dispacher-type": "storic" } PUT my-index-01/_doc/02 { "my-keyword-field": "FOO", "dispacher-type": "non-storic" } GET storic-index/_search # > _id" : "01" GET my-index-01/_search # > "_id" : "02",
- Snapshots
  - 💡 If you register the same snapshot repository with multiple clusters, only one cluster should have write access to the repository (others readonly activated)
  - Register where store snapshot (elasticsearch.yml) on each node and create a repository to use it.
    Then you could make snapshots using the PUT /_snapshot API, moreover, schedule snapshot lifecycle (SLM) using PUT /_slm/policy/ API.
    Note: everything could be done also with Kibana UI.
  - Restore using POST /_snapshot/<repoName>/<snapName>/_restore API, we could restore everything or cherry-pick only some indices
    - we need to close an index before restoring it from a snapshot
- Searchable snapshots
  - 💡 Searchable snapshots is the functionality, it could be part of an ILM or we could mount a snapshot
    - mount = restore an index stored into a snapshot without creating new shards on the cluster but instead searching directly into the snapshot
    - ilm = we could include searchable snapshots (ss) inside the ILM phases (hot or cold, usually in the latter)
      
      cold phase + ss = when reaching the cold phase, under the hood the ILM:
      store the index on a snapshot, delete the original index, mount the index on the snapshot on a new index (restored-<indexName>), create an alias <index-name> --> restored-<indexName>
      
      Best practices are to reserve a clone of the snapshot only for the mounting and ss service
      
      🦂 Use Kibana Code instead of the GUI. The Searchable snapshot button from GUI is disabled and only shown under the cold section
      
      Follow the ILM process throughGET test-index-03/_ilm/explain API
- Cross-cluster (CC)
  - Create monodirectional connections between two ES clusters, for cross-search or data replication
    - The node of the cluster that wants to establish the connection must have the remote_cluster_client role
    - The node chosen as seed node on the cluster to reach for the connection must be reached at the transport port (:9300) and should be stable (better choose master)
  - CC replication
    - Copy indices (leader) to remote cluster (replica).
      
      If we want to create a replica of idx1 on c1 to idxr1 on c2:
      
      c2 must have remote_cluster_client role and connect to c1
      
      Use the PUT /idxr1/_ccr/follow? API on c2
      
      The leader indices must have the soft-deletion feature activated - API
      
      🦂 In the cluster that will have follower indices all nodes with the master node role must also have the remote_cluster_client role - doc
    - If “not auto recovering” outage appear: pause & resume the follower index
- Security
  - Enable on the cluster the security
    xpack.security.enabled: true xpack.license.self_generated.type: trial # <-- optional but good to have
  - Run ES, generate keys, run Kibana, create keystore, add elasticsearch.password to keystore, use kibana_system as username (elasticsearch.yml → elasticsearch.username: kibana_system: kibana_system), run kibana, access with elastic generated credentials, create a new role, create new user can use that role
- Data streams
  - How setup data stream:
    - Create ILM
    - Create index template with mandatory: @timestamp and data_stream:{}
      
      💡 Note that data_stream:{} is a index_template paramter!
    - Create data stream using dedicated API: PUT _data_stream/<dataStreamName>
    - Use <dataStreamName> like a normal index, under the hood ES automatically rollover the index and apply ILM
  - The difference with ILM “standard”:
    We can obtain similar functionalities without specificdata_stream parameter:
    - Create ILM with rollover
    - Create index template with index.lifecycle.rollover_alias: <aliasName>parameter
    - Create an index use the template
      
      🦂💡 The index name must be in the form <index-name>-000001
    - Create alias that link <aliasName> and <index-name>-000001 and "is_write_index": true
Less relevant topics
- For use Data Visualizer to upload a file, at least 1 ingest node must be declared
- ssh {{username}}@{{remote_host}} to ssh as specific user
- For a debian installation, on doc are specified all settings endpoint (logs, config…)
- curl usages
```
# Tip: write the curl on vim and use 2nd CLI to run the script
# Tip: in vim ":set tabstop=2" for a better indentation
      
# Curl with body
# Tip: generate using a Kibana UI and adapt
curl -XPUT localhost:9200/test-index-01 -H 'Content-Type: application/json' -d'
{
  "mappings":
  {
    "properties":
    {
      "foo":
      {
        "type":"text"
      }
    }
  }
}'
      
# Curl with body and security enabled
curl --user elastic:LxZ9PHGTh07oOWhnwKjn -XPUT localhost:9200/test-index-01 -H 'Content-Type: application/json' -d'
{
  "mappings":
  {
    "properties":
    {
      "foo":
      {
        "type":"text"
      }
    }
  }
}'
```
- Remember use size=0 during aggs if no query is provided
- We can connect a remote cluster through the Kibana UI
- Fast highlight with fvh highlighter, but it require field has indexed with "term_vector": "with_positions_offsets" and this double the size of the field - doc
- If sort is specified and max_score is required set "track_scores": true on the query
- There is a page named “Fix common cluster issues” under “How To” section: useful for a guide on how resolve some cluster problems
- the path where store snapshot files is defined inside the repository and must be declared on each node setting (elasticsearch.yml - see doc)
- In hot-warm-cold architecture, the number of replicas for each phase are defined inside the ILM

🤝 Advices

Some exam advice and tips

Use Kibana shortcuts
- Use the Kibana shortcuts, the complete list on Kibana UI “help” window
  - ctrl + i → indent the block
  - ctrl + ↑ and ctrl + ↓ → navigate b etween blocks
  - Ctrl + / → open API documentation page
Search through documentation
- At the exam, the official documentation will be provided
- To better search through the documentation, expand all the sections of the official Guide and use the browser finder (ctrl + f)
  - Where push to expand all sections: image
- We can also use the integrated website search system, but if you have familiarity with the documentation the latter approach is faster

Use the API *common options*

API parameters useful to better work on Kibana, see the examples to understand how to use - doc

Most useful:

?v - add the output columns name

Example

GET _cat/shards
# > .kibana_7.13.0_001                  0 p STARTED     105  5.1mb 172.20.0.5 es01
              
GET _cat/shards?v
# > index                               shard prirep state      docs  store ip         node
# > .kibana_7.13.0_001                  0     p      STARTED     105  5.1mb 172.20.0.5 es01

Make an index backup

During the exam you could mess with the index, so making a backup before running index changes could be a good thing - idea from Guido Lena Cota post

# ---
# Clone an index
# ---
PUT kibana_sample_data_ecommerce/_settings
{
  "settings": {
    "index.blocks.write": true
  }
}
      
POST kibana_sample_data_ecommerce/_clone/bkp_kibana_sample_data_ecommerce
      
PUT kibana_sample_data_ecommerce/_settings
{
  "settings": {
    "index.blocks.write": false
  }
}

Useful bash commands

You will ssh into VM, better know some useful commands

# Get VM users
$ cat /etc/passwd
  
# Get all processes
$ ps aux
  
# Run as <user>
su - <user>
# e.g. `su - elasticsearch bin/elasticsearch`

📔 Dictionary

Relevant Keywords/Concepts explanations

Closed index
- A closed index is blocked for read/write/search operations
- “A closed index is blocked for read/write operations and does not allow all operations that opened indices allow. […] resulting in a smaller overhead on the cluster.” - doc
- Usually you close the index before some maintaining processes (e.g.)
Data tier

“A data tier is a collection of nodes with the same data role that typically share the same hardware profile” - doc
(ECS) Elastic Common Schema
- “common set of fields to be used when storing event data in Elasticsearch, such as logs and metrics.” - doc
- With ECS we can use a standardised form of values mapping, this let us reach better data analytics, charts, and other common goals (ECS fields integrate with several Elastic Stack features by default)
Heap size
- The JVM heap: area of memory used to store objects instantiated by applications running on the JVM. Objects in the heap can be shared between threads. - 📎 azul
- Elasticsearch automatically set the heap size based on the node’s role- doc
  - You can always override the heap size using the parameter ES_JAVA_OPTS - doc:
    - E.g. on docker is useful to limit the heap memory using the env variable:
      "ES_JAVA_OPTS=-Xms512m -Xmx512m"
History retention
- “Elasticsearch keeps track of the operations it expects to need to replay in future using a mechanism called shard history retention leases.” - doc
- 💡 ES store operations (insert and deletion) done on an index, so they can be replayed.
  In CCR leader index send only the last operations done and the following index will reply to them

⭐ Mapping *Fields* term

🔗 official doc

“index the same field in different ways for different purposes” - doc

Use the term fields inside the document field:

🖱️ Code example

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "city": {
        "type": "text",
        "fields": {
          "raw": {              # <--- we will refer as city.raw
            "type":  "keyword"
          }
        }
      }
    }
  }
}

Memory locking requested
- When you run an ES instance, for example on a new container instance, you could receive an error like this on the command line logs:
  bootstrap check failure [1] of [1]: memory locking requested for elasticsearch process but memory is not locked
  - To resolve this issue lock the memory on the machine, using e.g. for docker compose the parameter
    ulimits: memlock: soft: -1 hard: -1
- References:
  - ulimit set to -1 - so
⭐ Node roles
- Each instance of Elasticsearch ran is a node - docs
- Usually, you start multiple nodes on different VM with different hardware (e.g. for the Hot-Warm-Cold Architecture, or for ML purposes)
- 🦂 If you set custom node.roles, ensure you specify every node role your cluster needs - docs
  - e.g. If you don’t use the role data, be sure to have defined both data_content and data_hot
- List of available nodes roles
  - 💡 data_content is the node preferred to put data that doesn’t fit a time series - docs
Realm

“The authentication process is handled by one or more authentication services called realms” - doc
- A realm is used to resolve and authenticate users based on authentication tokens. - doc
- The system that stores and check the user credentials. There are internal and external realms (external realms like kerberos require interaction between ES and 3th parties)
Remote recovery process
- In CCR (Cross Cluster Replication) is the process of copying data from leader index to follower
- information about an in-progress remote recovery: - cat-recovery API
Rollover
- “Creates a new index for a data stream or index alias.” - doc
- ELI5: create a new index, assign the same “old” index alias, move the new writing on this index - reddit
Runtime Fields
- A runtime field is a field that is evaluated at query time - docs
- 💡 Useful for adding fields to existing documents without reindexing your data
- Defined at mapping time or query time
Soft deletes
- See history retention dictionary entry*,* they are the same thing
- Soft deletes is a feature ES provide that is activated when history retention is activated
Seed node | Seed hosts

Official doc
- Inside each node configuration (elasticsearch.yml) there is a field named discovery.seed_hosts that take a list of hosts names.
  This hosts list will be used to join the cluster and for the cluster formation
- “In short discovery.seed_hosts is the list of master nodes” - so
Segments

Info from medium article
- “The Lucene index is divided into smaller files called segments.
  A segment is a small Lucene index. Lucene searches in all segments sequentially.”
- Segments are immutable
  - More segments: slow searches (because ES search sequentially)
    - So you can merge: “During a merge, Lucene takes 2 segments, and moves the content into a third, new one”
    - This allows us to not copy “deleted” documents into the new segment
Shard Doc (search field)

🔗 Official doc
- “The _shard_doc value is the combination of the shard index within the PIT and the Lucene’s internal doc ID, it is unique per document and constant within a PIT”
- Used inside the _search API to paginate using the search after functionality
Shrink
- “Shrinks an existing index into a new index with fewer primary shards.” - doc
  - This because primary shards are not mutable - doc
- Note: there are some important things to check before shrink (e.g. disk space) - doc
Soft delete

The underlying mechanism used to retain these operations (history of individual write operations) is soft deletes. - doc
- ES maintain a history file with individual write operations, useful for example during the update of the following index in a Cross Cluster Replica architecture.
  The retaining mechanism of this file is called Soft Delete - doc
Split brain
- After a network error, we could reach a situation when two nodes think they’re the masters of the cluster - doc
- Avoid it:
  - using an odd number of master nodes- doc
  - sizing the minimum_master_nodes parameter - doc
- 💡 Is no longer a problem we take into account:
  “No matter how it is configured, Elasticsearch will not suffer from a “split-brain” inconsistency.” - doc
X-pack
- X-pack is an ES Stack extension
- Provides security, alerting, monitoring, reporting, machine learning, and many other capabilities.
- X-Pack is open, but not everything is free
  - “Many features in X-Pack are free […], some features in X-Pack are paid” - link
    - You always have a 30-day trial
    - Here list of what is free and what is payed

🙏 Resources

Useful online resources
- Preparing for the Elastic Certified Engineer Exam - Get Elasticsearch Certified - youtube
- ⭐ Elastic Certified Engineer Exam - My Experience and How I Prepped - linkedin
- ⭐ Guido Lena Cota medium posts (2019)
  - Elastic Certified Engineer Exam — what to expect and how to rock it - medium
  - Exercises for the Elastic Certified Engineer Exam: Deploy and Operate a Cluster - medium
  - Exercises for the Elastic Certified Engineer Exam: Store Data into Elasticsearch - medium
  - Exercises for the Elastic Certified Engineer Exam: Model Data into Elasticsearch - medium
  - Exercises for the Elastic Certified Engineer Exam: Search and Aggregations - medium
- Querying and aggregating time series data in Elasticsearch - ES blog
- Designing the Perfect Elasticsearch Cluster: the (almost) Definitive Guide - medium
- Official Elasticsearch examples - GitHub
- Searchable Snapshots - Daily Elastic Byte S01E14 - yt
- Troubleshooting Elasticsearch ILM: Common issues and fixes - blog
- 📚 Books
  - Old but gold: Elasticsearch: The Definitive Guide - physical book, online version
    - 🦂 A lot of the new features are created after the book (first edition: 2015) and some API used on the book are now deprecated.
    - 💡 Anyway, the book is really well written and have some meaningful insights and descriptions about the internal operation of Elasticsearch that aren’t version-specific and not easily deducible from the official documentation
  - A book about running Elasticsearch: running-elasticsearch-fun-profit - web, github