OpenSearch

Índex

General

  • OpenSearch
    • Introduction to OpenSearch
      • OpenSearch Database
        index table
        document (JSON) row
        shard (each shard stores a subset of all documents in an index)

      • Index configured with:
        • mapping: a collection of fields and the types of those fields
        • settings: include index data like the index name, creation date, and number of shards
    • ...

Servidor / Server

  • Getting started
    • Installation quickstart (using Docker)
      1. system setup
        • sudo -i
        • swapoff -a
        • echo "vm.max_map_count=262144" >>/etc/sysctl.conf
        • sysctl -p
      2. download compose file
        • mkdir ~/opensearch
        • cd ~/opensearch
        • curl -O https://raw.githubusercontent.com/opensearch-project/documentation-website/2.18/assets/examples/docker-compose.yml
      3. set admin password
        • cd ~/opensearch
        • echo "OPENSEARCH_INITIAL_ADMIN_PASSWORD=xxxxxx" >.env
      4. start
      5. verify (3 lines should appear)
        • docker compose ps
          • si no us apareixen les tres línies és que us cal fer les accions del primer pas
      6. dashboards:
      7. Experiment with sample data

        • generate your own from an existing index or download a sample apply
          mapping elasticdump --debug --input=https://master:xxx@<my_cluster_host>/myindex --output=myindex_mappings.json --type=mapping
          ecommerce-field_mappings.json
          • curl -H "Content-Type: application/json" -X PUT "https://localhost:9200/ecommerce" -ku admin:<custom-admin-password> --data-binary "@ecommerce-field_mappings.json"
          • curl -H "Content-Type: application/json" -X PUT "https://localhost:9200/myindex" -ku admin:<custom-admin-password> --data-binary "@myindex_mappings.json"
          data elasticdump --debug --input=https://master:xxx@<my_cluster_host>/myindex --output=myindex.ndjson ecommerce.ndjson
          • curl -H "Content-Type: application/x-ndjson" -X PUT "https://localhost:9200/ecommerce/_bulk" -ku admin:<custom-admin-password> --data-binary "@ecommerce.ndjson"
          • curl -H "Content-Type: application/x-ndjson" -X PUT "https://localhost:9200/myindex/_bulk" -ku admin:<custom-admin-password> --data-binary "@myindex.ndjson"
  • Import / Export
  • Managing indexes
    • CRUD
      • table caption



        bulk
        template create template index
        • create index template:
          • ...


        create template data stream
        • create data stream templatePUT _index_template/<datastream_template_name>
          • {
              "index_patterns": "logs-nginx",
              "data_stream": {
                "timestamp_field": {
                  "name": "request_time"
                }
              },
              "priority": 200,
              "template": {
                "settings": {
                  "number_of_shards": 1,
                  "number_of_replicas": 0
                }
              }
            }

        index create index
        • only needed if parameters are non-default
          • PUT <index>
            { "settings": { "number_of_shards": 6, "number_of_replicas": 2 } }


        rollover index or datastream
        (can be automated with ISM)
        • rollover:
          • POST <index_or_datastream>/_rollover

        data stream create data stream
        • create explicit data stream
          (will use matching datastream template, if any; error if no matching datastream template):
          • PUT _data_stream/<datastream_name>
        • create implicit data stream by creating a document in a new index:
          • ...


        retrieve data stream
        • retrieve info about all datastreams:
          • GET _data_stream
        • retrieve info about a datastream:
          • GET _data_stream/<datastream_name>
        • retrieve stats about a datastream:
          • GET _data_stream/<datastream_name>/_stats


        delete data stream
        • delete a data stream:
          • DELETE _data_stream/<name_of_data_stream>

        document create documents
        • if index:
          1. exists:
            • a document will be added to existing index
          2. (order?) matches an index template: 
            • specified index will be created, with settings from template
          3. (order?) matches a data stream template:
            1. a data stream will be created: <index>
            2. an index will be created (.ds-<index>-00001)
          4. does not match a template:
            • specified index will be created, with default settings
        • specifying id:
          • PUT <index>/_doc/<id>
            { "A JSON": "document" }
        • without specifying id:
          • POST <index>/_doc
            { "A JSON": "document" }
        • bulk (using ndjson)
          • POST _bulk
            { "index": { "_index": "<index>", "_id": "<id>" } }
            { "A JSON": "document" }

        retrieve documents
        • specifying id:
          • GET <index>/_doc/<id>
        • multiple documents with all fields:
          • GET _mget
            {
              "docs": [
                {
                  "_index": "<index>",
                  "_id": "<id>"
                },
                {
                  "_index": "<index>",
                  "_id": "<id>"
                }
              ]
            }
        • multiple documents with selected fields:
          • GET _mget
            {
              "docs": [
                {
                  "_index": "<index>",
                  "_id": "<id>",
                  "_source": "field1"
                },
                {
                  "_index": "<index>",
                  "_id": "<id>",
                  "_source": "field2"
                }
              ]
            }

        search documents
        • search documents:
          • GET <index>/_search
            {
              "query": {
                "match": {
                  "message": "login"
                }
              }
            }


        check documents
        • verify whether a document exists:
          • HEAD <index>/_doc/<id>


        update documents
        • total update (replace), specifying id
          (same as creating a new document with the same id): 
          • PUT <index>/_doc/<id>
            { "A JSON": "document" }
        • partial update, specifying id:
          • POST <index>/_update/<id>{
              "doc":
            { "A JSON": "document" }
            }
        • conditional update (upsert), specifying id
          (if it exists: update its info with doc; if it does not exist: create a document with upsert):
          • POST movies/_update/2
            {
              "doc": {
                "title": "Castle in the Sky"
              },
              "upsert": {
                "title": "Only Yesterday",
                "genre": ["Animation", "Fantasy"],
                "date": 1993
              }
            }


        delete documents
        • delete a document, specifying id:
          • DELETE <index>/_doc/<id>

    • Templates
      • quan es crea un index o bé un data stream (explícitament; o bé implícitament, quan es crea un document), opensearch comprova si el nom quadra amb algun template. Si quadra, crearà l'índex o el data stream amb la configuració especificada al template
      • Tipus
        • Index template: va bé per exemple quan AWS Firehose crea automàticament índexs amb rotació (diària, setmanal, mensual...)
        • Data stream template: configures a set of indexes as a data stream
    • Data streams
      • "A data stream is internally composed of multiple backing indexes. Search requests are routed to all the backing indexes, while indexing requests are routed to the latest write index. ISM policies let you automatically handle index rollovers or deletions."
      • un dels camps ha de ser "@timestamp"
      • Info
      • steps
        1. create a data stream template
        2. create a data stream
        3. ingest data into data stream
        4. search documents
        5. rollover a data stream
  • Cluster
  • Shards and nodes
    • Each shard stores a subset of all documents in an index
      Index-shard
    • Shards are used for even distribution across nodes in a cluster. A good rule of thumb is to limit shard size to 10–50 GB. (index 1: split into 2 shards; index 2: split into 4 shards)
      Cluster
    • Primary and replica shards (index 1: 2 primary shards + 2 replica shards; index 2: 4 primary shards + 4 replica shards). Default: 5 primary shards + 1 replica = 10 shards
      Cluster replicas
  • ...

Seguretat / Security

Clients

  • Clients
    • OpenSearch Dashboards
      • Self-hosted
      • Amazon OpenSearch Service
        • Go to details and click on url
        • Problemes / Problems
          • {"Message":"User: anonymous is not authorized to perform: es:ESHttpGet with an explicit deny in a resource-based policy"}
            • Solució / Solution
              • Amazon OpenSearch Service / Domains
                • select your domain and go to tab "Security configuration"
                • Access policy:
                  • ...
                    "Effect": "Deny" "Allow"
      • OpenSearch Dashboards quickstart guide
      • Dark mode
        • Management / Dashboards Management / Advanced settings / Appearance / Dark mode
      • Dev Tools console
      • Discover
      • ...
    • REST API
    • Python
    • ...

  • REST API (curl -X ...) Dev Tools (OpenSearch Dashboards)
    health GET "https://localhost:9200/_cluster/health GET _cluster/health
    index a document
    (add an entry to an index)
    (index students is automatically created)
    PUT https://<host>:<port>/<index-name>/_doc/<document-id> PUT /students/_doc/1
    {
      "name": "John Doe",
      "gpa": 3.89,
      "grad_year": 2022
    }
    dynamic mapping

    GET /students/_mapping
    Search your data

    GET /students/_search
    GET /students/_search
    {
      "query": {
        "match_all": {}
      }
    }

    Updating documents (total upload)

    PUT /students/_doc/1
    {
      "name": "John Doe",
      "gpa": 3.91,
      "grad_year": 2022,
      "address": "123 Main St."
    }

    Updating documents (partial upload)
    POST /students/_update/1/
    {
      "doc": {
        "gpa": 3.74,
        "address": "123 Main St."
      }
    }
    Delete a document
    DELETE /students/_doc/1
    Delete index

    DELETE /students
    Index mapping and settings
    PUT /students
    {
      "settings": {
        "index.number_of_shards": 1
      },
      "mappings": {
        "properties": {
          "name": {
            "type": "text"
          },
          "grad_year": {
            "type": "date"
          }
        }
      }
    }

    GET /students/_mapping
    Bulk ingestion POST "https://localhost:9200/_bulk" -H 'Content-Type: application/json' -d'
    { "create": { "_index": "students", "_id": "2" } }
    { "name": "Jonathan Powers", "gpa": 3.85, "grad_year": 2025 }
    { "create": { "_index": "students", "_id": "3" } }
    { "name": "Jane Doe", "gpa": 3.52, "grad_year": 2024 }
    '
    POST _bulk
    { "create": { "_index": "students", "_id": "2" } }
    { "name": "Jonathan Powers", "gpa": 3.85, "grad_year": 2025 }
    { "create": { "_index": "students", "_id": "3" } }
    { "name": "Jane Doe", "gpa": 3.52, "grad_year": 2024 }
    Ingest from local json files (sample mapping) curl -H "Content-Type: application/json" -X PUT "https://localhost:9200/ecommerce" -ku admin:<custom-admin-password> --data-binary "@ecommerce-field_mappings.json"
    Ingest from local json files (sample data) curl -H "Content-Type: application/x-ndjson" -X PUT "https://localhost:9200/ecommerce/_bulk" -ku admin:<custom-admin-password> --data-binary "@ecommerce.ndjson"
    Query
    GET /ecommerce/_search
    {
      "query": {
        "match": {
          "customer_first_name": "Sonya"
        }
      }
    }
    Query string queries
    GET /students/_search?q=name:john



  • Ingest your data into OpenSearch
  • Search your data
  • ...

Índexs / Indexes

Query DSL

  • query
    Boolean query
    must AND GET _search
    {
      "query": {
        "bool": {
          "must": [
            {}
          ],
          "must_not": [
            {}
          ],
          "should": [
            {}
          ],
          "filter": {}
        }
      }
    }
    must_not NOT
    should OR
    filter AND

    Filter context Query context

    Term-level queries Full-text queries

    • no relevance
    • cached
    • exact matches
    • not for text (except keyword)
    • relevance
    • not cached
    • non-exact matches
    • for text

    term
    • value
    • boost
    • case_insensitive

    terms

    terms_set
    • terms
    • minimum_should_match_field
    • minimum_should_match_script
    • boost

    ids
    • vallues
    • boost

    range
    • operators
      • gte
      • gt
      • lte
      • lt
    • format
    • relation
    • boost
    • time_zone

    prefix
    • value
    • boost
    • case_insensitive
    • rewrite

    exists
    • boost

    fuzzy
    • value
    • boost
    • fuzziness
    • max_expansions
    • prefix_length
    • rewrite
    • transpositions

    wildcard
    • value
    • boost
    • case_insensitive
    • rewrite

    regexp
    • value
    • boost
    • case_insensitive
    • flags
    • max_determinized_states
    • rewrite

    intervals
    rule parameters
    match
    • query
    • analyzer
    • filter
    • max_gaps
    • ordered
    • use_field
    prefix
    • ...
    wildcard
    fuzzy
    all_of
    any_of

    match
    • query
    • auto_generate_synonyms_phrase_query
    • analyzer
    • boost
    • enable_position_increments
    • fuzziness
    • fuzzy_rewrite
    • fuzzy_transpositions
    • lenient
    • max_expansions
    • minimum_should_match
    • operator
    • prefix_length
    • zero_terms_query

    match_bool_prefix
    • query
    • analyzer
    • fuzziness
    • fuzzy_rewrite
    • fuzzy_transpositions
    • max_expansions
    • minimum_should_match
    • operator
    • prefix_length

    match_phrase
    • query
    • analyzer
    • slop
    • zero_terms_query

    match_phrase_prefix
    • query
    • analyzer
    • max_expansions
    • slop

    multi_match
    • query
    • auto_generate_synonyms_phrase_query
    • analyzer
    • boost
    • fields
    • fuzziness
    • fuzzy_rewrite
    • fuzzy_transpositions
    • lenient
    • max_expansions
    • minimum_should_match
    • operator
    • prefix_length
    • slop
    • tie_breaker
    • type
    • zero_terms_query

    query_string
    • query
    • allow_leading_wildcard
    • analyze_wildcard
    • analyzer
    • auto_generate_synonyms_phrase_query
    • boost
    • default_field
    • default_operator
    • enable_position_increments
    • fields
    • fuzziness
    • fuzzy_max_expansions
    • fuzzy_transpositions
    • max_determinized_states
    • minimum_should_match
    • phrase_slop
    • quote_analyzer
    • quote_field_suffix
    • rewrite
    • time_zone

    simple_query_string

    aggs


  • ...

Anàlisi de text / Text analysis

  • Mapping parameters
  • Analyzer:
    • source text -> 1. char_filter -> 2. tokenizer -> 3. token filter -> terms
    • Classification:
    • Testing an analyzer
    • Exemples / Examples
      • url
        • Analyze URL paths to search individual elements in Amazon Elasticsearch Service
        • PUT scratch_index
          {
            "settings": {
              "analysis": {
                "char_filter": {
                  "my_clean": {
                    "type": "mapping",
                    "mappings": ["/ => \\u0020",
                                 "s3: => \\u0020"]
                  }
                },
                "tokenizer": {
                  "my_tokenizer": {
                    "type": "simple_pattern",
                    "pattern": "[a-zA-Z0-9\\.\\-]*"
                  }
                },
                "analyzer": {
                  "s3_path_analyzer": {
                    "char_filter": ["my_clean"],
                    "tokenizer": "my_tokenizer",
                    "filter": ["lowercase"]
                  }
                }
              }
            },
            "mappings": {
                "properties": {
                  "s3_key": {
                    "type": "text",
                    "analyzer": "s3_path_analyzer"
                  }
                }
            }
          }
        • PUT scratch_index
          {
            "settings": {
              "analysis": {
                "char_filter": {
                  "url_clean": {
                    "type": "mapping",
                    "mappings": ["/ => \\u0020",
                                 "https: => \\u0020"]
                  }
                },
                "tokenizer": {
                  "url_tokenizer": {
                    "type": "simple_pattern",
                    "pattern": "[a-zA-Z0-9\\.\\-]*"
                  }
                },
                "analyzer": {
                  "url_path_analyzer": {
                    "char_filter": ["url_clean"],
                    "tokenizer": "url_tokenizer",
                    "filter": ["lowercase"]
                  }
                }
              }
            },
            "mappings": {
                "properties": {
                  "my_url_field": {
                    "type": "text",
                    "analyzer": "url_path_analyzer"
                  }
                }
            }
          }
  • Normalizer
  • set get
    PUT my_index
    {
      "settings": {
        "analysis": {
          "char_filter": {
            "my_char_filter": {}
          },
          "tokenizer": {
            "my_tokenizer": {}
          },
          "filter": {
            "my_filter": {}
          },
          "analyzer": {
            "my_analyzer": {
              "type": "custom",
              "char_filter": ["my_char_filter"],
              "tokenizer": "my_tokenizer",
              "filter": ["my_filter"]
            }
          }
        }
      }
      "mappings": {
        "properties": {
          "my_field": {
            "analyzer": "my_analyzer"
          }
        }
      }
    }
    GET my_index/_settings
    GET my_index/_mapping
  • ...

http://www.francescpinyol.cat/opensearch.html
Primera versió: / First version: 9.XI.2024
Darrera modificació: 19 de gener de 2025 / Last update: 19th January 2025

Valid HTML 4.01!

Cap a casa / Back home.