4 Methods of Data Modeling in Elasticsearch with Pythonic Code Samples

Spread the love

In this blog post, you will learn about the four different methods of data modeling in Elasticsearch with concrete code samples with benefits and tradeoffs of each method. The four methods of data modeling in Elasticsearch are application side joins, data denormalization, nested objects, and parent-child modeling.

Method 1 – Application Side Joins

Application side joins involve joining documents programmatically within the application (Zhao, 2022). Consider modeling Books and Authors where many authors can author many books. When modeling data for an application side join, one represents the book and author as separate indices. The following are schemas for a Book document and Author document respectively:

Book:
Title: string
AuthorIds: string[]

Author:
Name:string
BookIds: string[]

The following book and author JSON documents are modeled for application side join:

Book Document

{
    "title" : "Discipline Equals Freedom", 
    "authorIds" : ["1"] 
}

Author Document

{
    "name" : "Jocko Willink", 
    "bookIds" : ["1"] 
}

The following curl commands will insert a book document into the book index and an author document in the author index (Note: We assume HTTPS is enabled and HTTPS Basic Authentication is enabled on the Elasticsearch instance. For more information about securely accessing Elasticsearch, you can read my other blog post Securely and Programmatically Accessing Elasticsearch with curl and Python):

curl --cacert tmp/elasticsearch-8.11.1/config/certs/http_ca.crt -u elastic:fL1F3_72R7tD8KwlvjQC -H "Content-Type: application/json" -XPUT "https://localhost:9200/book/_doc/1" -d '{"title" : "Discipline Equals Freedom", "authorIds" : ["1"] }'

curl --cacert tmp/elasticsearch-8.11.1/config/certs/http_ca.crt -u elastic:fL1F3_72R7tD8KwlvjQC -H "Content-Type: application/json" -XPUT "https://localhost:9200/author/_doc/1" -d '{"name" : "Jocko Willink", "bookIds" : ["1"] }'

In an application, when performing an application side join and querying for books of an author, you must first query the author to get the bookIds, and then, you must run a separate query to get the corresponding book of each book id. The following code sample demonstrates performing an application side join:

from elasticsearch import Elasticsearch


ELASTIC_PASSWORD = "fL1F3_72R7tD8KwlvjQC"


def main():
   es = Elasticsearch("https://localhost:9200",
       ca_certs="/Users/gcdrocella/tmp/elasticsearch-8.11.1/config/certs/http_ca.crt",
       basic_auth=("elastic", ELASTIC_PASSWORD))
  


   # Query 1 - query for the author
   author_resp = es.search(index="author", query= {
       "term" : {
           "name.keyword" : "Jocko Willink"
       }
   })


   author_hits = author_resp["hits"]["hits"]


   for author_hit in author_hits:
       author_source = author_hit["_source"]


       for bookId in author_source["bookIds"]:
           # Query 2 - query for the book


           book_resp = es.get(index="book", id=bookId)
           book_source = book_resp["_source"]
           print(author_source["name"] + " authored " + book_source["title"])
          
main()

This is beneficial when there are a few documents; however, as you can imagine, the sample python script can become more complicated when pagination is required. Also, since you are performing multiple queries, there can be a performance tradeoff, too.

Method 2 – Data Denormalization

Denormalizing the data involves flattening and joining the inner objects into a single document when Elasticsearch is indexing the document (Zhao, 2022). To learn more about the specifics of flattening, you can read the following: https://www.elastic.co/guide/en/elasticsearch/reference/current/flattened.html.

It’s also important to note that denormalization creates redundant data in the documents since there are multiple copies of the data (Kashyap, 2020). An example schema that denormalizes data for modeling books and authors is the following:

Book:
Title:string
Authors : [
    {
        Name: string
    }
]

This means that there are multiple copies of the author object in every book they have authored in the books index. The document models a one-to-many relationship where one book has many authors. Consider the following documents stored in the books index, which demonstrates the author object is copied in multiple book documents:

{
    "title" : "Discipline Equals Freedom",
    "authors" : [ 
        {
            "name" : "Jocko Willink"
        }
    ] 
}

{
    "title" : "Leadership Strategy and Tactics",
    "authors" : [ 
        {
            "name" : "Jocko Willink" 
        } 
    ] 
}

The following curl commands will insert the book documents into the book index:

curl --cacert tmp/elasticsearch-8.11.1/config/certs/http_ca.crt -u elastic:fL1F3_72R7tD8KwlvjQC -H "Content-Type: application/json" -XPUT "https://localhost:9200/book/_doc/1" -d '{"title" : "Discipline Equals Freedom", "authors" : [ {"name" : "Jocko Willink" } ] }'

curl --cacert tmp/elasticsearch-8.11.1/config/certs/http_ca.crt -u elastic:fL1F3_72R7tD8KwlvjQC -H "Content-Type: application/json" -XPUT "https://localhost:9200/book/_doc/2" -d '{"title" : "Leadership Strategy and Tactics", "authors" : [ {"name" : "Jocko Willink" } ] }'

You can see each book object has a separate copy of an author object in the array; however, by default, Elasticsearch will automatically flatten inner objects.

The following query will return all the books with the authored by Jocko Willink

{
    "query" : {
        "term" : {
            "authors.name.keyword" : "Jocko Willink"
        }
    }
}

One can query the books of an author in one query as seen in the following code snippet:

from elasticsearch import Elasticsearch


ELASTIC_PASSWORD = "fL1F3_72R7tD8KwlvjQC"


def main():
   es = Elasticsearch("https://localhost:9200",
       ca_certs="/Users/gcdrocella/tmp/elasticsearch-8.11.1/config/certs/http_ca.crt",
       basic_auth=("elastic", ELASTIC_PASSWORD))
  
   resp = es.search(index="book", query = {
       "term" : {
           "authors.name.keyword" : "Jocko Willink"
       }
   })


   hits = resp["hits"]["hits"]


   for hit in hits:
       source = hit["_source"]


       for author in source["authors"]:
           print(author["name"] + " authored " + source["title"])

The query is fast because the join operation is performed when the document was indexed (Zhao, 2022). One caveat of data denormalization is that if an Author’s name needs to be updated, then the Author’s name needs to be changed everywhere the data is copied (Zhao, 2022).

Method 3 – Nested Objects

Elasticsearch enables you to create an array of nested inner objects by explicitly stating the schema of the document (“Nested field type”, n.d.; Zhao, 2022). It’s important to note that nested objects are a one-to-many modeling, and in this scenario, one book has many authors. The following specifies a schema for a document that has an “authors” property, which holds an array of nested author objects:

{
    "mappings" : { 
        "properties" : { 
            "title": { 
                "type" : "text" 
            }, 
            "authors" : { 
                "type" : "nested"
            } 
        }
    } 
}

The following curl command will set the schema for the books index:

curl -H "Content-Type: application/json" -u elastic:fL1F3_72R7tD8KwlvjQC --cacert tmp/elasticsearch-8.11.1/config/certs/http_ca.crt -XPUT https://localhost:9200/books -d '{"mappings" : { "properties" : { "title": { "type" : "text" } , "authors" : { "type" : "nested" } } } }'

The following is a sample document (notice it’s similar to data denormalization):

{
    "title" : "Discipline Equals Freedom", 
    "authors" : [ 
        {
            "name" : "Jocko Willink"
        }
    ]
}

The following curl command will insert the document into the index:

curl --cacert tmp/elasticsearch-8.11.1/config/certs/http_ca.crt -u elastic:fL1F3_72R7tD8KwlvjQC -H "Content-Type: application/json" -XPUT "https://localhost:9200/books/_doc/1" -d '{"title" : "Discipline Equals Freedom", "authors" : [ {"name" : "Jocko Willink" } ] }'

Internally, Elasticsearch will index each author object of the array as a separate document, which means they can be queried independently using Elasticsearch nested queries.

The following will query the nested author objects in the book document:

{
    "query" : {
        "nested" : {
            "path" : "authors",
            "query" : {
                "term" : {
                    "authors.name.keyword" : "Jocko Willink"
                }
            }
        }
    }
} 

The following code sample demonstrates how you can use nested queries in Elasticsearch assuming one is using the aforementioned schema on the index.

from elasticsearch import Elasticsearch


ELASTIC_PASSWORD = "fL1F3_72R7tD8KwlvjQC"


def main():
   es = Elasticsearch("https://localhost:9200",
       ca_certs="/Users/gcdrocella/tmp/elasticsearch-8.11.1/config/certs/http_ca.crt",
       basic_auth=("elastic", ELASTIC_PASSWORD))


   resp = es.search(index="books", query = {
       "nested" : {
           "path" : "authors",
           "query" : {
               "term" : {
                   "authors.name.keyword" : "Jocko Willink"
               }
           }
       } 
   })


   hits = resp["hits"]["hits"]


   for hit in hits:
       source = hit["_source"]


       for author in source["authors"]:
           print(author["name"] + " authored " + source["title"])

Method 4 – Join Fields and Parent-Child Modeling

The parent-child modeling will enable a special join operation to be performed on a one-to-many relationship between documents with a parent-child relationship in Elasticsearch (“Join field type”, n.d.; Zhao, 2022). For demonstrating parent-child modeling in Elasticsearch, we consider modeling employees where some employees are managers and other employees are subordinates. In this scenario, the managers are the parent object and the subordinates are the child objects.

The following schema will create a join field in the employee document:

{
    "mappings" : { 
        "properties" : { 
            "manager_subordinate" : {
                "type" : "join",
                "relations" : { 
                    "manager" : "subordinate"
                }
            }
        }
    } 
}

This means the document will have a property called “manager_subordinate”, which is a join field, which can have the values “manager” or “subordinate where “manager” is the parent object and “subordinate” is the child object. The following will set the schema for the employee index:

curl -H "Content-Type: application/json" -u elastic:fL1F3_72R7tD8KwlvjQC --cacert tmp/elasticsearch-8.11.1/config/certs/http_ca.crt -XPUT https://localhost:9200/employee -d '{"mappings" : { "properties" : { "manager_subordinate" : {"type" : "join", "relations" : { "manager" : "subordinate" } } } } }'

The following is a manager employee document:

{
    "name" : "Bob",
    "manager_subordinate" : "manager"
}

You can insert the document as follows:

curl -H "Content-Type: application/json" -u elastic:fL1F3_72R7tD8KwlvjQC --cacert tmp/elasticsearch-8.11.1/config/certs/http_ca.crt -XPUT https://localhost:9200/employee/_doc/1 -d '{"name" : "Bob", "manager_subordinate" : "manager"}'

Notice, in the document, we specify that Bob is a manager, and we set the property “manager_subordinate” to the value “manager”.

The following is a subordinate employee document, the parent id is set to 1, which means that Bob is Alices’ manager:

{
    "name" : "Alice", 
    "manager_subordinate" : { 
        "name" : "subordinate",
        "parent": "1"
    }
}

You can insert the document with the following curl command:

curl -H "Content-Type: application/json" -u elastic:fL1F3_72R7tD8KwlvjQC --cacert tmp/elasticsearch-8.11.1/config/certs/http_ca.crt -XPUT "https://localhost:9200/employee/_doc/2?routing=1" -d '{"name" : "Alice", "manager_subordinate" : { "name" : "subordinate", "parent": "1" } }'

When inserting the document, we set the routing value to 1, which is the id of the manager’s parent document, which informs Elasticsearch to insert the document in the same shard as the parent document.

The following query will return all the subordinates in the employee index:

{
    "query" : { 
        "has_parent" : { 
            "parent_type" : "manager", 
            "query" : { 
                "match_all" : { }
            } 
        }
    } 
}

The following python code sample will query the employee index for subordinates:

ELASTIC_PASSWORD = "fL1F3_72R7tD8KwlvjQC"


def main():
   es = Elasticsearch("https://localhost:9200",
       ca_certs="/Users/gcdrocella/tmp/elasticsearch-8.11.1/config/certs/http_ca.crt",
       basic_auth=("elastic", ELASTIC_PASSWORD))


   resp = es.search(index="employee", query = {
       "has_parent" : {
           "parent_type" : "manager",
           "query" : {
               "match_all" : { }
           }
       }
   })


   hits = resp["hits"]["hits"]


   for hit in hits:
       source = hit["_source"]


       print(source["name"] + " is a subordinate.")

One of the limitations is that the parent and children are joined in memory, so the parent and children objects must be on the same shard (Zhao, 2022).

In this blog post, we examined the four methods of data modeling in Elasticsearch, which are application side joins, data denormalization, nested objects, and parent-child modeling. Sample documents and concrete code samples were given for querying data for each data modeling method.

Thanks for reading! For more blog posts just like this, subscribe and share!

Subscribe

* indicates required

Intuit Mailchimp

References

Ghriss, J. [@jassemghriss2611]. (2019, April 22). 17 Data Modeling with Elasticsearch. Youtube. https://www.youtube.com/watch?v=fPNiGjB8JR8

Join field type. (n.d.). Elastic.Co. Retrieved November 30, 2023, from https://www.elastic.co/guide/en/elasticsearch/reference/current/parent-join.html

Kashyap, V. (2020, April 18). ElasticSearch – Denormalization & Invtered Index. Linkedin.com. https://www.linkedin.com/pulse/elasticsearch-denormalization-invtered-index-vaibhav-kashyap/

Nested field type. (n.d.). Elastic.Co. Retrieved November 30, 2023, from https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html

Zhao, J. Y. (2022, July 25). Different ways to model your data in Elasticsearch – Joey Yi Zhao. Medium. https://medium.com/@zhaoyi0113/different-ways-to-model-your-data-in-elasticsearch-bbc719f3d4fc


Posted

in

,

by