Building Enterprise Ontology — Part III — Data Store Overview — What is AWS Cloud Directory?

Published in

AWS Tip

7 min readJun 11, 2024

Selecting the appropriate data store for a system is a complex task. The initial design phase, which includes setting up the schema, is particularly challenging because business requirements might still need to be fully defined. Additionally, accommodating growth and modifying the schema introduces further complexity as time progresses. AWS Cloud Directory, which supports hierarchical data, represents a less familiar type of data store service. However, it is well-suited for managing the extensive ontology data of large enterprises that encompass millions of resources and thousands of policies. This part will explore the Cloud Directory service in depth, highlighting how hierarchical data stores differ from other database models and why they are advantageous for certain organizational needs.

This is part III of the series on “Building Enterprise Ontology,” and you are welcome to read the other parts here:

Part I — Overview of Ontology Solution on AWS.
Part II — Business Overview — What is an Enterprise Ontology?
Part III — Ontology Data Store — AWS Cloud Directory (this post).
Part IV — Ontology Access API — GraphQL on AppSync and Lambda Resolver.
Part V — Ontology Setup and Data Loading — AWS Step Functions Automation (coming soon).
Part VI — Reliable Authorization with Ontology — Amazon Verified Permission (coming soon).
Part VII — Natural Language Interfaces for Policies — Amazon Bedrock Agents.

Overall Solution Architecture with AWS Cloud Directory as its Data Store

Why not a Relational Database?

Over the years, I have seen hundreds of data systems fail to scale. It is not bad karma that I bring to places, as I was usually called to help fix the data systems. The systems were usually built by great teams, were mostly successful, and attracted lots of usage. When analyzing the reasons for the failures (such as the Correction-of-Error (CoE) analysis in Amazon.com), the problems were often with the system's database. And when analyzing the reasons for the decisions of the team regarding these databases, it narrows down to the following database dilemma:

The diagram above shows the exponential cost (“c”) increase of relational databases over time (“t”) compared to the sublinear cost increase of specialized databases. On the other hand, the diagram highlights the significant additional investment (marked in red) needed to learn the specialized database foundations, schema, query language, and similar technical details compared to the more familiar relational database technology stack.

The diagram above also shows three distinct phases in the life of a data system. In the initial development phase, many developers decide to use a relational database since they don’t have the time to learn the new syntax of a specialized database. In the last phase, when the service scale is, hopefully, of exponential growth, it is evident that the specialized database is more suitable (marked in green). In the middle, there is the initial growth phase, where the scale issues start to show, but the situation is still manageable with a traditional relational database. Unless you measure the slope of the increase in cost and complexity, the decision to migrate to a specialized database is too late.
Therefore, if you know that the system needs to support large scale, such as enterprise-scale ontology, it is recommended that you invest the effort to choose and learn the best-specialized database for the use case.

AWS Cloud Directory Commands

As with many other specialized data stores, the way to define the schema, add, update, or delete items, and query them differs from the familiar SQL. Redis has a large set of commands, MongoDB has a unique query language, OpenSearch requires Query DSL, and Graph databases such as Neptune use complex query languages such as Gremlin, OpenCypher, or SparQL.
Cloud Directory exposes many API commands that are difficult to navigate and use. The GraphQL on AppSync with the Lambda resolver, which will be covered in the coming part, makes working with the data in the Cloud Directory reasonably simple.

Let’s understand the core concepts and how to connect them to the ontology:

The core concept of Cloud Directory is a facet, which is similar to a table in other data stores. The facet defines an entity, such as a region, product, IoT device, manager, or employee, and its attributes (name, email, type, model, etc.). The objects created and added to the directory are assigned to a specific facet (table) and then connected to other objects to create the hierarchy.

A specialized type of object is a policy (from Cloud Directory Core Concepts):

Q: What is a policy?
A policy is a specialized object type with attributes that define the type of policy and policy document. A policy can be attached to objects or the root of a hierarchy. By default, objects inherit policies from their parents. Amazon Cloud Directory does not interpret policies.

AWS Cloud Directory doesn’t have a good set of tools such as query language or studio, and the 70 API commands are available only through the different SDKs such as Java or Python. The core API of the service has the following types of calls:

Schema definition — adding facets and their attributes, creating a schema and publishing it, creating a new directory, and applying a schema. Overall, 34 out of the 70 different API commands of the service are related to the directory schema.
Directory mutation — adding objects and indices, attaching, and detaching them. Overall, 12 of the 70 API commands are related to directory mutations.
Directory traversal — listing children and parents and typed links, searching indices, and getting object attributes. Overall, 13 of the 70 API commands are related to directory traversal.
Policy management — creating policies, attaching them to objects, and looking them up through the hierarchy. Overall, 5 API commands are related to policy management.

The Cloud Directory API is far from being simple and usable. However, it has many functionalities that can be used to build our enterprise ontology solution. Here are the main ways to simplify the creation and operation of the Cloud Directory.

Step Functions Flows

The complex API command flow that is needed to create a new Cloud Directory and its schema is mapped into a few Step Functions flows:

Cloud Directory Setup — Once the schema, including the facets and their attributes, is defined, multiple API commands must be executed in order. The following flow gets as input the name of the ontology and the customized schema for it.

Create Cloud Directory Step Functions Flow

{
  "StartAt": "createSchemaTask",
  "States": {
    "createSchemaTask": {
      "Next": "PutSchemaFromJson",
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:clouddirectory:createSchema",
      "Parameters": {
        "Name.$": "States.Format('ontology-{}', $.ontology_name)"
      },
      "ResultPath": "$.Schema"
    },
    "PutSchemaFromJson": {
      "Next": "PublishSchema",
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:clouddirectory:putSchemaFromJson",
      "Parameters": {
        "Document.$": "$.document",
        "SchemaArn.$": "$.Schema.SchemaArn"
      },
      "ResultPath": "$.input"
    },
    "PublishSchema": {
      "Next": "CreateDirectory",
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:clouddirectory:publishSchema",
      "Parameters": {
        "DevelopmentSchemaArn.$": "$.Schema.SchemaArn",
        "Version": "001"
      },
      "ResultPath": "$.PublishedSchema"
    },
    "CreateDirectory": {
      "Next": "CreateIndexNode",
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:clouddirectory:createDirectory",
      "Parameters": {
        "Name.$": "States.Format('ontology-{}', $.ontology_name)",
        "SchemaArn.$": "$.PublishedSchema.PublishedSchemaArn"
      },
      "ResultPath": "$.Directory"
    },
    "CreateIndexNode": {
      "Next": "ForEachIndexMap",
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:clouddirectory:createObject",
      "Parameters": {
        "DirectoryArn.$": "$.Directory.DirectoryArn",
        "SchemaFacets": [
          {
            "SchemaArn.$": "$.Directory.AppliedSchemaArn",
            "FacetName": "indices"
          }
        ],
        "ParentReference": {
          "Selector": "/"
        },
        "LinkName": "indices"
      },
      "ResultPath": null
    },
    "ForEachIndexMap": {
      "Type": "Map",
      "End": true,
      "ItemsPath": "$.indices",
      "ItemSelector": {
        "DirectoryArn.$": "$.Directory.DirectoryArn",
        "SchemaArn.$": "$.Directory.AppliedSchemaArn",
        "FacetName.$": "$$.Map.Item.Value.IndexFacet",
        "IndexFacetField.$": "$$.Map.Item.Value.IndexFacetField",
        "LinkName.$": "States.Format('{}LeafIndex', $$.Map.Item.Value.IndexFacet)"
      },
      "ItemProcessor": {
        "ProcessorConfig": {
          "Mode": "INLINE"
        },
        "StartAt": "CreateIndex",
        "States": {
          "CreateIndex": {
            "End": true,
            "Type": "Task",
            "Resource": "arn:aws:states:::aws-sdk:clouddirectory:createIndex",
            "Parameters": {
              "DirectoryArn.$": "$.DirectoryArn",
              "IsUnique": "true",
              "OrderedIndexedAttributeList": [
                {
                  "FacetName.$": "$.FacetName",
                  "Name.$": "$.IndexFacetField",
                  "SchemaArn.$": "$.SchemaArn"
                }
              ],
              "ParentReference": {
                "Selector": "/indices"
              },
              "LinkName.$": "$.LinkName"
            }
          }
        }
      }
    }
  },
  "TimeoutSeconds": 30,
  "Comment": "Create the cloud directory from the schema and initialize indeces for the leaves"
}

Ontology Data Loading—The most straightforward way to fill up the ontology with the data of the nodes and leaves is to use an Excel file with the relevant columns. The following Step Functions flow gets as input the location of the Excel/CSV file and the mapping of the columns to the different facets (node types) and their attributes. The flow iterates through the ontology hierarchy; it creates it if it doesn’t find a relevant node or leaf. The flow supports multiple types of hierarchies (organization → region → store → user, or organization → region → store → department → counter, for example) that are managed within the same ontology.

Loading Data into Cloud Directory from CSV file

Traversal Query using GraphQL

The next part in the post series covers the GraphQL API, which simplifies directory mutation, traversal, and policy management. GraphQL is nicely mapped to the hierarchical structure of the Cloud Directory and allows simple and meaningful access to the data while hiding the technical complexity of calling the proprietary API calls of the Cloud Directory service.

For example, the following GraphQL query:

query regions {
  organization {
    regions {
      name
      policies {
        policy_type
      }
    }
  }
}

is translated to a set of GetObjectAttributes , ListObjectChildren , GetObjectInformation , ListObjectPolicies , LookupPolicy, and ListObjectAttributes API calls.

Summary

Adopting a new specialized database like Cloud Directory is not a simple decision. The learning curve of a new data store technology and the operational complexity of the proprietary APIs can deter many people and organizations from using it. However, such a solution's superior performance, scale, security, and cost should be a good incentive to invest in learning and using it for systems with distinct hierarchical data structures.

In this post and the full Ontology series, I described the reasoning behind choosing Cloud Directory and the methods for reducing its overall complexity, including using Step Functions and AppSync to wrap and simplify the service API.

AWS Tip

Building Enterprise Ontology — Part III — Data Store Overview — What is AWS Cloud Directory?

Why not a Relational Database?

AWS Cloud Directory Commands

Step Functions Flows

Traversal Query using GraphQL

Summary

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in AWS Tip

Written by ML-Guy

No responses yet