Skip to content

bug: aws_regions node duplication in Neo4j #12414

@smokentar

Description

@smokentar

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

aws_regions -> new set of regions are created on every CloudQuery sync even if they already exist for a given AWS Account ID.
This is due to the loading cypher query using _cq_id (which is always unique) as the single node property.
Suggestion to avoid this duplication is using the account_id and region fields as node properties.
One AWS Account should only have one set of regions discovered, currently 27.
With this setting, new set of regions (27) will still be discovered for different accounts (as they can have different opt_in_status / enabled values), but one account will only have one set.

Current load cypher query:
TRC RUN "UNWIND $rows AS row MERGE (t:aws_regions {_cq_id: row._cq_id}) SET t = row"{...}

Expected Behavior

Each discovered AWS Account has one set of regions (27) discovered.

Desired load cypher query:
TRC RUN "UNWIND $rows AS row MERGE (t:aws_regions {account_id: row.account_id, region: row.region}) SET t = row"{...}

CloudQuery (redacted) config

kind: source
spec:
  name: "aws"
  registry: "github"
  path: "cloudquery/aws"
  version: "v19.2.0"
  tables: ["*"]
  skip_tables:
    - aws_ec2_vpc_endpoint_services
    - aws_cloudtrail_events
    - aws_docdb_cluster_parameter_groups
    - aws_docdb_engine_versions
    - aws_ec2_instance_types
    - aws_elasticache_engine_versions
    - aws_elasticache_parameter_groups
    - aws_elasticache_reserved_cache_nodes_offerings
    - aws_elasticache_service_updates
    - aws_elasticsearch_versions
    - aws_neptune_cluster_parameter_groups
    - aws_neptune_db_parameter_groups
    - aws_rds_cluster_parameters
    - aws_rds_cluster_parameter_groups
    - aws_rds_cluster_parameter_group_parameters
    - aws_rds_db_parameter_groups
    - aws_rds_engine_versions
    - aws_servicequotas_services
    - aws_servicequotas_quotas
    - aws_iam_role_last_accessed_details
    - aws_iam_user_last_accessed_details
    - aws_iam_policy_last_accessed_details
    - aws_directconnect_locations
    - aws_ram_resource_types
    - aws_lambda_runtimes
    - aws_docdb_event_categories
  destinations: ["neo4j"]
  spec:
    accounts:
      - id: "${ACCOUNT_ID}"
        role_arn: "${ROLE_ARN}"
        role_session_name: "Discovery"
    aws_debug: false

kind: destination
spec:
  name: "neo4j"
  registry: "github"
  path: "cloudquery/neo4j"
  version: "v4.0.1"
  # batch_size: 10000 # optional
  # batch_size_bytes: 5242880 # optional
  spec:
    connection_string: "${DB_ENDPOINT}"
    username: "${DB_USERNAME}"
    password: "${DB_PASSWORD}"

Steps To Reproduce

No response

CloudQuery (redacted) logs

TRC RUN "UNWIND $rows AS row MERGE (t:aws_regions {_cq_id: row._cq_id}) SET t = row"{...}

CloudQuery version

v3.8.0

Additional Context

No response

Pull request (optional)

  • I can submit a pull request

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions