Using AWS Athena's UNNEST Function to Filter JSON Arrays with AND Conditions

AWS Athena Query JSON Array with AND Condition

Introduction

AWS Athena is a serverless query service that allows users to analyze data stored in Amazon S3 using SQL. When working with JSON data, it can be challenging to write efficient queries that extract specific fields or apply conditions. In this article, we will explore how to use AWS Athena’s UNNEST function to flatten an array of objects and then filter the results based on AND conditions.

Background

JSON (JavaScript Object Notation) is a popular data format used to represent structured data in a human-readable format. When working with JSON data, it is often stored as a single string or document in Amazon S3. To analyze this data using AWS Athena, we need to convert the JSON string into a tabular format that can be queried using SQL.

UNNEST Function

The UNNEST function in AWS Athena allows us to flatten an array of objects into separate rows. This is useful when working with nested arrays or objects that contain multiple values. However, it also means that we need to carefully consider how to filter the results based on conditions that apply to individual elements.

Sample Data

Let’s take a look at some sample JSON data stored in S3:

{
    "sample_data": {
        "people": [
            {
                "firstName": "Emily",
                "address": {
                    "streetAddress": "101",
                    "city": "abc",
                    "state": "",
                    "phoneNumbers": [
                        {
                            "type": "home",
                            "number": "11"
                        },
                        {
                            "type": "city",
                            "number": "4"
                        }
                    ]
                }
            },
            {
                "firstName": "Smily",
                "address": {
                    "streetAddress": "102",
                    "city": "def",
                    "state": "",
                    "phoneNumbers": [
                        {
                            "type": "home",
                            "number": "1"
                        },
                        {
                            "type": "city",
                            "number": "1"
                        }
                    ]
                }
            }
        ]
    }
}

This data contains two people with their address and phone numbers. The UNNEST function can be used to flatten the phoneNumbers array into separate rows.

Querying JSON Array

To query this data, we need to use the UNNEST function to flatten the array of objects:

SELECT 
    idx,
    JSON_EXTRACT_SCALAR(x.n, '$.address.streetAddress') as street_address,
    JSON_EXTRACT_SCALAR(x.n, '$.address.city') as city,
    JSON_EXTRACT_SCALAR(x.m, '$.type') as type,
    JSON_EXTRACT_SCALAR(x.m, '$.number') as value
FROM 
    sample_data1 CROSS JOIN 
    UNNEST (CAST(JSON_EXTRACT(sample_data,'$.people') AS ARRAY<JSON>) ) AS x(n)
CROSS JOIN 
    UNNEST (CAST(JSON_EXTRACT(x.n,'$.address.phonenumbers') AS ARRAY<JSON>) ) WITH ordinality AS x(m,idx);

This query uses UNNEST to flatten the people array and then another UNNEST to flatten the phoneNumbers array.

Applying AND Conditions

To apply an AND condition, we can use the WHERE clause with a combination of cardinality and filter. The cardinality function returns the number of elements that match the filter condition. By using this combination, we can check for both conditions (home = 2 and city = 4) in a single query.

Query

Here is an example query that applies the AND conditions:

SELECT 
    street_address,
    city
FROM (
        SELECT 
            JSON_EXTRACT_SCALAR(json_str, '$.address.streetAddress') as street_address,
            JSON_EXTRACT_SCALAR(json_str, '$.address.city') as city,
            CAST(
                JSON_EXTRACT(json_str, '$.address.phoneNumbers') AS ARRAY<JSON>) phones
        FROM dataset
    )
WHERE 
    cardinality(
        filter(
            phones,
            js->json_extract_scalar(js, '$.type') = 'home'
                AND try_cast(json_extract_scalar(js, '$.number') as integer) > 2
        )
    ) > 0 
AND 
    cardinality(
        filter(
            phones,
            js->json_extract_scalar(js, '$.type') = 'city'
                AND json_extract_scalar(js, '$.number') = '4'
        )
    ) > 0;

This query uses UNNEST to flatten the phoneNumbers array and then applies the AND conditions using a combination of cardinality and filter.

Output

The output of this query is:

street_address   city
101            abc   

Only one row matches both conditions (home = 2 and city = 4).

Conclusion

AWS Athena’s UNNEST function allows us to flatten arrays of objects into separate rows, making it easier to apply conditions. By combining cardinality and filter, we can check for multiple conditions in a single query. This example demonstrates how to use these features to extract specific fields from JSON data stored in S3 using AWS Athena.

Additional Tips

  • When working with arrays or objects, consider using UNNEST to flatten the data into separate rows.
  • Use cardinality and filter combinations to apply multiple conditions in a single query.
  • Be mindful of performance when applying filters to large datasets.

Last modified on 2024-01-14