Using Amazon Athena With Spring Boot to Query S3 Data | Baeldung

1. Overview

We often store large amounts of data in Amazon S3, but analyzing this data can be challenging. Traditional methods require us to move the data or set up complex systems like a data warehouse.

Amazon Athena offers a simpler solution, allowing us to query our S3 data directly using SQL.

In this tutorial, we’ll explore using Amazon Athena to analyze data in our S3 buckets using Spring Boot. We’ll walk through the necessary configurations, execute Athena queries programmatically, and handle the results.

2. Understanding Amazon Athena

Amazon Athena is a serverless query service that allows us to perform ad-hoc queries on the data stored in our S3 buckets without setting up any infrastructure.

One of the key benefits of using Athena is that we only pay for the amount of data consumed while executing the query, making it a cost-effective solution for ad-hoc and occasional data analysis.

Athena also uses schema-on-read to translate our S3 data in-flight into a table-like structure. Specifically, this means we query our data without altering the source and without performing any extract, transform, and load (ETL) operations. The tables we define in Athena don’t contain the actual data like traditional databases. Instead, they store instructions on how to convert the source data for querying.

The data in our S3 buckets can originate from various AWS services, such as CloudTrail logs, VPC Flow Logs, and ALB Access Logs, or even custom data that we store in S3 in formats such as JSON, XML, Parquet, etc.

3. Setting up the Project

Before we use Amazon Athena, we’ll need to include the dependency for it and configure our application correctly.

3.1. Dependencies

Let’s start by adding the Amazon Athena dependency to our project’s pom.xml file:

<dependencies>
    <dependency>
        <groupId>software.amazon.awssdk</groupId>
        <artifactId>athena</artifactId>
        <version>2.26.0</version>
    </dependency>
</dependencies>

This dependency provides us with the AthenaClient and other related classes, which we’ll use to interact with the Athena service.

3.2. Defining Athena Configuration Properties

Now, to interact with the Athena service and execute queries, we need to configure our AWS credentials for authentication, the Athena database name to use for running our SQL queries, and the query result location, which is an S3 bucket where Athena stores the results of our queries.

We’ll store these properties in our project’s application.yaml file and use @ConfigurationProperties to map the values to a POJO, which our service layer references when interacting with Athena:

@Getter
@Setter
@Validated
@ConfigurationProperties(prefix = "com.baeldung.aws")
class AwsConfigurationProperties {

    @NotBlank
    private String accessKey;

    @NotBlank
    private String secretKey;

    @Valid
    private Athena athena = new Athena();

    @Getter
    @Setter
    public class Athena {

        @Nullable
        private String database = "default";

        @NotBlank
        private String s3OutputLocation;

    }

}

The s3OutputLocation field represents the S3 bucket location where Athena stores the results of our queries. This is necessary because Athena is serverless and doesn’t store any data itself. Instead, it performs the queries and writes the results to the specified S3 location, which our application can then read from.

We’ve also added validation annotations to ensure all the required properties are configured correctly. If any of the defined validations fail, it results in the Spring ApplicationContext failing to start up. This allows us to conform to the fail fast pattern.

Below is a snippet of our application.yaml file, which defines the required properties that will be mapped to our AwsConfigurationProperties class automatically:

com:
  baeldung:
    aws:
      access-key: ${AWS_ACCESS_KEY}
      secret-key: ${AWS_SECRET_KEY}
      athena:
        database: ${AMAZON_ATHENA_DATABASE}
        s3-output-location: ${AMAZON_ATHENA_S3_OUTPUT_LOCATION}

Accordingly, this setup allows us to externalize the Athena properties and easily access them in our application.

4. Configuring Athena in Spring Boot

Now that we’ve defined our properties, let’s reference them to configure the necessary beans for interacting with Athena.

4.1. Creating the AthenaClient Bean

The AthenaClient is the main entry point for interacting with the Athena service. We’ll create a bean to set it up:

@Bean
public AthenaClient athenaClient() {
    String accessKey = awsConfigurationProperties.getAccessKey();
    String secretKey = awsConfigurationProperties.getSecretKey();
    AwsBasicCredentials awsCredentials = AwsBasicCredentials.create(accessKey, secretKey);
    
    return AthenaClient.builder()
      .credentialsProvider(StaticCredentialsProvider.create(awsCredentials))
      .build();
}

Here, we create an instance of AthenaClient using the configured AWS credentials. This client is used to start query executions and retrieve results from the S3 bucket.

4.2. Defining the QueryExecutionContext Bean

Next, we need to tell Athena which database to use when running our SQL queries:

@Bean
public QueryExecutionContext queryExecutionContext() {
    String database = awsConfigurationProperties.getAthena().getDatabase();
    return QueryExecutionContext.builder()
      .database(database)
      .build();
}

We create a QueryExecutionContext bean and specify the database to be used for our queries. The database name is retrieved from our configuration properties, which defaults to the default database if not explicitly specified.

4.3. Setting up the ResultConfiguration Bean

Finally, we need to configure where Athena should store the results of our SQL queries:

@Bean
public ResultConfiguration resultConfiguration() {
    String outputLocation = awsConfigurationProperties.getAthena().getS3OutputLocation();
    return ResultConfiguration.builder()
      .outputLocation(outputLocation)
      .build();
}

It’s important to note that the S3 bucket we use to store query results should differ from the bucket containing our source data.

This separation prevents query results from being interpreted as additional source data, which would lead to unexpected query results. Moreover, Athena should have read-only access to the source bucket to maintain data integrity, with write permissions only granted on the bucket we’ve provisioned to store results.

5. Executing Athena Queries

With the necessary configuration in place, let’s look at how we can execute queries using Athena. We’ll create a QueryService class, autowiring all the beans we’ve created, and expose a single public execute() method that encapsulates the query execution logic.

5.1. Starting a Query Execution

First, we’ll use the AthenaClient instance to start query execution:

public <T> List<T> execute(String sqlQuery, Class<T> targetClass) {
    String queryExecutionId;
    try {
        queryExecutionId = athenaClient.startQueryExecution(query -> 
            query.queryString(sqlQuery)
              .queryExecutionContext(queryExecutionContext)
              .resultConfiguration(resultConfiguration)
        ).queryExecutionId();
    } catch (InvalidRequestException exception) {
        log.error("Invalid SQL syntax detected in query {}", sqlQuery, exception);
        throw new QueryExecutionFailureException();
    }

    // ...rest of the implementation in the upcoming sections
}

We provide the SQL query string, the QueryExecutionContext, and the ResultConfiguration when starting the query execution. The startQueryExecution() method returns a unique queryExecutionId that we’ll use to track the query’s status and retrieve the results.

The targetClass argument specifies the Java class to which we’ll be mapping the query results.

We also handle the InvalidRequestException that the Athena SDK throws if the provided SQL query contains syntax errors. We catch this exception, log the error message along with the invalid query, and throw a custom QueryExecutionFailureException.

5.2. Waiting for Query Completion

After starting the query execution, we need to wait for it to complete before attempting to retrieve the results:

private static final long WAIT_PERIOD = 30;

private void waitForQueryToComplete(String queryExecutionId) {
    QueryExecutionState queryState;

    do {
        GetQueryExecutionResponse response = athenaClient.getQueryExecution(request -> 
            request.queryExecutionId(queryExecutionId));
        queryState = response.queryExecution().status().state();

        switch (queryState) {
            case FAILED:
            case CANCELLED:
                String error = response.queryExecution().status().athenaError().errorMessage();
                log.error("Query execution failed: {}", error);
                throw new QueryExecutionFailureException();
            case QUEUED:
            case RUNNING:
                TimeUnit.MILLISECONDS.sleep(WAIT_PERIOD);
                break;
            case SUCCEEDED:
                queryState = QueryExecutionState.SUCCEEDED;
                return;
        }
    } while (queryState != QueryExecutionState.SUCCEEDED);
}

We create a private waitForQueryToComplete() method and periodically poll the query’s status using the getQueryExecution() method until it reaches the SUCCEEDED state.

If the query fails or is canceled, we log the error message and throw our custom QueryExecutionFailureException. If it’s queued or running, we wait for a short period before checking again.

We invoke the waitForQueryToComplete() method from our execute() method with the queryExecutionId we received from starting the query execution.

5.3. Processing Query Results

After the query execution completes successfully, we can retrieve the results:

GetQueryResultsResponse queryResult = athenaClient.getQueryResults(request -> 
    request.queryExecutionId(queryExecutionId));

The getQueryResults() method returns a GetQueryResultsResponse object containing the result set. We can process these results and convert them into instances of the class specified by the targetClass argument of our execute() method:

private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper().registerModule(new JsonOrgModule());

private <T> List<T> transformQueryResult(GetQueryResultsResponse queryResultsResponse, Class<T> targetClass) {
    List<T> response = new ArrayList<T>();
    List<Row> rows = queryResultsResponse.resultSet().rows();
    List<String> headers = rows.get(0).data().stream().map(Datum::varCharValue).toList();

    rows.stream()
      .skip(1)
      .forEach(row -> {
          JSONObject element = new JSONObject();
          List<Datum> data = row.data();
           
          for (int i = 0; i < headers.size(); i++) {
              String key = headers.get(i);
              String value = data.get(i).varCharValue();
              element.put(key, value);
          }
          T obj = OBJECT_MAPPER.convertValue(element, targetClass);
          response.put(obj);
      });
    return response;
}

Here, we extract the headers from the first row of the result set and then process each subsequent row, converting it into a JSONObject where the keys are the column names and the values are the corresponding cell values. We then use the ObjectMapper to convert each JSONObject into an instance of the specified target class, representing the domain model. These domain model objects are added to a list that is returned.

It’s important to note that our transformQueryResult() implementation is generic and works for all types of read queries, regardless of the table or domain model.

5.4. Executing SQL Queries With the execute() Method

With our execute() method fully implemented, we can now easily run SQL queries against our S3 data and retrieve the results as domain model objects:

String query = "SELECT * FROM users WHERE age < 25;";
User user = queryService.execute(query, User.class);

record User(Integer id, String name, Integer age, String city) {};

Here, we define a SQL query that selects all users younger than 25 years. We pass this query and the User class to our execute() method. The User class is a simple record representing the structure of the data we expect to retrieve.

The execute() method takes care of starting the query execution, waiting for its completion, retrieving the results, and transforming them into a list of User objects. This abstraction allows us to focus on the query and the domain model, without worrying about the underlying interactions with Athena.

5.5. Parameterized Statements With Athena

It’s important to note that when constructing SQL queries with user input, we should be cautious about the risk of SQL injection attacks. Athena supports parameterized statements, which allow us to separate the SQL query from the parameter values, providing a safer way to execute queries with user input. While we’ve used a raw SQL query here for demonstration purposes, using parameterized statements when building queries with user-supplied input is strongly recommended.

To use parameterized queries, we can modify our execute() method to accept an optional list of parameters:

public <T> List<T> execute(String sqlQuery, List<String> parameters, Class<T> targetClass) {
    // ... same as above
    
    queryExecutionId = athenaClient.startQueryExecution(query -> 
        query.queryString(sqlQuery)
          .queryExecutionContext(queryExecutionContext)
          .resultConfiguration(resultConfiguration)
          .executionParameters(parameters)
    ).queryExecutionId();
    
    // ... same as above
}

We’ve added a new parameters argument to the execute() method, which is a list of string values that will be used in the parameterized query. When starting the query execution, we pass these parameters using the executionParameters() method.

Let’s look at how we can use our updated execute() method:

public List<User> getUsersByName(String name) {
    String query = "SELECT * FROM users WHERE name = ?";
    return queryService.execute(query, List.of(name), User.class);
}

This example defines a SQL query with a placeholder ‘?’ for the name parameter. We pass the name value as a list containing a single element to the execute() method, along with the query and the target class.

6. Automating Database and Table Creation

To query our S3 data using Athena, we need to first define a database and a table that’ll map to the data stored in our S3 bucket. While we can create these manually using the AWS Management Console, it’s more convenient to automate this process as part of our application startup.

We’ll place our SQL scripts for setting up the necessary database and table in a new athena-init directory, which we’ll create inside the src/main/resources directory.

To execute these SQL scripts, we’ll create an AthenaInitializer class that implements the ApplicationRunner interface:

@Component
@RequiredArgsConstructor
class AthenaInitializer implements ApplicationRunner {

    private final QueryService queryService;
    private final ResourcePatternResolver resourcePatternResolver;

    @Override
    public void run(ApplicationArguments args) {
        Resource[] initScripts = resourcePatternResolver.getResources("classpath:athena-init/*.sql");
        for (Resource script : initScripts) {
            String sqlScript = FileUtils.readFileToString(script.getFile(), StandardCharsets.UTF_8);
            queryService.execute(sqlScript, Void.class);
        }
    }

}

Using constructor injection via Lombok, we inject instances of ResourcePatternResolver and QueryService that we created earlier.

We use the ResourcePatternResolver to locate all our SQL scripts in the athena-init directory. We then iterate over these scripts, read their contents using Apache Commons IO, and execute them using our QueryService.

We’ll first begin by creating a create-database.sql script to create a custom database:

CREATE DATABASE IF NOT EXISTS baeldung;

We create a custom database named baeldung if it doesn’t already exist. The database name used here can be configured in the application.yaml file, as we’ve seen earlier in the tutorial.

Similarly, to create a table named users in the baeldung database, we’ll create another script named create-users-table.sql with the following content:

CREATE EXTERNAL TABLE IF NOT EXISTS users (
  id INT,
  name STRING,
  age INT,
  city STRING
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://baeldung-athena-tutorial-bucket/';

This script creates an external table named users with columns corresponding to the fields in the JSON data that we’ll store in S3. We specify JsonSerDe as the row format and provide the S3 location where we’ll store our JSON files.

Significantly, to correctly query the data stored in S3 using Athena, it’s important to ensure that each JSON record is entirely on a single line of text with no spaces or line breaks between keys and values:

{"id":1,"name":"Homelander","age":41,"city":"New York"}
{"id":2,"name":"Black Noir","age":58,"city":"Los Angeles"}
{"id":3,"name":"Billy Butcher","age":46,"city":"London"}

7. IAM Permissions

Finally, for our application to function, we’ll need to configure some permissions for the IAM user configured in our app.

Our policy should configure Athena and S3 access:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowAthenaQueryExecution",
            "Effect": "Allow",
            "Action": [
                "athena:StartQueryExecution",
                "athena:GetQueryExecution",
                "athena:GetQueryResults"
            ],
            "Resource": "arn:aws:athena:region:account-id:workgroup/primary"
        },
        {
            "Sid": "AllowS3ReadAccessToSourceBucket",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::baeldung-athena-tutorial-bucket",
                "arn:aws:s3:::baeldung-athena-tutorial-bucket/*"
            ]
        },
        {
            "Sid": "AllowS3AccessForAthenaQueryResults",
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::baeldung-athena-tutorial-results-bucket",
                "arn:aws:s3:::baeldung-athena-tutorial-results-bucket/*"
            ]
        },
        {
            "Sid": "AllowGlueCatalogAccessForAthena",
            "Effect": "Allow",
            "Action": [
                "glue:CreateDatabase",
                "glue:GetDatabase",
                "glue:CreateTable",
                "glue:GetTable"
            ],
            "Resource": [
                "arn:aws:glue:region:account-id:catalog",
                "arn:aws:glue:region:account-id:database/baeldung",
                "arn:aws:glue:region:account-id:table/baeldung/users"
            ]
        }
    ]
}

The IAM policy consists of four key statements to build the permissions required for our Spring Boot application. The AllowAthenaQueryExecution statement provides the necessary permissions to interact with Athena itself, including starting queries, checking their status, and retrieving results.

Then, the AllowS3ReadAccessToSourceBucket statement allows read access to our S3 bucket that contains the source data we intend to query. The AllowS3AccessForAthenaQueryResults statement focuses on the S3 bucket where Athena stores query results. It grants permissions for Athena to write results to the configured S3 bucket and for our application to retrieve them

Finally, to allow interactions with AWS Glue, which Athena uses as its metadata store, we define the AllowGlueCatalogAccessForAthena statement. It allows us to create and retrieve database and table definitions which are essential for Athena to understand the structure of our S3 data and execute SQL queries.

Our IAM policy conforms to the least privilege principle, granting only the necessary permissions required by our application to function correctly.

8. Conclusion

In this article, we’ve explored using Amazon Athena with Spring Boot to query data directly from our S3 buckets without setting up any complex infrastructure.

We discussed starting a query execution, waiting for its completion, and generically processing the query results. Additionally, we automated the creation of databases and tables using SQL scripts executed during application startup.

As always, all the code examples used in this article are available over on GitHub.

Persistence

REST

Security