Skip to content

san4d/github-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GitHub Project

Getting Started

The following assumes you're on macOS or Linux when executing commands. If you are running Windows, use gradlew.bat to execute Gradle commands.

Project Dependencies

Java 25

This project uses JDK 25 and has been tested with the OpenJDK distribution.

General Setup

Follow the OpenJDK installation instructions that correspond to your operating system.

Brew and macOS

To install OpenJDK 25 with brew, run:

brew install openjdk@25

You may need to register this install with your system's openjdk registry after install:

ln -sfn /opt/homebrew/opt/openjdk/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk.jdk

To verify this installation, both of these commands should return information about your JDK 25 install:

/usr/libexec/java_home -v 25
java --version

Build and Test

./gradlew build

Running for Development

./gradlew bootRun

Running the Packaged JAR

First, produce the jar:

./gradlew bootJar

then run from the commandline:

java -jar build/libs/branch-0.0.1-SNAPSHOT.jar

Note that in production, we'd need to create fixed versions and update the script to point to the most-recent version.

Quick cURL

The server starts on port 8080. You can test the problem locally with the following cURL command:

curl localhost:8080/profiles/octocat

where the last path parameter (ex. octocat) is the username to search.

Decision Log

Java Version and Distribution

I wanted a readily-available distribution that supports virtual threads. The Spring Boot docs recommend using at least Java 24. Since Java 25 receives LTS and OpenJDK is free, I figured OpenJDK 25 would be easy to set up and would work well with Spring.

At a previous company, we used Amazon's Corretto distribution without issue and picked it specifically because the base image was produced by our cloud provider. The Lambda coldstart times were shorter and their base Docker image had better integration with ECR, AWS's container registry. To extend this project past a take-home, I'd investigate how different distributions perform with the intended cloud provider.

Virtual Threads and Concurrency

The throughput of this project is limited by waiting for responses from the GitHub API. This IO-bound setting is a primary usecase for virtual threads, which allow the application to process more connections with the same memory allocation compared to platform threads.

This project configures Spring to use virtual threads for handling concurrent inbound requests. The GitHub API calls themselves are made sequentially per request based on GitHub's suggestion to avoid concurrent requests in order to avoid using up the applications quota. In our project, this comes at the cost of latency - we have to wait for the /users/{username} request to finish before making the call to /users/{username}/repos. This project does not make use of the rate limit response headers returned in each API call in the interest of time. These headers would be useful when implementing a backpressure system or communicate to our customer how long they need to wait before the limit resets.

If the client needed to decrease the latency, virtual threads would provide a simple approach to making the two required GitHub API calls concurrently. Because there are fewer system parameters to tune - we don't need to configure a request thread pool - the overall maintenance cost is lower than managing request thread pools. This benefit compounds if we were to add other hosted git providers, like GitLab or BitBucket, since each integration would not require tuning a dedicated thread pool.

Connection Pooling

In order to get the most out of the virtual threads, requests to GitHub will need to utilize a connection pool. Unauthenticated requests are limited to (GitHub rate limit docs) 60 per hour, so the connection pool for this takehome doesn't need to be large - I went with 10 for this single instance.

In production or if we were to add in a personal access token, which increases the limit to 5000 per hour, we'd want to tune the connection pool based on the expected requests per second and SLAs. With 10 connections and ~500ms per GitHub request, we'd exhaust our 5000 request/hour quota in about 4 minutes of continuous load of non-cached usernames.

500 ms/call, 2 calls/username => 1 s/username = 1 username/s; 10 max connections => 10 username/s
5000 calls/hour, 2 calls/username => 2500 username/hr
2500 usernames / (10 username/s) = 250 s =  4.16 minutes

This is significantly below the request reset period of 1 hr, suggesting we'd need to investigate setting up a GitHub Enterprise account for higher throughput. Placing the connection pool at 10 lets us burst up to ten usernames at once, which seems reasonable for this project. Increasing past 10 would waste resources since the pool would quickly become idle due to the rate limit.

Connecting this section to the previous section on virtual threads, we see that we could support more concurrent inbound requests than we can make to GitHub. These requests would queue up, waiting for access to the connection pool. GitHub's rate limiting becomes the system's bottleneck.

Spring RestClient

I stuck with standard Spring dependencies, namely RestClient. This made in-memory caching, serialization, and deserialization straightforward to set up.

For systems requiring deployment flexibility (CLI tools, Lambda functions), a standalone HTTP client would avoid the Spring Web dependency. I've used Retrofit + OkHTTP for banking integrations without issue.

I opted for RestClient over WebFlux because virtual threads provide sufficient concurrency for this I/O-bound workload. Using WebFlux requires using reactive types throughout your contracts (ex. Mono, Flux) and requires using reactive versions of all of your clients. There are more operational complexities in reactive clients (ex. WebFlux, R2DBC) than throughput benefits for our problem. The juice isn't worth the squeeze.

Code Organization

For simplicity, I kept this project as a single module project.

$ tree -d src/main/java/xyz/sanfordtech/branch/
src/main/java/xyz/sanfordtech/branch/
├── common_errors
├── github_adapter
├── profile_api
│   ├── dtos
│   └── errors
├── profile_service
└── web
    └── profile

The profile_api module prevents a cyclic dependency between the github_adapter and the profile_service if we did move to a multi-module gradle project. If we eventually added support for other git hosting services, like GitLab and BitBucket, each of those services would need to:

  • implement the ProfileAdapter defined in profile_api
  • add persistence to the profile_service to keep track of supported platforms per username.

A multi-module Gradle project would be overengineering for this project, but the module-level caching makes that project structure appealing for larger projects.

Username Validation

I based the username validation logic on the Enterprise Administrator username documentation.

The key requirements are that usernames

  • contain at least one character
  • are no more than 39 characters and
  • only contain alphanumeric characters and dashes

Performing these checks before making an HTTP call to GitHub helps conserve our meager request limit.

If the core modules (ex. profile_service, github_adapter) were going to be used outside of this Spring application, the validation would also need to be performed in those modules.

Caching

This project uses Caffeine as an in-memory cache to reduce calls to GitHub's rate-limited API. The cache is configured with Spring's @Cacheable and applied to the lookupProfile method in GithubAdapter.java.

I picked it because it was simple to wire into Spring and allowed me to configure a TTL.

Configuration

The cache is configured in application.properties with the following settings:

  • TTL: 12 hours (gh.adapter.cache-ttl-hrs)
  • Maximum Size: 10,000 entries (gh.adapter.cache-size)
  • Statistics: Enabled via recordStats() for monitoring cache performance

These defaults balance freshness with API rate limit conservation. A user's profile is unlikely to change within a 12-hour window, and 10,000 entries should accommodate a reasonable working set for this service.

Limitations

The current implementation has two key limitations:

  1. No Persistence: The cache exists only in memory and is lost on service restart. All cached data must be re-fetched after deployment or crashes.
  2. No Distribution: Each service instance maintains its own cache. Multiple instances cannot share cached data, meaning the same GitHub profile could be fetched multiple times across different instances, wasting rate limit quota.

Production Considerations

For production deployment, especially with multiple service instances, a distributed cache like Redis or Memcached would be recommended. This would:

  • Share cached data across all service instances
  • Persist cache data across service restarts
  • Provide better utilization of GitHub's rate limit quota
  • Support cache invalidation strategies (e.g., webhook-triggered updates when GitHub profiles change)

The abstraction provided by Spring's caching annotations makes swapping to Redis straightforward. The CacheManager bean in GithubAdapterConfig would need to be updated to configure the Redis client and the appropriate dependencies added. The bigger lift is operating and maintaining Redis in your environment.

Observability

This project would benefit from configuring Spring Boot Actuator for monitoring and observability. The cache is set up to make its statistics available through Actuator's metrics endpoint. Care needs to be given to ensure sensitive details are not exposed in production.

Production Setup

For production deployments, consider:

  • Enabling only necessary Actuator endpoints and securing them with authentication
  • Integrating with Prometheus + Grafana for metrics visualization
  • Setting up alerts on key metrics:
    • High cache miss rates (may indicate TTL is too short)
    • GitHub API rate limit exhaustion
    • Elevated error rates or latencies
  • Adding distributed tracing (e.g., OpenTelemetry, Zipkin) to track request flows across service boundaries
  • Structured logging with correlation IDs for request tracing

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages