At Sumo Logic, we manage petabytes of unstructured log data as part of our core log search and analytics offering. Multiple terabytes of data are indexed every day and stored persistently in AWS S3. When a query is executed against this data via UI, API, scheduled search or pre-installed apps, the indexed files are retrieved from S3 and cached in a custom read-through cache for these AWS S3 objects.
For the most part, the caching scheme for S3 objects works reasonably well. This is primarily due to the implicit locality of reference of log search queries which tend to have a healthy bias towards recently ingested data. This access pattern yields itself to simple cost-optimization based on the judicious use of S3 storage classes.
S3 Storage Classes
Every object stored in S3 belongs to a particular storage class. S3 storage classes come primarily in two variants1: S3 Standard and S3 Standard-Infrequent Access (S3 Standard-IA).
Since use-cases and requirements vary across customers and “one size doesn’t fit all”, these storage classes provide the levers to manage the trade-offs for specific customer scenarios.
While S3 Standard is suitable for frequently accessed data, S3 Standard-IA, as the name implies, is more cost-effective for long-lived data that is rarely accessed. These semantics arise from the fact that while S3 Standard-IA charges less2 for the per-GB, per-month storage cost of S3 Standard, it also charges a rather hefty premium on data access. In fact, the cost of accessing a GB of data just twice from S3 Standard-IA is approximately equal to the storage cost per-GB of S3 Standard for an entire month. This data retrieval cost doesn’t apply to objects residing in S3 Standard. Therefore, it makes sense to use S3 Standard-IA only for “truly” infrequently accessed data3.
Given the above semantics of S3 storage classes and data access patterns biased towards recent data in Sumo Logic, the following simple S3 object lifecycle management rule makes sense: Ingest fresh data in S3 Standard and once the age (time elapsed since object creation) of an S3 object crosses a predetermined threshold, transition objects to S3 Standard-IA. The threshold age for the transition should be selected in a manner such that objects older than the threshold age are rarely accessed. This ensures that the reduced per-GB, per-month storage cost of S3 Standard-IA more than compensates for the increased data retrieval cost.
It should be noted here that the economics of the above lifecycle management scheme is crucially dependent on one parameter: the threshold age for transition from S3 Standard to S3 Standard-Infrequent Access storage class. Choose a value that is a bit too large and one forgoes the benefits of low per-GB, per-month storage cost of S3 Standard-IA. Pick a value too small and one can end up draining a lot of cash to AWS in terms of per-GB data retrieval cost of S3 Standard-IA. Like most engineering problems, this one involves finding the sweet spot.
Optimizing the threshold age for transition from S3 Standard to S3 Standard-IA
At Sumo Logic, we dog food our production application logs. Performing a simple analysis via Sumo Logic log analytics helped us understand what our access patterns were and what had the biggest impact on the costs. One of our most interesting findings was the fact that many objects were retrieved several (three or more) times within the first few days after S3 Standard-IA transition. In these cases, we lost on retrievals much more than we gained from lower storage fees4.
To find the optimal value for the threshold age parameter, one essentially needs two pieces of information:
- A plot of the data retrieved (in TBs) from S3 by age (say, in days).
Fig. 1: A representative image for bytes_retrieved by age of data in days.
- Average amount of data ingested per day in S3 Standard (assuming all of fresh data goes into S3 Standard).
To construct (1), we used S3 server access logs and sampled Sumo Logic application logs. S3 server access logs are relatively cheap and include information about bytes retrieved in every S3 access. However, we cannot infer the “age” of the object accessed from these logs. This is because we enabled S3 server access logs just a couple of weeks before the analysis. Therefore, creation records were missing for the most interesting (older) objects. Fortunately, we log the S3 object creation time in our application logs. Therefore, we performed a join on S3 object Id over S3 server access logs containing S3 object sizes and sampled Sumo Logic application logs containing the S3 object creation timestamp to plot (1).
(2) is available from daily reported AWS Cloudwatch metrics for S3.
Let’s assume that the current value of threshold age for S3 Standard → S3 Standard-IA transition is T days. Now, consider a candidate for the optimal threshold age: T + d days.
In our work, we focused on estimating the difference between our current spendings and the theoretical spendings for different values of d. The choice of the parameter d affects the overall cost in two ways.
Since data is now residing in S3 Standard for d more days (and for d days less in S3 Standard-IA), storage cost increases as follows:
Δ(Storage Cost per GB/month) = Storage CostS3 Standard - Storage CostS3 Standard-IA
Δ(Storage Cost per month) = Average Daily Ingest in GBs x (d / 30) x(Storage Cost
Data retrieval cost
When the threshold age for transition from S3 Standard to S3 Standard-IA is increased from T to T + d days, data retrieval cost for S3 objects with age greater than T days but less than T + d days drops to zero as S3 Standard does not charge for data retrieval unlike S3 Standard-IA.
In other words:
Dretrieved (in GBs) = Average data retrieved daily with T < age < T + d days
Δ(Data Retrieval Cost per month) = - (Dretrieved x 30 x (Data retrieval cost per GB))
If the decrease in data retrieval cost is greater than the increase in storage cost, then the net S3 cost will decrease.
Cost savings can be similarly calculated when threshold age for transition is decreased by d days. By iterating over a range of candidate values for d, one can determine doptimal that would result in maximum dollar savings.
Other Cost Components
Note that we have only considered data retrieval cost and storage cost in the above analysis as these are the dominating cost components for our use-case. The above analysis can be similarly extended for S3 API cost with a plot of number of S3 API requests by age.
Data transfer costs are not applicable to our use-case as the EC2 instances retrieving the data are co-located in the same region as the source S3 bucket.
Considering the petabyte scale at Sumo Logic and the fact that our S3 access patterns have changed since we started using S3 Standard-IA a few years ago, the tuning of the above parameter alone is expected to save a million dollars in S3 cost. As highlighted above, the data access pattern has an important role to play here. The key learning here is that voluminous, long-lived data, when retrieved from S3 Standard-IA, can play havoc with your AWS bills.
We should note here that our S3 access patterns have changed in the past and are likely to change in the future as well. Therefore, it is very important to evaluate the transition policy periodically. We can not be sure that the current configuration will be optimal in, say, twelve months.
But in the long run, there’s still time to change the road you are on
S3 Intelligent Tiering is another appealing alternative to the S3 Standard, S3 Standard-IA combo which automatically takes care of moving objects to the right S3 storage class based on data access pattern without any explicit lifecycle configuration. Specifically, new objects are kept in the Frequent Access Tier and are moved to the Infrequent Access Tier if not accessed for 30 days. Objects in the Infrequent Access Tier when accessed, are moved back to the Frequent Access Tier.
The above semantics imply that S3 Intelligent Tiering could be a very good choice for long-lived data with rather unpredictable access patterns. This is because S3 Intelligent Tier does not have any data retrieval charges unlike S3 Standard-IA. Any object accessed from the Infrequent Access Tier is moved to the Frequent Access Tier and charged for a minimum of 30 days. Even then, this extra monthly per GB storage cost of Frequent Access Tier is still less than the per GB data retrieval cost of S3 Standard-IA. There is a monitoring fee per object, but with decent sized objects (few MBs or more), it is rather insignificant. Of course, S3 Standard-IA will still be a better option for data, for which it’s known that it will be rarely accessed (because of lower storage costs from day zero). But in more uncertain scenarios, S3 Intelligent Tier could be a reasonable substitute for more sophisticated analysis based object lifecycle configuration.
Given the above background, while it is possible to consider Intelligent Tiering for future projects, rearchitecting current solutions to use it was not an option for us. In addition, we need more granular control on storage policies rather than the 30 day blanket rule used in S3 Intelligent Tiering.
When performing a cost-saving analysis like above, one may also consider using the AWS S3 storage class analysis tool. However, its cost is proportional to the number of S3 objects monitored and it starts producing results only 30 days after enabling. For our use-case, it was simpler, quicker and cost-effective to do the analysis in-house. In addition, the analysis can be easily extended for more fine-grained cost optimizations. For example, instead of a single global optimal transition age, one can determine the optimal transition threshold age for each group of objects based on some object attribute(s).
1Other storage classes include S3 Glacier, S3 Glacier Deep Archive and S3 One-Zone IA. However, we will stick to S3-Standard and S3-IA for the purpose of this post as they have more stringent latency and/or availability SLAs in tune with Sumo Logic’s log search requirements. Another storage class, S3 Intelligent-Tiering isn’t suitable for our use-case as discussed in a subsequent section.
2Actual cost savings may vary due to cross-region pricing difference which can be significant.
3Additional considerations: The minimum billable storage period for S3 Standard-IA is 30 days, so moving objects to S3 Standard-IA and removing them after a day or two makes no sense. In addition, S3 Standard-IA may not be suitable for very small objects as objects are charged for a minimum of 128 KB anyway.
4If an object is retrieved once a month, S3 Standard and S3 Standard-IA costs are very similar. But if an object is retrieved three times a month, storing it in S3 Standard-IA will be approximately twice as expensive as storing it in S3 Standard!
Complete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.