In this blog series, we will cover how Amazon Redshift and Sumo Logic deliver best-in-class data storage, processing, analytics, and monitoring. In this first post, we will discuss how Amazon Redshift works and why it is the fastest growing cloud data warehouse in the market, used by over 15,000 customers around the world.
When an organization gains traction, the size of data that needs to be stored, monitored, and analyzed expands exponentially. On traditional database warehouses, queries will start taking more time, making data difficult to manage.
With the rise of cloud computing, the need for warehousing solutions that can scale up for the increasing demands of data storage and analysis has been apparent, resulting in organizations looking for alternatives to traditional on-premise warehousing.
AWS’s Amazon Redshift is a direct response to this demand.
What is Amazon Redshift?
Amazon Redshift is a fully-managed petabyte-scale cloud based data warehouse product designed for large scale data set storage and analysis. It is also used to perform large scale database migrations.
Redshift’s column-oriented database is designed to connect to SQL-based clients and business intelligence tools, making data available to users in real time. Based on PostgreSQL 8, Redshift delivers fast performance and efficient querying that help teams make sound business analyses and decisions.
Each Amazon Redshift data warehouse contains a collection of computing resources (nodes) organized in a cluster. Each Redshift cluster runs its own Redshift engine and contains at least one database.
Amazon Redshift vs Traditional Data Warehouses
Amazon Redshift is a direct alternative to on-premise traditional database warehouses. Let’s look at how Redshift stacks up to traditional warehousing in the following areas:
Amazon Redshift is most known for its speed. Redshift delivers the fast query speeds on large data sets, dealing with data sizes up to a petabyte and more. The speed by which Redshift processes data up to these sizes is just simply impossible to attain in traditional data warehousing, making it the top choice for applications that run massive amounts of queries on-demand.
The ability to deliver this level of performance comes with the use of two architectural elements: columnar data storage and massively parallel processing design (MPP). We will delve deeper into these two later.
Amazon Redshift is markedly faster than traditional warehousing--but when it comes to choosing tech solutions, organizations are arguably most concerned about cost.
As a cloud-based solution, Amazon Redshift is able to provide high-level performance affordably. IT executives know that traditional warehousing is extremely costly from the beginning, with the initial outlay for hardware possibly costing up to the multi-millions. On the other hand, there are no substantial upfront costs to getting setup and started with Redshift. Being a fully-managed solution, Redshift has no recurrent hardware and maintenance costs. Database admins cans setup data warehouses that can handle massive amounts of data without having to go through the lengthy process of procurement and strategic buy-in from leadership that multi-million-dollar on-premise hardware requires.
Traditional on-premise data warehousing poses quite the challenge in case your data needs increase or decrease.
For traditional warehousing, when organizations data needs change, they are forced to have to make another round of costly investments for new hardware purchase and implementation.
Redshift allows for more flexibility and elastic scale. As your requirements change, Redshift can scale up or down instantly to match your capacity and performance needs with a few clicks in the management console.
Cost-wise, on-demand pricing ensures you only pay for what you use. Not being tied down to expensive hardware and lengthy maintenance contracts mean organizations have the liberty to change their minds without having to eat up sunk costs. From a single 160GB DC1.Large node all the way up to multiple 16TB DS2.8XLarge nodes for a petabyte or more of data, you have access to processing power on-demand.
Although Amazon Redshift is demonstrably better than traditional warehousing in the abovementioned regards, security remains to be the tipping point for many enterprises--but it’s not because of known security vulnerabilities. The reality is that some still feel concerned about not having their data physically present.
That said, security is a topmost concern for Amazon, knowing this is a salient point in the decision making for warehousing solutions.
Amazon follows the shared responsibility model of security where Amazon is responsible for the security of the cloud, and the organization is responsible for security in the cloud.
- Security of the cloud: AWS protects infrastructure where AWS services run in the cloud. They are responsible for making sure that features and services that can be used securely are available to users. AWS also ensures that security levels are regularly tested and verified as part of AWS compliance.
- Security in the cloud: The security responsibility of organizations using Redshift is determined by the AWS service they use. Organizations are also responsible for other factors like data sensitivity, an org’s own internal requirements, and compliance with laws and regulations.
That said, Amazon Redshift has most security features of the larger Amazon Web Services platform. Credentials and access are granted and managed on the AWS-level through Identity and Access Management (IAM) accounts. Cluster security groups are created and associated with data clusters for inbound access. For orgs that use a private cloud, access through a Virtual Private Cloud (VPC) environment is available as well. Data encryption is also enabled upon cluster creation and cannot be switched from encrypted to unencrypted directly.
For data in transit, Redshift uses SSL encryption to communicate with S3 or Amazon DynamoDB for COPY, UNLOAD, backup, and restore operations.
Amazon Redshift Performance
As mentioned above, Amazon Redshift is able to deliver performance with best-in-class speed due to the use of two main architectural elements: Massively Parallel Processing (MPP) design and columnar data storage. Let’s look at each one and see how they enable fast processing in Redshift.
Massive Parallel Processing (MPP) Explained
Redshift’s Massively Parallel Processing (MPP) design automatically distributes workload evenly across multiple nodes in each cluster, enabling speedy processing of even the most complex queries operating on massive amounts of data. Multiple nodes share the processing of all SQL operations in parallel, leading up to final result aggregation. Users can optimize the distribution of data by locating the data where it needs to be before the query is executed. This is done by choosing the appropriate distribution style, minimizing the impact of the redistribution step.
Columnar Data Storage Explained
By using columnar storage for database tables, Amazon Redshift reduces the disk I/O requirements, contributing to the optimization of analytic query performance. When database table information is stored in a columnar fashion, the number of disk I/O requests and the amount of data needed to be loaded from disk are reduced. When less data is loaded into memory, Redshift can perform more in-memory processing for executed queries. The amount of time needed to perform a query is reduced using this method compared to when data is stored by row.
How Do I Set Up Amazon Redshift?
Completing the prerequisites
- AWS Account
To get started with Amazon Redshift, you need an AWS account. You may start with a free trial if you don’t already have an account. Y
- Open firewall port
You would also need to ensure that you have an open port that Redshift can use. By default, Redshift will use port number 5439 but the connection will not work if that port is not open in your firewall. Either make sure that port is open or identify an open port in your firewall and input the open port number when you create the cluster. The port number cannot be changed once the cluster has been created.
- Permission to access other AWS resources
To access resources on another AWS resource like Amazon S3, the Redshift cluster you’re about to create needs the necessary access permissions. Those permissions can only be provided in two ways:
- Providing the AWS access key to an IAM user that has the necessary permissions
- By creating a dedicated IAM role that is attached to the Redshift cluster (recommended)
You can create an IAM role by following these instructions from AWS.
Launching a Redshift cluster
After completing the prerequisites, you’re ready to launch a Redshift cluster.
- Step 1: While logged in the user with the necessary permissions to perform cluster operations, open the Amazon Redshift console.
- Step 2: Select the region in which you want to create the cluster.
- Step 3: Choose Quick Launch Cluster and enter the following values. These are default values for those wanting to explore Redshift while incurring minimal charges. If you already have specific values in mind for your use case, replace these values with those.
- Node type: dc2.large.
- Number of compute nodes: 2.
- Cluster identifier: examplecluster.
- Master user name: awsuser.
- Master user password and Confirm password: Enter a password for the master user account.
- Database port: 5439.
- Available IAM roles: Choose myRedshiftRole.
- Step 4: Click Launch Cluster and wait a few minutes for the launch to finish. When done, click Close to return to the list of clusters. The cluster you just launched should be listed there. Check that Cluster Status says available, and Database Health says healthy.
- Step 5: Choose the cluster you just launched. Click the Cluster button just above the list then click on Modify cluster. In the dialog box that appears, choose the VPC security groups you want to associate with this cluster then click Modify to save the association.
Authorizing access to the cluster
After following the steps, the Redshift cluster is now launched. To connect to the cluster, you need to configure a security group to authorize access. If the cluster is launched in the EC2-VPC platform, follow these instructions from AWS.
Connecting to the cluster and running queries
Now that you have launched a cluster, you may connect to it and start running queries. Running queries can be done in two ways:
- Connect to your cluster from the AWS Management Console using the AWS Query Editor.
- Connect to your cluster through a SQL client tool like SQL Workbench/J.
At this point, you can now use your Redshift cluster. You can create tables in the database, upload data to the tables, and try running queries. These activities can be done through the AWS Query Editor or through a SQL client tool of your choice.
How to Monitor Amazon Redshift
Now you know how Amazon Redshift works and why it’s fast and efficient. Still, the best way to know for sure is to see its performance for yourself by monitoring performance. In the next blog posts in this series, we will take a deep dive into how to analyze Redshift queries and how to monitor Amazon Redshift performance with Sumo Logic. Stay tuned.
Complete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.