Amazon web Services
The size and complexity of the data that needs to be analyzed today, means the same technology and approaches that worked in the past, don't work anymore. To get the most value from your data, AWS provides the most comprehensive, secure, scalable, and cost-effective portfolio of services that enables you to build your data lake in the cloud, analyze all of your data, including data from IoT devices with a variety of analytical approaches including machine learning.There are more organizations running their data lakes and analytics on AWS than anywhere else with customers like NASDAQ, Zillow, Yelp, iRobot, and FINRA trusting AWS to run their business critical analytics workloads.
Data Movement
The first step to building data lakes on AWS is to move data to the cloud. The physical limitations of bandwidth and transfer speeds restrict the ability to move data without major disruption, high costs, and time. To make data transfer easy and flexible, AWS provides the widest range of options to transfer data to the cloud.On-premises data movement
AWS provides multiple ways to move data from your datacenter to AWS. To establish a dedicated network connection between your network and AWS, you can use AWS Direct Connect. To move petabytes to exabytes of data to AWS using physical appliances, you can use AWS Snowball and AWS Snowmobile. To have your on-premises applications store data directly into AWS, you can use AWS Storage Gateway.
Real-time data movement
AWS provides multiple ways to ingest real-time data generated from new sources such as websites, mobile apps, and internet-connected devices. To make it simple to capture and load streaming data or IoT device data, you can use Amazon Kinesis Data Firehose, Amazon Kinesis Video Streams, and AWS IoT Core.
Data Lake
Once data is ready for the cloud, AWS makes it easy to store data in any format, securely, and at massive scale with Amazon S3 and Amazon Glacier. To make it easy for end users to discover the relevant data to use in their analysis, AWS Glue automatically creates a single catalog that is searchable, and queryable by users.Object Storage - Amazon S3
Amazon S3 is secure, highly scalable, durable object storage with millisecond latency for data access. S3 is built to store any type of data from anywhere - web sites and mobile apps, corporate applications, and data from IoT sensors or devices. It is built to store and retrieve any amount of data, with unmatched availability, and built from the ground up to deliver 99.999999999% (11 nines) of durability. S3 provides comprehensive security and compliance capabilities that meet even the most stringent regulatory requirements.
Backup and Archive - Amazon Glacier
Amazon Glacier is secure, durable, and extremely low cost storage for long-term backup and archive that can access data in minutes. It is designed to deliver 99.999999999% durability (11 nines), and provides comprehensive security and compliance capabilities that can help meet even the most stringent regulatory requirements. Customers can store data for as little as $0.004 per gigabyte per month, a significant savings compared to on-premises solutions.
Data Catalog - AWS Glue
AWS Glue is a fully managed service that provides a data catalog to make data in the data lake discoverable, and has the ability to do extract, transform, and load (ETL) to prepare data for analysis. The data catalog is automatically created as a persistent metadata store for all data assets, making all of the data searchable, and queryable in a single view.
Analytics
AWS provides the broadest, and most cost-effective set of analytic services that run on the data lake. Each analytic service is purpose-built for a wide range of analytics use cases such as interactive analysis, big data processing using Hadoop and Spark, data warehousing, real-time analytics, operational analytics, dashboards, and visualizations.Interactive Analytics - Amazon Athena
For interactive analysis, Amazon Athena makes it easy to analyze data directly in S3 and Glacier using standard SQL queries. Athena is serverless, so there is no infrastructure to setup or manage. You can start querying data instantly, get results in seconds and pay only for the queries you run. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds.
Big Data Processing - Amazon EMR
For big data processing using the Hadoop and Spark frameworks, Amazon EMR provides a managed service that makes it easy, fast and cost-effective to process vast amounts data. Amazon EMR supports 19 different open-source projects including Hadoop, Spark, HBase, Presto, and more. Each project is updated in EMR within 30 days of a version release, ensuring you have the latest and greatest from the community.
Data Warehousing - Amazon Redshift
For data warehousing, Amazon Redshift provides the ability to run complex, analytic queries against petabytes of structured data, and includes Redshift Spectrum that runs SQL queries directly against Exabytes of structured or unstructured data in S3 without the need for unnecessary data movement. Amazon Redshift is less than a tenth of the cost of traditional solutions. Start small for just $0.25 per hour, and scale out to petabytes of data for $1,000 per terabyte per year.
Real-Time Analytics - Amazon Kinesis
For real-time analytics, Amazon Kinesis makes it easy to collect, process and analyze streaming data such as IoT telemetry data, application logs, and website clickstreams. This enable you to process, and analyze data as it arrives in your data lake, and respond in real-time instead of having to wait until all your data is collected before the processing can begin.
Operational Analytics - Amazon Elasticsearch Service
For operational analytics such as application monitoring, log analytics and clickstream analytics, Amazon Elasticsearch Service allows you to search, explore, filter, aggregate, and visualize your data in near real-time. Amazon Elasticsearch Service delivers Elasticsearch's easy-to-use APIs and real-time analytics capabilities alongside the availability, scalability, and security that production workloads require.
Dashboards and Visualizations - Amazon QuickSight
For dashboards and visualizations, Amazon QuickSight provides you a fast, cloud-powered business analytics service, that that makes it easy to build stunning visualizations and rich dashboards that can be accessed from any browser or mobile device.
Machine Learning
For predictive analytics use cases, AWS provides a broad set of machine learning services, and tools that run on your data lake on AWS. Our services come from the knowledge and capability we've built up at Amazon, where ML has powered Amazon.com's recommendation engines, supply chain, forecasting, fulfillment centers, and capacity planning.Frameworks and Interfaces
For expert machine learning practitioners and data scientists, AWS provides AWS Deep Learning AMIs that make it easy to build deep learning models, and build clusters with ML and DL optimized GPU instances. AWS supports all the major machine learning frameworks, including TensorFlow, Caffe2, and Apache MXNet, so that you can bring or develop any model you choose. These capabilities provide unmatched power, speed, and efficiency that deep learning and machine learning workloads require.
Platform Services
For developers who want to get deep with ML, Amazon SageMaker is a platform service that makes the entire process of building, training, and deploying ML models easy by providing everything you need to connect to your training data, select, and optimize the best algorithm and framework, and deploy your model on auto-scaling clusters of Amazon EC2. SageMaker also includes hosted Jupyter notebooks that make it is easy to explore, and visualize your training data stored in Amazon S3.
Application Services
For developers who want to plug-in pre-built AI functionality into their apps, AWS provides solution-oriented APIs for computer vision, and natural language processing. These application services lets developers add intelligence to their applications without developing and training their own models.