Machine Learning LensdeAWS Well-Architected FrameworkvihApril 2020crAThis paper has been archived.The latest version is now available est/machine-learning-lens/welcome.html

NoticesCustomers are responsible for making their own independent assessment of theinformation in this document. This document: (a) is for informational purposes only, (b)represents current AWS product offerings and practices, which are subject to changewithout notice, and (c) does not create any commitments or assurances from AWS andits affiliates, suppliers or licensors. AWS products or services are provided “as is”without warranties, representations, or conditions of any kind, whether express orimplied. The responsibilities and liabilities of AWS to its customers are controlled byAWS agreements, and this document is not part of, nor does it modify, any agreementbetween AWS and its customers.vihde 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.crA

ContentsIntroduction .1Definitions .2Machine Learning Stack .2Phases of ML Workloads .3General Design Principles.15deScenarios .16Build Intelligent Applications using AWS AI Services .17Use Managed ML Services to Build Custom ML Models.22vihManaged ETL Services for Data Processing .25Machine Learning on Edge and on Multiple Platforms .26Model Deployment Approaches.29The Pillars of the Well-Architected Framework .35crAOperational Excellence Pillar .35Security Pillar.46Reliability Pillar .55Performance Efficiency Pillar .61Cost Optimization Pillar .66Conclusion .73Contributors .74Further Reading .74Document Revisions.74

AbstractThis document describes the Machine Learning Lens for the AWS Well-ArchitectedFramework. The document includes common machine learning (ML) scenarios andidentifies key elements to ensure that your workloads are architected according to bestpractices.crAvihde

Amazon Web ServicesMachine Learning LensIntroductionThe AWS Well-Architected Framework helps you understand the pros and cons ofdecisions you make while building systems on AWS. Using the Framework, allows youto learn architectural best practices for designing and operating reliable, secure,efficient, and cost-effective systems in the cloud. It provides a way for you toconsistently measure your architectures against best practices and identify areas forimprovement. We believe that having well-architected systems greatly increases thelikelihood of business success.deIn the Machine Learning Lens, we focus on how to design, deploy, and architect yourmachine learning workloads in the AWS Cloud. This lens adds to the best practicesincluded in the Well-Architected Framework. For brevity, we only include details in thislens that are specific to machine learning (ML) workloads. When designing MLworkloads, you should use applicable best practices and questions from the AWS WellArchitected Framework whitepaper.vihThis lens is intended for those in a technology role, such as chief technology officers(CTOs), architects, developers, and operations team members. After reading this paper,you will understand the best practices and strategies to use when you design andoperate ML workloads on AWS.crA1

Amazon Web ServicesMachine Learning LensDefinitionsThe Machine Learning Lens is based on five pillars: operational excellence, security,reliability, performance efficiency, and cost optimization. AWS provides multiple corecomponents for ML workloads that enable you to design robust architectures for yourML applications.There are two areas that you should evaluate when you build a machine learningworkload: Machine Learning Stack Phases of Machine Learning WorkloadsMachine Learning StackdevihWhen you build an ML-based workload in AWS, you can choose from different levels ofabstraction to balance speed to market with level of customization and ML skill level: Artificial Intelligence (AI) ServicescrAML ServicesML Frameworks and InfrastructureAI ServicesThe AI Services level provides fully managed services that enable you to quickly addML capabilities to your workloads using API calls. This gives you the ability to buildpowerful, intelligent applications with capabilities such as computer vision, speech,natural language, chatbots, predictions, and recommendations. Services at this levelare based on pre-trained or automatically trained machine learning and deep learningmodels, so that you don’t need ML knowledge to use them.AWS provides many AI services that you can integrate with your applications throughAPI calls. For example, you can use Amazon Translate to translate or localize textcontent, Amazon Polly for text-to-speech conversion, and Amazon Lex for buildingconversational chat bots.ML ServicesThe ML Services level provides managed services and resources for machine learningto developers, data scientists, and researchers. These types of services enable you to2

Amazon Web ServicesMachine Learning Lenslabel data, build, train, deploy, and operate custom ML models without having to worryabout the underlying infrastructure needs. The undifferentiated heavy lifting ofinfrastructure management is managed by the cloud vendor, so that your data scienceteams can focus on what they do best.In AWS, Amazon SageMaker enables developers and data scientists to quickly andeasily build, train, and deploy ML models at any scale. For example, AmazonSageMaker Ground Truth helps you build highly accurate ML training datasets quicklyand Amazon SageMaker Neo enables developers to train ML models once, and thenrun them anywhere in the cloud or at the edge.deML Frameworks and InfrastructureThe ML Frameworks and Infrastructure level is intended for expert machine learningpractitioners. These people are comfortable with designing their own tools andworkflows to build, train, tune, and deploy models, and are accustomed to working atthe framework and infrastructure level.vihIn AWS, you can use open source ML frameworks, such as TensorFlow, PyTorch, andApache MXNet. The Deep Learning AMI and Deep Learning Containers in this levelhave multiple ML frameworks preinstalled that are optimized for performance. Thisoptimization means that they are always ready to be launched on the powerful, MLoptimized compute infrastructure, such as Amazon EC2 P3 and P3dn instances, thatprovides a boost of speed and efficiency to machine learning workloads.crACombining LevelsWorkloads often use services from multiple levels of the ML stack. Depending on thebusiness use case, services and infrastructure from the different levels can becombined to satisfy multiple requirements and achieve multiple business goals. Forexample, you can use AI services for sentiment analysis of customer reviews on yourretail website, and use managed ML services to build a custom model using your owndata to predict future sales.Phases of ML WorkloadsBuilding and operating a typical ML workload is an iterative process, and consists ofmultiple phases. We identify these phases loosely based on the open standard processmodel for Cross Industry Standard Process Data Mining (CRISP-DM) as a generalguideline. CRISP-DM is used as a baseline because it’s a proven tool in the industry3

Amazon Web ServicesMachine Learning Lensand is application neutral, which makes it an easy-to-apply methodology that isapplicable to a wide variety of ML pipelines and workloads.The end-to-end machine learning process includes the following phases: Business Goal Identification ML Problem Framing Data Collection and Integration Data Preparation Data Visualization and Analytics Feature Engineering Model Training Model Evaluation Business Evaluation Production Deployment (Model Deployment and Model Inference)decrAvihFigure 1 – End-to-End Machine Learning ProcessBusiness Goal IdentificationBusiness Goal Identification is the most important phase. An organization consideringML should have a clear idea of the problem to be solved, and the business value to begained by solving that problem using ML. You must be able to measure business valueagainst specific business objectives and success criteria. While this holds true for anytechnical solution, this step is particularly challenging when considering ML solutionsbecause ML is a disruptive technology.4

Amazon Web ServicesMachine Learning LensAfter you determine your criteria for success, evaluate your organization's ability torealistically execute toward that target. The target should be achievable and provide aclear path to production.You will want to validate that ML is the appropriate approach to deliver your businessgoal. Evaluate all of the options that you have available for achieving the goal, howaccurate the resulting outcomes would be, and the cost and scalability of eachapproach when deciding your approach.For an ML-based approach to be successful, having an abundance of relevant, highquality data that is applicable to the algorithm that you are trying to train is essential.Carefully evaluate the availability of the data to make sure that the correct data sourcesare available and accessible. For example, you need training data to train andbenchmark your ML model, but you also need data from the business to evaluate thevalue of an ML solution. Understand business requirements Form a business question devihApply these best practices:crADetermine a project’s ML feasibility and data requirementsEvaluate the cost of data acquisition, training, inference, and wrong predictionsReview proven or published work in similar domains, if availableDetermine key performance metrics, including acceptable errorsDefine the machine learning task based on the business questionIdentify critical, must have featuresML Problem FramingIn this phase, the business problem is framed as a machine learning problem: what isobserved and what should be predicted (known as a label or target variable).Determining what to predict and how performance and error metrics need to beoptimized is a key step in ML.For example, imagine a scenario where a manufacturing company wants to identifywhich products will maximize profits. Reaching this business goal partially depends ondetermining the right number of products to produce. In this scenario, you want topredict the future sales of the product, based on past and current sales. Predicting5

Amazon Web ServicesMachine Learning Lensfuture sales becomes the problem to solve, and using ML is one approach that can beused to solve it.Apply these best practices: Define criteria for a successful outcome of the project Establish an observable and quantifiable performance metric for the project, suchas accuracy, prediction latency, or minimizing inventory value Formulate the ML question in terms of inputs, desired outputs, and theperformance metric to be optimized Evaluate whether ML is a feasible and appropriate approach Create a data sourcing and data annotation objective, and a strategy to achieve it Start with a simple model that is easy to interpret, and which makes debuggingmore manageabledevihData CollectionIn ML workloads, the data (inputs and corresponding desired output) serves threeimportant functions: crADefining the goal of the system: the output representation and the relationship ofeach output to each input, by means of input/output pairsTraining the algorithm that will associate inputs to outputsMeasuring the performance of the trained model, and evaluating whether theperformance target was metThe first step is to identify what data is needed for your ML model, and evaluate thevarious means available for collecting that data to train your model.As organizations collect and analyze increasingly large amounts of data, traditional onpremises solutions for data storage, data management, and analytics can no longerkeep pace. A cloud-based data lake is a centralized repository that allows you to storeall your structured and unstructured data regardless of scale. You can store your dataas-is, without first having to structure the data, and run different types of