Hadoop For Windows DBI-B335 Rohit Bakhshi Speaker Rohit Bakhshi Product Manager Hortonworks Agenda Modern Data Architecture Hadoop for Windows Hortonworks Data Platform under the covers Q&A Modern Data Architecture What Makes Up Big Data? Petabytes BIG DATA Mobile Web Sentiment SMS/MMS Speech to Text User Click Stream Terabytes WEB WEB Transactions + Interactions + Observations = BIG DATA
Social Interactions & Feeds Web logs Spatial & GPS Coordinates A/B testing Sensors / RFID / Devices Behavioral Targeting Gigabytes CRM CRM Segmentation Megabytes ERP ERP Purchase detail Purchase record Payment record Business Data Feeds Dynamic Pricing Customer Touches External Demographics Search Marketing Affiliate Networks Support Contacts Offer details Dynamic Funnels Offer history
User Generated Content HD Video, Audio, Images Product/Service Logs Increasing Data Variety and Complexity APPLICATIONS A data architecture under pressure from new data OLTP, ERP, CRM Systems Custom Applications Business Analytics Packaged Applications Unstructured documents, emails Server logs DATA SYSTEM 2.8 ZB in 2012 85% from New Data Types RDBMS EDW Sentiment, Web Data MPP REPOSITORIES 15x Machine Data by 2020 40 ZB by 2020 Sensor. Machine Data
Source: IDC SOURCES Geolocation Existing Sources (CRM, ERP, Clickstream, Logs) Clickstream APPLICATIONS Hadoop within an emerging Modern Data Architecture OLTP, ERP, CRM Systems Business Analytics Custom Applications Packaged Applications DEV & DATA TOOLS Server logs EDW MPP REPOSITORIES Data Management Operations RDBMS Data Access
Security OPERATIONS TOOLS Governance & Integration DATA SYSTEM Build & Test Unstructured documents, emails Sentiment, Web Data Provision, Manage & Monitor SOURCES Sensor. Machine Data Geolocation OLTP, ERP, Documents, Web Logs, Social CRM Systems Emails Click Streams Networks Machine Generated Sensor Data Geolocation Data Clickstream Hadoop for Windows
HDP for Windows Hortonworks Data Platform 2.2 BATCH, INTERACTIVE & REAL-TIME DATA ACCESS GOVERNANCE Data Workflow, Lifecycle & Governance Falcon Sqoop Flume WebHDFS Script SQL Java Scala NoSQL Stream Pig Hive Cascading HBase Storm Tez Tez Spark ISV
Engines Solr YARN: Data Operating System (Cluster Resource Management) 1 Linux Windows
Pipeline: Falcon Cluster: Knox Cluster: Ranger Slider Slider Tez Others In-Memory Search SECURITY On-Premises Cloud Ambari Zookeeper Scheduling Oozie Hortonwork s Data Platform (HDP) The Only Completely Open Distribution for Apache Hadoop Fundamentally Versatile and Comprehensive enterprise capabilities Wholly Integrated for deep ecosystem interoperability HDP: Enterprise Data Platform
HDP certifies the most recent & stable community innovation 1.2.0 0.98.4 2.6.0 0.60 0.5.1 0.6.0 0.4.0 0.9.3 4.2 1.5.0 4.7.2 Data Management 3.4.5 4.0.0 1.4.0 0.9.1 3.4.5 1.4.4 0.4.0 1.4.4
&YARN October 0.4.0 4.0.0 0.12.0 2014 HDP 2.0 0.98.0 Storm 0.12.1 Hive & HCatalog April 1.5.1 0.5.0 2.4.0 0.5.0 1.4.5 0.13.0 HDP 2.1 4.1.0 Operations Ranger October
Knox 0.14.0 4.10.0 Zookeeper HDP 2.2 2014 1.7.0 0.14.0 Security Hortonworks Data Platform 2.2 * version numbers are targets and subject to change at time of general availability in accordance with ASF release New! Power BI DEV & DATA TOOLS OPERATIONAL TOOLS a HDInsight Azure xx SOURCES DATA SYSTEM APPLICATIONS
Seamless Interoperability INFRASTRUCTURE Integrations with Microsoft tools for native big data analysis HDP: Powered by Apache Hadoop HDP certifies the most recent & stable community innovation 1.2.0 0.98.4 2.6.0 0.60 0.5.1 0.6.0 0.4.0 0.9.3 4.2 1.5.0 4.7.2 Data Management 3.4.5 4.0.0
1.4.5 0.13.0 HDP 2.1 4.1.0 Operations Ranger October Knox 0.14.0 4.10.0 Zookeeper HDP 2.2 2014 1.7.0 0.14.0 Security Hortonworks Data Platform 2.2 * version numbers are targets and subject to change at time of general availability in accordance with ASF release Apache Hadoop Storage Open Source Data Management HDFS
Distributed across nodes Natively redundant Single File System Scalable Linearly scale to store Petabytes of data Reliable Redundant storage protects against node failures Flexible Processin g Store all types of data, apply flexible schemas for analysis and sharing YARN Cluster Resource Manager Built in Fault Tolerance High Cluster Utilization Economical Utilize cose efficient commodity hardware Achieve high cluster utilization YARN: Data Operating System ResourceManager Scheduler
reduce1.1 vertex1.2.1 Right Tool for the Right Usage SCALE (storage & processing) Traditional Database EDW Required on write Reads are fast MPP Analytics schema speed NoSQL Hadoop Platform Required on read Writes are fast Standards and structured governance Loosely structured Limited, no data processing Structured processing Processing coupled with data
Multi and unstructured Interactive OLAP Analytics Complex ACID Transactions Operational Data Store data types best fit use Data Discovery Processing unstructured data Massive Maximize Hadoop Deployment Hortonworks Data Platform (HDP) for Windows Choice 100% Apache open source Hadoop software for Windows Server Microsoft Azure HDInsight Hadoop-based managed service in the cloud via Microsoft Azure Microsoft Analytics Platform System (APS) Scale-out appliance with data warehousing and Hadoop in one box All offerings co-engineered by Hortonworks and Microsoft Enjoy seamless interoperability across on-premises and cloud HDP under the covers HDP 2.2: Core Platform Data Operating System of Hadoop Single Cluster, Shared Data Set, Multiple Workloads Support a range of access patterns Shared operational services
DATA ACCESS Batch Script SQL NoSQL Stream Search Others Map Reduce Pig Hive/Tez, HCatalog HBase Accumulo Storm Solr In-Memory Analytics, ISV engines YARN : Data Operating System 1
HDFS
N (Hadoop Distributed File System) DATA MANAGEMENT Flexible Ingest into HDP HORTONWORKS DATA PLATFORM (HDP) For Windows Sqoop RPC REST (HTTP) C LibHDFS Flume SQL Access: Stinger Initiative Stinger Initiative Custom Apps
Business Analytics Next generation SQL based interactive query in Hadoop SQL Apache Hive Apache Tez Apache MapReduce Apache YARN Speed 1 Interactive Hive Query response Scale
HDFS N (Hadoop Distributed File System) queries that scale from TB to PB SQL broadest range of SQL semantics for analytic applications Apache Hive Contribution 1,672 Jira Tickets Closed 145 Developers 44
an Open Community at its finest Companies ~390,000 13 Lines Of Code Added (2x) Months Apache Tez (Speed) Replaces MapReduce as primitive for Hive, Pig, etc Task with pluggable Input, Processor and Output Input Processor Output Task Tez Task - Hive with Tez as execution engine SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state Tez avoids unneeded writes to HDFS Hive MR M M Hive Tez
M SELECT a.state SELECT b.id R R M SELECT a.state, c.itemId M M M R M SELECT b.id R M HDFS JOIN (a, c) SELECT c.price M R M R
HDFS R JOIN (a, c) R HDFS JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) M M R M JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) R Hive: Enhanced SQL Semantics Hive SQL Datatypes Hive SQL Semantics SQL Compliance INT SELECT, INSERT TINYINT/SMALLINT/BIGINT
GROUP BY, ORDER BY, SORT BY BOOLEAN JOIN on explicit join key FLOAT Inner, outer, cross and semi joins DOUBLE Sub-queries in FROM clause Hive provides a wide array of SQL datatypes and semantics so your existing tools integrate more seamlessly with Hadoop STRING ROLLUP and CUBE TIMESTAMP UNION BINARY Windowing Functions (OVER, RANK, etc) DECIMAL Custom Java UDFs ARRAY, MAP, STRUCT, UNION Standard Aggregation (SUM, AVG, etc.) DATE
Advanced UDFs (ngram, Xpath, URL) VARCHAR Sub-queries for IN/NOT IN, HAVING CHAR Expanded JOIN Syntax Hive 0.12 (HDP 2.0) INTERSECT / EXCEPT Hive 0.13 (HDP 2.1) Hive 0.11 Stream Processing Apache Storm Real-time event processing for sensor and business activity monitoring Scale: Ingest millions of events per second. Fast query on petabytes of data Implement new real time business cases with your Hadoop platform http://storm.incubator.apache.org/ NoSQL Database Store and Process Petabytes of NoSQL Data HBase Scale out on Commodity Servers
YARN : Data Operating System 1 HDFS (Permanent Data Storage)
N High Performance Highly Available Integrated with YARN SQL Interface HDP Search Apache Solr High performance indexing and simple UI for advanced search applications Search Web App Quer y MapReduce Indexing Job HTML PDF Word XML Logs Raw Files
Indexed Document s HDFS (Hadoop Distributed File System) Solr Respons e Solr Lucene Solr All Processing on Shared Infrastructure SQL
Tez Tez Tez Tez Slider Slider Slider Kafka Script Slider YARN: Data Operating System (Cluster Resource Management) 1
HDFS
(Hadoop Distributed File System) YARN: Next Generation Hadoop Single Use System Multi Use Data Platform Batch Apps Batch, Interactive, Online, Streaming, 1st Gen of Hadoop 2nd Gen of Hadoop Classic Hadoop Apps
Batch MapReduce MapReduce (cluster resource management & data processing) Flexible Data Processing Online Data Processing Stream Processing Hive, Pig, others HBase, Accumulo Storm Batch & Interactive Tez Efficient Cluster Resource Management & Shared Services (YARN) HDFS Redundant, Reliable Storage (redundant, reliable storage) (HDFS) others Data Governance & Integration Apache Falcon
Simplified Data Governance for Enterprise Hadoop Provides key governance framework for: Acquisition & processing of data sets Replication & Retention of datasets Redirect datasets to non-Hadoop extensions Provides audit trail & lineage Apache Falcon Define sophisticated Worklows and DLM Policies Enable audit, compliance, and data re-processing Staged Data Cleansed Data Conformed Data Presented Data Retain 5 Years Retain 3 Years Retain 3 Years Retain Last Copy Only Apache Falcon Disaster Recovery and Backup between environments Site to Site Publishing data between
environments for Discovery Site to Cloud Extend with the Cloud Cloud Hadoop HDInsigh t Hybrid = On-premises + Cloud Cloud Constraints of on-premises Scale constrained to on-premise procurement Capex up front costs Expertise for tuning and deployment On-premises Hadoop APS Appliances Software Benefits of Cloud Unlimited elastic scale Auto geo redundancy No hardware costs Pay only for what you need Central Security Administration HDP Advanced Security Single Pane of Glass Centralizes administration of
security policy across entire HDP Project: Apache Ranger Perimeter Security Apache Knox A common place to preform authentication across Hadoop and all related projects Integrated to LDAP and AD Secure interfaces for: WebHDFS, WebHCAT, Oozie, Hive & HBase Broad community effort, Incubated with Microsoft, broad set of developers invovled Apache Knox: Perimeter Security Enterprise Enterprise Identity Identity Provider Provider LDAP/AD LDAP/AD Browser Browser HDP HDP Cluster Cluster 11 Firewall Firewall Identity Providers Masters Masters NN
NN Web Web HCat HCat JT JT DN DN DMZ REST REST Client Client Hive Hive TT TT YARN YARN HBase HBase Knox Knox Gateway Gateway GW GW HDP HDP Hadoop Hadoop Cluster Cluster 22 JDBC JDBC
Client Client Masters Masters NN NN JT JT DN DN A stateless reverse proxy instance deployed in DMZ Oozie Oozie -Requests streamed through GW to Hadoop services after auth. -URLs rewritten to refer to gateway Hive Hive Web Web HCat HCat Oozie Oozie TT TT HBase HBase YARN
MONITOR Ambari: Deploy on Windows Ambari: Deploy on Windows Ambari: Manage on Windows Ambari: Monitor on Windows Ambari SCOM Enables Microsoft System Center Operations Manager (SCOM) to monitor Hadoop Ambari SCOM Management Pack gives insight into the performance and health of Hadoop Ambari SCOM leverages the Ambari framework to aggregate and expose Hadoop metrics Ambari SCOM Server aggregates + exposes Hadoop metrics Ambari SCOM Mgmt Pack Ambari SCOM Server Ambari SCOM monitors health + alerts in case of problems HADOOP Storage & Process at Scale For More Information Web hortonworks.com/products/hdp-windows/ hortonworks.com/labs/microsoft/ microsoft.com/bigdata Training
hortonworks.com/hadoop-training/hadoop-on-windows/ Online documentation docs.hortonworks.com Forums hortonworks.com/community/forums/ Questions? DBI Track resources 27 Hands on Labs + 8 Instructor Led Labs in Hall 7 Free SQL Server 2014 Technical Overview ebook Free online training at Microsoft Virtual Academy microsoft.com/sqlserver and Amazon Kindle Store Try new Azure data services previews! microsoftvirtualacademy.com Azure Machine Learning, DocumentDB, and Stream Analytics Resources Learning Sessions on Demand http://channel9.msdn.com/Events/Tec hEd TechNet Microsoft Certification & Training Resources www.microsoft.com/learning Developer Network Resources for IT Professionals
http://microsoft.com/technet http://developer.microsoft.com SUBMIT YOUR TECHED We value your feedback! EVALUATIONS Fill out an evaluation via CommNet Station/PC: Schedule Builder LogIn: europe.msteched.com/catalog TechEd Mobile app for session evaluations is currently offline 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Double Sampling Plan CT Reject Lot Accept Lot Compare the total number of defective in both lots to CT and make the appropriate decision Double Sampling Plan Lot First Random sample Second Random sample A Multiple Sampling Plan is similar...
To save himself, Bud breaks through the window and kills the last few hornets that are on him. After he calms down, Bud gets mad. Really mad. He gets so mad that he wants revenge. Chapter 3. Characters we meet....
Moving Away from the C- Corporation: Understanding REITs, MLPs, PTPs and BDCs. November 30, 2016 ... processing, refining, transportation (including pipelines transporting gas, oil, or products thereof), or the marketing of any mineral or natural resource (including fertilizer, geothermal energy...
Chief Executive, GTCS ... (mandatory, comprising the SPR and the SFR) The Standard for Career-long Professional Learning. The Standards for Leadership and Management (for middle leaders and Head Teachers) Key purposes of Professional Update.
8.5 Polar Coordinates The rectangular coordinate system (x/y axis) works in 2 dimensions with each point having exactly one representation. A polar coordinate system allows for the rotation and repetition of points.
Chapter 4 Life in the Marine Environment Energy The ability to do work All living things require energy Metabolism The sum total of all the chemical reaction that take place in an organism Anabolism Reactions that build up Reactions that...
What exists Metrics for security programs Metrics to evalute security level improvement within an organisation Models and standards to map the security levels within and organisation "Improvement programs" for security, based on models like SPICE (ISO15504) or CMM ISECOM(RAV,SCARE),NIST( SAMATE)ecc.
Defects of heat insulation. The wind protection of the joint on the roof is defective - air is warming in the structure and conducted to the outdoor air. Mineral wool is not tight against the surface - inner convection. Mineral...
Download Presentation
Ready to download the document? Go ahead and hit continue!