Hadoop For Windows

Hadoop For Windows

Hadoop For Windows DBI-B335 Rohit Bakhshi Speaker Rohit Bakhshi Product Manager Hortonworks Agenda Modern Data Architecture Hadoop for Windows Hortonworks Data Platform under the covers Q&A Modern Data Architecture What Makes Up Big Data? Petabytes BIG DATA Mobile Web Sentiment SMS/MMS Speech to Text User Click Stream Terabytes WEB WEB Transactions + Interactions + Observations = BIG DATA

Social Interactions & Feeds Web logs Spatial & GPS Coordinates A/B testing Sensors / RFID / Devices Behavioral Targeting Gigabytes CRM CRM Segmentation Megabytes ERP ERP Purchase detail Purchase record Payment record Business Data Feeds Dynamic Pricing Customer Touches External Demographics Search Marketing Affiliate Networks Support Contacts Offer details Dynamic Funnels Offer history

User Generated Content HD Video, Audio, Images Product/Service Logs Increasing Data Variety and Complexity APPLICATIONS A data architecture under pressure from new data OLTP, ERP, CRM Systems Custom Applications Business Analytics Packaged Applications Unstructured documents, emails Server logs DATA SYSTEM 2.8 ZB in 2012 85% from New Data Types RDBMS EDW Sentiment, Web Data MPP REPOSITORIES 15x Machine Data by 2020 40 ZB by 2020 Sensor. Machine Data

Source: IDC SOURCES Geolocation Existing Sources (CRM, ERP, Clickstream, Logs) Clickstream APPLICATIONS Hadoop within an emerging Modern Data Architecture OLTP, ERP, CRM Systems Business Analytics Custom Applications Packaged Applications DEV & DATA TOOLS Server logs EDW MPP REPOSITORIES Data Management Operations RDBMS Data Access

Security OPERATIONS TOOLS Governance & Integration DATA SYSTEM Build & Test Unstructured documents, emails Sentiment, Web Data Provision, Manage & Monitor SOURCES Sensor. Machine Data Geolocation OLTP, ERP, Documents, Web Logs, Social CRM Systems Emails Click Streams Networks Machine Generated Sensor Data Geolocation Data Clickstream Hadoop for Windows

HDP for Windows Hortonworks Data Platform 2.2 BATCH, INTERACTIVE & REAL-TIME DATA ACCESS GOVERNANCE Data Workflow, Lifecycle & Governance Falcon Sqoop Flume WebHDFS Script SQL Java Scala NoSQL Stream Pig Hive Cascading HBase Storm Tez Tez Spark ISV

Engines Solr YARN: Data Operating System (Cluster Resource Management) 1 Linux Windows

(Hadoop File Distributed System) HDFS Deployment Choice OPERATIONS Authentication Authorization Accounting Data Protection Provision, Manage & Monitor Storage: HDFS Resources: YARN Access: Hive,

Pipeline: Falcon Cluster: Knox Cluster: Ranger Slider Slider Tez Others In-Memory Search SECURITY On-Premises Cloud Ambari Zookeeper Scheduling Oozie Hortonwork s Data Platform (HDP) The Only Completely Open Distribution for Apache Hadoop Fundamentally Versatile and Comprehensive enterprise capabilities Wholly Integrated for deep ecosystem interoperability HDP: Enterprise Data Platform

HDP certifies the most recent & stable community innovation 1.2.0 0.98.4 2.6.0 0.60 0.5.1 0.6.0 0.4.0 0.9.3 4.2 1.5.0 4.7.2 Data Management 3.4.5 4.0.0 1.4.0 0.9.1 3.4.5 1.4.4 0.4.0 1.4.4

0.96.1 Data Access Governance & Integration Oozie 3.3.2 Ambari Flume Sqoop Falcon Slider Tez Solr Spark Phoenix 1.3.1 HBase 0.12.0 Pig 2013 2.2.0 Hadoop

&YARN October 0.4.0 4.0.0 0.12.0 2014 HDP 2.0 0.98.0 Storm 0.12.1 Hive & HCatalog April 1.5.1 0.5.0 2.4.0 0.5.0 1.4.5 0.13.0 HDP 2.1 4.1.0 Operations Ranger October

Knox 0.14.0 4.10.0 Zookeeper HDP 2.2 2014 1.7.0 0.14.0 Security Hortonworks Data Platform 2.2 * version numbers are targets and subject to change at time of general availability in accordance with ASF release New! Power BI DEV & DATA TOOLS OPERATIONAL TOOLS a HDInsight Azure xx SOURCES DATA SYSTEM APPLICATIONS

Seamless Interoperability INFRASTRUCTURE Integrations with Microsoft tools for native big data analysis HDP: Powered by Apache Hadoop HDP certifies the most recent & stable community innovation 1.2.0 0.98.4 2.6.0 0.60 0.5.1 0.6.0 0.4.0 0.9.3 4.2 1.5.0 4.7.2 Data Management 3.4.5 4.0.0

1.4.0 0.9.1 3.4.5 1.4.4 0.4.0 1.4.4 0.96.1 Data Access Governance & Integration Oozie 3.3.2 Ambari Flume Sqoop Falcon Slider Tez Solr Spark Phoenix

1.3.1 HBase 0.12.0 Pig 2013 2.2.0 Hadoop &YARN October 0.4.0 4.0.0 0.12.0 2014 HDP 2.0 0.98.0 Storm 0.12.1 Hive & HCatalog April 1.5.1 0.5.0 2.4.0 0.5.0

1.4.5 0.13.0 HDP 2.1 4.1.0 Operations Ranger October Knox 0.14.0 4.10.0 Zookeeper HDP 2.2 2014 1.7.0 0.14.0 Security Hortonworks Data Platform 2.2 * version numbers are targets and subject to change at time of general availability in accordance with ASF release Apache Hadoop Storage Open Source Data Management HDFS

Distributed across nodes Natively redundant Single File System Scalable Linearly scale to store Petabytes of data Reliable Redundant storage protects against node failures Flexible Processin g Store all types of data, apply flexible schemas for analysis and sharing YARN Cluster Resource Manager Built in Fault Tolerance High Cluster Utilization Economical Utilize cose efficient commodity hardware Achieve high cluster utilization YARN: Data Operating System ResourceManager Scheduler

NodeManager NodeManager NodeManager NodeManager map 1.1 nimbus0 vertex1.1.1 vertex1.2.2 NodeManager NodeManager NodeManager NodeManager map1.2 Batch Interactive SQL vertex1.1.2 nimbus2 NodeManager NodeManager NodeManager NodeManager nimbus1 Real-Time

reduce1.1 vertex1.2.1 Right Tool for the Right Usage SCALE (storage & processing) Traditional Database EDW Required on write Reads are fast MPP Analytics schema speed NoSQL Hadoop Platform Required on read Writes are fast Standards and structured governance Loosely structured Limited, no data processing Structured processing Processing coupled with data

Multi and unstructured Interactive OLAP Analytics Complex ACID Transactions Operational Data Store data types best fit use Data Discovery Processing unstructured data Massive Maximize Hadoop Deployment Hortonworks Data Platform (HDP) for Windows Choice 100% Apache open source Hadoop software for Windows Server Microsoft Azure HDInsight Hadoop-based managed service in the cloud via Microsoft Azure Microsoft Analytics Platform System (APS) Scale-out appliance with data warehousing and Hadoop in one box All offerings co-engineered by Hortonworks and Microsoft Enjoy seamless interoperability across on-premises and cloud HDP under the covers HDP 2.2: Core Platform Data Operating System of Hadoop Single Cluster, Shared Data Set, Multiple Workloads Support a range of access patterns Shared operational services

DATA ACCESS Batch Script SQL NoSQL Stream Search Others Map Reduce Pig Hive/Tez, HCatalog HBase Accumulo Storm Solr In-Memory Analytics, ISV engines YARN : Data Operating System 1

HDFS

N (Hadoop Distributed File System) DATA MANAGEMENT Flexible Ingest into HDP HORTONWORKS DATA PLATFORM (HDP) For Windows Sqoop RPC REST (HTTP) C LibHDFS Flume SQL Access: Stinger Initiative Stinger Initiative Custom Apps

Business Analytics Next generation SQL based interactive query in Hadoop SQL Apache Hive Apache Tez Apache MapReduce Apache YARN Speed 1 Interactive Hive Query response Scale

HDFS N (Hadoop Distributed File System) queries that scale from TB to PB SQL broadest range of SQL semantics for analytic applications Apache Hive Contribution 1,672 Jira Tickets Closed 145 Developers 44

an Open Community at its finest Companies ~390,000 13 Lines Of Code Added (2x) Months Apache Tez (Speed) Replaces MapReduce as primitive for Hive, Pig, etc Task with pluggable Input, Processor and Output Input Processor Output Task Tez Task - Hive with Tez as execution engine SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state Tez avoids unneeded writes to HDFS Hive MR M M Hive Tez

M SELECT a.state SELECT b.id R R M SELECT a.state, c.itemId M M M R M SELECT b.id R M HDFS JOIN (a, c) SELECT c.price M R M R

HDFS R JOIN (a, c) R HDFS JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) M M R M JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) R Hive: Enhanced SQL Semantics Hive SQL Datatypes Hive SQL Semantics SQL Compliance INT SELECT, INSERT TINYINT/SMALLINT/BIGINT

GROUP BY, ORDER BY, SORT BY BOOLEAN JOIN on explicit join key FLOAT Inner, outer, cross and semi joins DOUBLE Sub-queries in FROM clause Hive provides a wide array of SQL datatypes and semantics so your existing tools integrate more seamlessly with Hadoop STRING ROLLUP and CUBE TIMESTAMP UNION BINARY Windowing Functions (OVER, RANK, etc) DECIMAL Custom Java UDFs ARRAY, MAP, STRUCT, UNION Standard Aggregation (SUM, AVG, etc.) DATE

Advanced UDFs (ngram, Xpath, URL) VARCHAR Sub-queries for IN/NOT IN, HAVING CHAR Expanded JOIN Syntax Hive 0.12 (HDP 2.0) INTERSECT / EXCEPT Hive 0.13 (HDP 2.1) Hive 0.11 Stream Processing Apache Storm Real-time event processing for sensor and business activity monitoring Scale: Ingest millions of events per second. Fast query on petabytes of data Implement new real time business cases with your Hadoop platform http://storm.incubator.apache.org/ NoSQL Database Store and Process Petabytes of NoSQL Data HBase Scale out on Commodity Servers

YARN : Data Operating System 1 HDFS (Permanent Data Storage)

N High Performance Highly Available Integrated with YARN SQL Interface HDP Search Apache Solr High performance indexing and simple UI for advanced search applications Search Web App Quer y MapReduce Indexing Job HTML PDF Word XML Logs Raw Files

Indexed Document s HDFS (Hadoop Distributed File System) Solr Respons e Solr Lucene Solr All Processing on Shared Infrastructure SQL

Java Scala Others HBase Accumulo Storm Spark Others Solr Others Pig Hive Cascading Engines NoSQL NoSQL Stream In-Memory Engines Search ISV Engines

Tez Tez Tez Tez Slider Slider Slider Kafka Script Slider YARN: Data Operating System (Cluster Resource Management) 1

HDFS

(Hadoop Distributed File System) YARN: Next Generation Hadoop Single Use System Multi Use Data Platform Batch Apps Batch, Interactive, Online, Streaming, 1st Gen of Hadoop 2nd Gen of Hadoop Classic Hadoop Apps

Batch MapReduce MapReduce (cluster resource management & data processing) Flexible Data Processing Online Data Processing Stream Processing Hive, Pig, others HBase, Accumulo Storm Batch & Interactive Tez Efficient Cluster Resource Management & Shared Services (YARN) HDFS Redundant, Reliable Storage (redundant, reliable storage) (HDFS) others Data Governance & Integration Apache Falcon

Simplified Data Governance for Enterprise Hadoop Provides key governance framework for: Acquisition & processing of data sets Replication & Retention of datasets Redirect datasets to non-Hadoop extensions Provides audit trail & lineage Apache Falcon Define sophisticated Worklows and DLM Policies Enable audit, compliance, and data re-processing Staged Data Cleansed Data Conformed Data Presented Data Retain 5 Years Retain 3 Years Retain 3 Years Retain Last Copy Only Apache Falcon Disaster Recovery and Backup between environments Site to Site Publishing data between

environments for Discovery Site to Cloud Extend with the Cloud Cloud Hadoop HDInsigh t Hybrid = On-premises + Cloud Cloud Constraints of on-premises Scale constrained to on-premise procurement Capex up front costs Expertise for tuning and deployment On-premises Hadoop APS Appliances Software Benefits of Cloud Unlimited elastic scale Auto geo redundancy No hardware costs Pay only for what you need Central Security Administration HDP Advanced Security Single Pane of Glass Centralizes administration of

security policy across entire HDP Project: Apache Ranger Perimeter Security Apache Knox A common place to preform authentication across Hadoop and all related projects Integrated to LDAP and AD Secure interfaces for: WebHDFS, WebHCAT, Oozie, Hive & HBase Broad community effort, Incubated with Microsoft, broad set of developers invovled Apache Knox: Perimeter Security Enterprise Enterprise Identity Identity Provider Provider LDAP/AD LDAP/AD Browser Browser HDP HDP Cluster Cluster 11 Firewall Firewall Identity Providers Masters Masters NN

NN Web Web HCat HCat JT JT DN DN DMZ REST REST Client Client Hive Hive TT TT YARN YARN HBase HBase Knox Knox Gateway Gateway GW GW HDP HDP Hadoop Hadoop Cluster Cluster 22 JDBC JDBC

Client Client Masters Masters NN NN JT JT DN DN A stateless reverse proxy instance deployed in DMZ Oozie Oozie -Requests streamed through GW to Hadoop services after auth. -URLs rewritten to refer to gateway Hive Hive Web Web HCat HCat Oozie Oozie TT TT HBase HBase YARN

YARN Operating Enterprise Hadoop Ambari: Deploy, Manage, Monitor AMBARI WEB REST APIs AMBARI SERVER PROVISION | MANAGE | MONITOR PROVISION compute & storage . . . MANAGE . . . . . . . compute & storage

MONITOR Ambari: Deploy on Windows Ambari: Deploy on Windows Ambari: Manage on Windows Ambari: Monitor on Windows Ambari SCOM Enables Microsoft System Center Operations Manager (SCOM) to monitor Hadoop Ambari SCOM Management Pack gives insight into the performance and health of Hadoop Ambari SCOM leverages the Ambari framework to aggregate and expose Hadoop metrics Ambari SCOM Server aggregates + exposes Hadoop metrics Ambari SCOM Mgmt Pack Ambari SCOM Server Ambari SCOM monitors health + alerts in case of problems HADOOP Storage & Process at Scale For More Information Web hortonworks.com/products/hdp-windows/ hortonworks.com/labs/microsoft/ microsoft.com/bigdata Training

hortonworks.com/hadoop-training/hadoop-on-windows/ Online documentation docs.hortonworks.com Forums hortonworks.com/community/forums/ Questions? DBI Track resources 27 Hands on Labs + 8 Instructor Led Labs in Hall 7 Free SQL Server 2014 Technical Overview ebook Free online training at Microsoft Virtual Academy microsoft.com/sqlserver and Amazon Kindle Store Try new Azure data services previews! microsoftvirtualacademy.com Azure Machine Learning, DocumentDB, and Stream Analytics Resources Learning Sessions on Demand http://channel9.msdn.com/Events/Tec hEd TechNet Microsoft Certification & Training Resources www.microsoft.com/learning Developer Network Resources for IT Professionals

http://microsoft.com/technet http://developer.microsoft.com SUBMIT YOUR TECHED We value your feedback! EVALUATIONS Fill out an evaluation via CommNet Station/PC: Schedule Builder LogIn: europe.msteched.com/catalog TechEd Mobile app for session evaluations is currently offline 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Recently Viewed Presentations

  • Lesson 15 Acceptance Sampling - College of Charleston

    Lesson 15 Acceptance Sampling - College of Charleston

    Double Sampling Plan CT Reject Lot Accept Lot Compare the total number of defective in both lots to CT and make the appropriate decision Double Sampling Plan Lot First Random sample Second Random sample A Multiple Sampling Plan is similar...
  • Week 7-8 - WordPress.com

    Week 7-8 - WordPress.com

    To save himself, Bud breaks through the window and kills the last few hornets that are on him. After he calms down, Bud gets mad. Really mad. He gets so mad that he wants revenge. Chapter 3. Characters we meet....
  • Moving away from the C-Corporation: Understanding REITs, MLPs ...

    Moving away from the C-Corporation: Understanding REITs, MLPs ...

    Moving Away from the C- Corporation: Understanding REITs, MLPs, PTPs and BDCs. November 30, 2016 ... processing, refining, transportation (including pipelines transporting gas, oil, or products thereof), or the marketing of any mineral or natural resource (including fertilizer, geothermal energy...
  • Reinvigorating Teacher Professionalism

    Reinvigorating Teacher Professionalism

    Chief Executive, GTCS ... (mandatory, comprising the SPR and the SFR) The Standard for Career-long Professional Learning. The Standards for Leadership and Management (for middle leaders and Head Teachers) Key purposes of Professional Update.
  • Law of Sines - Valencia

    Law of Sines - Valencia

    8.5 Polar Coordinates The rectangular coordinate system (x/y axis) works in 2 dimensions with each point having exactly one representation. A polar coordinate system allows for the rotation and repetition of points.
  • Chapter 3

    Chapter 3

    Chapter 4 Life in the Marine Environment Energy The ability to do work All living things require energy Metabolism The sum total of all the chemical reaction that take place in an organism Anabolism Reactions that build up Reactions that...
  • Prediction of application and systems security Within ...

    Prediction of application and systems security Within ...

    What exists Metrics for security programs Metrics to evalute security level improvement within an organisation Models and standards to map the security levels within and organisation "Improvement programs" for security, based on models like SPICE (ISO15504) or CMM ISECOM(RAV,SCARE),NIST( SAMATE)ecc.
  • Energy renovation of structures Moisture and microbial damage

    Energy renovation of structures Moisture and microbial damage

    Defects of heat insulation. The wind protection of the joint on the roof is defective - air is warming in the structure and conducted to the outdoor air. Mineral wool is not tight against the surface - inner convection. Mineral...