Web Data Management - Laurentian

Web Data Management - Laurentian

Web Data Management COSC 4806 Introduction The world wide web a vast, widely distributed collection of semi-structured multimedia documents heterogeneous collection of documents documents in the form of web pages documents connected via hyperlinks

World Wide Web The web is growing rapidly Business organizations increasingly presenting information on the Web Business on the highway Myriad of raw data to be processed for information World Wide Web The web is a fast growing, distributed & non-administered global information

resource WWW allows access to text, images, video, sound and graphical data Ever-increasing number of businesses building web servers A chaotic environment to locate information of interest Lost in hyperspace syndrome World Wide Web

Characteristics of the WWW: its a set of directed graphs data is heterogeneous, self-describing & schema less unstructured, deeply nested information no central authority for information management dynamic information vs. static information web information discovery search engines

World Wide Web Rapid growth of web: In 1994, WWW grew by 1758 % !!

June 1993 - 130 June 1994 - 1265 Dec. 1994 - 11,576 April 1995 - 15,768 July 1995 - 23,000+ January 2005 11.5 billion publiclyindexed web pages World Wide Web .com domains on the rise, as of July 2006:

76,683,115 hosts for com domains 10,232,188 hosts for edu domains 185,919,955 hosts for net domains 727,773 hosts for gov domains

1,933,551 hosts for mil domains 1,660,470 hosts for org domains World Wide Web The exponential growth of the Internet is reflected in the number of hosts on the net

1.000 in 1984 10.000 in 1987 100.000 in 1989 1.000.000 in 1992

10.000.000 in 1996 100.000.000 in 2000 171,638,297 in 2003 489,774,269 in July 2007 Net Timeline (http://www.pbs.org/internet/timeline/) Internet Domain Survey (http://www.isc.org/ds/) World Wide Web Distribution of hosts (worldwide)

US European Union

Japan Germany Netherlands South Korea Australia UK Brazil Taiwan 195,138,696 22,000,414

21,304,292 7,657,162 6,781,729 5,433,591 5,351,622 4,688,307 4,392,693 3,838,383 World Wide Web Popular search methods

email 77% Search engine 63% Get news 46% Job related search 29% Instant messaging 18% Online banking

18% Chat room 8% Travel reservation 5% Read blogs 3% Online auction 3% World Wide Web

Key limitations of search engines: do not exploit hyperlinks search limited to string matching queries evaluated on archived data rather than up-to-date data; no indexing on current data low accuracy; replicated results no further manipulation possible World Wide Web Key limitations of search engines

(contd.): ERROR 404! No efficient document management Query results cannot be further manipulated No efficient means for knowledge discovery World Wide Web more issues.. specifying/understanding what information is

wanted the high degree of variability of accessible information the variability in conceptual vocabulary or ontology used to describe information complexity of querying unstructured data World Wide Web contd. complexity of querying structured data uncontrolled nature of web-based

information content determining which information sources to search/query World Wide Web Search Engines capabilities: Selection of language Keywords with disjunction, adjacency, presence, absence, ... Word stemming (Hotbot) Similarity search (Excite)

Natural language (LycosPro) Restrict by modification date (Hotbot) or range of dates (AltaVista) Restrict result types (e.g., must include images) (Hotbot) Restrict by geographical source (content or domain) (Hotbot) Restrict within various structured regions of a document (titles or URLs) (LycosPro); (summary, first heading, title, URL) (Opentext)

World Wide Web Search & Retrieval.. Search engine Hotbot AltaVista Northern Light Excite Infoseek Lycos % web covered

34 28 20 14 10 3 Using several search engines is better than using only one World Wide Web

Schemes to locate information: Supervised links between sites ask at the reference desk Gopher (Univ. Of Minnesota): menu format with links both to sites and content Classification of documents search in the catalog Archie (McGill Univ.): system to automatically gather, index and serve information from all

anonymous FTP sites Automated searching wander around the library Use META tags to gethermeta data Spiders (robots, web-crawlers) World Wide Web Popular search engines.. Year 2000

Year 2001 AltaVista Yahoo HotBot Google NorthernLight AltaVista

World Wide Web Boolean search in Alta vista.. World Wide Web Specifying field content in HotBot.. World Wide Web Natural language interface in AskJeeves World Wide Web Examples of search strategies:

Rank web pages based on popularity Rank web pages based on word frequency Match query to an expert database The major search engines use a mixed strategy World Wide Web Frequency based ranking: Library analogue: Keyword search

Basic factors in HotBot ranking of pages: - words in the title keyword meta tags word frequency in the document document length World Wide Web Alternative word frequency

measures: Excite uses a thesaurus to search for what you want, rather than what you ask for AltaVista allows you to look for words that occur within a set distance of each other NorthernLight weighs results by search term sequence, from left to right World Wide Web Popularity based ranking: Library analogue: citation index

The Google strategy for ranking pages: - Rank is based on the number of links to a page - Pages with a high rank have a lot of other web pages that link to it - The formula is on the Google help page World Wide Web More on popularity ranking: The Google philosophy is also applied by others, such as NorthernLight

HotBot measures popularity of a page by how frequently users have clicked on it in past search results World Wide Web Expert Databases, Yahoo An expert database contains predefined responses to common queries A simple approach is subject directory, e.g. in Yahoo!, which contains a selection of links for each topic

The selection is small, but can be useful Library analogue: Trustworthy references World Wide Web Expert Databases, AskJeeves AskJeeves has predefined responses to various types of common queries These prepared answers are augmented by a meta-search, which searches other SEs Library analogue: Reference desk

World Wide Web Example, best wines in France; AskJeeves World Wide Web Best wines in France; HotBot World Wide Web Best wines in France; Google World Wide Web

Linux in Iceland; Google World Wide Web Linux in Iceland; HotBot World Wide Web Linux in Iceland; AskJeeves Web Data Management Web Data Management; key objectives Design a suitable data model to represent

web information Development of web algebra and query language, query optimization Maintenance of Web data - view maintenance Development of knowledge discovery and web mining tools Web warehouse Data integration, secondary storages, indexes Web Data Management Limitations of the web..

Applications cannot consume HTML HTML wrapper technology is brittle Companies merge , need interoperability Web Data Management Paradigm Shift New Web standards XML XML generated by applications and consumed by applications Data exchange

- Across platforms: enterprise interoperability - Across enterprises Web : from documents to data Web Data Management Database challenges:

Query optimization and processing Views and transformations Data warehousing and data integration Mediators and query rewriting Secondary storages Indexes Web Data Management DBMS needs paradigm shift too

Web data differs from database data - self describing, schema less, structure changes without notice, heterogeneous, deeply nested, irregular documents and data mixed designed by document expert, but not DB expert - need Web Data Management

Web Data Management Web data representation HTML - Hypertext Markup Language - fixed grammar, no regular expressions - Simple representation of data - good for simple data and intended for human consumption - difficult to extract information SGML - Standard Generalized Markup Language - good for publishing deeply structured document XML - Extended Markup Language

- a subset of SGML Web Data Management Terminology HTML - Hypertext Mark-up Language HTTP - Hypertext Transmission Protocol

URL - Uniform Resource Locator example :=:////filena me>[<#location>] where - is http, ftp, gopher - host is internet address - #location is a textual label in the file Web Data Management Prevalent, persistent and informative HTML documents (now XML) created by humans or applications

Accessed day in and day out by Humans and Applications Persistent HTML documents Can database technology help? Web Data Management Some recent research projects Web Query System - W3QS, WebSQL, AKIRA, NetQL, RAW, WebLog, Araneus

Semi structured Data Management - LOREL, UnQL, WebOQL, Florid Website Management System - STRUDEL, Araneus Web Warehouse - WHOWEDA Web Data Management Main tasks.. Modeling and Querying the Web - view web as directed graph

- content and link based queries - example - find the page that contain the word Clinton which has a link from a page containing word Monica Web Data Management Main tasks contd. Information Extraction and integration - wrapper - program to extract a structured representation of the data; a set of tuples from HTML pages.

- mediator: integration of data - software that accesses multiple sources from a uniform interface Web Site Construction and Restructuring - creating sites - modeling the structure of web sites - restructuring data Web Data Management What to model? Structure of Web sites Internal structure of web pages

Contents of web sites in finer granularities Web Data Management Data representation of Web data Graph Data Models Semi structured Data Models (also graph based) Web Data Management Graph data model Labeled graph data model where nodes

represent web pages & arcs represent links between pages Labels on arcs can be viewed as attribute names Regular path expression queries Web Data Management Semi structured data models Irregular data structure, no fixed schema known and may be implicit in the data Schema may be large and may change

frequently Schema is descriptive rather than perspective; describes current state of data, but violations of schema still tolerated Web Data Management Semi structured data models Data is not strongly typed; for different objects the values of the same attributes may be of differing types. (heterogeneous sources) No restriction on the set of arcs that emanate

from a given node in a graph or on the types of the values of attributes Ability to query the schemas; arc variables which get bound to labels on arcs, rather than nodes in the graph Web Data Management Graph based Query Languages Use graph to model databases Support regular path expressions and graph construction in queries.

Examples - Graph Log for hypertext queries - graph query language for OO Web Data Management Query languages for semi structured data: Use labeled graphs Query the schema of data Ability to accommodate irregularities in the data, such as missing links etc.

Examples : Lorel (Stanford) , UnQL (AT&T), STRUQL (AT&T Web Data Management Comparing Query Systems Web Data Management Types of Query Languages First Generation Second Generation

Web Data Management First Generation Query languages Combine the content-based queries of search engines with structure-based queries Combine conditions on text pattern in documents with graph pattern describing link structures Examples - W3QL (TECHNION, Israel), WebSQL (Toronto), WebLOG (Concordia)

Web Data Management Second Generation Query languages Called web data manipulation languages Web pages as atomic objects with properties that they contain or do not contain certain text patterns and they point to other objects Useful for data wrapping, transformation, and restructuring Useful for web site transformation and restructuring

Web Data Management How they differ? Provide access to the structure of web objects they manipulate - return structure Model internal structures of web documents as well as the external links that connect them Support references to model hyperlinks and some support to ordered collections of records for more natural data representation Ability to create new complex structures as a result of a query

Web Data Management Examples.. WebOQL STRUQL Florid Web Data Management Information Integration To answer queries that may require extracting and combining data from multiple web

sources Example - Movie database ; data about movies, their start casts, directors, schedule etc. Give me a movie playing time and a review of movies starring Frank Sinatra, playing tonight in Paris Web Data Management Approaches Web warehouse Data from multiple web sources is

loaded into a warehouse, all queries are applied to warehouse data - Disadvantage - Warehouse needs to be updated when data sources change - Advantage - Performance Improvement Virtual warehouse Data remain in the web sources, queries are decomposed at run time into queries to sources - Data is not replicated and is fresh - Due to autonomy of web sources query optimization and execution methodology may differ and

performance may be affected - Good when the number of sources are large, data changes frequently, little control over web sources Web Data Management Virtual approach vs. DBMS In virtual approach, data is not communicated directly with storage manager, instead it communicates to wrappers Second, user does not pose queries directly in the schema in which data is stored, user is

free from knowing the structure User pose the queries to mediated schema, virtual relations (not stored anywhere) designed for particular application Web Data Management Data Integration Steps Specification of mediated schema and reformulation Mediated schema is the set of collection and attribute names needed to formulate queries - Data integration system translates the query on the

mediated schema into a query to data source Completeness of data in web sources Differing query processing capabilities Query Optimization selecting a set of minimal sources and minimal queries Wrapper construction Matching objects across sources

Recently Viewed Presentations

  • Bridge Preservation at the Local Level

    Bridge Preservation at the Local Level

    The T2 Municipal Bridge Maintenace and Assessment Checklist. Based off of a multitude of different resources including DOT checklists and manuals Reviewed by numerous industry professionals including NHDOT Staff and UNH CEE Professors. 3 Major Parts. Cyclical Maintenace Schedule. Bridge...
  • Viruses, Viroids, and Prions - Mrs. Fagan&#x27;s Website

    Viruses, Viroids, and Prions - Mrs. Fagan's Website

    Viruses, Viroids, and Prions. copyright cmassengale. Are Viruses Living or Non-living? ... HIV, the AIDS virus, is a retrovirus. Feline Leukemia Virus is also a retrovirus. copyright cmassengale. Viroids. Small, circular RNA molecules without a protein coat ...
  • WomenDiscover Maximized Generosity. Increased Impact. For internal use

    WomenDiscover Maximized Generosity. Increased Impact. For internal use

    Activate - Be in community with other women passionate about creating change. ... Women make up 63% of the people in the church pews of mainline denominations. Women are more generous than men - We are 40% more likely to...
  • OLD TESTAMENT - Amazon S3

    OLD TESTAMENT - Amazon S3

    FOUR ABOMINATIONS. In the inner court sat an "image of jealousy" (vv. 3b-6). Idolatrous engraving on the wall (vv. 7-10), and 70 elders of Judah worshiping these engravings (vv. 11-13).
  • Friday 22 - SHS GCSE PE - Home

    Friday 22 - SHS GCSE PE - Home

    specific example of its application to bring about overload in a Personal Exercise Programme (PEP). Swap papers with the person next to you and use the mark scheme to assess their work. EXAM QUESTION 1. Which one of the following...
  • Loss Control Services - Hortica

    Loss Control Services - Hortica

    Tailgate/Toolbox Training Guides 4 Wheeler Safety Autumn Driving Tips Backing & Following Distance Bee & Wasp Stings Bloodborne Pathogens Bug Bites-Safety Chainsaw Safety Chemical Label Safety Chocking and Blocking Cold Stress Driving Tired Electrical Safety Ergonomics Safety talk Extension Cord...
  • Executive Director Structure Chief Executive Officer Michael Scott

    Executive Director Structure Chief Executive Officer Michael Scott

    Willows Ward Inpatient ManagerPaul Morris. Avocet Ward Inpatient Services ManagerKatie West. Poppy Ward Inpatient Services ManagerAdrian Matthews. Lark Ward Inpatient Services ManagerAlex Williams. Home Treatment Team ManagerAndy Barton. Psychiatric Liaison Services ManagerMaureen Parnell. Police Triage ManagerLouissa Friend
  • Hiring &amp; Evaluating Library Director

    Hiring & Evaluating Library Director

    Determine a competitive salary range and fringe benefit package. Check references of applicants & evaluate qualifications. If the board desires to contact current or past colleagues of the top potential candidates to get a more complete picture of an applicant's...