WebScienceCourse - Leibniz Universität Hannover

WebScienceCourse - Leibniz Universität Hannover

Mining the Social Web Jwan Alhussein [email protected] 12 July 2016 1 Mining cross-cultural relations from Wikipedia A study of 31 European food cultures Authors: Paul Laufer, Claudia Wagner, Fabian Flck, Markus Strohmaier : Jwan Alhussein 12 July 2016

Introduction Wikipedia represents one of the primary sources of knowledge about foreign cultures. Uncover diverging representations of cultures provides an important insight, since they may foster the formation of cross-cultural stereotypes, misunderstandings and potentially even conflict. Jwan Alhussein 12 July 2016 3 The Wikipedia article on "French cuisine" found on the Romanian-language edition might surprise a French national when translated into her mother tongue. Unlike the French "original", there exists no

extended mention of French wines and only a very short paragraph on croissants and pastries. But, on the other hand, it features a section on fois gras and lamb dishes so wide information that the French language have. Jwan Alhussein 12 July 2016 Problem Mining the relations between cultures as expressed on Wikipedia. Approach and contributions We introduce a computational approach for mining and assessing relations between cultural communities on Wikipedia along three dimensions cultural understanding

cultural similarity cultural affinity by exploring the communities descriptions and interest in cultural practices. Jwan Alhussein 12 July 2016 5 2. RELATED WORK Previous research acknowledged the fact that interesting differences exist in different language editions of Wikipedia. Unlike previous work we exploit the collectively generated descriptions of cultures on different language versions of Wikipedia since each language community may perceive and document their own and other cultures through their particular cultural lenses. Countries with a lower Human Development Index such as Russia or Poland

show less interest in editing and maintaining Wikipedia than more developed countries such as Denmark or Germany. Jwan Alhussein 12 July 2016 3. METHODS & DATASETS We use language as a proxy for cultural communities since language is closely linked to both national and cultural boundaries. 3.1 Cross-Cultural Relation Mining Cultural Similarity. Cultural Understanding. Cultural Affinity and Bias. Jwan Alhussein

12 July 2016 Cultural Relations Similarity Understanding Affinity Jwan Alhussein 12 July 2016 Cultural Similarity Italian cuisine

German cuisine sim( ,) = Wheat Beer Riesling Pasta Jaccard similarity Sousage Parmigiano

Tortano Sauerkraut Pizza sim ( , 1 )= 8 Jwan Alhussein 12 July 2016

Cultural Similarity between Neighbors Jwan Alhussein 12 July 2016 Cultural Understanding Wikipedia edition Used concepts Understanding Native definition 2/5

0/6 Understanding the Italian food culture Jwan Alhussein 12 July 2016 What may explain Cultural Understanding? Germany Create for each country a list of countries ranked by where most of its immigrants come from. Create for each country a list of countries ranked by how similar their values and beliefs are according to ESS.

Pair (p-value) wiki ess 0.18 (0.00019) wiki migration 0.36 (1.74e-22)

ESS is a biennial 30-country survey of attitudes, beliefs and behavior. Jwan Alhussein 12 July 2016 Cultural Affinity View statistics of cuisine pages in different language editions How much more attention than we would expect does language community A pay to the culture of community B? Jwan Alhussein 12 July 2016 Self-Focus &

Regional Bias Jwan Alhussein 12 July 2016 Summary Affinities between language communities are present in Wikipedia and drive the attention process Cultural understanding can to some extent be explained by migration Cultural similarities inferred from Wikipedia are pretty plausible crowdflower Relation between similarity, understanding and affinities? Understanding and affinity: -0.35 Similarity and affinity: 0.27 Similarity and understanding: 0.19

Jwan Alhussein 12 July 2016 Democrats, Republicans and Starbucks Afficionados: User Classification in Twitter Authors: Marco Pennacchiotti, Ana-Maria Popescu: Jwan Alhussein 12 July 2016 Classification task The starting point is to fulfill the incomplete user attributes by classifiying the user with respect to the incomplete user attribute,

indeed. Most of the users do not mention explicitly her political view, for example There are various methods for solving the user classification problem What do we have in social media domain ? Users have many attributes, such as age, gender, etc Based on the attributes a classifier may be trained/constructed Social Network Users have friends that she follows How to define the classification task so that we can combine these two types of information structure, user attributes and social network ? Jwan Alhussein 12 July 2016

17 Machine learning model A novel architecture combining user-centric information and social network information User-centric information are the attributes of the users, which we call as features hereafter Social Network information is the information of friends of the users Main contribution of the paper Use Gradient Boosted Decision Trees (GBDT) framework as the classification algorithm Train the GDBT with given labeled input data And label the users with respect to the built classifier Then apply same classifier model to the friends of the users and label the friends also Lastly, update each users label with respect to her friends label using an

update formulae Jwan Alhussein 12 July 2016 18 User-Centric Information User-centric information is represented as features. There is a overmuch feature set mainly comprised of four parts Profile features(PROF) User name, use of avatar picture, date of account creation, etc Tweeting behavior features(BEHAV) Average number of tweets per day, number of replies etc... Linguistic content features Richest feature set, comprised of four sub-feature sets Uses Latent Drichlet Allocation (LDA) as Language Model

Prototypical words(LING-WORD): Proto words, words that are icons in users. Found probabilistically from the data Firstly partition the users into n class, then find the most frequent words for each class and take mostly used k words for each class Prototypical hashtags(LING-HASH): Hashtag (#) to denote topics Same technique for proto words Generic LDA(LING-GLDA): LDA is the language model they used, extracted topics with respect to the LDA model and represents users as a distribution over topics LDA is trained by all sets of users Domain-specific LDA(LING-DLDA): Same as Generic LDA, but trained with specific training set such as users that are only democrats and republicans Sentiment words(LING-SENT): Manually collected small set of terms, Ronald Regan, good or bad ? Opinion Finder Tool gives the sentiment as positive, negative, neutral

Jwan Alhussein 12 July 2016 19 User-Centric Information Social Network Features Combination of two different features Friend accounts(SOC-FRIE): Informs about sharing same friends for different labeled users such as democrats and republicans Prototypical replied(SOC-REP) and retweeted (SOC-RET) users: Find most frequent mentioned (@) and retweeted (RT) users for different labeled users Thats all for user-centric information

OVERMUCH, indeed Jwan Alhussein 12 July 2016 20 Experimental Evaluation Three binary classification tasks: Detecting political affiliation Democrat or Republican 5169 Democrats and 5169 Republicans 1.2 millions friends Ethnicity African American or Not 3000 African Americans and 3000 Not African Americans

508K friends Following a business Following Starbucks or Not 5000 Starbucks follower and 5000 Not 981K friends Jwan Alhussein 12 July 2016 21 Experimental Results, Political Affiliation Task Best achieved result for combined HYBRID model among three tasks however, not significant increase over single ML model Social Network features are very successfull. This is because users from a particular political view are friends with similar particular views.

Suportting sinle Graph-Based Label update is also very successfull alone Jwan Alhussein 12 July 2016 22 Overall Comments #1 ML method mostly good enough and update part of the architecture does not bring significant improvement. If the task allows for users to form a community update function works, else, it may even hurt the alone ML system as in ethnicity case #2 Linguistic Features always reliable Jwan Alhussein

12 July 2016 23 Review The novelty of combining the types of information is attractive, however, there are serious points that should be criticized First of all the classifier is doing only binary classification and nothing said about multidimensional classification. Doing multi-dimensional classification using binary classifier is timeconsuming and weakens the claim about the scalability. As said, the novel arch. idea is attractive, however, the results show that label-update does not work well. Why ? They did not give any appriciable comment on why label update does not work well. This, I believe, shows that the feature set and the novel architecture is not well-studied. There are overmuch features. But the reasons why these features are selected are not given. Morever, applying same ML model the users and their friends replicates the information. Obviously connected users will have some common and different attributes, what is the point? The social graph should be used more effectively. I think it should not be used to update the

labels but as an importantly weigthed feature in the ML model. This is because we should superpose different information types instead of using one to compensate the other. You can see difference in thinking vector space, update means spanning same vector again, superposing means using both vector concurrently. For example, proto words would have been extracted using the network, somehow. Jwan Alhussein 12 July 2016 24 Review They told about Gradient Boosted Decision Trees (GBDT) but gave nothing about this classification algorithm, an explanation is expected at least in principle about GBDT. Same thing is valid for Latent Drichlett Allocation (LDA) language model. It is the first time I hear this language model, and they said nothing about LDA. It is only said that LDA is used as language model and associated with topics. But, what is LDA and how it is associated with topics?

There is no data analysis, very cruical lacking of paper, everything is data! They only gave the number of users used in training, but what about the test set? Development set? Any other statistics about the data? Moreover, they used different number of samples for each task. The success of label update is very low for ethnicity task than the political affiliation task, however, there are 1.2M friends for political affiliation task but almost half of them for ethnicity task, 508K. Hence the cross-task comments are not confident. Experiments are not done in a structured way. They have just done the experiments and shows the results. There is not a useful comment. Beside, they did not explain why they have chosen these experiments. For example, I would want to see some success of subset features as features alone have mostly very good results, some subset may increase the overall HYBRID result. Jwan Alhussein 12 July 2016 25

Thanks for your attention ! Jwan Alhussein 12 July 2016 26

Recently Viewed Presentations

  • Aquaporins - cdn.ymaws.com

    Aquaporins - cdn.ymaws.com

    Without this process, the kidneys would not be able to adequately reabsorb water and the organism would be polyuric,hypernatremic, and produce dilute urine. APQ 7 is located in the proximal tubule where water is also reabsorbed (Figure 2). It also...
  • Resursteam i barntandvård - rjl.se

    Resursteam i barntandvård - rjl.se

    Fler exempel Cariogram. Dietistens och tandvårdens kostråd. Barn med invandrarbakgrund. Behandling av karies i mjölktandsbettet. SBU-rapporten Att förebygga karies. Barn som far illa. Kariesdiagnostik och progression i det unga bettet. Fetma och karies.
  • Development of an Uncertainty Tool to Assess Model Forecast ...

    Development of an Uncertainty Tool to Assess Model Forecast ...

    ERA Interim analysis and mean absolute error, valid February 11, 2010 at 00z. The Motivation The goal is to create a "standardized spread anomaly" which can be used operationally but…
  • Culture Makes You Stronger: Aboriginal women's voices from ...

    Culture Makes You Stronger: Aboriginal women's voices from ...

    "Culture Makes You Stronger"Aboriginal women's voices from the South Coast of NSW Presented by Marlene Longbottom, Institute for Urban Indigenous Health (IUIH). On behalf of Waminda and the research partners:. Professor Bronwyn Fredericks, Professor Juanita Sherwood, Dr Reuben Bolt, Professor...
  • Aspek Pengendalian Akuntansi Dalam Networked Economy

    Aspek Pengendalian Akuntansi Dalam Networked Economy

    Bagan distribusi kerja b. Bagan organisasi Teknik pengembangan sistem: 1. Teknik manajemen proyek, spt CPM dan PERT utk penjadwalan proyek. 2. Teknik menemukan fakta, spt wawancara, observsi, daftar pertanyaan, dan pengumpulan sampel. 3. Teknik analisis biaya/manfaat 4. Teknik utk menjalankan...
  • 2006-2007 High School Planning Packet

    2006-2007 High School Planning Packet

    Naviance. account . to find updated high school information and applications (check Team News for Naviance information) Get familiar with the school websites you are applying to for next year - you can learn a lot by searching these sites...
  • CHAPTER 6: Computer Systems - cbafaculty.org

    CHAPTER 6: Computer Systems - cbafaculty.org

    Assembly Language Instruction Set Input/Output LMC Input/Output Internal Data Movement LMC Internal Data Data storage location Arithmetic Instructions LMC Arithmetic Instructions Simple Program: Add 2 Numbers Program to Add 2 Numbers: Using Mnemonics Program to Add 2 Numbers Program Control...
  • Planet Earth and Beyond

    Planet Earth and Beyond

    Natural Sciences - Grade 7. VOCABULARY: astronomy [the scientific study of the Universe] Nomadic [wandering and living in no fixed place, according to the availability of seasonal trade/ food/ pasture. celestial [of the sky] constellation [grouping of stars that the...