Options for voted sessions

Please vote by 1/27 here


Text as data

  • Text analysis in social science research: Overview
    • Key points: typical process and applications, research design, text corpus resources
    • Readings (TBD):
      • GRS: Introduction, Social science research and text analysis
  • Preprocessing
    • Key points: regular expression, tokenization, part-of-speech tagging, meaningful and meaningless words and stopwords
    • Readings (TBD):
      • GRS: Selection and representation
      • JM: Regular Expressions, Text Normalization, Edit Distance
  • Text representation and vectorization methods
    • Key points: bag-of-words, count vector, word vector, distributed representation of words, word embedding, contextual word embedding
    • Readings (TBD):
      • JM: Vector semantics and embeddings
  • Text analysis: Scaling
    • Key points: semantic similarity, sentiment analysis
    • Readings (TBD):
      • Grimmer, Justin, and Brandon M. Stewart. 2013. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21 (3): 267–97. https://doi.org/10.1093/pan/mps028.
  • Text analysis: Identification
    • Key points: Classification, multilingual topic modeling, named-entity recognition
    • Readings (TBD):
      • Grimmer, Justin, and Brandon M. Stewart. 2013. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21 (3): 267–97. https://doi.org/10.1093/pan/mps028.

Relation as data

  • Network analysis in social science research: Overview
    • Key points: Basic concepts and applications, research design, network components and levels of analysis
    • Readings (TBD):
      • Scott, John. 2017. “What Is Social Network Analysis?” In Social Network Analysis, Fourth edition. Thousand Oaks, CA: SAGE Publications Ltd.
      • Watts, Duncan J. 2004. “The ‘New’ Science of Networks.” Annual Review of Sociology 30 (1): 243–70. https://doi.org/10.1146/annurev.soc.30.020404.104342.
      • Scott, John. 2017. “Terminology for Network Analysis.” In Social Network Analysis, Fourth edition, 73–94. Thousand Oaks, CA: SAGE Publications Ltd.
  • Data collection: How to generate networks
    • Readings (TBD):
      • Scott, John. 2017. “Organising and Analysing Network Data.” In Social Network Analysis, Fourth edition. Thousand Oaks, CA: SAGE Publications Ltd.
      • Scott, John. 2017. “Data Collection for Social Network Analysis.” In Social Network Analysis, Fourth edition. Thousand Oaks, CA: SAGE Publications Ltd.
  • Analysis of nodes
    • Key concepts: degree, betweenness, eigenvector centrality, etc.
    • Readings (TBD):
      • Scott, John. 2017. “Popularity Mediation and Exclusion.” In Social Network Analysis, Fourth edition. Thousand Oaks, CA: SAGE Publications Ltd.
  • Analysis of communities
    • Key concepts: community detection (louvain clustering, “rich club”)
    • Readings (TBD):
      • Scott, John. 2017. “Groups, Factions and Social Divisions.” In Social Network Analysis, Fourth edition. Thousand Oaks, CA: SAGE Publications Ltd.
  • Network topology and hypothesis testing
    • Key concepts: modularity, clustering coefficients, random graph.
    • Readings (TBD).

Here I recommend some Python packages based on my own research experience (I may cover some of them in class). Neither the list nor my description is comprehensive. As a social science researcher, I usually define my goals of analysis first, then look for appropriate packages or functions. The technical documentations often enlighten (or empower) me to respond to more novel questions.

  • NLTK: Preprocessing.
  • Stanza: Preprocessing, POS, NER, sentiment analysis.
  • Gensim: Preprocessing, vectorization, topic modeling (fixed word-embedding).
  • BERTopic: Topic modeling (fixed and contextualized word-embedding, multilingual support, visualization).
  • Top2Vec: Topic modeling (fixed and contextualized word-embedding, multilingual support). I recently used it for a multilingual topic modeling task.
  • SentenceTransformers: Vectorize sentences or documents. Used by many proceeding packages. I sometime use it to obtain the raw vector values if analysis requires (e.g., calculating text similarity in this and this article, visualizing semantic spaces, etc.)
  • Transformers: Train or fine-tune pretrained BERT models. Used by many proceeding packages. I used it to fine-tune a BERT model for classifying nonprofits according to their mission statements.
  • NetworkX: Network analysis.
  • igraph: Network analysis, more efficient than NetworkX, but I primarily used it for visualization or functions that NetworkX does not have.
  • Gephi: Network visualization. Calculating large networks is very very slow, strongly discoursed. Usually I use NetworkX for crunching numbers then Gephi for visualization.