Late submissions are not accepted.

Assignment 1: Complete data topic page (individual, 5% points)


  1. Complete the sign-up sheet.
  2. Find a dataset of your interest and a related academic paper.
  3. Folk the course website, edit the data topic page accordingly, and create a pull request.

Knowledge and skills practiced:

  • Using GitHub;
  • Markdown language.

Assignment 2: Create your own cloud computing server (individual, 5% points)


  1. Create a 48-core VM on ChameleonCloud;
  2. Install Anaconda Python;
  3. Run a Jupyter Notebook server with password and SSL for encrypted communication;
  4. Login your Jupyter Notebook server through web-browser;
  5. Save an image of your instance, submit screenshots through Canvas showing: 1) instance image is saved successfully, and 2) Jupyter server is started successfully;
  6. After submission, release your IP and server to other users (if you don’t plan to use the instance).

** Special attention: ChameleonCloud often has technical glitches, so please DON’T procrastinate this assignment to the last minute. If you have technical issue, submit a ticket through Help Desk. They almost only work on weekdays and will reply you in one or two business days. Again, don’t procrastinate this assignment. **

Knowledge and skills practiced:

  • Using cloud computing platform;
  • Using command line terminal and Linux system.

Assignment 3: Parallel computing (individual, 5% points)


  1. Start a Jupyter server as you did in Assignment#2;
  2. Install htop;
  3. Define a function to clean, process, or analyze a large dataset of your interest;
  4. Compare the efficiency of serial computing to that of parallel computing;
  5. Submit a screenshot of htop showing all cores are crunching numbers.

Knowledge and skills practiced:

  • Using cloud computing platform;
  • Using command line terminal and Linux system;
  • Python programming: parallel computing.

Assignment 4: Disambiguation using algorithm (individual/group, 15% points)

Disambiguating entity is a common task in data preprocessing. For example, the University of Texas at Austin can be written as “UT Austin,” “UT-Austin,” or even “UTA.” How can we recognize these records and give them a unique ID? This assignment will practice this ability.

Tasks: The dataset provided to you is retrieved from Scopus, one of the largest bibliographical databases in the world. Each line is a record of a published paper on nonprofit studies. You are expected to:

  1. Generate a codebook of this dataset;
  2. Create criteria for disambiguating authors and affiliations.
  3. Explain why the criteria can generate valid results.
  4. Compile a function according to the criteria, run the function on records, and give unique IDs to these entities.
  5. Verify accuracy: choose a random sample and manually check the false positive and false negative rates.
  6. Describe: 1) who are the most productive authors in nonprofit studies? 2) Which are the most productive institutions in nonprofit studies? 3) How you define “productive.”
  7. Remember to document everything in detail using Jupyter Notebook, and submit the notebook through Canvas.

Example studies using this dataset:

  • Ma, Ji, and Sara Konrath. 2018. “A Century of Nonprofit Studies: Scaling the Knowledge of the Field.” VOLUNTAS: International Journal of Voluntary and Nonprofit Organizations 29 (6): 1139–58.

Assignment 5: Network analysis (individual/group, 10% points)

Review the dataset for Assignment 4, prepare two sets of questions:

  • Descriptive. For example, who is the most connected scholar/institution?
  • Inferential. For example, are scholars from wealthier countries/institutions more likely to be “structural holes”?

After instructor’s approval, these will be the questions you need to answer as Assignment 5. You are expected to submit a detailed report.