Instructors: Dr. Thi Thanh Huyen Nguyen
Event type:
Interactive class
Displayed in timetable as:
Hours per week:
3
Credits:
6,0
Language of instruction:
English
Min. | Max. participants:
- | 37
Comments/contents:
PRE-REQUISITES
This course requires a basic understanding of Python, statistics, probability theories and applied econometric techniques used in social sciences.
NOTE: All students are required to check and go through the following links and materials to (1) prepare for the course; and (2) check if the course level is suitable for you.
Math: If you don’t have a solid background in calculus, linear algebra, and probability, read part 1 from this online book
Python: Check this free course on Practical Python Programming.
Natural Language Processing: Check Chapter 1 to 3 of the NLTK book.
Learning objectives:
We are living in a rapidly digitized world, with an ever-increasing availability of large-scale textual corpora in law, politics and economics. This massive data development scene poses exciting challenges for social scientists, to understand the fabrics and functioning of our societies, beyond just numbers. Coupling the proliferation of legal and political corpora with the speedy growth of data science toolkits, we have at hand a powerful infrastructure to extract hidden novel insights about relevant institutional and human patterns in texts.
This course provides a comprehensive introduction to the basic theory and hands-on applications of text analysis and machine learning for social science in Python. The course begins with quick introduction to Python languages and moves on to the challenge of representing texts as data. Next, it gives an overview of key techniques to clean texts, extract relevant information, and represent documents as vectors. These techniques include, but not limited to, for instance, measuring document similarity, clustering documents based on topics, as well as visualization methods such as word clouds and spatial relation plots between documents. Students are also provided with various sources to different text corpora, tips and techniques to query a programming issue online and self-study materials to deepen their understanding beyond the scope of the course.
Finally, we consider text-based prediction problems. For instance, given the evidence of a particular case, how will a judge decide on sentences? Given recorded speeches and transcripts of politicians, how ideological is a politician? Such predictions are then incorporated into social science analysis, Students will investigate and implement the relevant machine learning tools for making these types of predictions, including regression, classification, and deep learning models. If time permits, we will also touch upon causal inference methods using texts, either as treatment or outcome in a given data context.
Didactic concept:
The course is organized as an interactive, in-person class focusing on hands-on applications of text analysis tool taught in a specific week. For every weekly meeting in the first 2/3 of the course, the first hour is dedicated to a lecture on the theories and know-hows of text tools, whereas the remaining time is interactive coding practice in Python to build up programming and text analysis skills of students.
During these hours, students can directly implement themselves the text tools they learn during the lecture, while exchanging their solutions and struggles with fellow students and the course instructor. The course also provides a week-by-week discussion forum on Slack, where students can form groups, discuss questions and solutions with one another.
Since in-class “learning by doing” is the fundamental learning block of this course, please do NOT register for the course if you intend to do it online.
Literature:
REFERENCE MATERIALS
The course will be mainly based on the weekly lecture slides, but the following books and code exercises can be used as reference along with the slide content.
· Natural Language Processing in Python, Third Edition, available at nltk.org/book.
· Aurelien Geron, Hands-On Machine Learning with Scikit-Learn & TensorFlow, O’Reilly 2017.
. Jupyter notebooks Github for Geron’s book.
· Google Developers Text Classification Guide (This guide contains some practical tips and code examples for using text data)
. For Python syntax programming, check the book Fluent Python (O'Reilly 2015).
. For research ideas and project design, check the book Bit by bit: Social Research in the Digitial Age (Matthew Salganik)
PROGRAMMING
Python is the best option for text data and machine learning, used and developed by most data scientists in this. The examples in the course will use Python. Additionally, when we move to regression exercises, Stata might be used in economic examples. For installing and setting up Python before the course, please follow the following links:
Python Setup Instructions
Codecademy Online Python Course
Additional examination information:
There will be no exams. Course assessment includes the following components:
30%: Midterm report (& presentation) assignment
70%: Final group research project presentations
In addition, active participation in weekly class sessions is expected throughout the course.
|