It's estimated that 1 in 3 women worldwide face some form of intimate partner violence. Women and girls account for nearly 71% of trafficking victims. 650 million women alive today were married before the age of 18. Gender-based violence (GBV) is the reality for a sickeningly large proportion of our world. Mitigation efforts have been unsuccessful because we don't have comprehensive data on the issue. And that's where social media comes in. What if social media could be the key to finding more information as to how GBV is discussed across demographics and communities?
I developed a model that rapidly classifies tweets for relevance to GBV and further divides relevant tweets into categories that indicate which form of violence is being discussed. I defined four classes that pertain to GBV: Physical Violence, Sexual Violence, Harmful Practices, and Other. I extracted data from the Twitter API based on class-specific search criteria and employed the Natural Language Toolkit for natural language processing on tweet text. I used the Naive Bayes algorithm to train the machine learning model. I went on to conduct an analysis of two feature sets, and constructed a confusion matrix to understand the model’s deficiencies.
I was able to pinpoint specific areas that discuss GBV. I also identified words and topics that are frequently discussed. Overall, I was able to find data that could significantly advance our efforts to end GBV and allow us to make important conclusions about the societal reaction to the issue.
Gender-based violence (GBV), including aggressive behavior, is a ubiquitous human rights violation that victimizes 1 in 3 women worldwide, but there still remains a dearth of viable data surrounding it. Social media generates a plethora of valuable data across users, locations, and topics, with its users spanning a multitude of demographics. Twitter is one of the most popular and frequently used social media platforms; fortunately, it also has an API that allows the public to easily access its data. In this project, I aim to make use of natural language processing and Twitter data to identify content related to gender-based violence and aggressive behavior, and answer the question: how can Twitter data be used to address the paucity of information about GBV and help reduce its prevalence? I intend to mine geographic, quantitative, and qualitative data about GBV and to develop a model that is able to classify tweets rapidly with at least 70% accuracy. I expect that because my processing is limited to tweets in English, English-speaking countries will turn up the most in my search. When it comes to cities and states, I think that GBV discussion frequency will be proportional to population. Issues like domestic violence and rape, which have garnered significant media attention, will likely be what most people are talking about. However, I might come across some surprises when I analyze the data. My findings will allow us to discover how GBV comes about and will equip us with the knowledge necessary to prevent it.
While the AI for GBV space is largely uncharted territory, research into natural language processing on Tweet data, machine learning using the Naive Bayes classification algorithm, and what the status quo is when it comes to how much we know about GBV shaped my approach. My project was initially birthed out of my interest in human rights worldwide, after I watched a few documentaries and perused some United Nations reports investigating violence against women. This helped me to fully comprehend the pervasiveness of GBV and motivated me to find a way to address the issue.
Once I decided to introduce Twitter data to the project, I probed into how data can be extracted from Twitter’s API, as well as what security and authentication mechanisms are in place. Furthermore, I learned about how individual tweet attributes, whether pertaining to the user posting it or the text in itself, can be obtained, processed, and appropriately formatted. With the support of the other literature I’d read through that had applied machine learning techniques to problems like cyberbullying, I worked out how to design a processing pipeline and filter the data I collected. It also gave me the understanding necessary to conduct natural language processing, specifically processes like lemmatization, removing plurals, and feature extraction, so that I could gather the main ideas of a tweet without any extraneous words that don’t contribute to its meaning.
I also inquired into how machine learning can be used for feature engineering along with how Naive Bayes in particular computes the probability of individual features occurring. My research allowed me to identify how to separate labeled data and divide between training and testing data; additionally, I discovered confusion matrices and what kinds of insight they can give us into an algorithm’s performance.
I validated my research and proved the impact of my project firstly by looking into GBV incidence statistics in the cities I researched. When I found out that there was a correlation between how much GBV was discussed and how common issues like rape and human trafficking were in a given city, I realized that I was developing a potentially invaluable tool that would allow us to more effectively allocate funds, establish prevention programs, and craft powerful, data-driven policy. I also became cognizant of the fact that the work I was doing had implications even beyond the GBV space, and that my algorithm could be modified to assess other problems ranging from gun violence to voter suppression to cyberterrorism. The research I did both on my project and its scope will also benefit me in the future, should I choose to take this idea further or use similar methodologies for other topics.
I set up four classes (Physical Violence, Sexual Violence, Harmful Practices, and Other). Based on the search criteria I established for each class, I received only matching tweets. I accumulated 1,943,255 tweets posted by over 400,000 unique users. The data I collected was split into class-specific hourly chunks, with one file allocated to each hour during the 10-day streaming period. I filtered the tweets based on user (followers, following, lists) and text attributes (punctuation, likes, retweets).
After the filtering process, I was left with 4,000 tweets, 3,200 (80%) of which served as training data and 800 (20%) of which served as testing data. I performed NLP on the tweet set, which entailed tokenization, lemmatization, and punctuation removal. All 4,000 tweets were subsequently labeled, and my algorithm used feature engineering to identify key words and trends that appeared in tweets that were classified as belonging to certain categories. After generating a list of “features,” or frequently-appearing words across GBV-relevant tweets, my algorithm classified the tweets in the testing set, and I compared the class it gave to these tweets to the class I had given it during the labeling stage. I had two sets of features: features I had generated from the text I was left with after NLP, and features extracted from the initial search criteria. Unigrams are one-word features and bigrams are two-word features. All one-word terms that appeared more than four times across all tweets became unigrams and all recurring pairs of unigrams became bigrams.
I generated 1665 NLP-based unigrams and 1160 NLP-based bigrams, as well as 44 search criteria-based unigrams and 41 search criteria-based bigrams. After comparing algorithms, I chose to work with the Naive Bayes classification algorithm because it is widely used in the domain of NLP. Naive Bayes computes the likelihood of a feature being present in a tweet independently of other features. For both my NLP-based and search criteria-based models, I ran two trials to gauge my algorithm’s consistency in performance and see if there were any wide disparities to ensure fairness. I generated metrics for both in order to better assess my model’s performance. I identified my algorithm’s overall accuracy and its F-score, precision, and recall for each class to see if my model fared differently for one than another. Additionally, to understand how frequently it had difficulty differentiating between classes, I constructed a confusion matrix. The confusion matrix would point out the deficiencies in my model — if, for example, my model frequently misclassified Physical Violence tweets as Sexual Violence, then my confusion matrix would include these misclassification rates for both trials.
The control group in my project is my labeled data. Of the 4,000 labeled tweets, 570 were labeled Physical Violence, 491 were labeled Sexual Violence, 739 were labeled Harmful Practices, 354 were labeled Other, and 1,846 were labeled (unrelated to GBV). My variables in this project were the unigrams and bigrams, or features, extracted from both the initial search criteria and the results of natural language processing.
Over 10 days of streaming tweets, I amassed 396 files, with nearly 2 million tweets in total posted by about 400,000 users. 231,376 tweets were classified as Physical Violence, 1,380,358 tweets were classified as Sexual Violence, 45,716 were classified as Harmful Practices, and 285,505 were classified as Other. Nearly every country was represented across the four classes along with over 60 languages. When I filtered the tweets, I was left with 4,000 tweets, with 1,000 in each class.
These 4,000 filtered tweets then went on to the labeling stage, where 570 were labeled Physical Violence, 491 were labeled Sexual Violence, 739 were labeled Harmful Practices, and 354 were labeled Other. I was able to collect data as to which countries, US states, and US cities most frequently discussed GBV; which of the four initial classes were most widely discussed; which topics were most frequently discussed; and how my model fared in regard to both feature sets. I found that, in decreasing order, the countries that posted the largest amounts of GBV-relevant tweets were the US, the UK, Canada, India, Australia, Ireland, South Africa, and Nigeria. I also found that, in decreasing order, the US states that posted the largest amounts of GBV-relevant tweets were California, Texas, Florida, Georgia, New York, and Ohio. Finally, in decreasing order, the US cities that posted the largest amounts of GBV-relevant tweets were Los Angeles, New York City, Washington, D.C., Houston, and Philadelphia. From my search criteria feature set, I extracted 44 unigrams and 41 bigrams. From my NLP-based feature set, I extracted 1,665 unigrams and 1,160 bigrams. Sexual violence was the most frequently discussed, and hot topics across all tweets included domestic violence, child marriage, rape, sexism in the media, harassment, genital mutilation, and human trafficking, as evidenced by the extracted unigrams and bigrams. This was consistent for both feature sets.
The model was most accurate when based off the search criteria-based features, suggesting that feature reduction boosts model performance. For the NLP-based features, Harmful Practices had the highest precision and Other had the lowest. For the search criteria-based features, Harmful Practices had the highest precision and Physical Violence had the lowest. F-score and recall statistics were similar. The accuracy ranged between 82.25% and 85.38%, depending on the feature set. My confusion matrix shows that my algorithm had the most difficulty differentiating between Physical Violence and Sexual Violence. It also had difficulties when it came to differentiating between Other and most other classes. Overall, I was able to develop a highly accurate tweet classification model for GBV relevance and I unearthed data about where and how we discuss GBV. A model like this could allow us to establish a correlation between the frequency of tweets related to GBV being posted in a specific area and GBV incidence rates in the same area.
In conclusion, I was successfully able develop a model that can label a Tweet as relevant to GBV and classify it among four categories with upward of 85% accuracy. I was able to uncover information about which geographical regions most vehemently discuss gender-based violence and what specific topics were tweeted about the most. I was finally able to conduct an analysis of the performance of two different feature sets, and I concluded that the search criteria-based feature set gave me better results.
My project results aligned with my expected outcome for the most part. I surpassed the 70% accuracy goal significantly and was able to find the data that I seeked. The geographic data largely corresponded with known information about population and language, with some surprises. For example, I didn't expect Ireland or South Africa to rank on the list, and I would've expected more GBV discussion from Nigeria and European countries besides the UK. I also would've imagined more tweets coming from Chicago, as it didn't rank on the list despite having a large population, and I was taken aback by the fact that more discussion originated in Los Angeles than New York City. I was also shocked by how many tweets were posted regarding sexual violence, and how little attention problems like child marriage and genital mutilation, though widespread, got. I didn't initially think that the features generated from natural language processing would lead to more accurate results than the features extracted from the search criteria, but the analysis that I conducted has overall refined my understanding of how to maximize the accuracy of classification algorithms and leave the narrowest possible margin of error.
This project allowed me to learn more about gender-based violence and the applications of social media data. Previously, I had little to no understanding of the workings of machine learning and natural language processing. Even though I was already well-versed in Python programming, working on the algorithm helped me improve. My project enabled me to leverage computer science to solve human rights issues while bringing all these skills together. I realized through presenting my project that a similar classification model can be used to combat other problems, not just gender-based violence. When it comes to potential modifications I could make, having a wider sample set and including crowdsourcing for labeling would decrease bias. The confusion matrix shows that many Other tweets were incorrectly classified, which calls for better NLP-based handling of Other tweets. I noticed that some tweets were misclassified because my model was not able to distinguish between homonyms. For example, a tweet about an offensive back’s playing during a recent football game may get classified as an Other tweet because of the word “offensive.” To analyze how GBV is discussed across cultures, I could also expand my language processing beyond English.
A few years ago, on a trip to visit family in India, I met a girl named Rajeshwari. She was no older than eight or nine but had already experienced a world of horrors. Her older sister was seventeen and married, with two kids and an eighth-grade education. Her abusive father had abandoned them following Rajeshwari's birth to pursue work elsewhere and hadn't made contact since. Her mother, my cousins' nanny, had faced a host of maternal complications that triggered numerous miscarriages, and yet had never seen the interior of a hospital. But heart-wrenching as these stories sound, they are by no means unique — they're a reality for millions of people around the world plagued by issues like child marriage, human trafficking, and sexual violence.
I'm Sneha Revanur. I'm a high school student from San Jose who believes that computing is a superpower that, when wielded, has the capacity to produce legitimate social change. I'm on a mission to discover how AI can help us prevent Rajeshwari's life from becoming the norm. Outside of STEM, I'm interested in law, politics, writing, and language learning. I intend to pursue coursework in computer science and history in college, and see myself on the entrepreneurial side of tech as the CEO of a company leveraging AI for social good. My inspirations include Katherine Johnson, Grace Hopper, Reshma Saujani, Joy Buolamwini, Alexandria Ocasio-Cortez, and Ruth Bader Ginsburg. I’d love to continue exploring STEM to see where it takes me.
My project was all on my computer, and there aren't any health or safety guidelines that I would've had to consider before starting. I didn't work with any external labs; everything was done at home.
Bonzanini, M., Mining Twitter Data with Python. March 2, 2015.
Enthought, 5 Simple Steps to Create a Real-Time Twitter Feed in Excel using Python and PyXLL. June 23, 2016.
Python Central, Introduction to Tweepy, Twitter for Python. January 23, 2013.
Kumar, Shamanth; Morstatter, Fred; Liu, Huan, Twitter Data Analytics. August 19, 2013.
United Nations Population Fund, Gender-based violence. Not dated.
United States Agency for International Development, Gender-based violence. Not dated.
My school science teacher supported me over the course of my algorithm’s development. We discussed my progress throughout the project. My Python instructor, from whom I have learned programming for more than a year outside of school, allowed me to solidify my foundation in the language before I began this technical project. My family members helped me label tweets; they also helped me with my presentation and board. Before beginning my work, I read myriad research papers relating to the application of similar methodologies to fight gun violence and cyberbullying. I also read articles about accessing and extracting data from Twitter’s API, authentication mechanisms, JSON to CSV data conversion, and CSV-specific processing toolkits. Previous work within computational humanitarianism and computational linguistics allowed me to develop my initial ideas for my algorithm and procedure. My project was completed entirely at home using the Python programming language, the Natural Language Toolkit for natural language processing, the Naive Bayes classification algorithm, and the Tweepy module in Python. I didn't work at any external facilities or use any special equipment.