Anonymity and Identity Online: Response to Rauh & Kearney

Response to “The Silent Revolution: Free Speech, Censorship, and the Campus Dilemma in America (Part 2)” by Joshua Rauh and Gregory Kearney

December 8, 2023

Dear Joshua and Gregory,

Thank you for taking the time to offer comments on our paper “Anonymity and Identity Online”. However, I am a bit confused by some of these comments. I believe you may have read an old version of our paper because several of the points that you raised in your Substack post on December 1, 2023, are explicitly addressed in the version of our paper (available here) that I emailed you on October 26, 2023. For purposes of correcting the record, I am responding to these points below.

First, in Appendix A we describe the provenance of our data in detail. Our study uses only publicly available pages on EJMR, the same pages viewed by other EJMR users and indexed by search engines such as Google, Yandex, Baidu, Bing, and Archive.org. At no point did we access any non-public pages, hidden URLs, or APIs. Every page of EJMR we used in our study was in a chain of links from the EJMR homepage. Every page permitted access from EJMR's robots.txt, EULA (none), and terms of service (none). Furthermore, every single page from EJMR used in our study contained advertisements. That is, everything in our study was visible to us as ordinary consumers of the ordinary EJMR content designed for public consumption. What we present in the paper is a statistical analysis of that ordinary, publicly available content, particularly the usernames shown on EJMR until May 2023. We are looking at the same EJMR content as everybody else, but with our methods we can assign probabilities over IP addresses for posts. This is all to say, our study is very plainly not a “hack”. It is instead just like any other economics paper that uses crawled web data. I would appreciate it if you would clarify this on your Substack because many people will not bother to read our paper and could be misinformed.

Second, our study does not doxx anyone and does not reveal any personally identifying information. That is not the purpose of the paper and there is no content of that kind in the paper. Instead, the paper describes aggregated features of posting behavior on EJMR and only reveals a single IP address, which is provided specifically in response to a public million-dollar prize offered by EJMR's owner for identifying the IP address of one particular post.

Third, our study received extremely careful IRB review at multiple junctures. In fact, I am certain that our project received more scrutiny and employee time than 99% of economics studies this year. I am confused and disheartened by your statement that we “misled the IRB”. I am not sure how to respond other than to say that this is a completely false allegation.

Fourth, you raise some concerns about the classifiers we use. Specifically, we used three classifiers:

misogyny with “MilaNLProc” (Attanasio et al. 2022)
toxicity with “ToxiGen Roberta” (Hartvigsen et al. 2022)
hate speech with “Twitter Base Hate” (Antypas and Camacho-Collados 2023)

Each of these classifiers has false positives and false negatives. The misclassifications you found are real and dissatisfying to us. But, these are the best models available. They are literally best-in-class. The Toxigen model in particular, which we use for toxicity classification, was just recently used for the same purpose in Meta's Llama2 large language model, Meta's ChatGPT equivalent (see page 69 of https://arxiv.org/abs/2307.09288). In light of Meta's choice, we feel comfortable using Toxigen. Furthermore, in our paper we compare EJMR to Reddit. We have no reason to think that the classifiers are inconsistent across those datasets, so we expect any misclassifications to be equally distributed.

Fifth, we do not cherry-pick particular subreddits but instead compare EJMR to the 1,000 most popular subreddits and provide a systematic and coherent explanation for our sample. In contrast, the other empirical analysis that you cite does use a cherry-picked set of subreddits. Our analysis uses both weighted and unweighted post number comparisons to show that EJMR is substantially more misogynistic and toxic than Reddit.

Sixth, you wrote that it is “unclear to this point whether the authors are potentially double, triple, quadruple, etc. counting “toxic” phrases.” This is not true. On page 21 of our manuscript (Section 2.5) we clearly state that we “removed any quoted blocks of text belonging to other posts to avoid misattribution of the content of posts.” There is no double counting of content, toxic or otherwise.

Finally, your Substack post contains significant speculation about what one could do with our data, but not what our paper actually does.

Florian Ederer
Allen and Kelli Questrom Professor in Markets, Public Policy & Law
Boston University