Part One: Sources of bias when gathering open source information
In the first of this three-part guide examining open source information gathering, we look at the potential for information bias to influence an investigation and the importance of mitigating such biases when designing an effective research methodology
While not usually given as much attention as techniques for verifying open source content, the initial stage of identifying and gathering content, which the Berkeley Protocol on Digital Open Source Investigations refers to as ‘online inquiry’, is where it all starts. This process more often than not determines the material that the researcher will work with and, ultimately, the conclusions that will be drawn. It is also a stage when bias can significantly impact an investigation, stemming from the researcher’s approach to gathering information and the tools and platforms they make use of. As Yvonne McDermott, Alexa Koenig, and Daragh Murray note in their recent article, “while open source information has a clearly democratizing potential, there is a risk that a rush towards greater adoption of open source research methods in investigations may inadvertently silence some of the most marginalized populations,” and that “digital open source information can be as vulnerable to subjectivity and bias as any other form of evidence.”
Bias in online inquiry
Whether searching for open source content relevant to a specific investigation or monitoring for potential human rights violations in real-time, the process of online inquiry involves experimenting with different search terms, search engines, and other digital tools and resources to identify a thorough collection of relevant information. Although, given the immense volume of content available through open sources and the often limited time that can be allocated to an investigation, an exhaustive inquiry process is generally not possible. Thus, the main challenge for researchers is to strike the right balance between maximizing the amount of relevant content returned by their searches while minimizing irrelevant results. Because of this, when designing and conducting such online inquiries, researchers must carefully consider what information is most effectively captured by their methodology, as well as what might be overlooked. This includes an awareness of biases in the researcher themselves and the tools used to carry out searching and monitoring. McDermott et al. refer to these as cognitive and technical biases respectively.
The content identified during open source discovery is heavily determined by the researcher’s approach and decision making – which is inevitably influenced by biases. For example, the search terms selected, the online platforms explored, and the types of events or violations focused on during the evidence-gathering process will likely be heavily impacted by the researcher’s knowledge of the relevant language(s) and the context surrounding the subject being investigated. If researchers are searching in a language other than the primary language used in a particular location, their results will leave out large amounts of available material, most likely content shared by those closest to the incident. Online language translation tools may be useful for open source researchers confronted by these challenges, however, such tools are no substitute for skilled human translation and local knowledge. An overreliance on these tools will likely introduce additional technical bias (discussed below), as such tools often do not produce translations that replicate what a native speaker might say and do not effectively take into account local terminology. Even where the researcher does have knowledge of the relevant language(s) but not the research context, certain slang terms or coded language may not be understood or accounted for when developing search terms. This can result in certain types of human rights violations or crimes being underrepresented or entirely omitted from the information gathered. For example, according to Alexa Koenig and Ulic Egan in their forthcoming article “Hiding in Plain Site: Using Online Open Source Information to Investigate Sexual Violence and Gender-Based Crimes”, despite common misconceptions, content related to sexual and gender-based violence is actually shared online, but it is often described using coded language that researchers generally aren’t familiar with and thus do not look for.
Cognitive bias may also result from the researcher’s own preconceptions or opinions about a research topic. For example, researchers may already have adopted a narrative of the events that they are investigating, leading to expectations about what types of violations they will find evidence of and what content will be most relevant, assumptions about victim/perpetrator roles, or otherwise shaping the online inquiry around predetermined hypotheses. This can, according to McDermott, Koenig and Murray, lead to confirmation bias, with certain content being dismissed or overlooked in favour of the information that best supports or reinforces these initial understandings. Such biases may be particularly relevant when researching highly politicized events, where partisan media coverage or individual opinion may lead the investigator to ‘take sides’ (either consciously or unconsciously), which will influence how they interpret and identify content while conducting online inquiries.
Parallel to the researcher’s own cognitive bias, open source information gathering may also be influenced by the digital media landscape in which the investigation is being conducted. This includes who has access to technology, inequitable levels of digital connectivity, and the fact that certain events, such as airstrikes or police violence, are more likely to be visible than others. Similarly, visually compelling evidence, particularly depicting egregious abuses, is often amplified by social media users. Incidents that ‘go viral’ may drown out other content, making it more challenging for researchers to identify and creating a risk that such incidents may receive a disproportionate amount of attention. Likewise, the fact that certain types of violations are more visible, and thus readily documented, means that such events may be disproportionately represented in open source evidence gathering, producing a situation where the information collected does not capture the full scope of violations that have actually taken place (McDermott et al. identify this as a form of selection bias).
Other forms of bias affecting the sourcing of online information may be ‘built in’ to the very tools and platforms that open source researchers rely on. Search engines such as Google and Bing are not ‘neutral’ – the results returned by the search algorithm, and the priority given to certain results, are influenced by details such as the researcher’s location, search history, the popularity of content, and the time when the content was created. This ‘algorithmic bias’ results in search engines amplifying certain sources and voices over others, particularly those with a lot of traffic or those that can pay for prioritization. This makes it both more difficult for open source researchers to identify the content that is often most relevant to their investigation and, in turn, can result in certain information being overlooked. While getting closer to a neutral search is possible and is often an effective strategy for researchers to implement when searching for content online (covered in part two of this series), the influence of algorithmic bias when using search engines can never be fully eliminated – search algorithms will always play a role in the results researchers see.
Technical bias is also present in the many online social media platforms that open source researchers often rely on. Each platform sorts and serves content in a way that is highly determined by algorithms (particularly in the case of YouTube or TikTok), and offer varying degrees of ‘searchability’ for researchers to identify content based on keywords, location, publication date, and so on. For example, Facebook has very limited search functionality, making it much more difficult for open source researchers to identify relevant content, particularly when Facebook is the predominant social media platform in the context they are working in. Additionally, an increasing amount of online content is shared within semi-closed networks such as WhatsApp chats, Telegram channels, or private Facebook groups. Information shared on these networks may eventually emerge on ‘open’ platforms where the researcher can more easily identify them, however, particularly in countries or contexts where semi-closed networks are more widely used, open source researchers must assume that large amounts of potentially relevant content are confined to these spaces. While researchers may seek to gain access to these networks, doing so can present substantial ethical, legal, and security considerations.
Content uploaded to social media is also subjected to ever-increasing moderation, with each platform developing their own methods and criteria for removing information and users, often relying heavily on algorithms. Content documenting certain violations, particularly those involving graphic injuries or violence, as well as content related to particular groups and ideologies, is more likely to be flagged by content moderators and removed before researchers are able to identify and preserve it.
Social media platforms produce additional biases during the information-gathering process, largely due to their design being for communication with peers rather than as a repository for human rights documentation. For one, nearly all major social media platforms remove potentially valuable metadata from uploaded images and videos. This makes it more challenging for researchers to filter content by location for example when searching for open source information, particularly as platform-specific options for users to ‘geotag’ their posts are rarely used. Instead researchers must rely on the uploader’s own captioning or tagging of a post with location information, what Scott Edwards refers to as “self-structured tagging,” which is often highly variable. For example, one social media user may include information about the exact town or city where an event took place, making it much more likely that a researcher will identify this post in their online searches, while another may provide much more general location information that results in such content remaining undiscovered.
Also, since social media platforms are mostly used for everyday communication by members of the public, the language used by uploaders is often colloquial and may be influenced by their beliefs or attitudes relevant to the subject of the content. Uploaders often use first-person terminology and expressive language characteristic of someone who witnessed a notable event, but not necessarily those terms that human rights researchers are most likely to use when conducting searches. Likewise, social media users may seek to advance a particular narrative or agenda when sharing content. As Edwards writes, “the same piece of video shared in the open by two people could be characterized by one as ‘indigenous protestors massacred by military’ and by another as ‘police repel foreign terrorist attack’.” Part three of this series will discuss how understanding the perspective of someone experiencing such events and the language they might use to describe them can help investigators identify relevant content. However, the lack of a consistent system of tagging and classification on social media platforms means that researchers must adapt their research methodologies to this media landscape in order to identify such content.
While researchers may not be able to completely mitigate these challenges related to the process of online inquiry, an increased mindfulness of potential sources of bias in digital open source research and effective strategies to recognise and account for such biases can greatly decrease their impact. These strategies are essential components of a successful methodology for open source data collection or monitoring, serving both to limit the influence of bias and, as a product of this, result in more effective and thorough information gathering. Parts two and three of this blog series will discuss useful tips and techniques for improving the effectiveness of online searches while mitigating the impact of biases when gathering open source information.