In this World of Smart Data blog, Deborah Wiltshire from the GESIS Leibniz Institute for the Social Sciences in Germany considers the challenges and benefits of digital behavioural data for a small Trusted Research Environment.
Trusted Research Environments (TREs) are an important part of smart data research. They are secure computing environments designed to hold sensitive data. Researchers can access and analyse data held in a Trusted Research Environment while maintaining strong privacy protections.
Here at GESIS, we are creating digital behavioural data products. These are tools and services built using data about what people do online. We want to archive these products and share them with the research community through our Trusted Research Environment. So we’re running a pilot in which two data products will be created. The first consists of web-tracking data linked with survey data. The second is made of social media data.
The challenges of using digital behavioural data in TREs
There are many challenges with these types of data. For example, some significant legal considerations must be addressed, starting with the legal basis for data sharing. ‘Informed consent’ is the preferred legal basis used in Germany. With web-tracking data collected from social survey participants, we can assume that research participants who consent to their web browsing behaviour being tracked were properly informed. The picture is less clear with other data, such as scraped X (formerly Twitter) data. There is much deliberation of the changing terms and conditions and what is permitted.
These kinds of data also come with increased potential for disclosure, so our team plans to create two versions of each data product: one that will be fully anonymised and another that will be pseudonymised. These pseudonymised datasets will be shared through the Secure Data Center, our Trusted Research Environment that currently offers on-site access to sensitive social survey data.
While digital behavioural data is not new to the research community, it is still relatively new to Trusted Research Environments and is completely new to the Secure Data Center. As a small team with limited resources, adapting to digital behavioural data is presenting us with new challenges. We’re trying to work out how resolving these can bring us new opportunities for developing our service. These challenges and opportunities can be divided into three areas – technical, legal and ethical, and organisational.
Technical challenges
The key challenge for any Trusted Research Environment is building an infrastructure that perfectly balances security with meeting the needs of researchers. It must be secure against potential, unknown external forces and must prevent misuse. It must also give the researchers the computational power and software that researchers need, coupled with ease of use. Some consideration of future-proofing this infrastructure is also wise.
It’s hard to find this balance before we know the exact nature of the datasets, and before we know the exact needs of researchers. It’s challenging, but adapting our existing technical and governance infrastructures will take time, so we are turning to Trusted Research Environments and research communities to try to anticipate what is needed.
We have gained insight into what researchers are likely to need through discussions with teams at organisations like SOMAR and Smart Data Research UK, and with researchers experienced with working with these data types. Through these discussions, we can already see that our existing infrastructure will not meet researchers’ needs. Our current system is built for social survey datasets; smaller and simpler in structure than many digital behavioural datasets. So computational capacity and software availability will need to be reassessed while remaining cognizant of our limited technical resources. To this end, we are exploring the possibility of working with an external infrastructure provider. This would allow us to tap into resources and expertise that will help us to develop our existing on-site access. It would also allow us to expand to provide remote desktop access for the first time.
Legal challenges
By the time data are made available via the Secure Data Center, the potential issues around the legal basis for sharing have (thankfully!) been addressed and resolved by the ‘ingest team’. So legal considerations for us as a Trusted Research Environment focus on the data governance that we have in place. This sets out clear definitions of who can use the data, for what and for how long. Our current governance policies are well established. They should be sufficient for digital behavioural data. Nevertheless, two areas need some additional consideration – ethics and disclosure risk.
Ethical challenges
There is no formal assessment of whether a project is ethical when a researcher applies to access our data. Whether this remains the case with the addition of digital behavioural data depends on the content of the data and what harms could arise from its use. For example, analysis of data relating to accessing certain types of websites (such as websites relating to sensitive areas like mental health issues or adult content) could lead to stigmatization of particular groups. We might want researchers to demonstrate that they are conscious of this before granting access.
We also need to consider the disclosure risk of these data. The risk of a violation of privacy or breach of confidentiality is primarily a legal issue. Arguably it’s also an ethical one. The risk of disclosure can be particularly high when it comes to digital behavioural data. This is due to the easy availability of additional data and the increased level of potential linkability to individuals. The piecing together of more than one source of information to identify an individual is referred to as secondary or ‘jigsaw’ disclosure. It’s a significant challenge for trusted research environment teams. That said, we have well-established output-checking procedures and sensitivity rules that should serve the additional risk of digital disclosure data well, assuming the results produced by researchers are quantitative. For qualitative outputs, we don’t have a solution in our community although work is underway in this area.
Organisational challenges
The final area of challenge is in the day-to-day running of the Secure Data Center. We anticipate good levels of demand for digital behavioural data datasets. As a small team, we need to make our existing application process more streamlined to meet the increased demand for our services. One of the opportunities of the digital behavioural data project is that it is lending support to the development and implementation of a Customer Record Management system; replacing our manual application process.
While integrating digital behavioural data into a Trusted Research Environment presents technical, legal, and organisational hurdles, it also offers a chance for reflection and improvement. TREs can use this challenge and others like it to build a more robust technical infrastructure, strengthen data governance, and streamline work. This will expand our capabilities and better serve researcher communities.
Additional reading
Kavianpour S, Sutherland J, Mansouri-Benssassi E, Coull N, Jefferson E. Next-Generation Capabilities in Trusted Research Environments: Interview Study. J Med Internet Res. 2022 Sep 20;24(9):e33720. doi: 10.2196/33720. PMID: 36125859; PMCID: PMC9533202. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9533202/
Munzert, S., Ramirez-Ruiz, S., Watteler, O., Breuer, J., Batzdorfer, V., Eder, C., … Yang, J. (2023, June 28). Publishing Combined Web Tracking and Survey Data. https://doi.org/10.31219/osf.io/y4v8z
Useful links
Secure Data Center: gesis.org/en/services/processing-and-analyzing-data/analysis-of-sensitive-data/secure-data-center-sdc
SOMAR: https://socialmediaarchive.org/?ln=en
Smart Data Research UK: https://www.sdruk.ukri.org/researchers/