Data Science

Technical Approaches to Managing Personal Data in Large Language Models

Authored By

Published on

March 25, 2024

In the actual and evolving age of artificial intelligence (AI), large language models (LLM) have become a popular tool because of their capability to analyze and generate human-like text in a previously unseen way. These models, usually trained with huge datasets, use attention modules within their architectures to process and understand the text input, allowing the easy development of different applications such as chatbots, translation tools, content generation, and so on.

However, despite their immense potential, some ethical implications about their use appear in the landscape, particularly with respect to Personally Identifiable Information (PII), which usually contains sensitive information. Within this blog, we will explore some of the challenges associated with protecting user information in the era of LLMs and how different approaches are developed to protect this information when LLMs are being used.

LLMs and personal data

It is widely known that LLMs are the state-of-the-art of different applications, from understanding and analyzing information to generating coherent and contextual text. These models have changed the way that different applications were traditionally made because of their versatility. This is evident in their application as virtual assistants, translators, sentiment analysis, or content generation, for example.

However, besides the positive impact these models bring to our technological development, there are a number of ethical concerns about their responsible use and handling of data, especially during their training. This is because the models require vast amounts of training data, and non-sensitive data could be transformed into Personally Identifiable Information (PII) during inferences if users share confidential information with the LLMs, making privacy a crucial step during the use of LLMs to mitigate possible associated risks.

Here, Personally Identifiable Information (PII), according to the General Data Protection Regulation (GDPR), is “any information relating to an identified or identifiable natural person (‘data subject’)..., in particular by reference to an identifier such as name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.”

Common examples of PII are:

Name and surname
Email address
Phone number
Home address
Date of birth
Race

This information could expose individuals to different risks, such as identity theft, financial fraud, or unauthorized access, for example, if it gets into the wrong hands.

A recent experiment by scientists at Google’s Deep Mind highlighted this risk when they asked ChatGPT to endlessly repeat the word “poem”. This, concerningly, bypassed the application’s guardrails and resulted in the chatbot revealing parts of its training data, which contained personally identifiable information such as email addresses and phone numbers. This was not the first time that malicious prompt injection was able to bypass the guardrails of AI chatbots – researchers at the University of California at Santa Barbara bypassed safeguards by providing the application with examples of harmful content.

Protecting PII in LLMs using anonymization

Given these vulnerabilities and the sheer amount of users of generative AI chatbots, it is crucial that steps are taken to protect PII when using LLMS. While regulation and guidance can support this, there are also technical solutions that can be implemented.

For example, a number of technical strategies can be applied to mitigate any personal data leakage, especially during the dataset curation by using anonymization. To do this, personal information is removed or modified to prevent the direct association of personal information with specific individuals.

Figure 1: Masking PII information process used for LLM fine-tuning. — ***Figure 1:*** *Masking PII information process used for LLM fine-tuning.*

‍

For example, let’s imagine that our original information is:

“Jhon Smith is a white person that was born in 15/11/1980, he lives in 2875 Olentangy River Rd, Columbus, Ohio. His phone number is (402) 346-0444 and his email is jhon.smith@mydomain.com”

‍

This can be anonymized by masking it with any symbol, such as:

"XXXX XXXXX is a XXXXX person that was born in XX/XX/XXXX, he lives in XXXXXXXXXX, XXXXXX, XXXX. His phone number is (XXX) XXX-XXXX and his email is XXXXXXXXXXXXXXXXXXXX"

‍

Or by using entity labels, such as the pii-masking-200k dataset:

“[NAME] is a [RACE] person that was born in [DATE], he lives in [ADDRESS], [CITY]. His phone number is [PHONE_NUMBER] and his email is [EMAIL_ADDRESS]”

‍

or encryption, which encodes personal data to add a robust layer of security:

Integrating anonymization using prompt engineering

“G4gU21pdGgiLCJpYXQiOjE1MTYyMzkwMjJ9 is a XRlIiwiaWF0IjoxNTE2MjM5MDIyfQ person that was born in zExLzE5ODAiLCJpYXQiOjE1MTYyMzkwMjJ9, he lives in zUgT2xlbnRhbmd5IFJpdmVyIFJkLCBDb2x1bWJ1cywgT2hpbyIsImlhdCI6MTUxNjIzOTAyMn0. His phone number is DIpIDM0Ni0wNDQ0IiwiaWF0IjoxNTE2MjM5MDIyfQ and his email is 24uc21pdGhAbXlkb21haW4uY29tIiwiaWF0IjoxNTE2MjM5MDIyfQ”

‍

Furthermore, even pre-trained LLM models are being used to integrate the mentioned strategies to detect and remove PII information from datasets or public data by using prompt engineering.

Figure 2. Pre-trained LLM used for PII information detection. — ***Figure 2.*** *Pre-trained LLM used for PII information detection.*

‍

For example:

“From the following text please, find all the instances related with a person’s name, address, email, race, birth, or any information related with PII. Present the output as a JSON object with the instances found and remove them from the original text replacing them with ‘X’ symbol. Text: ‘Jhon Smith is a white person that was born in 15/11/1980, he lives in 2875 Olentangy River Rd, Columbus, Ohio. His phone number is (402) 346-0444 and his email is jhon.smith@mydomain.com’”

‍

Model’s output:

{‘Name’: ‘Jhon Smith’, ‘Address’: ‘2875 Olentangy River Rd, Columbus, Ohio’, ‘Email’:’jhon.smith@mydomain.com’, ‘Race’: ‘white’, ‘Birth’: ‘15/11/1980’, ‘Phone number’: (402) 346-0444} Text: “XXXX XXXXX is a XXXXX person that was born in XX/XX/XXXX, he lives in XXXXXXXXXX, XXXXXX, XXXX. His phone number is (XXX) XXX-XXXX and his email is XXXXXXXX@XXXXXXX.XXX”

‍

More information about data anonymization and encryption can be found in this paper, which expands on the concept of masking commercial information.

Tools for PII anonymization in LLMS

Outside of anonymization, a number of different tools have been developed to help programmers with PII anonymization. These include the well-known Spacy’s Named Entity Recognition (NER) tool, which identifies words related to names, locations, dates, etc. Another interesting tool is Microsoft’s Presidio Analyzer SDK, which provides modules not only for identification but also for the anonymization of private entities from texts by using different recognizers. Finally, the Amazon Comprehend feature available from AWS also allows for the detection and anonymization of PII from a given text.

Legal frameworks regulating PII

Finally, there are a number of legal resources that set out guidance and legislation governing the use of PII, introducing requirements such as consent and placing limitations on how PII can be used.

These include the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) provide frameworks that regulate the use and collection of PII by businesses to provide users with control over their personal data. There are also PII laws that target specific applications, such as the Health Insurance Portability and Accountability Act (HIPAA) and the Fair Credit Reporting Act (FCRA) which protect the privacy and confidentiality of health and credit information, respectively. Finally, the National Institute of Standards and Technology (NIST) in the US published a widely-known “Guide to Protecting the Confidentiality of Personally Identifiable Information (PII)” in 2010 that provides a powerful guide to assist organizations in protecting personal information.

Ensure the safety of your LLMS with Holistic AI

Personally identifiable information presents significant challenges with the use of AI that must be addressed to uphold the safety of users and the general public. Accordingly, efforts must be made to maintain the balance between technological advances and the protection of user privacy, and doing so requires a combination of legal and technical expertise.

Schedule a demo to find out how Holistic AI’s Safeguard can help you secure your data and govern your generative AI.

Heading 2

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.