Creating an application to identify the presence of government issued personally identifiable information (PII) embedded in documents and data, inadvertently or otherwise

Abstract

Background: In today’s digital age, a wide variety of services and processes take place online. Users of these digital facilities are required to upload government-issued containing documents or provide data for successfully availing the services. However, the uploaded documents or data which are required to facilitate these digital services and processes contain personally identifiable information (PII), i.e. any data that can be used to identify an individual uniquely. These documents can be like Aadhaar card, PAN, Credit Card, Driving License etc and can include data like names, address, phone number, email address, and financial information, among others of the user. The handling of PII is crucial as its exposure can lead to privacy breaches, identity theft, and financial fraud among other cyber related issues. Detailed Description: The above problem statement envisages that an application be developed to identify whether PII, in the form of government-issued documents such as Aadhaar, Driving license, MHA-issued ID Cards, etc. is embedded in the uploaded document or provided data. Notable, the PII may be included inadvertently as well. PII, by its nature, is sensitive data, and its exposure must be protected against in order to safeguard users’ privacy. Entities and organizations handling documents or data containing users’ PII must be mindful of the complex challenges that arise with it – they have to balance data storage, encryption, access controls, data retention policies, data management processes with users’ knowledge and consent, notification of breaches by users, grievance redressal, etc. Such an application will aid in alerting individual users to verify whether it is necessary to upload or provide PII-containing document. Simultaneously, it will allow the personal data processing entity to check whether such PII document or data is required, and in case not necessary, help in removing, redacting or masking the PII document or data from the uploaded or provided document or alerting the individual user regarding the same. This application would be useful for the purposes of data protection compliance, risk mitigation, enhanced security, improved data quality, operational efficiency, and legal and regulatory compliance. Expected Solution: A software application or library package to detect and alert users when there is personally identifiable information (PII) related to identified government-issued identification documents (Aadhaar card, PAN, Driving License to start with) embedded in the uploaded documents or providing data, while uploading or reviewing. In addition, the software application may be placed in public domain and shall allow the receiver of the document in removing, redacting or masking the PII from the document and data, if required.

Existing System

Most of the existing schemes are based on rule-based approaches, limiting the scope and applicability of existing solutions in the real context. The rule-based approaches are based on the set of procedures and principles to represent knowledge from the structured data. Since, in real-world cases, the organization mostly maintains a large corpus which stores PII in the textual data format such as emails, contracts, IPv4 and MAC addresses, and telephone numbers. Among these textual data, emails contain more entropy compared to the other textual data. Apart from this, the textual corpus, especially email, is mostly unstructured since the information is presented in the native format, especially the email contents written with the different writing styles, contains the short subjective textual body. Therefore, the rule-based approaches are not much suitable for identifying PII from the unstructured large text corpus because they mainly deal with structured data formats. However, with the advent of machine learning (ML) models and advancement in natural language processing (NLP), PII of individuals from large unstructured text corpus can be efficiently identified, which cannot be addressed by applying the existing rule-based solution discussed so far [11- 13]. In the existing literature, many efforts have been put forward by the researchers.

Disadvantages

Complexity of PII Detection: Government-issued PII can be highly variable in format and context (e.g., social security numbers, tax IDs). Designing an application that accurately identifies all types of PII across diverse documents and formats can be technically challenging. False Positives/Negatives: There is a risk of false positives (identifying non-PII as PII) or false negatives (failing to identify actual PII). Both can lead to either unnecessary panic or missed compliance issues. Privacy Concerns: The application itself will handle sensitive data, raising concerns about how this data is processed, stored, and protected. Ensuring the application complies with data protection regulations is critical. Cost and Resource Intensive: Developing, maintaining, and updating the application can be resource-intensive. Regular updates may be needed to keep up with evolving PII formats and regulatory requirements.

Proposed System

The performance validation of the proposed system C-PIIM is carried out concerning clustering performance metrics and PII probability in the text corpus. The significant contribution of the proposed work can be summarized as follows: ? The proposed system addresses the problem of automatically detecting and classifying the possibly included PII attributes from the large text data. ? The study also addresses the problem of precise feature extraction from low-quality textdata by introducing an effective data modeling and processing mechanism. ? A topic modeling is done to determine a set of contexts that show which category of text document has the most probable vulnerable PII. ? The hybrid nature of the deep learning technique is employed to achievehigher accuracy inclassifying the PPI from the email or text data. The design and development of the proposed system are carried out in such a manner that it can meet corporate production requirements by automatingmonitoring and detecting PII and ensuringagreement.

Advantages

Data Security and Compliance: Helps organizations comply with regulations and standards such as GDPR, CCPA, or sector-specific regulations like HIPAA, which often require the protection of PII. By identifying PII, organizations can implement appropriate safeguards. Mitigation of Data Breaches: By scanning and identifying sensitive information, the application helps prevent accidental exposure or unauthorized access, thus reducing the risk of data breaches. Improved Data Management: Facilitates better data governance by locating and categorizing PII. This can be especially useful for data cleanup and ensuring that only necessary information is retained. Efficiency and Automation: Automates the process of identifying PII, saving time and resources compared to manual scanning and review. This can be particularly advantageous in handling large volumes of documents

Download DOC Download PPT