Data Tokenization
The process applied to datasets to protect sensitive information is known as Data Tokenization.
Updated: December 4, 2023
The process applied to datasets to protect sensitive information is known as Data Tokenization. It is most commonly done for the data at rest. Sensitive data is replaced with non-sensitive, stand-in data known as token data which will be in the same format as the original data.
Non-sensitive token data remains in the dataset, while the reference of token to the original sensitive data is often stored securely outside of the system in a token server with data tokenization. The relationship of token data with the original sensitive data can be looked up on the token server when the original sensitive data is needed again which is called a detokenization process.
Vault tokenization and Vaultless tokenization are two types of data tokenization. The techniques of data tokenization are most commonly used in the payment processing industry. It is used by the Payment Card Industry Data Security Standard (PCI-DSS) which requires to protect the sensitive data such as credit card numbers. However, any kind of sensitive data can be protected by using data tokenization.
Data tokenization is used by companies to meet industry security standards, reduce data misuse and improve customer confidence. Securing sensitive information is the most common impact of using data tokenization techniques. It reduces threat vectors and the need for advanced security controls.
Types of data tokenization
- Word Tokenization
- Sentence Tokenization
- Subword Tokenization
- Character Tokenization
- N-gram Tokenization
- Treebank Tokenization
- Regular Expression Tokenization
- White Space Tokenization
- Named Entity Recognition (NER)
- Part-of-Speech (POS) Tagging
- Hash-based Tokenization
- Byte Pair Encoding (BPE)