For storing mobile phone numbers of 3 billion global users, should the data type be int, string, varchar, or char? And why?

With platforms like Facebook and YouTube each hosting over 2.5 to 3 billion users globally, the scale of user data storage has become increasingly complex. If every valid user is required to register with a mobile phone number, this results in billions of phone numbers to be stored and managed. Given that most countries use mobile numbers ranging from 10 to 11 digits, several challenges arise.

One of the key challenges we need to address is: if we are storing 3 billion mobile phone numbers, should we use int or string? varchar or char? And why?

With 3 billion records, what problems might arise if we use Int or BigInt?

1.1 Can an Int hold an 10 or 11-digit phone number?

First, we all know that mobile phone numbers typically have 10 or 11 digits—for example, 01511234567. 

In Java, the int type is 32-bit, with a maximum value of 2³¹ - 1, which equals 2,147,483,647 (approximately 2 × 10⁹). Clearly, this is not enough to store an 11-digit phone number. 

To store such numbers, we would need to use a 64-bit long type, which corresponds to BIGINT in mysql database.

1.2 Data integrity

For example, if the phone number is 01365618899, using a long would drop the leading zero, and the stored value would become 1365618899—this directly breaks data integrity.

Long phoneNumber = 01365618899L;  // Compilation error: Java does not allow leading zeros in Long integers

Moreover, sometimes phone numbers may include country codes such as (+81), or contain hyphens, for example, 90-2819-9213. These reasons make it impossible to store phone numbers using integer data types.

1.3 Query difficulties

For example, if you want to find phone numbers starting with "5168", using BigInt (or Long type) requires converting the number to a string first for pattern matching, which greatly reduces query efficiency.

Using String to store mobile phone numbers 

2.1 Benefits of using String

Therefore, in development and design, we usually use the String data type to store phone numbers. The main advantages of doing so are:

• Fidelity: preserves numbers, symbols, and leading zeros exactly as they are.

• Flexibility: supports fuzzy searches and international numbers, allowing easy expansion.

• Convenience: no need to worry about overflow or format conversion issues.

2.2 Why Use VARCHAR(20) Instead of VARCHAR(11)?

Let’s take mobile phone numbers as an example. Why is it generally recommended to use VARCHAR(20) rather than VARCHAR(11)? 

Suppose a standard phone number is 11 digits long, so why not simply use VARCHAR(11)?

If you have the habit of considering data fault tolerance in your daily development, you would realize:

  • International numbers like +49 151 23456789 (14 characters)

  • Numbers with country codes like 008193822223333 (15 characters)

  • Extensions like 93822223333#123 (longer than 11 characters)

In these cases, VARCHAR(11) would cause errors or crashes.

Secondly, from a business scalability perspective:

VARCHAR(11) can only store exactly 11-digit numbers. But future requirements might include:

  • Landline numbers (e.g., 510-62223333, containing a hyphen)

  • Virtual numbers (e.g., 67012341234-1238)

  • Other login methods (e.g., mixed storage of emails and phone numbers)

Therefore, the field length and type should be designed with room for future business changes to avoid frequent schema modifications. This reflects the mindset of business scalability in daily development.

There is also the consideration of data fault tolerance and input unpredictability: users may input numbers with spaces or symbols (e.g., 38 2222 3333), so storing the original value directly makes data cleaning easier.

From a design compromise perspective: if you strictly use VARCHAR(11), you must rigorously filter out non-digit characters in the code, which increases complexity.

Lastly, consider storage cost:

  • VARCHAR(11) uses up to 11 bytes (if under utf8mb4 encoding, each character can take up to 4 bytes, but digits and the plus sign only take 1 byte each)

  • VARCHAR(20) uses up to 20 bytes

The difference for 3 billion records is about 27GB, which is acceptable considering the cost of data storage.

2.3 Extreme Scenarios

If the mobile phone number consists of pure digits and does not start with a zero, you may use BIGINT; however, you should never use INT. 

Common Pitfalls to Avoid in Development and Design

When designing phone number storage, there are several pitfalls to watch out for. The main ones include:

3.1 Field Length Too Short

Using VARCHAR(11) to store only pure digits will cause truncation when encountering numbers like +8613822223333 (14 characters).

Best Practice:

Use VARCHAR(20) to accommodate international numbers and extensions (e.g., 9822223333#123).

3.2 Character Set and Collation

Using the utf8 character set cannot store emojis or special symbols.

Best Practice:

Use utf8mb4 with utf8mb4_unicode_ci collation to support all Unicode characters (such as +, *, #).

3.3 Improper Index Design

Failing to add a unique index on the phone number can lead to duplicate data.

Best Practice:

Add a UNIQUE constraint.

ALTER TABLE user ADD UNIQUE INDEX idx_phone (mobile_phone);

3.4 Missing Data Cleaning and Validation

Users may input numbers like 98-2222-3333 or 98 222 23333, and storing them directly causes inconsistent formatting.

Best Practice:

Clean data before storage by removing spaces, hyphens, and other symbols—only keep + and digits.

Use regex validation, for example: ^+?\d{8,20}$ (allowing 8 to 20 digits with an optional leading +).

3.5 Ignoring Privacy and Security

Storing phone numbers in plain text risks leaking user privacy.

Best Practice:

Encrypt data using AES or built-in database encryption functions.

Mask display results, e.g., show 98****3333 when querying.

3.6 Risk Control Validation

// Strict validation (11-digit pure number, no international code)
String regex = "^1(3[0-9]|4[579]|5[0-35-9]|6[2567]|7[0-8]|8[0-9]|9[0-35-9])\\d{8}$";

// Loose validation (allows international code, e.g. +81 19812345678)
String looseRegex = "^(\\+\\d{1,3})?1(3\\d|4[579]|5[0-35-9]|6[2567]|7[0-8]|8\\d|9[0-35-9])\\d{8}$";

Comments

Popular posts from this blog

Why Do Remote Java Transmission Objects Need to Be Serialized?

Usage of MD5 Encryption and Decryption Technology in Java Application Development