Your Online Data Never Dies, Until Now
By Ronin Bae
Have you ever heard the expression “once on the Internet, it's there forever?” Well, that isn't so true any more due to new laws such as the CCPA. The CCPA (California Consumer Privacy Act) was passed in 2018 and put into effect in 2020. It contains many important rights such as the Right to Delete, Right to Know, and Right to Opt-Out.
Specifically, the Right to Delete states that companies must completely delete personal data when requested by an individual within a certain period of time (e.g. 45-60 days). This has caused large losses to data-centric companies. For example, the General Data Protection Regulation (GDPR) is the UK’s version of the CCPA and was promulgated on May 25, 2018. Under the GDPR, Amazon lost $877 M, Whatsapp lost $255 M, and other companies lost over $1.2 B in total between January 2020 and January 2022. But why would these companies lose billions of dollars instead of simply deleting user data? Because they can’t. Data from databases suffer from data persistence, meaning that the data remains long after use and is only deleted much later. This quality is true for almost all modern hard drives, such as the ones in your computer, laptop, and even phone. Because of this flaw, reliant systems like data structures, data management systems, and other relevant data management tools have been developed without sufficient means to delete data. Instead, companies are forced to run third-party programs through their databases to delete specifically requested data every few weeks. This is extremely costly and time-consuming as well as causes servers to be slow and/or unavailable.
Data permanence is caused by both the hardware and systems used to store data. Data transfers highly value speed, as that naturally creates a better experience for users, and as a result, the most common data structures sacrifice much utility. The most common data structure, the LSM-Tree, is used by Google, Amazon, and Facebook and is a perfect example of this. LSM-Trees are notable due to their fast write speed (the time it takes to receive and archive information) and even faster read speed (the time it takes to locate and export requested data). This is partially achieved not by deleting data but instead by ignoring data that should be deleted. This data is eventually deleted from the system but only under specific situations. The oldest ‘ignored’ data is deleted first and when the system is full. This allows servers to delete the bare minimum of data at a time as well as delay deleting data. This allows active users to either not have to wait for servers to delete data or for very consistent and short deleting periods in the server. However, this does mean that data stays in a server long after its marked for deletion. Bigger servers suffer even more, as the more data there is, the longer the data needs to wait before it becomes the oldest. Furthermore, deleting takes a toll on the hardware that servers use as well. Although not well known, the SSDs (Solid State Drives) that servers, computers, and phones contain have a limited number of uses. SSDs store data as 1s or 0s in the form of bits but can only change between 1 and 0 so many times. If data were to be directly deleted, the SSD would have to lose some of its remaining uses while scrambling bits, whereas an LSM-Tree would directly replace the oldest ignored data with new data. As such, deleting data would force companies to replace their servers much faster than usual, costing unknown amounts of money.
Security and practicality have reached an unforeseen crossroads. Because current servers don’t delete data in a timely fashion, malicious entities can retrieve data long after it's been marked for deletion. However, proper security costs billions of dollars for companies to achieve. Thankfully, many new programs and techniques are being developed to solve or at least minimize the costs of this issue.