Tradeoffs In Archiving Data

Just storing everything in the cloud isn’t the answer. It’s a lot more complicated than that.

popularity

If you’ve ever had to sort through old technical documents, wondering what still has value and what can be safely tossed, you can identify with the quandary of Thomas Levy, UCSD professor of anthropology and co-founder of the field of cyber-archaeology. Staring at thousands of pieces of pottery in a Jordanian dessert, he erred on the side of keeping it all.

“My personal perspective when I excavate a site is to save everything that we physically collect — even all the broken pieces of pottery because we don’t know the questions that could be asked in the future,” Levy said. “That ends up creating a huge physical curation problem.”

For the semiconductor industry, the great irony of digital preservation is that we may not be able to trust the technology we’ve created. Electronics fail and companies get bought or go out of business. Moreover, storing data has an economic component, whether that is measured in memory, energy costs, real estate, or simply maintaining a database that allows data to be accessed whenever it’s needed. And the more data that is stored, the greater the likelihood that something could go wrong.

“Everybody’s going to cloud storage now because it’s cheaper, but what happens over the long term, when prices go up and it gets more expensive?” asks Paula Jabloner, senior director of collections and archives at the Computer History Museum. “What happens if a company goes bankrupt or gets bought out? As an archivist, I’m not thinking short-term. I’m thinking 50 years from now.”

Birth of a new field
As a field archeologist working on Iron Age history (circa 1200 to 586 BCE), Levy faced several collecting challenges, not least of which was modern-day politics. Like many archaeologists working outside their home countries, he could never be sure if a twist in international relations might restrict access to the site of his work.

As a solution, he and his colleagues developed protocols for digital archiving. “It’s all about creating a geospatial database, where we record either the archaeological monuments, historical monuments, or the actual archaeological excavation process,” Levy explained. “Let’s say you have a three-week expedition. You might create 3 to 4 terabytes of data, so if you want to archive that it’s a challenge. We have a workflow, which goes from physical artifacts data to digital data capture, to the curation of our data to the analyses of the data, and then to the dissemination of the data over the internet and in 3D visualization platforms.”

Archeologists examine “material culture,” which covers everything from potshards to gold jewelry to clothing fibers to papyri and more. As Levy digitized his material data, he realized that digital and material archives suffer from many of the same concerns. They need properly temperature-controlled environments, and their structural integrity can be threatened by vermin, fires, floods, and theft. With digital, there’s also hardware and software future-proofing, including maintenance and migration.

The same holds true for the semiconductor industry. To preserve an exact representation of a particular chip design, whether it is required to replace a part that no longer works or used for future debug, may not sound like a big problem. But when there are dozens of complex SoCs in a particular automobile, for example, and those SoCs are updated or changed every year or two, the amount of data that needs to be kept is enormous.

Cloud storage: solution or problem?
Even in the short-term, cloud companies severely limit their legal liabilities, warns technology attorney Richard Santalesa. “If you’re storing items that are vital to your patents or to your IP, you need to have multiple redundancies because there’s no cloud storage provider in the world that you can go through to recover your full losses,” he said. “All of the big three cloud providers — Amazon Web Services, Microsoft Azure, and Google Cloud Platform — cap their damages, typically, either 1x or 2x fees or over the past 12 months. Or if you have an exceptionally large contract with them, they will up that to a set amount or include a super-cap that applies to data breach situations or loss of personal information. But those super-caps don’t apply if they’ve lost the files related to your multi-million dollar patent.”

On top of that, there’s a legal concept called “spoilation.” If a company gets sued, it can be accused of deliberately “spoiling” documents, which makes it all the more important to be able to demonstrate why a particular back-up solution was chosen.

“You need to ask your storage provider about their practices,” said Marc Greenberg, group director for product marketing at Cadence. “When do they retire equipment? When do they transfer the data on to a new medium? How many copies of the data will they store? Is there a diversity of physical locations where the data is stored?”

That last question is vital. When the power went out in Texas in February 2021, it was out so long that back-up generators ran out of fuel.

The value of metadata
One of the challenges in data storage is how readily that data is available. This has been a problem since the early days of mainframes, when archival data typically was stored off-site using the least expensive technology, which in those days was tape. As the amount of data increases exponentially, along with the need for quicker access, that option is no longer feasible.

But recent solutions aren’t perfect, either.

“It’s much harder to retrieve data from cloud storage than if it’s on a server in your space,” Jabloner said. “You need all kinds of discovery metadata to understand what you have. You need to know its integrity. You need to be able to control whose hands it was in, to have check sums on it to make sure it’s the original file.”

Alan Porter, vice president, electronics and semiconductor strategy at Siemens Digital Industries Software, echoed the importance of retrievability. “You want to be able to correct the data. You need to be able to find that data and make the correction. And if your plant burned down and your data is sitting in that plant or on your server, and you can’t get back at that, how do you maintain your product line? How do you support your customers moving forward? You also need a way, if your source company goes out of business, that you can contractually control the rights to that data so you can move forward with it if you need to.”

Napkins and beyond
As Hewlett-Packard famously enshrined a garage, Cisco pays tribute to napkins. Cisco, whose archive Jabloner helped establish at the museum, found a photocopy of the napkins on which engineers Kirk Lougheed of Cisco and Yakov Rekhter of IBM first sketched out the Border Gateway Protocol (RFC 1105), which still controls much of Internet routing.

Reproductions of the napkins now hold a place over many Cisco engineers’ desks. But there’s more to saving napkins that just nostalgia.

“It’s a crucial marker in time for when you came up with the idea,” Santalesa said, warning patent law is so complex that even lawyers hire patent specialists. The general consensus is that nothing should be thrown away anytime soon. “I would hold on to that for that entire life of the patent — 20 years for utility patents, 15 years for design patents. You can get sued anywhere along the line. And that’s just in the United States. There’s a whole worldwide patent regime through the WIPO. And then there’s copyrights, and they last a lot longer than patents.”

It doesn’t stop there. “There are retention protocols that are dictated by government entities like the FCC and the FDA and others that you must follow no matter what,” said Siemens’ Porter. “For example, there’s an audit trail, which includes trusted traceability. Like a CSI crime scene, you have to understand where each piece of data came from. You need to understand when in the lifecycle of a product that data was introduced, when it was changed. And now, given the focus on sustainability and socially responsible engineering and development, you also have to follow the materials through that cycle. And that can get down to the atomic level. If you have a product that contains different materials that each maybe has one atom of lead, and you aggregate that, at some point that lead becomes measurable and significant. So not only do you have to follow all that, but you have to be able to effectively store an audit trail of that information.”

The priority and the liability can vary, depending upon the industry. This is particularly true for medical devices or autonomous vehicles, which may be in use for 20 years or more.

Determining how long to retain data depends on who wants the data, the analytics associated with it, and what they want to do with it.

“For the design team, they may not want to look at it right now,” said Simon Rance, vice president of marketing at Cliosoft. “It may be that the architect wants to look at it to improve the next architecture of the design for performance and low power. But there’s also the other aspect of what is not working as intended. How can the software team potentially do an over-the-air software update to compensate for the hardware?”

This also means it is possible to take measurable and meaningful actions immediately.

Rob Conant, vice president, software and ecosystem at Infineon Technologies, gave a specific example of how this plays out in real-time.

“We deploy WiFi chips into the field,” Conant said. “In one case we started collecting data specifically for battery-operated products, and the energy consumption of these devices in the field. We found that there was an average. But, 20% of the devices consumed eight times the average, which means if a device was supposed to have a sixteen-month battery life, that 20% has a two-month battery life. This leads to a bunch of angry customers. We analyzed the data, correlated it, figured out what it correlated with, and eventually figured out what was happening. We were then able to provide a software update to the companies that operated these products in the field. They performed over-the-air software updates, and tripled the battery life for those outliers. This is after the fact. After the devices have already shipped. That is what an IoT company does, and I believe that the semiconductor industry is going to do more of that.”

There are important operational aspects as well, Porter said. “A developer may alter the code, whether it’s for hardware or software design, some multiple of x times in one day. And then it’s released to the cross-functional teams at the enterprise, maybe once every two or three days, or a week. So if you don’t have any kind of storage of that work in process, you’re going to lose that that day no matter what. And you have to go back to the last time it was vaulted to the enterprise. And if somebody gets sick or quits the company, you need to continue your work in process. You can’t just say to yourself, ‘I have to go back and figure out where we are in the process’ and then try to move forward. You need to know precisely what that last state was.”

Conclusion
Whether it involves Fortune 100 companies with a full roster of attorneys, or startups with only a few employees, everyone needs formal protocols around data retention. If companies are large enough, they should also employ a dedicated archivist or data librarian in addition to legal counsel. And discussions should extend beyond legal and regulatory compliance to consider company culture, and how celebrating history can help improve recruitment and morale.

“You’re building on a legacy,” said Jabloner. “An engineer might like to relish that he or she works in a company that was the first to do X, even if that engineer was still in college or grad school at the time.”

Yet despite best practices, companies are still vulnerable to data loss. “Frankly, there’s nothing that’s permanent,” said Porter. “You have to have redundancy. The data that I put on the cloud, I still backup to my 4 terabyte drive.”

Reference

  1. Jabloner, Paula and Mancini, Anna (2020) “Corporate Archives in Silicon Valley: Building and Surviving Amid Constant Change,” Journal of Western Archives: Volume 11: Issue 1, Article 3. DOI: https://doi.org/10.26077/b786-1c7b, available here.


Leave a Reply


(Note: This name will be displayed publicly)