By Andrew Purcell1 and Alberto Pace 2
As described in the previous article, the meteorological community is rapidly advancing its capacity to manage and distribute the growing amounts of data and information being generated by WMO Members and other organizations. Not surprisingly, other technical and scientific communities are also grappling with how best to organize massive quantities of data. Like WMO, they are tackling this challenge by taking advantage of the rise of the Internet, the accelerating power of computers, and the increasing sophistication of software.
The following article describes the strategy being adopted by the European Organization for Nuclear Research, known by its French acronym as CERN and located just a few kilometers from WMO's Geneva headquarters. While the nature of CERN's data-management needs is in many ways different from WMO's, its strategy offers an interesting comparison to the approach taken by the WMO Information System.
When particles collide, data explodes. The Large Hadron Collider (LHC) at CERN produces roughly one million gigabytes of data per second. Using sophisticated selection systems, researchers at CERN are able to filter out all but the most promising data, but this still leaves the organisation with over 25 million gigabytes of data to handle annually — this is the equivalent of over 5 million DVDs. Analysing and understanding this data in a meaningful way is a tremendous challenge, requiring a global, collaborative effort. While the scale of the challenge at CERN may be larger than that faced by many other organisations, important lessons may be learned from how CERN handles this data.
The Grid never sleeps: this image shows activity on 1 January 2013, just after midnight, with almost 250,000 jobs running.
Modern scientific computing can be divided into three main pillars:
- Processing — many computing cores may be required to transform and analyse the data associated with a particular research project. LHC data analysis, for instance, requires computing power equivalent to approximately 300,000 of today's fastest PC processing cores.
- Networking — a fast network, including a large bandwidth to the Internet, is vital for providing geographically dispersed research centres and laboratories with access to the computing infrastructure.
- Data handling — data needs to be stored, moved to where processing resources are available at any given time, and distributed among a large number of stakeholders (universities, research laboratories, etc.). Long-term preservation of data is also often very important.
While computers are essential tools in almost all fields of science, never have they been more integral to research than today. Historically, scientific computing was focused primarily on processing information, but recent years have seen significant evolution in technologies relating to storage, processing, distribution and long-term preservation of data. CERN has both driven and profited from this evolution, making it a prime example of how to successfully deal with large amounts of scientific data.
A worldwide computer
The CERN data centre has a computing power capacity of 3.5 megawatts and boasts a staggering 88,000 processing cores. Nevertheless, the organisation only provides 15 per cent of the computing capacity required to process the mountain of data generated by the LHC, even after filtering by complex algorithms has discarded all but around 1 per cent of the data.
As far back as the late 1990s, it was already clear that the expected amount of LHC data would far exceed the computing capacity at CERN alone. This is why, in 2001, CERN initiated the Worldwide LHC Computing Grid (WLCG) project, which is a distributed computing solution connecting the data processing and storage facilities of more than 150 sites in nearly 40 countries around the world. Starting in 2003, the WLCG service was gradually built up through a series of increasing performance challenges, before being officially inaugurated in 2008.Today, it is used by about 10,000 physicists and, on average, well in excess of 250,000 'jobs' run concurrently.
Strength in numbers
WLCG uses a tiered structure, with the primary copy of all the data stored at the CERN data centre, often referred to as Tier-0. From here, CERN sends out a second copy of the data to 11 major data centres around the world that together form the first level, or Tier-1. Data centres in Russia and South Korea are set to join in the near future, bringing the total number of WLCG Tier-1 sites up to 13.
The handling of the magnetic tape cartridges is now fully automated, as they are racked in vaults where they are moved between the storage shelves and the tape drives by robotic arms. Over 100 petabytes of data are permanently archived, equivalent to 700 years of full HD-quality movies.
Tier-1 sites have a responsibility for long term guardianship of the data and they provide a second level of data processing. Each Tier-1 site links to a number of Tier-2 sites, usually located in the same geographical region. In total there are around 140 Tier-2 sites, which are typically university departments or physics laboratories. Although generally small, Tier-2 sites now regularly deliver more than half of total resources. It is at these sites that the real physics analysis takes place.
Earlier this year, CERN and the Wigner Research Centre for Physics inaugurated an extension of the CERN data centre in Budapest, Hungary. About 500 servers, 20,000 computing cores, and 5.5 petabytes of storage are already operational at that site. The two dedicated and redundant 100 gigabits-per-second circuits connecting the two sites have been functional since February 2013 and are among the first transnational links at this distance. The capacity at Wigner is remotely managed from CERN and substantially extends the capabilities of WLCG Tier-0.
With a large distributed computing solution, the most important factor limiting processing is that CPU resources may not be close to the data storage location. This leads to the transfer of large amounts of data, often across the academic Internet or across private fibre-optic cables, which leads to time during which some CPU resources are idle. Consequently, a good data-placement strategy is essential to maximise the computing resources available at all times.
Overall efficiency is heavily influenced by data-caching strategies as well as the speed of data transfer from offline to online storage, from remote to local sites, from data servers to local disk, and from local disk to local RAM. Similarly, the strategies used for error correction and data replication may have considerable consequences on both the availability and reliability of data. For these reasons, good data management — the architecture, policies and procedures that manage all data activities and lifecycle — is important in large-scale scientific computing.
In large research projects, data workflows can be extremely complex: data can be duplicated across multiple sites to reduce the probability of loss in case of severe operational incidents, to increase the number of CPU cores able to process the data in parallel, or to increase the available throughput for on-demand transfers (as multiple copies can serve the data in parallel).
The LHC physics experiments at CERN provide a good example of such data management activities. The data generated by these experiments is regularly moved from one storage pool to another, so as to minimize costs or because the quality of service differs among the different pools. A storage pool based on solid state drives with multiple data replicas and fast networking may, for instance, be used to store high-value raw data that needs to be analysed. Once the data processing has been completed, this same data can be moved to a high-latency, low-cost archive repository, where long-term reliability is key. Equally, some intermediate analysis results, which can be recalculated if lost, will likely be moved to a low-latency, high-throughput, low-cost, unreliable temporary storage pool, known as "scratch space".
The data model of the LHC experiments is, of course, much more complex than this rough outline and there is a clear need for data pools with different qualities of service (in terms of reliability, performance and cost). Also, a variety of tools are required to transfer and process this data across pools.
The need for a multi-pool architecture to efficiently handle complex data-workflows in scientific research is evident by the fact that even the simplest computer has multiple types of storage: L1 and L2 caches, RAM memory, hard disk. Running a computer from a single type of storage would be a simplification that would lead to inefficiencies.
This requirement of having multiple pools with a specific quality of service is probably the major difference between what may be termed "cloud storage" and "big data" approaches to handling large amounts of data:
- In the cloud storage model, all data are stored into a huge, flat, uniform storage pool. As there is only one pool containing all data (typically with three replicas spread across multiple sites), there is no need to move the data. This single-pool approach implies a uniform quality of service, with the consequence that this becomes rapidly suboptimal (and therefore uneconomic) when the amount of data grows beyond the point where storage contributes to a large fraction of the cost of the whole scientific project.
- Conversely, the big data approach goes beyond the single unique storage pool. It introduces the concepts of data workflows, data life time, data movement, data placement and a storage infrastructure based on multiple pools with variable quality of service, whereby the cost of storage can be significantly reduced and, hopefully, optimized.
Everybody using a computer at home or in the work place has a clear picture of what can be expected from hard disks. Common sense says that they are pretty fast and cheap. A majority of laptop or desktop users would also say that they are pretty reliable because they have never experienced a loss of data due to disk failure — although they are probably aware that this can happen. Yet, at the CERN data centre, we have the opportunity to measure the quality of the various media systematically over a very large volume of data and we have reached some surprising conclusions.
We have found that hard disks are a cost-effective (€0.03 per gigabyte) solution for online storage (without taking into account the electrical consumption cost of a 24/7 operation) and their performance is acceptable (100 megabytes per second in reading and writing a single stream with a few milliseconds of seek time). However, their reliability is too low to ensure any acceptable service without data loss: the CERN data centre has 80,000 disks and we experience a typical failure rate of around 5 disks on an average day. This gives a daily data loss of around 10,000 gigabytes, which is the equivalent of about 5,000 average-sized user mailboxes per day. This is clearly unacceptable and is why storing only one copy of data on a single disk is not a viable strategy for a storage service.
Long live long-lived storage
The situation we measure for tapes is rather different. Tapes come in cartridges of 4-5,000 gigabytes and have a cost (€0.02 per gigabyte) comparable to disks. The major drawback with tapes is that they have high latency (long access time), as it can take a couple of minutes to rewind a tape, demount it, mount another cartridge and fast forward to the location where the data needs to be read. Nevertheless, despite their reputation for being slow, once a tape has been mounted and positioned, data can actually be written or read at speeds that are typically twice those experienced with hard disks. In addition, tape drives have separate heads for reading and writing, allowing the data to be read "on the fly" just after having been written, giving an additional factor of two improvement in terms of streaming rates when data verification is necessary. Unlike hard disks, tapes consume no power when not being read or written.
Another difference between disks and tapes comes when comparing reliability. When a tape fails, the amount of data loss is limited to a localized area of the tape, and the remaining part of the tape remains readable. This "tape incident" generates a data loss that ranges from few hundred megabytes up to few gigabytes, thus representing three orders of magnitude less than the data loss in a disk failure. This figure is confirmed by the CERN data centre, which, with more than 50,000 tapes, experiences a loss of tape data of just a few hundreds of gigabytes annually. This compares against the few hundred terabytes of data lost due to disk failure every year.
The CERN data centre houses servers and data storage systems not only for 'tier-0' of WLCG and for other physics analysis, but also for systems that are critical to the daily functioning of the laboratory.
Failures with tapes are also much less correlated between each other: the probability of a read operation in tape A failing is quite independent from the probability of a failure on tape B. With hard disks, there is greater correlation when disks are in the same servers or under the same controller – the probability of a second failure after a first one is much higher than the probability of the initial failure. Independence of failures is a key strategy to increase the service reliability and this independence is often missing when servers with large number of disks are used in the infrastructure.
Conflicts and data integrity
The CERN Data Centre has so far recorded over 100 petabytes (equal to 100 million gigabytes) of physics data from the LHC, which is roughly equivalent to 700 years of full HD-quality movies . Particle collisions from the LHC have generated about 75 petabytes of this data in the past three years alone. At CERN, the bulk of the data (about 88 petabytes) is archived on tape. In total, the organisation has eight robotic tape libraries distributed over two buildings, and each tape library can contain up to 14,000 tape cartridges. CERN currently has around 52,000 tape cartridges. The remaining physics data (13 petabytes) is stored on a system of hard disks optimized for fast analysis and accessible by many concurrent users. The data are stored on over 17,000 disks attached to 800 disk servers.
In a multi-pool system, such as that used by CERN, it is important to ensure that there are no discrepancies between the multiple copies of the data replicated in different pools. In a perfect world, where software always behaves as designed, replication based on metadata information should be enough to ensure a sufficient level of consistency. However, where conflicts do arise between data sets in different pools, it is important to have a well-defined strategy in place to handle them. Security is another vital aspect of ensuring long-term data integrity. Every request that a storage service executes needs to be mapped to a particular identity, and this identity needs to be traceable to a person. Security may, for example, involve the encryption of data or placing limits on what data can be accessed or modified by particular users. High security may, however, require compromises to be struck between performance, scalability and costs.
This article shows that while distributing petabytes of scientific data 24/7 may be complex, it is certainly possible. Components within such a large system will inevitably fail, so it is important to build resilient systems. It is worth bearing in mind that it took over a decade to build WLCG. During the lifespan of WLCG, networks have proven to be far more reliable than anticipated, as well as more affordable. Global federated identities have also been key to the success of WLCG, since researchers regularly change organisations, but still want to access 'their' data.
CERN is at the forefront of handling extremely large data sets, with sophisticated solutions existing for data processing, distribution and analysis. The organisation is now tackling challenges related to long-term data preservation. Data preservation and sustainability require permanent efforts, even after funding for experiments ceases, with measures to achieve this now frequently becoming a requirement of funding agencies. This is a significant and welcome trend, since we have an important duty to future generations to accurately and efficiently preserve the data generated by CERN experiments.
You can learn more about how CERN handles data in this video.