By Alexander A. Kharlamov, Researcher – Blockchain for Healthcare
19 March 2019
This is Part I of a three part series about long data. Part I focuses on the definition, quality and value of long data. Part II explores the promises of blockchain technology to deliver data integrity for long data. Finally, Part III outlines the potential obstacles to guaranteeing data integrity for long data even supported by technologies such as blockchain.
We define ‘long data’ as longitudinal data or “data which tracks the same sample at different points in time” An example of ‘long data’ (sometimes referred to as panel data[1]) in the health context, would be data that track the same cohort of clinical trial subjects over time in relation to the same health variable (for example, blood pressure).
[1] See https://www.nlsinfo.org/content/getting-started/what-are-longitudinal-data for complete definition.
The Data: Its Quality and Value
For the majority of public and private organizations, data plays a major role – to help understand consumers, organisational planning, strategy and forecasting. The source of this data varies widely (Woerner and Wixom, 2015). The recent shift toward the New Industrial Revolution (known in the Business jargon as “Industrial Revolution 4.0”) catalysed the use of data for a wide variety of business and public tasks and operations (Lee et al., 2015). But not all data is created equal: some data are of poor or lower quality (integrity) than others, while other data is ‘noisier’ than other data (e.g., Pogrebna, 2015). This means that different datasets vary greatly in the value and volume amount of insights they can deliver (Monino, 2016; Demchenko et al. 2013). To understand this variability, it is important to understand that:
- All data has a certain ‘life cycle’
- The value of data varies greatly depending on how relevant it is in solving the problem at hand and how much intelligence can be derived from a particular dataset (Levitin and Redman, 1993)
- The value of data is also related to its longevity (its “time horizon”) as well as its quality (integrity)
How Long is Long Data?
Consider the data which your organization is collecting. Will this data be useful to you in 5, 10, 15 minutes, hours, days? How about years? Of course, there are many factors which may affect your answers. For example, in some sectors, data has a high data depreciation rate and expires rapidly. For other sectors, the data never goes out of date. For example, in the context of fast-moving consumer goods, businesses require customer demand data fast, ideally in real-time, in order to provide effective supply while controlling cost. If customer demand data is not received in a timely manner, this can lead to unnecessary waste. An alternative example is the case of climate records across the world, the longer the time horizon, the greater insight it provides: to researchers, governments, and societies. Other factors influencing data longevity may include the way in which datasets are affected by the global (universal) factors (such as technological changes), local factors (such as the way data is processed, stored and analysed) and even by individual factors (such as the leadership vision of the business management, etc). For example, a business that is run on a “make to order” principle will tend to focus on a shorter time frame rather than a business that employs a “make to stock” strategy. Similarly, a decision to progress a space exploration programme is based on many years of data whilst an individual’s decision to reduce the speed of her vehicle while driving is based on the data provided by the speedometer that is only relevant/useful for a fraction of a second.
Although there are therefore a wide range of variables affecting data longevity and value, it is clear that the longevity of data varies, depending upon its context. Some datasets are highly perishable. For other types of data, their value might diminish at the time of its collection, but subsequently grow in value over time (e.g. as a sample size grows). Finally, some types of data have and retain high value throughout its life-cycle. A dataset’s longevity ends when it can no longer provide any additional insights or intelligence because, for example, the data has become too difficult to analyse and hence loses value.
Data longevity is important because it is crucial to understand change. Change in social structures, psychology, nature and more. Long data is important, more than ever, since human activity is generating unprecedented amounts of data over time and dealing with it is challenging. Improper analysis and handling of long data can lead to the wrong conclusions which in turn can influence policy, regulation and human behaviour. Part II will explore the challenges of long data integrity.
References
Maletic, J.I. and Marcus, A., 2000, October. Data Cleansing: Beyond Integrity Analysis. In Iq (pp. 200-209).
Demchenko, Y., Grosso, P., De Laat, C. and Membrey, P., 2013, May. Addressing big data issues in scientific data infrastructure. In Collaboration Technologies and Systems (CTS), 2013 International Conference on (pp. 48-55). IEEE.
Gaetani, E., Aniello, L., Baldoni, R., Lombardi, F., Margheri, A. and Sassone, V., 2017. Blockchain-based database to ensure data integrity in cloud computing environments.
Godlee, F., Smith, J., Marcovitch, H. Wakefield’s article linking MMR vaccine and autism was fraudulent. BMJ.2011;342:c7452.
Katal, A., Wazid, M. and Goudar, R.H., 2013, August. Big data: issues, challenges, tools and good practices. In Contemporary Computing (IC3), 2013 Sixth International Conference on (pp. 404-409). IEEE.
Lee, J., Bagheri, B. and Kao, H.A., 2015. A cyber-physical systems architecture for industry 4.0-based manufacturing systems. Manufacturing Letters, 3, pp.18-23.
Levitin, A.V. and Redman, T.C., 1993. A model of the data (life) cycles with application to quality. Information and Software Technology, 35(4), pp.217-223.
Bernd Panzer-Steindel. Data integrity. CERN Technical Report Draft 1.3, CERN/IT, April 8, 2007.
Monino, J.L., 2016. Data value, big data analytics, and decision-making. Journal of the Knowledge Economy, pp.1-12.
Wakefield, A., Murch, S.A., Linnell, J., Casson, D., Malik, M. Ileal-lymphoid-nodular hyperplasia, non specific colitis, and pervasive developmental disorder in children. The Lancet. 1998;351:637-641.
Woerner, S.L. and Wixom, B.H., 2015. Big data: extending the business strategy toolbox. Journal of Information Technology, 30(1), pp.60-62.