Data Storage Visualisation
If you’re running this in Colab, make sure to save a copy of the notebook in Google Drive to save your changes.
To help visualise the scale of the data used in physics, here is a figure representing some major numbers in fields such as particle physics and cosmology. You can compare these with the standard units used when quantising data. The figure uses a log scale (base = 2) for the size of the dots.
Below is the code used to create the figure. It is hidden as it is not relevant to data storage concepts. However, feel free to have a look at it.
[1]:
# If you're running this notebook, uncomment the code in this cell to install the required packages.
# ! pip install numpy
# ! pip install plotly
# ! pip install matplotlib
# ! pip install time
# ! pip install pandas
# ! pip install warnings
[2]:
import numpy as np
import matplotlib.pyplot as mat
from plotly.offline import init_notebook_mode, iplot
import time
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
init_notebook_mode()
#change this path according to the location of the document
path =''
df = pd.DataFrame(pd.read_excel(path+ "Datasize.xlsx"))
df['Size (for scaling)'] = df['Size (for scaling)'].values.astype(int)
## creating a difference between the different domains so that we can colour them differently later
domains = []
for domain in df['Domain']:
if domain not in domains:
domains.append(domain)
#print(domains)
custom_colors = {
'Units': 'blue',
'Comparaison with Real World': 'pink',
'CERN': 'maroon',
'Astronomy': 'green'
}
columns = []
# make grid
x_column = 'X pos'
y_column = 'Y pos'
dot_column = 'Title'
time_column = 'SI Scale'
dotsize_column = 'Size (for scaling)'
cat_column = 'Domain'
SI_Scale = df[time_column].unique()
grid = pd.DataFrame()
df = df.dropna(subset=['SI Scale'])
### GRID
col_name_template = '{SI_Scale}_{domain}_{header}_grid'
col_names = [x_column, y_column, dot_column, dotsize_column]
for s in SI_Scale:
for d in domains:
dataset_by_scale = df[(df['SI Scale'] == int(s))]
dataset_by_scale_and_domain = dataset_by_scale[(dataset_by_scale['Domain'] == d)]
for col_name in dataset_by_scale_and_domain:
column_name = '{SI_Scale}_{domain}_{header}_grid'.format(
SI_Scale=s, domain=d, header=col_name)
if dataset_by_scale[col_name].size != 0:
grid = grid.append({'key': column_name, 'value': list(dataset_by_scale[col_name])},
ignore_index=True)
#print(grid.sample(20))
figure = {
'data': [],
'layout': {},
'frames': []
}
# Get the smallest scale
scale = min(SI_Scale)
# Make the data dictionary
for d in domains:
data_dict = {
'x': grid.loc[grid['key']==col_name_template.format(
SI_Scale=scale, domain = d, header=x_column
), 'value'].values[0],
'y': grid.loc[grid['key']==col_name_template.format(
SI_Scale=scale, domain = d, header=y_column
), 'value'].values[0],
'mode': 'markers',
'text': grid.loc[grid['key']==col_name_template.format(
SI_Scale=scale,domain = d, header=dot_column
), 'value'].values[0],
'marker': {
'sizemode': 'area',
'sizeref': 1/1000,
'size': grid.loc[grid['key']==col_name_template.format(
SI_Scale=scale, domain = d, header=dotsize_column
), 'value'].values[0],
'color': custom_colors[domain]
}
}
# Append the data dictionary to the figure (not needed when calculating the other frames)
#figure['data'].append(data_dict)
figure['layout']['title'] = 'Visualisation of scale of Data generation in Physics '
figure['layout']['showlegend'] = True
figure['layout']['hovermode'] = 'closest'
figure['layout']['xaxis'] = {'title': 'X',
'range': [0, 3000], 'visible': False}
figure['layout']['yaxis'] = {'title': 'Y',
'range': [0, 4000], 'visible': False}
##Creating the different frames
for s in SI_Scale:
frame = {'data': [], 'name': str(s)}
string = ''
domain_dict = {'domain': grid.loc[grid['key']==col_name_template.format(
SI_Scale=s, domain = d, header=cat_column), 'value'].values[0]}
if string == '':
string += domain_dict['domain'][0]
#print(s)
# Make a frame for each year
for d in domains:
#Make data dictionary for each frame
data_dict = {
'x': grid.loc[grid['key']==col_name_template.format(
SI_Scale=s, domain = d, header=x_column
), 'value'].values[0],
'y': grid.loc[grid['key']==col_name_template.format(
SI_Scale=s, domain = d, header=y_column
), 'value'].values[0],
'mode': 'markers',
'text': grid.loc[grid['key']==col_name_template.format(
SI_Scale=s, domain = d, header=dot_column
), 'value'].values[0],
'marker': {
'sizemode': 'area',
'sizeref': 1/40,
'size': grid.loc[grid['key']==col_name_template.format(
SI_Scale=s, domain = d, header=dotsize_column
), 'value'].values[0]
},
'name': string
}
data_dict['marker']['color'] = custom_colors[string]
# Add data dictionary to the frame
frame['data'].append(data_dict)
figure['data'].append(data_dict)
#print(figure['data'])
iplot(figure, config={'scrollzoom': True})
References
You can find more documentation on how plotly works here
Computerhistory.org gives a good timeline for the evolution of memory storage
Official CERN documentation about storage: Storage at CERN
Data sharing varies across physics. Nat Rev Phys 5, 73 (2023). link to article
Official documentation about CTA: CTA website
If you wish to get an overview of the remaining topics in this course, click the button below.
