Difference between revisions of "HvABigDataVisualisation"

From PDP/Grid Wiki
Jump to navigationJump to search
 
(5 intermediate revisions by the same user not shown)
Line 2: Line 2:
  
 
This data is used not only here, but by over 20000 researchers worldwide, who transfer the data back and forth between the data curation centres, the more than 300 compute centres, and thoudans of small analysis workstations and desktops. Data flows globally, and even just at Nikhef compute and disk clusters are interconnected by 240 gigabit-per-second links, and international connectivity exceeds 100 gigabit-per-second.  
 
This data is used not only here, but by over 20000 researchers worldwide, who transfer the data back and forth between the data curation centres, the more than 300 compute centres, and thoudans of small analysis workstations and desktops. Data flows globally, and even just at Nikhef compute and disk clusters are interconnected by 240 gigabit-per-second links, and international connectivity exceeds 100 gigabit-per-second.  
 +
 +
= More questions than answers? =
 +
 +
The Phase-II visualisation challenge leaves you with plenty of things to try out. Use your creativity to visualise, explain and analyse the data: big data lives by propaganda (and agitation)!
 +
 +
* How can global data flows be presented?
 +
* Can one conceive visualisations for the general public? for users? or for both?
 +
* Identifying troublesome inter-peer links (and local disk servers) via analytics techniques
 +
* Identifying �hot� (interesting) data by analysing usage
 +
* Pro-active security incident detection: are there users �behaving strangely�?
 +
* What�s the way to explain WLCG data flows to the world?
 +
* Which systems (or group of systems) uses the most bandwidth (in and out separately)
 +
* Which systems (or group of systems) generates the most connections?
 +
* What does the spectrum of transfers look like?� mostly big, mostly small, ... ?
 +
* What does the time distribution of transfers look like (bandwidth and also number)?
 +
* What "funny behavior" is there (machine learning anomaly detection)
 +
 +
but there's surely more to do with the data you have!
  
 
= About the data analytics cluster =
 
= About the data analytics cluster =
 
   
 
   
 +
[[File:RTM-2D-busy-2007.jpg|200px|thumb|right|Jobs distribution across EGEE and WLCG]]
 
But where does all that data go to? Based on the log files produced by the data servers, in 2015 HvA students experts from the technical informatics group set up a search engine based on big data analytics techniques: leveraging ElasticSearch, Logstash, and with a basic analysis front-end using Kibana and custom (Java-coded) queries. This �ELK� stack is hosted next to the data processing facility and contains last month� worth of log file data in a distributed search cluster: four elasticsearch servers, a data ingest and processing server, and a query gateway proxy.  
 
But where does all that data go to? Based on the log files produced by the data servers, in 2015 HvA students experts from the technical informatics group set up a search engine based on big data analytics techniques: leveraging ElasticSearch, Logstash, and with a basic analysis front-end using Kibana and custom (Java-coded) queries. This �ELK� stack is hosted next to the data processing facility and contains last month� worth of log file data in a distributed search cluster: four elasticsearch servers, a data ingest and processing server, and a query gateway proxy.  
  
[[File:RTM-2D-busy-2007.jpg|200px|thumb|right|Jobs distribution across EGEE and WLCG]]
 
 
But there�s much more to analysis than just collecting the data. To understand what is happening in the data network, notice anomalies in the system and detect and understand fault conditions needs visualization: a picture says more than a thousand words (which means this text is much too long already!), and keeping track of data flows in the LHC computing grid with over 300 participant sites is not done by reading long lists. Also Nikhef likes to explain what we do, why we do it, and how: making both sub-atomic physics and the experimental techniques used understandable not only by experts, but also by the general public. It is essential to visualize ongoing activity and its meaning in a way to is both appealing and conveys the �gist� of what LHC data traffic is about.  
 
But there�s much more to analysis than just collecting the data. To understand what is happening in the data network, notice anomalies in the system and detect and understand fault conditions needs visualization: a picture says more than a thousand words (which means this text is much too long already!), and keeping track of data flows in the LHC computing grid with over 300 participant sites is not done by reading long lists. Also Nikhef likes to explain what we do, why we do it, and how: making both sub-atomic physics and the experimental techniques used understandable not only by experts, but also by the general public. It is essential to visualize ongoing activity and its meaning in a way to is both appealing and conveys the �gist� of what LHC data traffic is about.  
  
Line 16: Line 34:
 
The TI student team in 2015 set up the ELK cluster and has taken care of a continuous stream of fresh data being inserted in the database. At any time there should be about 1 month worth of historical detailed data in the database, and it is updated in near-real-time.  
 
The TI student team in 2015 set up the ELK cluster and has taken care of a continuous stream of fresh data being inserted in the database. At any time there should be about 1 month worth of historical detailed data in the database, and it is updated in near-real-time.  
  
An ElasticSearch (ES) interface and API is accessible (through an application proxy) on a dedicated system: "vm6.stud1.ipmi.nikhef.nl" (172.23.1.16) on port 80/tcp.  
+
An ElasticSearch (ES) interface and API is accessible (through an application proxy) on a dedicated system: "vm6.stud1.ipmi.nikhef.nl" (172.23.1.16) on port 9200/tcp.  
 
Details of the API and how to use the interface (alongside some example queries) are available in your course documentation pack, kindly prepared by Jouke, Olivier, and Rens. We expect clients to talk to the ElasticSearch API via the network. The client should run on your local system (own laptop, desktop, HvA VDI system, &c), and
 
Details of the API and how to use the interface (alongside some example queries) are available in your course documentation pack, kindly prepared by Jouke, Olivier, and Rens. We expect clients to talk to the ElasticSearch API via the network. The client should run on your local system (own laptop, desktop, HvA VDI system, &c), and
 
connect to the system via the public API only.
 
connect to the system via the public API only.
Line 50: Line 68:
 
  comp-lzo
 
  comp-lzo
 
  verb 3
 
  verb 3
 +
cipher aes-256-cbc
 
  #auth-user-pass ../keys/secret-schaapscheerder.conf
 
  #auth-user-pass ../keys/secret-schaapscheerder.conf
  
Line 77: Line 96:
 
  -----END CERTIFICATE-----
 
  -----END CERTIFICATE-----
  
and if you want auto-login, *only on your own laptop*, make a secure, protected file "auth-user-pass ../keys/secret-schaapscheerder.conf" and uncomment the auth-user-pass line above. The file "auth-user-pass ../keys/ipvanish.conf" should contain something like
+
and if you want auto-login, *only on your own laptop*, make a secure, protected file "auth-user-pass ../keys/secret-schaapscheerder.conf" and uncomment the auth-user-pass line above. The file "auth-user-pass ../keys/secret-schaapscheerder.conf" should contain something like
  
 
  nvahva16x2342
 
  nvahva16x2342

Latest revision as of 12:28, 7 September 2016