Standing Committee Meeting Report Data Services

October 2022

Open Action Items 

[F2020:1-2] Enhance the ability of authors to cite data by 1) improving instructions to authors on  webpages, 2) promotion through IRIS newsletter, 3) informing journals of the citation services  provided, 4) investigating the use of tags on data distributed to users. Develop a DRAFT Data  Licensing policy in coordination with major funding agencies and UNAVCO; consider legal advice.

Responsible: Carter 

Status: (March 2021) Instructions to authors on IRIS and FDSN web pages have been updated and  an article on citation was included in the winter newsletter. Items 3 and 4 have not yet been  addressed. The Joint Data Services committee of IRIS and UNAVCO met in January 2021 to  discuss data licensing. It was recommended that no restrictions be imposed on the organization  regarding the data that might be accepted. Carter has reached out to FDSN to discuss the issue with  the executive committee. 

(October 2021) The UNAVCO DS governance committee is recommending that a workshop  proposal be prepared to address these issues and inviting IRIS to be involved in this workshop. This  item is expected to be linked to the citation workshop. In addition to this, the FDSN has been  approached about preparing a policy statement about licensing with a recommend that all metadata  be in the public domain and all data be either in the public domain or be minimally licensed to require  attribution. 

(March 2022) The UNAVCO DS governance committee chairpersons (Julie and Suzan) have started  to organize a workshop on Data Citation/Licensing. 

(October 2022) The UNAVCO DS governance committee chairpersons (Julie and Suzan) organized  a workshop on Data Citation/Licensing on 17, 19, and 21 October.  

[F2020:3] Develop a DAS data directive that provides a consistent approach to requests for storing  DAS data in the data repositories. 

Responsible: Carter 

Status: (March 2021) IRIS has submitted a MSRI design proposal to address DAS data storage.  (October 2021) The MSRI proposal was not asked to continue to the second round. More community  involvement is needed and this will be sought through a community workshop. (March 2022) A draft proposal for a community workshop has been introduced to the DAS RCN  working group on Data Management. Members of the organizing committee are being sought. It  should be noted that IRIS and UNAVCO are continuing to work on better metadata/format that would  be appropriate for accepting DAS data. This does not, however, solve the data volume issues. (October 2022) No progress 

[S2021:1] Investigate a way of finding data that “looks different”. What are the most common  reasons for which data are tossed? Link to new “funny squiggles” paper (Ringler et al. 2021  preprint). 

Responsible: QAAC 

Status: (October 2021) The QAAC is considering this at their next meeting (winter 2021). (March 2022) No progress. The QAAC is still planning to discuss. 

(October 2022) No progress. 

[S2022:1] Wordsmith a list of principles for the data services governance committee and share it  with Julie E., David M., Jerry. 

Responsible: Jonathan A.

Status: (October 2022) A draft version was sent to Suzan for consideration. 

[S2022:2] Begin to coordinate dates/times for a regular joint governance meeting beginning in May.  Responsible: Carter 

Status: (October 2022) As the merger will happen in 2-months’ time and the DS governance  committees are in the process of selecting members for the new committee, this action item is  recommended for removal.  

Meeting Summary 

Virtual Meetings on 12 and 14 October 2022 

Present: DSSC members: Ebru Bozdag, Marine Denolle, Heather Ford, Jonathan Ajo-Franklin,  Suzan van der Lee 

Staff: Jerry Carter, Rob Casey, Chad Trabant, Gillian Sharer. 

Reporters/Observers: Bruce Beaudoin, Eric Sandvol, Rebecca Rodd, Julie Elliott, David Mencin,  Dan Auerbach, Rob Mellors, Adam Ringler 

Approval of Spring 2022 Minutes:  

The minutes from the Spring 2022 DSSC meeting were approved.  

Requests to Store New Data – (Carter) 

There were three requests to store new datasets: 

• DAS from global earthquake experiment in Feb 2023 (Andreas Wuesterfeld)  Data from a community experiment. IRIS cannot accommodate DAS data in SEED or PH5. As data are being stored in pubDAS for nnow, this an opportunity for IRIS to experiment with  ingesting DAS data because the amount is not huge. Jerry proposes to store the DAS data at  IRIS as a format-agnostic “assembled data set”, which cannot take advantage of advanced  search and discovery. However, getting the data in a chunk (assembled) is better than not  having it. DSSC discussion was supportive of IRIS storing DAS data from this event.  

• SEGY data from oil exploration (Joseph Dellinger) 

Exploration data, license CC-BY-NC-SA, not typical for IRIS data. There are 2 datasets – one of  airgun data and one that captured an earthquake from ocean-bottom seismometers.  DSSC does not want to lose these data sets and make them more findable – for now they can  be accepted as assembled data sets, perhaps in the future, some indexing and findablility might  be added. 

• Antipodal earthquake database (Rhett Butler) 

Derivative data product: catalog of antipodal earthquakes – there are too many stations to  systematically find antipodal earthquakes. At the moment, all the tools exist at the IRIS DMC and 

they need to be put together to accomplish building the database. The script/code can live on  github and be indexed in SeisCode on the IRIS web site. Action: Jerry will contact Rhett and  recommend the development of a script which can be shared on github and indexed in seiscode. 

Director’s Report – (Carter) 

• Jerry returned to full-time work and effectiveness after successful leave. He is working remotely,  which works well. 

• 100 % uptime for the last 6 months. Only 1 month was a bit less. As is to be expected for any big  data center, there are a few glitches, which are being worked on.  

• One Data Services all-hands meeting: Staff from IRIS and UNAVCO attended to prepare for  working as a single unit, following the example of working together on the Common Cloud  Platform. All DS managers are setting the standard/ is a role model for the overall IRIS-UNACVO  merger into EarthScope. 

• CCP funding from NSF continues, but is not sufficient. Carry-over funds have supplemented this  amount.  

• Three new hires: Thaddeus Megow (cloud infrastructure), Bill Fassbinder, replaces Rick Benson  (leads infrastructure section), Emily Maher, replaces Forrest Thompson (data engineer, focused  on ingestion) 

• DS are well prepared for merger. EarthScope DS Org chart will be presented on October 24 to  DS staff. Org chart might evolve over time, but goal is to keep changes minimal. 

• Deadlines: NSF solicitation for next geophysical facility - proposal due in first Q of 2024. Need to  know what EarthScope DS facility will look like by middle of 2023. 

• Worldwide Data Centers are watching EarthScope DS developments closely, including TileDB.  Bring computing to the data (in the cloud). This has implications for other data centers, as data  exchange will be happening less and back-end (data/computation server-side) technology might  need to become more standardized. 

• New data collections: specifically DAS. CCP should contribute to data formats and data handling  efforts.  

• SZ4D submitted SZNet proposal (UCSC). IRIS is a subaward of this proposal for 1) making  legacy volcano monitoring data (esp. From Chile) available, 2) build umbrella web site with links  to data and sample repositories of volcano monitoring data. 

• Jerry sincerely thanks staff and deputy directors, they make IRIS DMC great. He has much  appreciated working with Chuck Meertens and David Mencin. 

Section Status Reports: Operations – (Sharer/Trabant) 

• Hardware transitions are going well. SeismiQuery had critical vulnerabilities and was disabled. It  had an effect on the community as it has some functions that are not easily replaced by other  tools. Shipments package improved and provides usage stats. A DASK engine was added. 

• Emily Maher joined Gillian’s Team and data engineer. MT (previously EM) data was re-archived  to allow attribution. Smart nodal data ingestions was worked on. New data from QZ and AB  network data was archived (part of SNECCA). NV network data was converted to OW.  

Section Status Reports: Quality Assurance – (Sharer)

• Nominal Response Library, version 2 released. Major improvement with postgres db.  Maintainable + extendable. New web service for delivery of responses in RESP, StationXML, or “sub-XML” format. NRL also available in zip archive. A PDCC replacement will implement  version 2. The STS6 response has not yet been added (Streckeisen will only distribute custom  responses by serial number, not a nominal response. We will contact them again about this.) 

Section Status Reports: QAAC report – (Sandvol /Sharer) 

• Should QAAC exist or has it served its time? Eric S. queried previous chairs. A: QAAC has been  important for providing community perspective and advice on MUSTANG and PIQQA.

• QAAC accomplishments: 

  o Creation of the MUSTANG quality assurance tools. MUSTANG is probably the most  significant system undertaken by IRIS to assure quality. And it has been very successful.  This accomplishment by itself is a strong argument to continue the QAAC.  

  o QAAC served a critically important role that helps to find balance between what users of  the facility need and what maintainers of the facility think is important. 

  o Support and advice in development of PIQQA, a tool that can be used to help PASSCAL  PI’s summarize the quality of the data acquired by their deployments. 

  o QAAC provides an important mechanism that balances the needs of different  constituents that use the Data Management System. 

• The committee discussed various options for the role and, makeup, and functionality of an  EarthScope QAAC that might be presented to the board. It was agreed that the QAAC is  valuable and should continue in some form after the merger. EarthScope QAAC would cut  across data, instrumentation, and engagement. 

Section Status Reports: Cyberinfrastructure – (Casey) 

• Cyberinfrastructure upgraded our services to Tomcat 9 and openJDK 11. Upgrade was  completed in August 2022. 

• New usage stats pipeline. Web services log to this new system: accepts user ID code, return  code, data content. 

• LUCID: enhanced identity management, with UNAVCO. Single Sign On, Auth0 portal with  CILogon. Use your own univ’s login credentials. Documentation is being developed. Can  manage restricted datasets and PIs can control data set access. People are signing up and  using auth0 to login, it’s off to a start and more participation from community members is  encouraged. 

• Web services for PH5 data. PH5 is preferred format for active-source data. SEGY output format  issues are fixed. 

• Cybersecurity: logging tickets, collaboration with LLNL security team (offered guidance and  performed penetration testing). Rob participates in a six-month cybersecurity workshop by  TrustedCI (NSF). Most exploits logged at IRIS were related to legacy code and quickly resolved.  

• An open source beta version of the Yasmine StationXML editor (GUI tool and a command-line  tool) is on github. Resif and ISTI contributed to the tools. An NRL web service will be developed. • Some discussion ensued about accessing embargoed data by manuscript reviewers, who are  requested by the journals to verify that shared data is indeed available, but the journals want  them to do so in an overall anonymous way.

  

Section Status Reports: DMC Architecture and Products – (Trabant)  

• MiniSEED 3 is in review (next generation data format). SeedLink (v4) is in review. SeedLink 4  can do identity management.  

• Derivative products: Revision of Source Time Function product and Event Plots. Manoch is  working on this. Products are ported away from matlab and into Python3. Software will be  published on github and will be ready for the cloud.  

• EMC-tools can handle projected coordinate systems. Bugs have been fixed in EMC web service,  which is not yet released. EMC model explorer is a Jupyter Notebook that will be shared with  community at Fall AGU Meeting. More notebooks will be developed in the future. The EMC data  product remains popular. 

• The EARS repository is not maintainable; the source code is lost and there is little value in  continued operation as most stations have been saturated. The preservation plan is to  Dockerize the old system and share on dockerhub.  

• Download stats are now mapped instead of tabled. 

• Lots of Jupyter notebooks will be published. The DSSC recommends that the engagement arm  of EarthScope be informed as soon as they are ready, and maybe pulled into testing them. E.g.  ROSES leaders would love to know about them and potentially use them.  

DCC Reports: ASL DCC – (Ringler) 

• ASL is in Isleta de Pueblo, quietest in US at 1 Hz, 40 staff, 15 USGS, 25 KBR. Run 2/3rd of GSN  ⅓ run by IDA. ASL runs US backbone and New England and intermountain west, and N4  networks. All is contributed to IRIS. STS1 upgraded to STS6, 260 s low corner. T360 is  nanometrics competitor to STS6. There are not many other BB instruments.  

• Many stations upgraded, now it’s the turn of harder stations: Africa, Norway, etc. GSN station  quality is good, also below 0.001 Hz.  

• New station in Antarctica (QSPA upgrade?): 2500 m depth near IceCube neutrino detector.  • Articulated value of GSN to NSF: reviews of Geophysics paper. Lots of science results published  by ASL/IDA authors: example: Hunga Tonga Hunga H’apai eruption.  

• Internship focused on select filmstrip records of interest for scanning.  

• SRL special issue on Global Seismology upcoming.  

DCC Reports: IDA DCC – (Mellors)  

• IDA send data to DMC via AWS cloud. All stations are up and sending data. Some issues in  Kazachstan and Pakistan, Easter Island, but overall recovery from COVID-related issues. Data  availability is good in general. Water in vault at Diego Garcia. 6 months of data has bad timing.  No nearby stations. GSN committee weighed in about adding timing that could be a few seconds off.  

• New metadata db (StationXML) – also using VPN for data transmission in the cloud. Looking for  single internet provider for all stations (e.g. Starlink). 

• New station in Uzbekistan. Rotational seismometer (blueseis) deployed at PFO.

• Cybersecurity requires increased effort: vulnerabilities can be unexpected.

• SRL special issue. Two sessions at SSA. 

DCC Reports: PASSCAL Instrument Center Report – (Beaudoin) 

• 2022 experiments: floodgate: 72 new ones in 2022, greatest since 2010. Many BB and Nodes,  also in polar. Also MT experiments.  

• New sensors and datalogger purchased, incl. 1000 Nodes. 85% of inventory is reported to NSF  (for availability assessment).  

• 1 Pb NAS storage server purchased for Node data.  

• PQL replaced with SQLX, commercial version of Richard Boaz. 

• MT capability is now fully functional. Receivers and coils are there. MT short course at  SAGE/GAGE workshop. Additional equipment is acquired. October 2022 MT workshop next  week at New Mexico Tech.  

• PASSCAL contributed to ObsPy for interfacing with NRL. Nexus also utilizes NRL. Are there  mutual benefits to leveraging PASSCAL and DMC software at DMC and PASSCAL,  respectively? 

Project Reports: CCP Project update (Trabant) 

• CCP is prioritized, various teams have various tasks, coordination planned with PASSCAL on  TileDB. GeoCrate is modern data container for cloud and would like to work with researchers,  especially those using HPC. Phased set of milestones. MiniSEED will remain for a while. TileDB  will likely be added. 

• Old metadata will be retained to meet FAIR standards. 

• Focus on Development teams, building in AWS. 

• Exploring user direct access to data. 

• Exploring process for transferring data to AWS. 

Project Reports: Citation and Data Licensing (Elliott) 

• Next week: workshop. Invited: publishers. Data policies, ethics policies, challenges, licensing,  citation and attribution will be discussed. 

Board Discussion Items (Van der Lee) 

• Review progress against the five-year SAGE-II plan 

- Core activities: data ingestion, curation, and distribution: 328 TiB of data in archive, 4PiB  sent to users. 

- Maintain Quality Assurance: MUSTANG was expanded and turned into a web service.  - Support scientific software: SAC (community code), SeisCode (repository), DMC programs  and libraries are on Github. SAC had a license, but is now in public domain. 

- Host multidomain data sets: Not all data can be squeezed into existing formats (e.g. MT) that  work for one data type but for another, looking for more flexible formats. 

- Develop shared data center in the cloud, provide seamless access to seismic and geodetic  data: CCP was designed and is being built and utilizes AWS.  

- Seamless integrate seismic and geodetic data to help the community generate integrated  data products: Infrastructure exists. One repository for data from both data centers. This will  be so when CCP is done, early 2024. Physical archive in Seattle might be decommissioned  in March 2024. It is already at the end of its life, and on extended support.  

- Expand Seismic Data Center Federation across world, incl. Africa & Asia: 24 federated data  centers have registered with FDSN, they can be looked up on FDSN web site  ([url=https://www.fdsn.org/datacenters/]https://www.fdsn.org/datacenters/[/url]: none in Africa have registered, one each in Korea,  Japan, etc.) Some data sit in more than one data center, but only one is the primary source.  

- Improve SEED and expand utility of SEED for other types of time series: SEED  improvements were made (they overcome limitations, e.g. lengths of strings), but SEED  found to not be compatible with many data types. Looking for other “universal” formats.  MiniSEED can still be an export format? 

- Seamless access to high-frequency active and passive source seismic data: PH5 was  developed (is HDF5, metadata are included in it rather than separated out as for TileDB)  specifically for passive and active data from Nodes. HDF5 is not cloud friendly, currently  looking for better formats. 

- Support data formats that are useful in HPC environments: Looking at TileDB. Client-side  ROVER can help with downloading very large data sets, even with intermittent connectivity.  [url=https://iris-edu.github.io/rover ]https://iris-edu.github.io/rover [/url];

- Support domain-agnostic formats like GeoCSV: GeoCSV is supported as export format.

- Improve availability of on-line tutorials: Little progress because of major shift in data archiving  practices, but working on Jupyter notebooks.  

- Establish capability to support workflows in the cloud where the data reside: CCP provides  capability. 

- Generate higher-level data products: Many products developed. 

• Develop any specific recommendations for SAGE-II work-plans and budgets for award years 6  and 7 

- Complete CCP: Provides flexibility and scalability for multiple data/metadata formats from  seismology, geodesy, and new types of data such as DAS as well as proximity to cloud  compute resources 

- Operate CCP: Working with international data centers on data services standards (e.g. web  services) and ways for direct access, common systems, standardize direct access.

- Training the user base: Find ways to provide a bridge/support/training between the massive  data archive and the researchers/other users/computation. Education & training of  researchers will be key: 

▪ Collaborate with ROSES and other EarthScope Engagement programs; this can also be  a testing ground for cloud-based access tools. 

▪ Involve graduate students in providing the education & training, develop documentation.  Employ graduate student interns at DMC to build valuable skills that are broader than  just geophysical research and to build affinity with and understanding of DS.

- Large N ingestion & preparing for this (streaming, DAS, ubiquitous sensors like MyShake, hi rate GNSS): Storage and distribution of legacy data (historical, analog seismic data from  microfilms and microfiches). People like Miaki Ishii and Tim Ahern are working on metadata  and data formats. NSF will decide what to fund. Community needs to come up with  justification for what data to scan and digitize.  

• Document key directions and priorities (strategic plan) for your program (include non-SAGE  activities as relevant), 

- Transition of staff to EarthScope DS 

- Complete full transition to the cloud-based system, including integrated data, data products,  and user training. 

- Respond to Facility Solicitation (NSF) 

- Encourage, guide, and optimize user transition to cloud computing near to data via  ROSES/SCOPED educational efforts & engagement of graduate students. 

- Some discussion ensued about financial support for new “bring workflow to cloud data”  users. It is about trying to create a small "energy barrier" to fully open use. 

- Prepare for new data types (e.g., DAS and hi-rate GNSS) 

- Develop new policies for data acceptance 

• Enumerate key science accomplishments, justifications, objectives, and concerns that the  program has/will facilitate (please note how these objectives tie to spending priorities for the  program) 

- Near 100% uptime  

- Virtually all data-driven research in seismology and geodesy uses data findable via IRIS. 

- Continual modernization of data formats, delivery methods, and other data and metadata  access methods. 

- Facilitated and supported the growing research fields of environmental seismology and  geodesy 

- Contributed to modern workforce development 

- Maintained high standards and provided metrics of data and metadata quality 

• Identify concerns or issues that should be brought to the attention of the IRIS and EarthScope  Boards 

- Moving data to the cloud requires investment in user training, which is a big deal and a big  opportunity for workforce development. 

- Identity management might decrease data usage and create unintended conflicts or  consequences, as well as cause cloud expenses. Identity management that involves  granular usage metrics can slow down computations with data in the cloud and data delivery  to users. 

- Discussion ensued about the goals of NSF when asking for identity management and about  the value of data products not being directly linked to the frequency and volume of data  used, and the value of data products for US research even if non-US researchers also  created value from the data. More dialogue with NSF could be useful.  

- Financial operations related to data flow and identity management in and out of the CCP are  a big unknown and can have critical impacts on operational budgets. 

- New data types (e.g. DAS) have huge storage needs.

- Data sources are often not or incompletely or incorrectly cited in professional and other  publications that used data via IRIS or UNAVCO.  

- An idea was proposed to appoint postdocs at EarthScope who can do research with the  EarthScope data. Pros and cons were discussed.  

• Include a summary of, and guidance for, incorporating input from subcommittees / advisory  committees that report to SCs. 

- DSSC recommendation for QAAC: DSSC agrees that there is important work to be done in  QA, which requires community input. DSSC also understands that having too many  committees does not always promote good governance. Hence we recommend that the type  of work currently done by QAAC will be managed by an ad-hoc committee (“ephemeral”  committee) or working group, which can be called into existence for the duration of the task  at hand. Membership can be composed of a small number of members from various  standing advisory committees as well as experts from the community, which bring the  necessary expertise for the task at hand.  

Briefings to DSSC (Carter) 

• The Director provided briefings to the DSSC on funding plans/priorities from recent NSF end-of year supplementary funding request as well as on the EarthScope Consortium governance  structure and member nomination process. The DSSC made recommendations for future  committee memberships.