Our recommendations for software, literature and data sources for the implementation of Data Science projects

Welcome to the Data Science materials collection, curated by GEOMAR DSU. Here we provide a list of links and descriptions of materials that we find useful in the context of Data Science: Courses, books, use case publications, datasets, etc. We are constantly updating this collection as we come across relevant contributions. Also, let us know your recommendations - we'll be happy to add them here.

Helmholtz Summer School

"From Data to Knowledge": Check out the inspiring and varied program for the 2-week event: https://events.hifis.net/event/1590/program
The Summer School will take place virtually from 16 – 27 September 2024 and is open to all researchers and staff in the Helmholtz Association.

Trainings - Material

Courses

Internal:

External:

  • HiDA: The Helmholtz Information & Data Science Academy is Germany's largest continuing education network in the field of information and data science.
  • MATLAB support: If you have specific questions or problems with your Matlab code, there are two people at Mathwork (the company behind Matlab) that support Helmholtz centers ( Mihaela Jarema (mjarema@mathworks.com) and Kostas Leptokaropoulos (kleptoka@mathworks.com).
  • Seminar ML in Earth sciences by HEREON: Every second Tuesday ,3 pm.
  • Data Carpentry workshops: Regularly scheduled workshops on various tasks and software; newsletter subscription available;
  • Helmholtz-AI-Consultants Earth and Environment: Get a consultant to help guide you through your ML project.
  • Open Campus SH: On- & offline (in-person) courses for ML and many more.
  • Roboflow: Online Computer Vision library & tutorials
  • HPC courses at the university: Introduction to the HPC Infrastructure and how to use it at CAU Kiel.
  • Data Train Uni Bremen Research Alliance: Online courses are open for everyone interested in improving their data skills and hands-on workshops can be opened for post-doctoral researchers, scientific staff or advance students, of places are available.You can find courses about Big Data Handling, Machine Learning and Python Basics and many more.
  • Online free Python courses

    It is difficult to recommend a particular course without knowing the background in programming and the particular application in mind. If you already have programming experience I find it most useful to simply start with a cheat sheet for python, it can quite easily replace a beginners course for python and then you can start with more advanced courses.

    Some example cheat sheets:

    Online cheat sheet

    pdf cheat sheet (a bit messy but also helpful)

    pdf cheat sheet

    In my experience it is good to look for a course that is geared towards data sciences because python is so versatile that some courses can cover a lot of topics that are not necessarily useful for natural scientist.

  • A highly recommended self-study course on the subject of "Multivariate Exploratory Data Analysis" via Open Classrooms you can find here.

    The carpentries platform

    this is the platform that we also use to teach our course, it is generally well maintained and validated.

    Software carpentry

    This is a Python course geared towards data science and teaches by applying python to a real world problem.

    Kaggle platform

    is geared towards data science and data and offers beginners courses in Python:

    Python homepage

    The Python community itself offers a lot of material for learning:

     

  • More platforms offering a huge range of courses. Most of them free, especially if you do not take the exams and require an official certificate:

     

    EdX platform

    Georgia Tech Python course – very high quality, the beginners courses are very good but also very slow if you already have coding experience, it helps to speed up the videos…

    Coursera platform

    offers - like EdX – a lot of courses including python courses for beginners and other applications:

    Codeacademy platform

    again a huge offer of courses for free.

    Udemy platform

    and another platform for courses, also has a good reputation

     

Books

Machine learning textbooks

  • Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition. by Aurélien Géron. Released September 2019. Publisher(s): O'Reilly Media, Inc.

Python textbooks

  • "Python for Data Analysis, 3E" by Wes McKinney: internal recommendation as "bible for pandas and numpy" (pdf-link: https://oceanofpdf.com/authors/wes-mckinney/pdf-python-for-data-analysis-3rd-edition-download/?id=001835841878)
  • "Python Crash Course: A Hands-On, Project-Based Introduction to Programming" by Eric Matthes

  • "Automate the Boring Stuff with Python: Practical Programming for Total Beginners" by Al Sweigart

  • "Python for Everybody: Exploring Data in Python 3" by Charles Severance

  • "Learning Python, 5th Edition" by Mark Lutz

  • "Python Programming: An Introduction to Computer Science, 3rd Edition" by John Zelle

  • "Introduction to Python for Science and Engineering" by David J. Pine

  • "Python Basics: A Practical Introduction to Python 3" by Real Python

  • "Think Python: How to Think Like a Computer Scientist" by Allen B. Downey

  • "Python 101: A Crash Course in Python Programming" by Mike Driscoll

  • "Python Programming for the Absolute Beginner, 3rd Edition" by Michael Dawson

Predictive (habitat) mapping

 

Resources

Selection of Data sets & portals

Marine Data Portal: Shows available data from your research area from all DAM partners: bathymetry, sediment and observation datasets, CONMAR datasets;

PANGAEA: Marine and environmental datasets published in the PANGAEA World Data Center.

Geoserver: Publication and sharing of geodata

OSIS: All information about expeditions, numerical models and experiments.

ZPL : Search for the rock samples and sediment cores stored at GEOMAR.

WDC Climate: Published datasets in the World Data Center Climate at the German Climate Computing Center (DKRZ).

GEOMAR OPeNDAP Service: Data from peer-reviewed articles with results from numerical models..

DSHIP Underway Dat of RVs: The recorded underway data of the German research vessels are transferred ashore and archived in the long term. They can be accessed and exported via interlinked web services at GEOMAR, BSH and AWI.

Google Earth Engine: Find, download and process global satellite data.

USGS Earth Explorer: Source for satellite data; choice from many different satellites; ability to import shape files to export imagery for specific areas.

Boknis Eck Time series data: Monthly samples since 1957 at the time series station Boknis Eck in the western Baltic Sea.

IHO DCDB Bathymetrie Daten-Viewer: Collection of bathymetric data available worldwide, including data from the major international bathymetric data repositories.

Real time Data: Real-time data from scientific platforms installed by GEOMAR research groups.

BIS Biosample Management : biological samples from GEOMAR expeditions

MDI DE Portal: Platform for marine geodata from Marine Data Infrastructure Germany

OBIS : Marine Biodiversity Database

IMLGS from NOAA: marine and lacustrine geological samples

EarthChem: global collection of seabed geochemical samples

Kaggle datasets: AI-ready datasets for a wide range of applications

 

Data Viewer

Digital Earth Viewer: Visualizes spatial time series datasets in real time. The viewer is able to handle different types of data and facilitates interactive exploration of different datasets in one place. As an in-house product, direct support can be provided.

ARENA 2: Explore your data in an in-house projection dome. It visualizes 2-4D geodata, model runs, large format videos, photos and enables telepresence.

BELUGA: Visualization of data from different platforms; besides the visualization of platform data, an essential part of BELUGA is also the underwater network (cummunication and navigation under water).

 

Tools

Geospatial data

ArcGIS Add-on Benthic Terrain Modeller: Tool compilation for the analysis and classification of benthic terrain

Geopandas: Python GeoPandas is a popular open source library for working with geospatial data that allows users to easily manipulate, analyze, and visualize geographic information within the Python environment.

QGIS: QGIS is a free and open source geographic information systems (GIS) software that allows users to create, edit, visualize and analyze geographic data.

GDAL: GDAL (Geospatial Data Abstraction Library) is an open source software library that provides a set of tools and libraries for working with raster and vector geospatial data formats and enables versatile geospatial data editing and conversion.

R landscape metrics: R landscape metrics are a collection of quantitative measures and statistics used in the R programming language to assess and analyze the spatial patterns and characteristics of landscapes, making them a valuable tool for landscape ecology and land use planning.

Computer Vision and Image processing

OpenCV: Python OpenCV is a powerful open-source computer vision library that allows developers to perform a wide range of image and video processing tasks using the Python programming language.

Colmap: COLMAP (Structure-from-Motion and Multi-View Stereo) is a computer vision software package that specializes in reconstructing 3D scenes from 2D images, making it valuable for tasks like photogrammetry and 3D modeling.

Metashape: MetaShape, is a professional photogrammetry software that allows users to create high-quality 3D models and maps from a collection of 2D images.

Python packages

Pandas: The Python package pandas is a powerful and popular data manipulation and analysis library that provides easy-to-use data structures and tools for working with structured data.

Bokeh: The Python package Bokeh is a data visualization library that provides a simple and interactive way to create web-based visualizations for modern browsers.

Holoviz: The Python Holoviz package is a collection of open-source data visualization and exploration tools that allow users to quickly create interactive visualizations with minimal code.

Panel: The Python Panel package is a library that allows users to easily create interactive web-based dashboards and applications from Python code, supporting a wide range of data sources and visualization tools.

Blender: Blender is a versatile and open source 3D computer graphics toolset that supports modeling, animation, rendering, compositing and much more.

Other

D3.js: Excellent Java Script library for data visualization (more precisely DOM manipulation). Comparatively low level with a steep learning curve.

Machine Learning Playground:Machine Learning Playground is an open-source project with the goal of providing students and interested parties with a guided introduction to the complex world of machine learning.

Hands on ML: A series of Jupyter notebooks that walk through the basics of machine learning and deep learning in Python with Scikit-Learn, Keras, and TensorFlow 2.

R Basics — Everything You Need to Know to Get Started with R: An introductory "Towards Data Science" article on working with R.

Seeing Theory: Seeing Theory is an interactive online resource that provides an intuitive and visual approach to understanding complex probability and statistics concepts.

Distill: Distill is an open access online publishing platform that emphasizes clear, interactive, and visually appealing articles to effectively communicate research findings and concepts across academic disciplines.

Colah: Colah is the blog of a prominent researcher and blogger in the field of artificial intelligence, known for his insightful and accessible writing on deep learning and neural networks.

Scientific color maps: Various citable color maps designed for different scientific visualization applications for download.

Environmental Data Science book: EDS book showcases and supports the publication of data, research and open-source tools using Data Science and AI for characterizing, monitoring and/or modelling a wide diversity of environmental systems.