Hello, and welcome to my website!

My name is Julia, and I'm a computer science PhD student located in Heidelberg, Germany. My main interests lay in bioinformatics, data analysis, and machine learning. I am currently doing a PhD in bioinformatics with the CME group at HITS in Heidelberg. Additionally, I work as a part-time freelance software engineer. I love learning new things and new technologies.

If you have an interesting project for me or just want to chat, feel free to contact me!

Education

Mar 2022 – Present
Heidelberg Institute for Theoretical Studies
  • Research focus on data analysis and machine learning applications in phylogenetics.
  • For more details on what I am working on in my PhD, have a look at the Experience, the Publications and Preprints, or the Projects section.
Oct 2019 – Feb 2022
Karlsruhe Institute of Technology
  • Master's Thesis: "Empirical Numerical Properties of Maximum Likelihood Phylogenetic Inference" (Mark: 1.0)
  • Specialization subjects: Data-Intensive Computing, Machine Learning and Artificial Intelligence
  • Minor subject: Biology
  • Final mark: 1.1 (graduation with distinction)
Oct 2015 – Sep 2019
Karlsruhe Institute of Technology
  • Bachelor's Thesis: "Patient Tracking in Surgery: An Image-Guided, Markerless Approach" (Mark: 1.0)
  • Minor subject: Physics
  • Final mark: 1.9
Sep 2006 – Jul 2015
Robert-Gerwig-Gymnasium Hausach
  • General qualification for university entrance
  • Final mark: 1.3

Further Education

January 2024
DeepLearning.AI

Coursera course teaching the most important concepts of probability and statistics for Machine Learning and Data Science.

You can verify the certificate here.

November 2023
Imperial College London

Coursera course teaching the most important mathematical concepts for Machine Learning. The specialization consists of the following courses:

  • Linear Algebra
  • Multivariate Calculus
  • PCA

You can verify the certificate here.

November 2023
Stanford University

Coursera course teaching the basics of statistics and hypothesis testing.

You can verify the certificate here.

September 2020
DeepLearning.AI

Coursera course teaching the basics of medical diagnosis prediction using machine learning and deep learning.

You can verify the certificate here.

July 2020
DeepLearning.AI

Coursera specialization consisting of the following courses:

  • Convolutional Neural Networks
  • Sequence Models
  • Structuring Machine Learning Projects
  • Regularization and Optimization

You can verify the certificate here

Note that all Coursera certificates state my maiden name Julia Schmid, but yes, that is me ☺

Experience

Mar 2022 – Present
Heidelberg Institute for Theoretical Studies

PhD in Computer Science in the interdisciplinary field of computational biology.

Machine Learning Bioinformatics Python Pandas numpy Plotly scikit-learn LightGBM Optuna Snakemake Biopython C/C++ Git GitHub Actions Jenkins RAxML-NG IQ-TREE FastTree

I am currently pursuing my PhD in Computer Science at the HITS in Heidelberg in the field of Bioinformatics. The working title of my thesis is "Applications of Machine Learning and Data Science in Phylogenetics". One main aim of Phylogenetics, and a research focus in my group at HITS is to infer phylogenetic trees. Phylogenetic trees represent hypothetical evolutionary relationships between organisms or species. With advances in machine learning and deep learning over the last decades, deep learning and large scale data analytics is becoming more popular in phylogenetics.

The goal of my thesis is to explore potential applications of machine learning and deep learning techniques to improve phylogenetic inferences, both in terms of accuracy and runtime. Furthermore, I am trying to improve current techniques and explore future directions by analyzing vast amounts of biological data, and research results.

The following list is a brief summary of what I am working on in my day-to-day work and projects that I have finished.

  • Pandora: Quantification of the uncertainty of population genetics genotype datasets under dimensionality reduction. See below for more details on this project.
  • Debunking Simulations: In a joint work with researchers in France, we demonstrated that current state-of-the-art models of sequence evolution in phylogenetics cannot simulate empirical-like data. See below for more details on this project.
  • Pythia: Predicting the difficulty of phylogenetic analysis. See below for more details on this project.
  • Numerical Analysis of thresholds in phylogenetic inference tools. See below for more details on this project.
  • My group develops a C library providing frequently used phylogenetic inference functionality (Coraxlib). I am currently working on setting up a CI pipeline in Jenkins for this project.

Jun 2024 – Sep 2024
Apple

Internship with one of Apple's MLPT teams.

Python TypeScript React TODO :-)

To be updated :-)

I'm very happy to share that I am joining Apple for an AIML-Internship this summer!

Oct 2021 – Mar 2024
Freelance

Major refactoring and migration of a business-critical Python 2 codebase to Python 3.

Python MySQL Redis FastAPI GraphQL Docker docker-compose Apache Kafka

In addition to pursuing my PhD in computer science, I worked as a part-time freelance software engineer. The idea was to gain further experience in the industry with frameworks and technologies I don't use in my research.

The primary task involved refactoring and migrating a business-critical, yet undocumented and untested Python 2 codebase to Python 3. The original developers are unavailable, and there's no specification for the expected functionality. The challenge lies in refactoring without tests to validate changes, and to write tests, the codebase must first be understood and refactored. The initial structure, with scattered database interaction logic, didn't allow for testing. Potential logical bugs present another issue, as it's unclear if they're actual bugs or undocumented, intended behavior. I made substantial progress in refactoring, adding unit tests and documentation, reducing technical debt, and enhancing maintainability. I further identified potential bugs, data inconsistencies, and opportunities for speedup and complexity reduction.

In various side-projects for the same company, I came in touch with a few additional technologies, including FastAPI, GraphQL, docker, docker-compose and Apache Kafka.

Dec 2020 – Feb 2022
Heidelberg Institute for Theoretical Studies

Large scale analysis of numerical properties of phylogenetic inference using the maximum likelihood method.

Python Snakemake Pandas Plotly SQLite

For more information, see the Project section below.

Apr 2020 – May 2020
Freelance

Development of a web-based filesharing tool.

Python Django JavaScript HTML CSS AWS S3

I developed a web-based filesharing tool for a company organizing large events.

The web app allows file-upload and file-sharing, including support for share-links to grant access to certain files for non-registered users. All files are stored in an AWS S3 bucked to ensure scalability.

The project was developed using the Django web framework, and the frontend was implemented using HTML, CSS, and JavaScript.

Oct 2019 – Mar 2020
ArtiMinds Robotics GmbH

Development of a web-based analytics software suite for large-scale robotics data.

JavaScript Typescript HTML CSS MariaDB Python Pandas Matplotlib

ArtiMind's main product is a robot programming suite (RPS) for industrial robots. Instead of writing code, users of the RPS formulate the task of the robot using pre-programmed building blocks. The RPS continuously monitors the task execution while recording measurements such as velocity and force. This data can be directly transferred to the second key product of ArtiMinds: the LAR (Learning & Analytics for Robots). The LAR is a web-based monitoring interface that displays the recorded measurements and task executions of the robot.

During my work as a working student, I was part of the LAR development team. This work included writing tests for the previously largely untested LAR frontend and backend codebase.

I further analyzed velocity and force data of an industrial robot for customers to identify errors and optimization potential.

Aug 2017 – Oct 2019
Walk In Fitness (KIT University Gym)

As a balance to sitting in the library doing some programming and studying, I wanted to do something other than computer science. So I decided to do an internship and later work in the unversity gym. I helped clients achieve their fitness goals and did some basic nutrition counselling.

Apr 2018 – Jul 2018
Chair for Embedded Systems, KIT

Tutor for the subject “Digitaltechnik und Entwurfsverfahren“ (Digital Technology and Design Methods).

Study aid for 28 undergraduate computer science students.

Jan 2017 – Sep 2017
Institute for Anthropomatics and Robotics, KIT

Working with the ArmarX robot programming framework.

Preparation of experiments for marker-based human motion capture.

Mar 2016 – Sep 2016
Geophysical Institute, KIT

Writing lectures notes for the lecture "Introduction to Geophysics"

LaTeX

I wrote lecture notes for the lecture "Introduction to Geophysics". The notes are used by the students as a study aid for the exam.

The resulting notes comprise 70 pages, including 50 custom-made graphics and are available on GitHub.

Projects

This is an (incomplete) list of projects I worked on or that I am still working on.

conda-forge

I contribute and update conda-forge recipes.


In my day-to-day work I rely on a lot of open-source software, and I think it's important to contribute to open-source projects myself. So far my contributions mainly concern conda-forge recipes. Namely, I contributed the following:

  • addition of the PyPythia feedstock (maintainer)
  • addition of the apricot-select feedstock (maintainer)
  • update the pomegranate feedstock (maintainer)
  • update the scikit-allel feedstock (maintainer)
  • update the r-curl feedstock

Additionally, all my software projects are available open-source on GitHub.

Python Eigensoft Plotly Pandas Scikit-Learn GitHub Actions

Pandora is a tool to estimate the uncertainty or stability of dimensionality reduction methods applied to genotype data in population genetics.

Repository

Paper (preprint)


Genotype datasets typically contain a large number of single nucleotide polymorphisms for a comparatively small number of individuals. To identify similarities between individuals and to infer an individual’s origin or membership to a cultural group, dimensionality reduction techniques are routinely deployed. However, inherent (technical) difficulties such as missing or noisy data need to be accounted for when analyzing a lower dimensional representation of genotype data, and the intrinsic uncertainty of such analyses should be reported in all studies. However, to date, there exists no stability assessment technique for genotype data that can estimate this uncertainty.

To address this issue, I developed Pandora, a stability estimation framework for genotype data based on bootstrapping. Pandora computes an overall score to quantify the stability of the entire embedding, infers per-individual support values, and also deploys a k-means clustering approach to assess the uncertainty of assignments to potential cultural groups. In addition to this bootstrap-based stability estimation, Pandora offers a sliding-window stability estimation for whole-genome data. In the respective publication, I demonstrate the usage and utility of Pandora for studies that rely on dimensionality reduction techniques using published empirical and simulated datasets.

Pandora is implemented in Python3, unit-tested an I setup a CI pipeline in GitHub Actions.

The tool is available on GitHub.

The paper is currently under review, a preprint is available on bioRxiv.

Python Pandas Scikit-Learn LightGBM Optuna Plotly

In a joint work with researchers in France, we demonstrated that current state-of-the-art models of sequence evolution in phylogenetics cannot simulate empirical-like data.

Paper


Simulating multiple sequence alignments (MSAs) using probabilistic models of sequence evolution plays an important role in the evaluation of phylogenetic inference tools and is crucial to the development of novel learning-based approaches for phylogenetic reconstruction, for instance, neural networks. These models and the resulting simulated data need to be as realistic as possible to be indicative of the performance of the developed tools on empirical data and to ensure that neural networks trained on simulations perform well on empirical data. Over the years, numerous models of evolution have been published with the goal to represent as faithfully as possible the sequence evolution process and thus simulate empirical-like data. In this study, we simulated DNA and protein MSAs under increasingly complex models of evolution with and without insertion/deletion (indel) events using a state-of-the-art sequence simulator. We assessed their realism by quantifying how accurately supervised learning methods are able to predict whether a given MSA is simulated or empirical.

Our results show that we can distinguish between empirical and simulated MSAs with high accuracy using two distinct and independently developed classification approaches across all tested models of sequence evolution. Our findings suggest that the current state-of-the-art models fail to accurately replicate several aspects of empirical MSAs, including site-wise rates as well as amino acid and nucleotide composition.

This work is a joint work with my colleagues at HITS and a team of researchers in Lyon. Johanna Trost, Dimitri Höhler, and I contributed all equally to this work.

The peer-reviewed publication is available at Molecular Biology and Evolution.

Python Pandas Scikit-Learn LightGBM TreeLite Snakemake RAxML-NG GitHub Actions

Pythia is a machine learning model that predicts the difficulty of phylogenetic inferences on a dataset based on its multiple sequence alignment.

Repository

Paper


Phylogenetic analyses under the Maximum-Likelihood (ML) model are time and resource intensive. To adequately capture the vastness of tree space, one needs to infer multiple independent trees. On some datasets, multiple tree inferences converge to similar tree topologies, on others to multiple, topologically highly distinct yet statistically indistinguishable topologies. At present, no method exists to quantify and predict this behavior. We introduce a method to quantify the degree of difficulty for analyzing a dataset and present Pythia, a Random Forest Regressor that accurately predicts this difficulty. Pythia predicts the degree of difficulty of analyzing a dataset prior to initiating ML-based tree inferences. Pythia can be used to increase user awareness with respect to the amount of signal and uncertainty to be expected in phylogenetic analyzes, and hence inform an appropriate (post-)analysis setup. Further, it can be used to select appropriate search algorithms for easy-, intermediate-, and hard-to-analyze datasets.

In the current version of Pythia we replaced the Random Forest Regressor with boosted trees implemented in LightGBM and retrained the predictor on approximately 10k MSAs, including DNA, AA, and morphological datasets.

For convenient usage, we provide the command line interface PyPythia that is build around Pythia. PyPythia is implemented in Python, it is unit-tested and I set up CI using GitHub Actions. PyPythia is available on GitHub.

Additionally, CPythia wraps Pythia in a C library as plugin for the Coraxlib project. This integration is a crucial part in our group's latest version of RAxML-NG, adaptive RAxML-NG.

The tool is available on GitHub.

The peer-reviewed publication is available at Molecular Biology and Evolution.

Python Snakemake Plotly RAxML-NG IQ-Tree FastTree

I performed large scale numerical analysis of the influence of threshold parameters on the runtime and results of Maximum Likelihood phylogenetic inferences.

Paper


Maximum Likelihood (ML) is a widely used phylogenetic inference model. ML implementations heavily rely on numerical optimization routines that use internal numerical thresholds to determine convergence. We systematically analyze the impact of these threshold settings on the log-likelihood and runtimes for ML tree inferences with RAxML-NG, IQ-TREE, and FastTree on empirical datasets. We provide empirical evidence that we can substantially accelerate tree inferences with RAxML-NG and IQ-TREE by changing the default values of two such numerical thresholds. At the same time, altering these settings does not significantly impact the quality of the inferred trees. We further show that increasing both thresholds accelerates the RAxML-NG bootstrap without influencing the resulting support values. For RAxML-NG, increasing the likelihood thresholds ϵLnL and ϵbrlen to 10 and 103 respectively results in an average tree inference speedup of 1.9 ± 0.6 on Data collection 1, 1.8 ± 1.1 on Data collection 2, and 1.9 ± 0.8 on Data collection 2 for the RAxML-NG bootstrap. Increasing the likelihood threshold ϵLnL to 10 in IQ-TREE results in an average tree inference speedup of 1.3 ± 0.4 on Data collection 1 and 1.3 ± 0.9 on Data collection 2.

Our research comprises four studies, I conducted Study 1 during my Master's thesis with the CME group at HITS. You can find the full thesis here

The peer-reviewed publication is available at Bioinformatics Advances.

Python Django HTML CSS Bootstrap

I implemented a private, web-based diary to write down memories and store images of vacations and other cool activities.

Repository


To remember vacations, holidays and trips I implemented a diary website using the django web framework. This website also stores images associated with certain trips and image descriptions, so I can show the best images to friends and family. It also includes a map, so I can see where on this beautiful earth I have already travelled to ☺ In its latest version, I can also upload files, e.g. GPS tracks of hikes.

The tool is available on GitHub.

Python Django HTML CSS Bootstrap

I implemented a web-based tool to organize courses for the computer science master's degree at KIT.

Repository


For my master’s studies at KIT I implemented a web-based tool to organize the courses I intend to take. The tool checks if the planned courses satisfy all requirements to get the master’s degree. It further shows a schedule of exam dates, as well as an overview of all grades. Since the computation of the final grade at KIT is a bit tedious, I also implemented an automatic computation of the final grade, as well as the average grades per module.

The tool is available on GitHub.

Python OpenCV scikit-image

I developed and implemented a markerless patient movement tracking algorithm for cochlear implant surgeries.

Repository

Paper


As my bachelor's thesis, at the Intelligent Process Automation and Robotics Lab (IPR) at KIT, I implemented a system for markerless patient movement tracking during cochlear implant surgeries. The approach uses only the images obtained by the microscopic camera used by the surgeon. I implemented the algorithm using Python and the image processing frameworks OpenCV and scikit-image. It also includes a neural network for semantic image segmentation.

My bachelor's thesis is available upon request only as it contains sensitive patient images.

The tool is available on GitHub.

The peer-reviewed publication is available at Frontiers in Surgery..

Python Django HTML CSS Bootstrap

I implemented a web-based recipe website to store and share recipes with friends and family.

Repository


This project was inspired by my love for baking. It is meant for storing all the ideas and recipes of me and my friends. Different users can create new recipes, categorize them or search for a specific recipe. The website also includes functionality to filter recipes by ingredients and categories.

This website is an ongoing project and I keep adding new features. The latest features I added were a shopping list (which is especially useful when planning to bake multiple different recipes) and an idea collection where I can save interesting recipes or ideas I come up with.

The tool is available on GitHub.

Java Google Web Toolkit JavaScript HTML CSS

We implemented a web-based Lambda Calculus IDE and interpreter for the programming paradigm lecture at KIT.

Repository


As part of the undergraduate studies, we implemented a browser-based lambda calculus IDE and interpreter in a team of six students. We used Java with the Google Web Toolkit, JavaScript and CSS and wrote about 10 000 lines of code plus 6000 lines of tests. The project was graded with mark 1.0 (best possible grade). I mostly implemented the interactive frontend to explore the lambda terms and their reduction steps. Also, I wrote lots of tests for the controller layer (MVC architecture).

The tool is available on GitHub.

Publications & Preprints

Publications

Simulations of Sequence Evolution: How (Un)realistic They Are and Why

J. Trost*, J. Haag*, D. Höhler*, L. Jacob, A. Stamatakis, B. Boussau (2024) Simulations of Sequence Evolution: How (Un)realistic They Are and Why. Molecular Biology and Evolution, 41(1). https://doi.org/10.1093/molbev/msad277
* equal contribution

Adaptive RAxML-NG: Accelerating Phylogenetic Inference under Maximum Likelihood using dataset difficulty

A. Togkousidis, A. M. Kozlov, J. Haag, D. Höhler, A. Stamatakis (2023) Adaptive RAxML-NG: Accelerating Phylogenetic Inference under Maximum Likelihood using dataset difficulty. Molecular Biology and Evolution, 40(10). https://doi.org/10.1093/molbev/msad227

The Free Lunch is not over yet – Systematic Exploration of Numerical Thresholds in Maximum Likelihood Phylogenetic Inference

J. Haag, L. Hübner, A. M. Kozlov, A. Stamatakis (2023) The Free Lunch is not over yet – Systematic Exploration of Numerical Thresholds in Maximum Likelihood Phylogenetic Inference. Bioinformatics Advances, 3(1). https://doi.org/10.1093/bioadv/vbad124

From Easy to Hopeless - Predicting the Difficulty of Phylogenetic Analyses

J. Haag, D. Höhler, B. Bettisworth, A. Stamatakis (2022) From Easy to Hopeless - Predicting the Difficulty of Phylogenetic Analyses. Molecular Biology and Evolution, 39(12). https://doi.org/10.1093/molbev/msac254

Continuous Feature-Based Tracking of the Inner Ear for Robot-Assisted Microsurgery

C. Marzi, T. Prinzen, J. Haag, T. Klenzner, F. Mathis-Ullrich (2021) Continuous Feature-Based Tracking of the Inner Ear for Robot-Assisted Microsurgery. Front. Surg., 8(742160). https://doi.org/10.3389/fsurg.2021.742160

Preprints

Pandora: A Tool to Estimate Dimensionality Reduction Stability of Genotype Data

J. Haag, A. I. Jordan, A. Stamatakis (2024) Pandora: A Tool to Estimate Dimensionality Reduction Stability of Genotype Data. bioRxiv. https://doi.org/10.1101/2024.03.14.584962

Predicting Phylogenetic Bootstrap Values via Machine Learning

J. Wiegert, D. Höhler, J. Haag, A. Stamatakis (2024) Predicting Phylogenetic Bootstrap Values via Machine Learning. bioRxiv. https://doi.org/10.1101/2024.03.04.583288

A representative Performance Assessment of Maximum Likelihood based Phylogenetic Inference Tools

D. Höhler, J. Haag, A. M. Kozlov, A. Stamatakis (2022) A representative Performance Assessment of Maximum Likelihood based Phylogenetic Inference Tools. bioRxiv. https://doi.org/10.1101/2022.10.31.514545

Talks

Educated Bootstrap Guesser: Predicting Phylogenetic Bootstrap Values

legend2024 (FORTH Heraklion, Crete, May 2024)
In this presentation, I presented the work of my master's student Julius
Recording
Talk slides

Simulations of Sequence Evolution: How (Un)realistic They Are and Why

legend2024 (FORTH Heraklion, Crete, May 2024)
Recording
Talk slides

Predicting the Difficulty of Phylogenetic Analyses

ERGA BioGenome Analysis and Applications Seminar (Online Seminar, November 2023)
Recording
Talk slides

Predicting the Difficulty of Phylogenetic Analyses

Peder Sather/Invertomics Symposium “Progress and Development in Phylogenetic Methods” (University of Oslo, Norway, March 2023)
Talk slides

Predicting the Difficulty of a Phylogenetic Analysis

EVOLCYP Workshop on Biodiversity Genomics (University of Cyprus, Cyprus, September 2022)

Skills

Programming Languages & Tools

I feel most comfortable working with Python, but I did come in touch with many other programming languages. Since most hobby projects are websites, I am also very familiar with HTML5 and CSS. During my work at ArtiMinds I wrote code in JavaScript and TypeScript and during my computer science studies I also coded in Java and C. I recently also learned the basics of C++ and R.

For all my projects I'm using Git, and I know the basic usage (commit, push, merge, rebase, ...). However, advanced Git features still feels like magic to me and I have to google a lot of commands ☺︎

Python

HTML5 & CSS

Git

SQL

JavaScript & TypeScript

Java

C++

C

R



Frameworks

Here is a list of frameworks I frequently work with and I would say I now my way around.

Pandas Plotly Dash Matplotlib Numpy scikit-learn LightGBM Django Snakemake Biopython


Languages

  • German (Native)
  • English (Professional)
  • Spanish (Elementary)
  • Greek (Elementary)

Miscellaneous

Hobbies
  • Baking and cooking! You can see pictures of my baked goods here ☺︎
    ...yes I will bring cake to the office ☺︎
  • Sports, especially bouldering, cycling, and tennis.
  • Drinking good coffee ☺︎ (I even have a barista certificate!)
  • Meeting friends
  • Reading a good book