Hello, and welcome to my website!
My name is Julia, and I'm a computer science PhD student located in Heidelberg, Germany. My main interests lay in bioinformatics, data analysis, and machine learning. I am currently doing a PhD in bioinformatics with the CME group at HITS in Heidelberg. Additionally, I work as a part-time freelance software engineer. I love learning new things and new technologies.
If you have an interesting project for me or just want to chat, feel free to contact me!
Education
- Research focus on data analysis and machine learning applications in phylogenetics.
- For more details on what I am working on in my PhD, have a look at the Experience, the Publications and Preprints, or the Projects section.
- Master's Thesis: "Empirical Numerical Properties of Maximum Likelihood Phylogenetic Inference" (Mark: 1.0)
- Specialization subjects: Data-Intensive Computing, Machine Learning and Artificial Intelligence
- Minor subject: Biology
- Final mark: 1.1 (graduation with distinction)
- Bachelor's Thesis: "Patient Tracking in Surgery: An Image-Guided, Markerless Approach" (Mark: 1.0)
- Minor subject: Physics
- Final mark: 1.9
- General qualification for university entrance
- Final mark: 1.3
Further Education
Coursera course teaching the most important concepts of probability and statistics for Machine Learning and Data Science.
You can verify the certificate here.
Coursera course teaching the most important mathematical concepts for Machine Learning. The specialization consists of the following courses:
- Linear Algebra
- Multivariate Calculus
- PCA
You can verify the certificate here.
Coursera course teaching the basics of statistics and hypothesis testing.
You can verify the certificate here.
Coursera course teaching the basics of medical diagnosis prediction using machine learning and deep learning.
You can verify the certificate here.
Coursera specialization consisting of the following courses:
- Convolutional Neural Networks
- Sequence Models
- Structuring Machine Learning Projects
- Regularization and Optimization
You can verify the certificate here
Experience
PhD in Computer Science in the interdisciplinary field of computational biology.
Machine Learning Bioinformatics Python Pandas numpy Plotly scikit-learn LightGBM Optuna Snakemake Biopython C/C++ Git GitHub Actions Jenkins RAxML-NG IQ-TREE FastTreeI am currently pursuing my PhD in Computer Science at the HITS in Heidelberg in the field of Bioinformatics. The working title of my thesis is "Applications of Machine Learning and Data Science in Phylogenetics". One main aim of Phylogenetics, and a research focus in my group at HITS is to infer phylogenetic trees. Phylogenetic trees represent hypothetical evolutionary relationships between organisms or species. With advances in machine learning and deep learning over the last decades, deep learning and large scale data analytics is becoming more popular in phylogenetics.
The goal of my thesis is to explore potential applications of machine learning and deep learning techniques to improve phylogenetic inferences, both in terms of accuracy and runtime. Furthermore, I am trying to improve current techniques and explore future directions by analyzing vast amounts of biological data, and research results.
The following list is a brief summary of what I am working on in my day-to-day work and projects that I have finished.
- Pandora: Quantification of the uncertainty of population genetics genotype datasets under dimensionality reduction. See below for more details on this project.
- Debunking Simulations: In a joint work with researchers in France, we demonstrated that current state-of-the-art models of sequence evolution in phylogenetics cannot simulate empirical-like data. See below for more details on this project.
- Pythia: Predicting the difficulty of phylogenetic analysis. See below for more details on this project.
- Numerical Analysis of thresholds in phylogenetic inference tools. See below for more details on this project.
- My group develops a C library providing frequently used phylogenetic inference functionality (Coraxlib). I am currently working on setting up a CI pipeline in Jenkins for this project.
Internship with one of Apple's MLPT teams.
Python TypeScript React TODO :-)To be updated :-)
I'm very happy to share that I am joining Apple for an AIML-Internship this summer!
Major refactoring and migration of a business-critical Python 2 codebase to Python 3.
Python MySQL Redis FastAPI GraphQL Docker docker-compose Apache KafkaIn addition to pursuing my PhD in computer science, I worked as a part-time freelance software engineer. The idea was to gain further experience in the industry with frameworks and technologies I don't use in my research.
The primary task involved refactoring and migrating a business-critical, yet undocumented and untested Python 2 codebase to Python 3. The original developers are unavailable, and there's no specification for the expected functionality. The challenge lies in refactoring without tests to validate changes, and to write tests, the codebase must first be understood and refactored. The initial structure, with scattered database interaction logic, didn't allow for testing. Potential logical bugs present another issue, as it's unclear if they're actual bugs or undocumented, intended behavior. I made substantial progress in refactoring, adding unit tests and documentation, reducing technical debt, and enhancing maintainability. I further identified potential bugs, data inconsistencies, and opportunities for speedup and complexity reduction.
In various side-projects for the same company, I came in touch with a few additional technologies, including FastAPI, GraphQL, docker, docker-compose and Apache Kafka.
Large scale analysis of numerical properties of phylogenetic inference using the maximum likelihood method.
Python Snakemake Pandas Plotly SQLiteFor more information, see the Project section below.
Development of a web-based filesharing tool.
Python Django JavaScript HTML CSS AWS S3I developed a web-based filesharing tool for a company organizing large events.
The web app allows file-upload and file-sharing, including support for share-links to grant access to certain files for non-registered users. All files are stored in an AWS S3 bucked to ensure scalability.
The project was developed using the Django web framework, and the frontend was implemented using HTML, CSS, and JavaScript.
Development of a web-based analytics software suite for large-scale robotics data.
JavaScript Typescript HTML CSS MariaDB Python Pandas MatplotlibArtiMind's main product is a robot programming suite (RPS) for industrial robots. Instead of writing code, users of the RPS formulate the task of the robot using pre-programmed building blocks. The RPS continuously monitors the task execution while recording measurements such as velocity and force. This data can be directly transferred to the second key product of ArtiMinds: the LAR (Learning & Analytics for Robots). The LAR is a web-based monitoring interface that displays the recorded measurements and task executions of the robot.
During my work as a working student, I was part of the LAR development team. This work included writing tests for the previously largely untested LAR frontend and backend codebase.
I further analyzed velocity and force data of an industrial robot for customers to identify errors and optimization potential.
As a balance to sitting in the library doing some programming and studying, I wanted to do something other than computer science. So I decided to do an internship and later work in the unversity gym. I helped clients achieve their fitness goals and did some basic nutrition counselling.
Tutor for the subject “Digitaltechnik und Entwurfsverfahren“ (Digital Technology and Design Methods).
Study aid for 28 undergraduate computer science students.
Working with the ArmarX robot programming framework.
Preparation of experiments for marker-based human motion capture.
Writing lectures notes for the lecture "Introduction to Geophysics"
LaTeXI wrote lecture notes for the lecture "Introduction to Geophysics". The notes are used by the students as a study aid for the exam.
The resulting notes comprise 70 pages, including 50 custom-made graphics and are available on GitHub.
Projects
This is an (incomplete) list of projects I worked on or that I am still working on.
I contribute and update conda-forge recipes.
In my day-to-day work I rely on a lot of open-source software, and I think it's important to contribute to open-source projects myself. So far my contributions mainly concern conda-forge recipes. Namely, I contributed the following:
- addition of the PyPythia feedstock (maintainer)
- addition of the apricot-select feedstock (maintainer)
- update the pomegranate feedstock (maintainer)
- update the scikit-allel feedstock (maintainer)
- update the r-curl feedstock
Additionally, all my software projects are available open-source on GitHub.
Pandora is a tool to estimate the uncertainty or stability of dimensionality reduction methods applied to genotype data in population genetics.
Genotype datasets typically contain a large number of single nucleotide polymorphisms for a comparatively small number of individuals. To identify similarities between individuals and to infer an individual’s origin or membership to a cultural group, dimensionality reduction techniques are routinely deployed. However, inherent (technical) difficulties such as missing or noisy data need to be accounted for when analyzing a lower dimensional representation of genotype data, and the intrinsic uncertainty of such analyses should be reported in all studies. However, to date, there exists no stability assessment technique for genotype data that can estimate this uncertainty.
To address this issue, I developed Pandora, a stability estimation framework for genotype data based on bootstrapping. Pandora computes an overall score to quantify the stability of the entire embedding, infers per-individual support values, and also deploys a k-means clustering approach to assess the uncertainty of assignments to potential cultural groups. In addition to this bootstrap-based stability estimation, Pandora offers a sliding-window stability estimation for whole-genome data. In the respective publication, I demonstrate the usage and utility of Pandora for studies that rely on dimensionality reduction techniques using published empirical and simulated datasets.
Pandora is implemented in Python3, unit-tested an I setup a CI pipeline in GitHub Actions.
The tool is available on GitHub.
The paper is currently under review, a preprint is available on bioRxiv.
In a joint work with researchers in France, we demonstrated that current state-of-the-art models of sequence evolution in phylogenetics cannot simulate empirical-like data.
Simulating multiple sequence alignments (MSAs) using probabilistic models of sequence evolution plays an important role in the evaluation of phylogenetic inference tools and is crucial to the development of novel learning-based approaches for phylogenetic reconstruction, for instance, neural networks. These models and the resulting simulated data need to be as realistic as possible to be indicative of the performance of the developed tools on empirical data and to ensure that neural networks trained on simulations perform well on empirical data. Over the years, numerous models of evolution have been published with the goal to represent as faithfully as possible the sequence evolution process and thus simulate empirical-like data. In this study, we simulated DNA and protein MSAs under increasingly complex models of evolution with and without insertion/deletion (indel) events using a state-of-the-art sequence simulator. We assessed their realism by quantifying how accurately supervised learning methods are able to predict whether a given MSA is simulated or empirical.
Our results show that we can distinguish between empirical and simulated MSAs with high accuracy using two distinct and independently developed classification approaches across all tested models of sequence evolution. Our findings suggest that the current state-of-the-art models fail to accurately replicate several aspects of empirical MSAs, including site-wise rates as well as amino acid and nucleotide composition.
This work is a joint work with my colleagues at HITS and a team of researchers in Lyon. Johanna Trost, Dimitri Höhler, and I contributed all equally to this work.
The peer-reviewed publication is available at Molecular Biology and Evolution.
Pythia is a machine learning model that predicts the difficulty of phylogenetic inferences on a dataset based on its multiple sequence alignment.
Phylogenetic analyses under the Maximum-Likelihood (ML) model are time and resource intensive. To adequately capture the vastness of tree space, one needs to infer multiple independent trees. On some datasets, multiple tree inferences converge to similar tree topologies, on others to multiple, topologically highly distinct yet statistically indistinguishable topologies. At present, no method exists to quantify and predict this behavior. We introduce a method to quantify the degree of difficulty for analyzing a dataset and present Pythia, a Random Forest Regressor that accurately predicts this difficulty. Pythia predicts the degree of difficulty of analyzing a dataset prior to initiating ML-based tree inferences. Pythia can be used to increase user awareness with respect to the amount of signal and uncertainty to be expected in phylogenetic analyzes, and hence inform an appropriate (post-)analysis setup. Further, it can be used to select appropriate search algorithms for easy-, intermediate-, and hard-to-analyze datasets.
In the current version of Pythia we replaced the Random Forest Regressor with boosted trees implemented in LightGBM and retrained the predictor on approximately 10k MSAs, including DNA, AA, and morphological datasets.
For convenient usage, we provide the command line interface PyPythia that is build around Pythia. PyPythia is implemented in Python, it is unit-tested and I set up CI using GitHub Actions. PyPythia is available on GitHub.
Additionally, CPythia wraps Pythia in a C library as plugin for the Coraxlib project. This integration is a crucial part in our group's latest version of RAxML-NG, adaptive RAxML-NG.
The tool is available on GitHub.
The peer-reviewed publication is available at Molecular Biology and Evolution.
I performed large scale numerical analysis of the influence of threshold parameters on the runtime and results of Maximum Likelihood phylogenetic inferences.
Maximum Likelihood (ML) is a widely used phylogenetic inference model. ML implementations heavily rely on numerical optimization routines that use internal numerical thresholds to determine convergence. We systematically analyze the impact of these threshold settings on the log-likelihood and runtimes for ML tree inferences with RAxML-NG, IQ-TREE, and FastTree on empirical datasets. We provide empirical evidence that we can substantially accelerate tree inferences with RAxML-NG and IQ-TREE by changing the default values of two such numerical thresholds. At the same time, altering these settings does not significantly impact the quality of the inferred trees. We further show that increasing both thresholds accelerates the RAxML-NG bootstrap without influencing the resulting support values. For RAxML-NG, increasing the likelihood thresholds ϵLnL and ϵbrlen to 10 and 103 respectively results in an average tree inference speedup of 1.9 ± 0.6 on Data collection 1, 1.8 ± 1.1 on Data collection 2, and 1.9 ± 0.8 on Data collection 2 for the RAxML-NG bootstrap. Increasing the likelihood threshold ϵLnL to 10 in IQ-TREE results in an average tree inference speedup of 1.3 ± 0.4 on Data collection 1 and 1.3 ± 0.9 on Data collection 2.
Our research comprises four studies, I conducted Study 1 during my Master's thesis with the CME group at HITS. You can find the full thesis here
The peer-reviewed publication is available at Bioinformatics Advances.
I implemented a private, web-based diary to write down memories and store images of vacations and other cool activities.
To remember vacations, holidays and trips I implemented a diary website using the django web framework. This website also stores images associated with certain trips and image descriptions, so I can show the best images to friends and family. It also includes a map, so I can see where on this beautiful earth I have already travelled to ☺ In its latest version, I can also upload files, e.g. GPS tracks of hikes.
The tool is available on GitHub.
I implemented a web-based tool to organize courses for the computer science master's degree at KIT.
For my master’s studies at KIT I implemented a web-based tool to organize the courses I intend to take. The tool checks if the planned courses satisfy all requirements to get the master’s degree. It further shows a schedule of exam dates, as well as an overview of all grades. Since the computation of the final grade at KIT is a bit tedious, I also implemented an automatic computation of the final grade, as well as the average grades per module.
The tool is available on GitHub.
I developed and implemented a markerless patient movement tracking algorithm for cochlear implant surgeries.
As my bachelor's thesis, at the Intelligent Process Automation and Robotics Lab (IPR) at KIT, I implemented a system for markerless patient movement tracking during cochlear implant surgeries. The approach uses only the images obtained by the microscopic camera used by the surgeon. I implemented the algorithm using Python and the image processing frameworks OpenCV and scikit-image. It also includes a neural network for semantic image segmentation.
My bachelor's thesis is available upon request only as it contains sensitive patient images.
The tool is available on GitHub.
The peer-reviewed publication is available at Frontiers in Surgery..
I implemented a web-based recipe website to store and share recipes with friends and family.
This project was inspired by my love for baking. It is meant for storing all the ideas and recipes of me and my friends. Different users can create new recipes, categorize them or search for a specific recipe. The website also includes functionality to filter recipes by ingredients and categories.
This website is an ongoing project and I keep adding new features. The latest features I added were a shopping list (which is especially useful when planning to bake multiple different recipes) and an idea collection where I can save interesting recipes or ideas I come up with.
The tool is available on GitHub.
We implemented a web-based Lambda Calculus IDE and interpreter for the programming paradigm lecture at KIT.
As part of the undergraduate studies, we implemented a browser-based lambda calculus IDE and interpreter in a team of six students. We used Java with the Google Web Toolkit, JavaScript and CSS and wrote about 10 000 lines of code plus 6000 lines of tests. The project was graded with mark 1.0 (best possible grade). I mostly implemented the interactive frontend to explore the lambda terms and their reduction steps. Also, I wrote lots of tests for the controller layer (MVC architecture).
The tool is available on GitHub.
Publications & Preprints
Publications
Simulations of Sequence Evolution: How (Un)realistic They Are and Why
J. Trost*, J. Haag*, D. Höhler*, L. Jacob, A. Stamatakis, B. Boussau
(2024)
Simulations of Sequence Evolution: How (Un)realistic They Are and Why.
Molecular Biology and Evolution, 41(1).
https://doi.org/10.1093/molbev/msad277
* equal contribution
Adaptive RAxML-NG: Accelerating Phylogenetic Inference under Maximum Likelihood using dataset difficulty
A. Togkousidis, A. M. Kozlov, J. Haag, D. Höhler, A. Stamatakis (2023) Adaptive RAxML-NG: Accelerating Phylogenetic Inference under Maximum Likelihood using dataset difficulty. Molecular Biology and Evolution, 40(10). https://doi.org/10.1093/molbev/msad227
The Free Lunch is not over yet – Systematic Exploration of Numerical Thresholds in Maximum Likelihood Phylogenetic Inference
J. Haag, L. Hübner, A. M. Kozlov, A. Stamatakis (2023) The Free Lunch is not over yet – Systematic Exploration of Numerical Thresholds in Maximum Likelihood Phylogenetic Inference. Bioinformatics Advances, 3(1). https://doi.org/10.1093/bioadv/vbad124
From Easy to Hopeless - Predicting the Difficulty of Phylogenetic Analyses
J. Haag, D. Höhler, B. Bettisworth, A. Stamatakis (2022) From Easy to Hopeless - Predicting the Difficulty of Phylogenetic Analyses. Molecular Biology and Evolution, 39(12). https://doi.org/10.1093/molbev/msac254
Continuous Feature-Based Tracking of the Inner Ear for Robot-Assisted Microsurgery
C. Marzi, T. Prinzen, J. Haag, T. Klenzner, F. Mathis-Ullrich (2021) Continuous Feature-Based Tracking of the Inner Ear for Robot-Assisted Microsurgery. Front. Surg., 8(742160). https://doi.org/10.3389/fsurg.2021.742160
Preprints
Pandora: A Tool to Estimate Dimensionality Reduction Stability of Genotype Data
J. Haag, A. I. Jordan, A. Stamatakis (2024) Pandora: A Tool to Estimate Dimensionality Reduction Stability of Genotype Data. bioRxiv. https://doi.org/10.1101/2024.03.14.584962
Predicting Phylogenetic Bootstrap Values via Machine Learning
J. Wiegert, D. Höhler, J. Haag, A. Stamatakis (2024) Predicting Phylogenetic Bootstrap Values via Machine Learning. bioRxiv. https://doi.org/10.1101/2024.03.04.583288
A representative Performance Assessment of Maximum Likelihood based Phylogenetic Inference Tools
D. Höhler, J. Haag, A. M. Kozlov, A. Stamatakis (2022) A representative Performance Assessment of Maximum Likelihood based Phylogenetic Inference Tools. bioRxiv. https://doi.org/10.1101/2022.10.31.514545
Talks
Educated Bootstrap Guesser: Predicting Phylogenetic Bootstrap Values
legend2024 (FORTH Heraklion, Crete, May 2024)
In this presentation, I presented the work of my master's student Julius
Recording
Talk slides
Simulations of Sequence Evolution: How (Un)realistic They Are and Why
legend2024 (FORTH Heraklion, Crete, May 2024)
Recording
Talk slides
Predicting the Difficulty of Phylogenetic Analyses
ERGA BioGenome Analysis and Applications Seminar (Online Seminar, November 2023)
Recording
Talk slides
Predicting the Difficulty of Phylogenetic Analyses
Peder Sather/Invertomics Symposium “Progress and Development in Phylogenetic Methods” (University of Oslo, Norway, March 2023)
Talk slides
Predicting the Difficulty of a Phylogenetic Analysis
EVOLCYP Workshop on Biodiversity Genomics (University of Cyprus, Cyprus, September 2022)
Skills
Programming Languages & Tools
I feel most comfortable working with Python, but I did come in touch with many other programming languages. Since most hobby projects are websites, I am also very familiar with HTML5 and CSS. During my work at ArtiMinds I wrote code in JavaScript and TypeScript and during my computer science studies I also coded in Java and C. I recently also learned the basics of C++ and R.
For all my projects I'm using Git, and I know the basic usage (commit, push, merge, rebase, ...). However, advanced Git features still feels like magic to me and I have to google a lot of commands ☺︎
Python
HTML5 & CSS
Git
SQL
JavaScript & TypeScript
Java
C++
C
R
Frameworks
Here is a list of frameworks I frequently work with and I would say I now my way around.
Pandas Plotly Dash Matplotlib Numpy scikit-learn LightGBM Django Snakemake BiopythonLanguages
- German (Native)
- English (Professional)
- Spanish (Elementary)
- Greek (Elementary)
Miscellaneous
Hobbies
- Baking and cooking! You can see pictures of my baked goods here ☺︎
...yes I will bring cake to the office ☺︎ - Sports, especially bouldering, cycling, and tennis.
- Drinking good coffee ☺︎ (I even have a barista certificate!)
- Meeting friends
- Reading a good book