Marselinus Alberto

Logo

Data Science Project Portfolio

View the Project on GitHub albertomoa/albert-portfolio

Junior Data Scientist | Health Data Science

Technical Skills: Python, Tableau, PyMol, R Studio, Chemaxon

Education

Certification

Work Experience

AI Trainer @ Outlier (April 2024 - Present)

Freelance Ads Quality Rater @ Welocalize (August 2023 - Present)

BCG Data Science Job Simulation @ Forage (July 2024)

Production Supervisor @ Berlico Farma member of Sido Muncul (April 2019 - April 2023)

Projects

Identification for Potensial Inhibitors for Tyrosine Kinase Enzym using Machine Learning

Notebook

This project aims to develop a machine learning pipeline for predicting the biological activity (pIC50 values) of chemical compounds. The focus lies in drug discovery, leveraging cheminformatics and bioinformatics techniques to evaluate compounds for their potential as inhibitors of Tyrosine Kinase, an enzyme critical in cancer cell signaling and progression. Using molecular descriptors calculated from chemical compounds, various regression models were built and tested to predict pIC50 values. The AdaBoost Regressor was the best-performing model on the training and validation sets, with an R² score of 0.8049 and an RMSE of 1.0175. However, on the unseen test set, the model’s accuracy was moderate, with an R² of 0.73 and an RMSE of 1.60.

Predicting Biological Activity of Compounds using Machine Learning

Notebook

This project aims to develop a machine learning pipeline for predicting the biological activity (pIC50 values) of chemical compounds. The project leverages bioinformatics tools to evaluate compounds for their potential as drug candidates by targeting acetylcholinesterase, an enzyme relevant in neurological functions. Using molecular descriptors calculated from chemical compounds, various regression models were built and tested to predict their biological activity.

Student Performance Prediction: An End-to-End Machine Learning Project

Project Code

This project builds a robust end-to-end machine learning pipeline to predict student performance based on key features. I utilized several models, with Linear Regression emerging as the top performer. The project covers the entire process, from data preprocessing and feature engineering to model training and evaluation. The pipeline incorporates CI/CD practices, ensuring the model is both reliable and scalable. The complete system is deployed locally.

TripleTen - Temperature Prediction for Steelproof Steel Mill

Notebook

Used Python, I identified and addressed several data issues, including renaming columns and handling missing values by creating new aggregate columns or excluding certain datasets. Key insights from exploratory data analysis include typical heating duration, energy consumption, material usage, and the number of iterations needed for optimal steel composition at 1590 degrees. Outliers were cleaned, and feature engineering was performed to enhance model quality. I merged datasets and dropped highly correlated features, resulting in a final dataset of 2329 observations. Among five developed models, a Linear Regression model achieved an MAE of 3.9 degrees, indicating potential energy savings by reducing iteration processes.

TripleTen - Zyfra (Au Concentrate Prediction)

Notebook

I analyzed three gold extraction and cleaning datasets using Python, filling missing values with medians and confirming accurate gold recovery calculations in the training set. I identified key features for model development and explored the data, finding significant increases in gold concentration at each stage while silver and lead remained stable. The particle size distribution was consistent across datasets, and anomalies with zero metal concentrations were removed. I developed and evaluated three models—Random Forest Regressor, Linear Regression, and Decision Tree Regressor—using KFolds. The Random Forest Regressor achieved the best performance with an sMAPE score of 5.67% on the test set, accurately predicting rougher and final recovery.