Junior Data Scientist | Health Data Science

Technical Skills: Python, Tableau, PyMol, R Studio, Chemaxon

Education

Data Scientist Bootcamp TripleTen, January 2023 – November 2023
Apothecary Profession Sanata Dharma University, August 2016 – October 2017
Bachelor of Pharmacy Sanata Dharma University, August 2012 – May 2016

Certification

Advance Learning Algorithm DeepLearning.AI, Stanford University, March 2024
Supervised Machine Learning: Regression and Classification DeepLearning.AI, Stanford University, January 2024
Certificate of Completion - The Data Science Course : Complete Data Science Bootcamp 2023 Udemy, November 2023

Work Experience

AI Trainer @ Outlier (April 2024 - Present)

Freelance Ads Quality Rater @ Welocalize (August 2023 - Present)

Evaluated search engine ads using a proprietary tool.
Excels in a remote work environment, demonstrating proficiency in online research and effectively managing diverse workloads.
Independently sets and manages a flexible schedule to ensure timely project delivery.

BCG Data Science Job Simulation @ Forage (July 2024)

Completed a customer churn analysis simulation for XYZ Analytics, demonstrating advanced data analytics skills, identifying essential client data and outlining a strategic investigation approach.
Conducted efficient data analysis using Python, including Pandas and NumPy. Employed data visualization techniques for insightful trend interpretation.
Completed the engineering and optimization of a random forest model, achieving an 85% accuracy rate in predicting customer churn.
Completed a concise executive summary for the Associate Director, delivering actionable insights for informed decision-making based on the analysis.

Production Supervisor @ Berlico Farma member of Sido Muncul (April 2019 - April 2023)

Drove a 50% improvement in packaging line efficiency through strategic initiatives, optimizing resource utilization, and implementing operational cost reductions.
Engineered and implemented SOP standardization for 11 production machines, elevating safety protocols and overall production efficiency.
Surpassed production fulfilment targets, maintaining rates above 95% through hands-on leadership and meticulous process optimization.
Managed and directed group activities and projects, aligning the team with strategic objectives to exceed key performance indicators (KPIs) and production targets.

Projects

Identification for Potensial Inhibitors for Tyrosine Kinase Enzym using Machine Learning

Notebook

This project aims to develop a machine learning pipeline for predicting the biological activity (pIC50 values) of chemical compounds. The focus lies in drug discovery, leveraging cheminformatics and bioinformatics techniques to evaluate compounds for their potential as inhibitors of Tyrosine Kinase, an enzyme critical in cancer cell signaling and progression. Using molecular descriptors calculated from chemical compounds, various regression models were built and tested to predict pIC50 values. The AdaBoost Regressor was the best-performing model on the training and validation sets, with an R² score of 0.8049 and an RMSE of 1.0175. However, on the unseen test set, the model’s accuracy was moderate, with an R² of 0.73 and an RMSE of 1.60.

Predicting Biological Activity of Compounds using Machine Learning

Notebook

This project aims to develop a machine learning pipeline for predicting the biological activity (pIC50 values) of chemical compounds. The project leverages bioinformatics tools to evaluate compounds for their potential as drug candidates by targeting acetylcholinesterase, an enzyme relevant in neurological functions. Using molecular descriptors calculated from chemical compounds, various regression models were built and tested to predict their biological activity.

Student Performance Prediction: An End-to-End Machine Learning Project

Project Code

This project builds a robust end-to-end machine learning pipeline to predict student performance based on key features. I utilized several models, with Linear Regression emerging as the top performer. The project covers the entire process, from data preprocessing and feature engineering to model training and evaluation. The pipeline incorporates CI/CD practices, ensuring the model is both reliable and scalable. The complete system is deployed locally.

TripleTen - Temperature Prediction for Steelproof Steel Mill

Notebook

Used Python, I identified and addressed several data issues, including renaming columns and handling missing values by creating new aggregate columns or excluding certain datasets. Key insights from exploratory data analysis include typical heating duration, energy consumption, material usage, and the number of iterations needed for optimal steel composition at 1590 degrees. Outliers were cleaned, and feature engineering was performed to enhance model quality. I merged datasets and dropped highly correlated features, resulting in a final dataset of 2329 observations. Among five developed models, a Linear Regression model achieved an MAE of 3.9 degrees, indicating potential energy savings by reducing iteration processes.

TripleTen - Zyfra (Au Concentrate Prediction)

Notebook

I analyzed three gold extraction and cleaning datasets using Python, filling missing values with medians and confirming accurate gold recovery calculations in the training set. I identified key features for model development and explored the data, finding significant increases in gold concentration at each stage while silver and lead remained stable. The particle size distribution was consistent across datasets, and anomalies with zero metal concentrations were removed. I developed and evaluated three models—Random Forest Regressor, Linear Regression, and Decision Tree Regressor—using KFolds. The Random Forest Regressor achieved the best performance with an sMAPE score of 5.67% on the test set, accurately predicting rougher and final recovery.