Hi, I'm Andy.

I'm a data scientist with four years of software development experience. Currently, I'm an intern at Zyper, a machine learning-driven community marketing platform.

After graduating from UC Berkeley with a degree in bioengineering, I spent most of my early career as a software engineer at biotech companies. I learned how to write clear, reproducible code, how to tackle challenging problems, and the often overlooked importance of communicating effectively to diverse audiences. Meanwhile, I also witnessed how data can positively impact organizations and the people they seek to help, from helping make crucial business decisions to creating life-saving drugs.

As a recent master's graduate and data science practitioner, I hope to apply my technical skills to solve meaningful problems across different industries. Let's connect!

My skillset

I'm an experienced Python and R developer who has built software ranging from machine learning models to complex web applications. I've been involved in all parts of the data science workflow -- from data cleaning, EDA, and model development, to deployment and communicating results. Personal growth is important to me, and I'm constantly learning new technologies and tools.

  • Machine learning (scikit-learn, NumPy, pandas)
  • EDA & data visualization (ggplot, matplotlib)
  • Deep learning (PyTorch, computer vision, NLP)
  • Databases, SQL
  • Distributed computing (PySpark, Spark SQL)
  • AWS, Docker, Git

My projects

Sparkle: ML-based medication adherence

Many people simply fail to take prescribed medication. It's costly to both their health and the healthcare system. Sparkle is a platform consisting of AppleWatch, iOS, and web apps to encourage patients to stay on track, and to allow doctors to monitor progress. At its core is a machine learning model trained on smartwatch sensor data to verify medication intake motions.
Our paper was accepted to the IEEE Engineering in Medicine and Biology Society Conference 2020, Montréal.
GitHub repository

Tools used: scikit-learn, XGboost, PySpark, Flask, AWS, PostgreSQL, Docker

Image Captioner

"A boy sitting on a towel under an umbrella at the beach." Image captioning is a notoriously challenging problem. Here's one solution with a CNN-LSTM encoder-decoder network.

Tools used: PyTorch, OpenCV, word embeddings

Scikit-learn Emulator

From-scratch implementations of core machine learning algorithms, including random forest, decision trees, Ridge and Lasso regression, Naive Bayes, and K-means clustering.

Tools used: Python, NumPy

Zillow housing price predictions

Using various time series techniques, we predicted the median selling price of houses in California from January 2016 to August 2017 and compared model results.

Tools used: time series forecasting (ARIMA, VAR, exponential smoothing)

Instacart Product Repurchases

Which Instacart products are likely to be repurchased? We performed EDA, data manipulation, feature engineering, and binary classification using five different models and compared results.

Tools used: scikit-learn, pandas

Feature importance

A deep dive into the world of model interpretability and analysis of different feature importance methods in machine learning.

Tools used: scikit-learn, XGBoost

K-means

Exploring the k-means and k-means++ clustering algorithms in depth with original implementations and visualizations.

Tools used: scikit-learn

Resume

You may download my resume here.

Contact me

Feel free to shoot me an email at cheon.andy@gmail.com or message me on LinkedIn!