Between July 7 and September 5, 2025, I joined AppNava in Ankara as a Data Science Intern. AppNava is a fast-growing AI company providing prediction and optimization tools for mobile games, focusing on player-level metrics like Lifetime Value (LTV), churn, and Return on Ad Spend (ROAS).

The goal of my internship was to build and evaluate machine learning pipelines for LTV prediction, using real-world game data at production scale. The challenge: 1 million+ rows, 479 features, 98% of users with zero LTV, and heavy-tailed distributions.

This post covers the technical workflow in depth:

  • Data exploration and preprocessing
  • Feature selection and transformations
  • Two-stage LTV prediction models
  • Comprehensive evaluations
  • Interpretations of results
  • Parallel R&D efforts at AppNava

1. Data Exploration

The dataset contained 1,032,565 rows and 479 columns of raw player telemetry from a real mobile game.

Key Findings

  • Missingness: 13.7% of entries had missing values, but the target column (ltv) had none.
  • Redundancy: 29 columns were entirely empty; 237 were duplicates of others.
  • Imbalance: 97.89% of users had 0 LTV. 2.11% had non-zero LTV.
  • Distribution: Skewness = 85.28, Kurtosis = 14,060.23. Confirmed via histograms, Q-Q plots, and boxplots (all showed strong right skew + extreme outliers).
  • Log: reduced skewness but preserved long tail.
  • Box-Cox: reduced skewness to -0.009 (best normalization).
  • Yeo-Johnson: skewness = 0.109, effective but slightly worse.

2. Data Preprocessing

2.1 Dimensionality Reduction

Removed 29 empty columns, 237 duplicates, and 3 constant-value columns.

Final feature set reduced to 252 columns.

2.2 Missing Value Handling

Missing values replaced with 0 (interpreted as “no event”).

Justified because absence of action = meaningful behavioral signal.

2.3 Feature Selection

Hybrid Lasso + Random Forest: kept linear signals + non-linear interactions.

Pure Random Forest importance: selected 124 features, outperformed hybrid.

2.4 Correlation Filtering

Tested removing features with correlation > 0.95.

Removal worsened performance → reverted (high correlation features sometimes still added value)

3. Modeling Approaches

The two-stage pipeline:

Classification – predict if a player will spend.

Regression – predict how much, conditional on spending.

I implemented and compared three pipelines:

3.1 Naive Heuristic Baseline

Classifier: always predicts “non-spender.”

Regressor: assigns mean spender LTV (9.35).

Results:

  • Classification: ROC AUC = 0.50, F1 = 0.00.
  • Regression: MAE = 9.37, RMSE = 10.11, R² = -4.51.

Usefulness: only as a lower-bound baseline.

3.2 Logistic Regression + Tweedie Regressor

  • Classifier: Logistic Regression with class weighting.
  • Oversampling (SMOTE, ADASYN) tested → poor on sparse, large-scale data.
  • Class weighting ({0:1, 1:10} best) worked best.
  • Metrics: ROC AUC = 0.9472, F1 = 0.5425, AP = 0.5225.
  • Regressor: Tweedie distribution (Compound Poisson-Gamma).
  • Naturally models zero-inflated outcomes.
  • Results:
    • Spearman = 0.40, MedAE = 4.18.
    • RMSE = 30.01 (high variance in large outliers).
    • R² = -0.14 (worse than mean baseline).
  • Worked decently for low/mid spenders, failed for high-value users.

3.3 LightGBM Classifier + LightGBM Regressor

  • Classifier: Class weighting + hyperparameter tuning via LightGBMTunerCV.
  • Cross-validation: StratifiedKFold (5 splits).
  • Metrics: ROC AUC = 0.9626, F1 = 0.5931, AP = 0.6029.
  • Regressor: Objective = Quantile Loss (α = 0.6) → robust to long tails.
  • Evaluation metrics:
    • MAE = 8.72
    • RMSE = 24.06
    • MedAE = 4.34
    • Spearman = 0.44
    • R² = 0.18
  • Outperformed Tweedie in high-value LTV users (critical business impact).

4. Comprehensive Evaluation

4.1 Classification Comparison

Model F1 ROC AUC Log Loss MCC AP
Naive 0.00 0.50 0.1015 0.00 0.0209
Logistic Regression 0.54 0.9472 0.0750 0.539 0.5225
LightGBM 0.59 0.9626 0.0835 0.586 0.6029

Takeaway: Both Logistic and LightGBM worked well, but LightGBM had the best balance of recall and precision on minority spenders.

4.2 Regression Comparison

Model MAE RMSE MedAE Spearman
Naive 9.37 10.11 9.35 0.00 -4.51
Tweedie 8.78 30.01 4.18 0.40 -0.14
LightGBM 8.72 24.06 4.34 0.44 0.18

Takeaway: Tweedie slightly better for MedAE (median error, robust to outliers), but LightGBM dominates across other metrics and uniquely achieves positive R².

4.3 Segment-Based Analysis

  • Players grouped into 3 LTV segments:
    • 0–5 (low spenders)
    • 5–15 (mid spenders)
    • 15+ (whales)
  • Tweedie Results:
    • Low: MedAE = 3.49, Spearman = 0.16.
    • Mid: MedAE = 5.26, Spearman = 0.30.
    • High: MedAE = 22.82, Spearman = 0.08 → almost random.
  • LightGBM Results:
    • Low: MedAE = 2.64, Spearman = 0.13.
    • Mid: MedAE = 6.06, Spearman = 0.24.
    • High: MedAE = 15.31, Spearman = 0.34, Pearson = 0.38.
  • Interpretation:
    • Tweedie = stable for low spenders.
    • LightGBM = strong for high-value whales — business-critical since whales drive revenue.

5. R&D at AppNava: Sequential and Deep Learning Approaches

5.1 LSTMs & GRUs

  • Used on session-level sequences (logins, purchases, ad views).
  • Captured time-to-first-purchase dynamics and churn signals.
  • Outperformed static models in early-stage LTV prediction (first 24–48h).

5.2 Transformers for User Event Sequences

  • Event streams were embedded into contextual vectors (actions, timestamps, metadata).
  • Transformer’s self-attention captured long-range dependencies — e.g., a purchase on day 3 influencing LTV weeks later.
  • Being tested against LSTM baselines; showed better stability on sparse data.

5.3 Temporal Convolutional Networks (TCNs)

  • Explored for scalability → dilated convolutions handled long histories in parallel.
  • Practical for millions of daily active users where RNNs became bottlenecks.

5.4 Hybrid Models

  • Ongoing work combined:
    • Boosting (LightGBM/XGBoost) on static tabular features.
    • Transformers/LSTMs on raw sequential events.
    • Ensembles provided better interpretability + predictive accuracy.

6. Key Learnings

  • Two-stage modeling is mandatory for zero-inflated outcomes.
  • Gradient boosting (LightGBM) is robust against high-dimensional, skewed, and imbalanced data.
  • Simple techniques (Box-Cox, class weighting) often outperform more complex oversampling methods at scale.
  • Segment-specific evaluation is critical: Tweedie ≈ low-LTV stability, LightGBM ≈ high-LTV accuracy.
  • Sequential deep models (LSTMs, Transformers, TCNs) are actively researched and already promising for early prediction of whales.

7. Reflection

This internship gave me hands-on experience in end-to-end ML pipelines, from BigQuery preprocessing to model design, evaluation, and interpretation. I also witnessed how cutting-edge R&D in sequence modeling is shaping the future of gaming analytics.

The key takeaway: LTV prediction is not just a technical challenge but a strategic one. Identifying whales early can change how studios allocate marketing budgets, personalize gameplay, and design monetization strategies.

At AppNava, my work on boosting-based models complemented ongoing R&D on deep learning architectures, making it clear that the future lies in hybrid approaches that combine tabular + sequential signals.