TUTORIAL 4

Barrie describes his data…

Barrie attempts ASSIGMENT 2! Will his glimpse of the Data Science Mythos drive him MAD?
Tutorial
Resources
Quarto
Data Sets
Author

Barrie D. Robison

Published

January 22, 2024

ASSIGNMENT 2: Your Data

In this assignment (detalied here), I will identify, import, describe, and host a data set that will be used throughout the remainder of the BCB 520 course for Data Visualizations.

MY DATASET

I’ve chosen a subset of a large dataset produced by our evolutionary video game, Project Hastur. We built Project Hastur to be an evolutionary video game, and we are bold in our assertions of that fact. But we haven’t really published any evidence that the evolutionary model works. This data set is the beginning of that exercise.

Note

PROJECT HASTUR creates a unique challenge by combining elements of 3D tower defense and real-time strategy with biological evolution. Fight against alien Proteans that evolve - using biologically accurate models of evolution - to overcome the player’s defenses.

Each creature you will face has its own unique genome controlling its abilities, behaviors, and appearance. Those that make it the furthest and do the most damage to your defenses have the most offspring you will have to defeat in the next generation. The result? Evolution responds to the player’s strategy and makes every playthrough a unique experience.

Use four upgradable turret classes, plus airstrikes and combat robots, to fight against the Protean invasion. Make strategic decisions about which turrets to build, when to upgrade them, and where to place them on the hex grid. A well-timed airstrike can change the flow of the game, but you’ll have to wait before you can use it again. Unlock powerful upgrades for each turret class as you move across the Nyx system. As you play, the Proteans evolve new weapon resistances, behaviors, and movement capabilities to better destroy your defenses.

In CAMPAIGN MODE, battle through a series of maps as a military defense commander to protect the planet Nyx from the ever-evolving threat of the Proteans. Unlock weapons and upgrades and use them to fight against the Protean swarm and learn about the mysteries of Project Hastur.

In EXPERIMENT MODE, choose any map, tweak the parameters, and play infinitely to see what you can evolve. Change the number of creatures and the parameters of evolution, make your turrets invincible, or crank up the biomatter and experiment with the most powerful turret upgrades. Experiment mode lets you experience Project Hastur your way.

Data Collection

The data were collected by running Project Hastur in Experiment mode using four predefined conditions:

I: The CHIP SHREDDER towers when Fitness Functions were turned ON and Civilians were PRESENT.

H: The CHIP SHREDDER towers when Fitness Functions were turned OFF and Civilians were PRESENT.

G: The CHIP SHREDDER towers when Fitness Functions were turned ON and Civilians were ABSENT.

K: The AUTOCANNON towers when Fitness Functions were turned ON and Civilians were ABSENT.

Each experimental condition was run 9 times (9 replicates).

IMPORTING THE DATA

I’m going to use the vroom package to import multiple files. Each file is a replicate and the filename tells us about the experimental condition. Below I convert the filename variable (I named it path) into a a single categorical attribute called Fit that uses the letter codes above.

Code
library(vroom)
library(stringr)
library(tidyverse)
library(readxl)
files <- fs::dir_ls(glob = "*.csv")

Hastur <- vroom(files, id = "path", 
                col_select = c(path, Generation, ID, Origin, AsexualReproduction, Fitness, Health,
                               SightRange, Armor, Damage, WalkSpeed, RunSpeed, Acceleration, 
                               TurnRate, Attraction0, Attraction1, Attraction2))

Hastur$Fit <- str_split_i(Hastur$path, pattern = "", 1)
Hastur$replicate <- str_split_i(Hastur$path, pattern = "", 4)

The glimpse command in the Tidyverse package is a nice way to summarize the data frame:

Code
glimpse(Hastur)
Rows: 412,246
Columns: 19
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
$ path                <chr> "GSC1.csv", "GSC1.csv", "GSC1.csv", "GSC1.csv", "G…
$ Generation          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ ID                  <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …
$ Origin              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ AsexualReproduction <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Fitness             <dbl> 57.83508, 66.87755, 66.14652, 65.88873, 62.12119, …
$ Health              <dbl> 1006, 1012, 1011, 992, 983, 1020, 982, 963, 996, 9…
$ SightRange          <dbl> 9.952521, 10.096590, 9.954091, 10.066170, 10.02955…
$ Armor               <dbl> 0.05081077, 0.05080924, 0.05010696, 0.04903501, 0.…
$ Damage              <dbl> 49, 51, 51, 50, 50, 51, 50, 49, 51, 50, 49, 50, 49…
$ WalkSpeed           <dbl> 6.930266, 7.034348, 6.970608, 6.903729, 6.962081, …
$ RunSpeed            <dbl> 20.03562, 19.88800, 19.80754, 19.94738, 19.95583, …
$ Acceleration        <dbl> 14.70648, 15.05868, 14.85994, 14.89853, 15.01570, …
$ TurnRate            <dbl> 356.3890, 361.2032, 358.9919, 361.8476, 360.6143, …
$ Attraction0         <dbl> 0.158477500, -0.007134318, -0.063494000, 0.0125864…
$ Attraction1         <dbl> -0.070695680, 0.059872570, -0.003332689, 0.0458469…
$ Attraction2         <dbl> -0.108705200, 0.015018780, -0.007600136, 0.0120656…
$ Fit                 <chr> "G", "G", "G", "G", "G", "G", "G", "G", "G", "G", …
$ replicate           <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", …

DESCRIBE THE DATA

Data Set Type

What we have here is a (big) Flat Table. The Items are the rows, and each row is an individual alien enemy that existed during one of the replicates. Each Item (alien) is described by Attributes, which are arranged in the columns.

Attribute Types

The glimpse we did in the preceding section gives us a hint as to what each attribute type might be. Let’s flesh that out a bit though. I’m going to create a new data frame that describes the attributes.

Code
Attributes <- read_excel("Attributes.xlsx")
knitr::kable(Attributes)
Attribute Type Note
path Categorical each File Name is a unique replicate
Generation Quantitative Each Enemy Wave is a Generation
ID Ordinal Each enemy has a unique ID within each replicate
Origin Categorical The hive from which the enemy was spawned
AsexualReproduction Categorical Was the enemy spawned by infectiing a civilian?
Fitness Quantitative The value of Fitness is used to determine probability of reproduction
Health Quantitative Hit Points
SightRange Quantitative How far they can see civilians, towers, etc
Armor Quantitative resistance to physical damage
Damage Quantitative how much damage they do to towers
WalkSpeed Quantitative how fast they can walk
RunSpeed Quantitative how fast they can run
Acceleration Quantitative how fast they can transition from walking to running
TurnRate Quantitative how fast they can turn
Attraction0 Quantitative negative values is attraction to civilians, positive is avoidance
Attraction1 Quantitative negative values is attraction to towers, positive is avoidance
Attraction2 Quantitative negative values is attraction to the base, positive is avoidance
Fit Categorical I, H, G, or K - an inscrutable code about fitness conditions

The problem here is my inscrutable filename codes for that Fit variable. Those letter codes actually contain information on a couple hidden variables. I’m going to create a new variable called Gun and another called Civilians. I’ll add those to the main data file and also the Data Dicttionary.

Code
Hastur$Gun <- "CHIP SHREDDER"
Hastur$Civilians <- "Present"
  

  Hastur$Gun[Hastur$Fit=="K"]<- "AUTOCANNON"
     
  Hastur$Civilians[Hastur$Fit=="K" | Hastur$Fit =="G"] <- "ABSENT"
     

  Attributes<-rbind(Attributes, c("Gun","Categorical", "Autocannon or Chip Shredder"))
  Attributes<-rbind(Attributes, c("Civilians","Categorical", "Present or Absent"))
Code
  knitr::kable(Attributes)
Attribute Type Note
path Categorical each File Name is a unique replicate
Generation Quantitative Each Enemy Wave is a Generation
ID Ordinal Each enemy has a unique ID within each replicate
Origin Categorical The hive from which the enemy was spawned
AsexualReproduction Categorical Was the enemy spawned by infectiing a civilian?
Fitness Quantitative The value of Fitness is used to determine probability of reproduction
Health Quantitative Hit Points
SightRange Quantitative How far they can see civilians, towers, etc
Armor Quantitative resistance to physical damage
Damage Quantitative how much damage they do to towers
WalkSpeed Quantitative how fast they can walk
RunSpeed Quantitative how fast they can run
Acceleration Quantitative how fast they can transition from walking to running
TurnRate Quantitative how fast they can turn
Attraction0 Quantitative negative values is attraction to civilians, positive is avoidance
Attraction1 Quantitative negative values is attraction to towers, positive is avoidance
Attraction2 Quantitative negative values is attraction to the base, positive is avoidance
Fit Categorical I, H, G, or K - an inscrutable code about fitness conditions
Gun Categorical Autocannon or Chip Shredder
Civilians Categorical Present or Absent

HOST THE DATA

I’m publishing to GitHub! We will elaborate on this step as everyone progresses through the assignment.

TASK ABSTRACTION

For this data set, I am currently defining the user as … me! My hypothesis is that the two Fitness conditions create different evolutionary outcomes of the aliens in Project Hastur. Some relevant ACTION TARGET pairs might be:

DISCOVER TRENDS

DISCOVER DISTRIBUTION

DISCOVER SIMILARITY

COMPARE TRENDS

COMPARE DISTRIBUTION

I’m going to try COMPARE TRENDS. I want to COMPARE the TREND in Health over time (Generation) between the two Gun types. To do this, I’ll create a scatterplot, faceted by Gun. I’m suspicious that Acceleration is involved somehow, so I’m coloring with that variable.

Code
ggplot(Hastur, aes(x=Generation, y = Health))+
  geom_point(aes(color=Acceleration), alpha = 0.01, size = 1)+
  scale_color_continuous(low="red", high = "blue")+
  facet_grid(replicate~Gun)

Interesting… it looks like a clear trend for Health to increase under the withering fire of the AUTO CANNONS, but not when the player uses the CHIP SHREDDER. It is a bit hard to see what is going on with Acceleration, so let’s reverse the graphs so that we plot Acceleration on the y axis but color by Health.

Code
ggplot(Hastur, aes(x=Generation, y = Acceleration))+
  geom_point(aes(color=Health), alpha = 0.01, size = 1)+
  scale_color_continuous(low="red", high = "blue")+
  facet_grid(replicate~Gun)

I’m now confident that the replicates within each Gun type are pretty similar, and I can SUMMARIZE the individual data points. This will help with the COMPARE TRENDS task, I think.

Code
ggplot(Hastur, aes(x=Generation, y = Health))+
  geom_point(aes(color=Acceleration), alpha = 0.01, size = 1)+
  geom_smooth()+
  scale_color_continuous(low="red", high = "blue")+
  facet_wrap(~Gun)
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Code
ggplot(Hastur, aes(x=Generation, y = Acceleration))+
  geom_point(aes(color=Health), alpha = 0.01, size = 1)+
  geom_smooth()+
  scale_color_continuous(low="red", high = "blue")+
  facet_wrap(~Gun)
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'