Babynames Project Maria Scricco

Author

Maria Scricco

Babynames Project

By Maria Scricco

MEA 3290, Fall 2023

Introduction

Using the babynames dataset on CRAN, I decided to filter out Italian names and track them over time. Overall, I believed that the top Italian names in Italy would have shown a steady increase in popularity over time in the US.

In order to locate popular Italian names, I researched online databases. I found a dataset on Kaggle which outlined these most popular names. This dataset can be found through the link here:

Download the Dataset Here!

Step 1: Libraries

I started by loading all necessary libraries for this project. First, the tidyverse is loaded in in order to utilize the pipe function for our code. Next, the Babynames package is also loaded in in order to utilize the babynames data package. Next, the ggplot2 package is loaded in in order to effectively visualize the plots used for this project. Finally, the library readr was utilized to load in the new dataset for Italian names.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.0     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.1     ✔ tibble    3.1.8
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

library(babynames)
library(ggplot2)
library(readr)
italy_names <- read_csv("~/Downloads/archive (3)/italy_names.csv")

Rows: 5967 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Name, Country
dbl (2): Count, Year
lgl (1): Gender

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Step 2: Arranging

The next step was to arrange the new Italian names dataset by descreasing order in order to show the most popular names throughout the years. However, the names needed to be grouped by the variable “Name” in order to exclude duplicated names, including only unique ones. The group_by() function was utilized to do this, and the summarize function was utilized to sum the total Counts by each name.

The top 20 names were printed in order to view enough data points to gather each of the top 5 names for each gender utilizing the head function after being arranged in descending order.

italy_names %>% 
  group_by(Name) %>% 
  summarize(total = sum(Count)) %>% 
  arrange(desc(total)) %>% 
  head(20)

# A tibble: 20 × 2
   Name       total
   <chr>      <dbl>
 1 Andrea     12534
 2 Francesco  11060
 3 Marco      10373
 4 Maria      10255
 5 Giuseppe   10137
 6 Alessandro  9924
 7 Luca        9883
 8 Anna        8689
 9 Antonio     8257
10 Francesca   8069
11 Giovanni    7929
12 Paolo       7776
13 Stefano     7434
14 Roberto     7174
15 Matteo      7164
16 Elena       6766
17 Davide      6447
18 Giulia      6253
19 Sara        6058
20 Laura       5941

Step 3: Subsectioning Data for Females

The next goal is to subsection a new dataset including only these top 5 Italian female names from the original babynames dataset. This was done utilizing a filter function and writing the top 5 female names taken from the Italian names dataset.

femalenames <- babynames %>% 
  filter(name %in% c('Andrea', 'Maria', 'Anna', 'Francesca', 'Elena'),
                                    sex == "F")

Step 4: Subsectioning Data for Males

The next goal is to subsection a new dataset including only these top 5 Italian male names from the original babynames dataset. This was also done utilizing a filter function and writing the top 5 male names taken from the Italian names dataset.

malenames <- babynames %>% 
                filter(name %in% c('Andrea', 'Francesco', 'Marco', 'Giuseppe', 
                                   'Alessandro'), sex == "M")

Step 5: Graphing!

I continued by graphing an outline of the popularity of these names in the US over time. I split the two graphs by gender, looking at the most popular female names, and then the most popular male names, respectively.

Plot the Female Names

femalenames %>% ggplot(aes(year, prop, color = name))+ labs(title = "Female Italian Names in the US Over Time", subtitle = "MEA 3290 Project 1, Graph 1", x = "Year", y= "Percentage")+ geom_line()+ theme_classic()

Plot the Male names

malenames %>% ggplot(aes(year, prop, color = name))+ geom_line()+ labs(title = "Male Italian Names in the US Over Time", subtitle = "MEA 3290 Project 1, Graph 2", x = "Year", y= "Percentage")+ theme_classic()

So, What Do We See?

From the female babynames dataset, we see a very high percentage of babies being named Anna in the early 1880’s, followed by a pretty steep decrease in proportion all the way into the 2000’s. The other four names do not show any clear or distinct trend, but the names Maria and Andrea have a small spike in the mid 1970’s.

From the male babynames dataset, we see a huge spike in the name Marco from ~1975 onwards until a general decrease in the later 2010’s. Most of the other names do not show as much of a clear trend over time. However, the name Alessandro seems to be on a general trend upwards in the post-2000’s, which may predict that the popularity of that name may increase over the next decade or so.

Step 6: Ranking Comparisons

I also thought it could be insightful to see the ranking of Italian baby names in the US by gender in comparison to the top names in Italy. I thought this could be insightful in investigating if the top Italian names were in the same ranking of popularity outside of Italy as well.

Specifically, here is the ranking from the dataset for females in Italy: 1. Andrea 2. Maria 3. Anna 4. Francesca 5. Elena

And, here is the ranking of most popular male names in Italy: 1. Andrea 2. Francesco 3. Marco 4. Giuseppe 5. Alessandro

femalenames %>% 
  group_by(name) %>% 
  summarise(total = sum(n)) %>% 
  arrange(desc(total))

# A tibble: 5 × 2
  name       total
  <chr>      <int>
1 Anna      888505
2 Maria     543324
3 Andrea    431157
4 Elena      73997
5 Francesca  28922

Based on this ranking from the babynames dataset, we see flip in popularity of these names when comparing the US to Italy. Anna is the most popular Italian name here in the US, when it only ranked 3rd in Italy. Francesca lands on the lower side of the ranking both in the US and Italy, and Maria also stayed at spot number 2 in both countries.

malenames %>% 
  group_by(name) %>% 
  summarise(total = sum(n)) %>% 
  arrange(desc(total))

# A tibble: 5 × 2
  name       total
  <chr>      <int>
1 Marco      69756
2 Alessandro  7957
3 Francesco   7301
4 Andrea      5865
5 Giuseppe    5471

Based on this ranking of male names, we again see a flip of popularity. Specifically, the most popular name is Marco (which is not surprising after looking at the graph visualization), which only ranked 3rd in Italy. Andrea is one of the least popular names, ranking in 4th place in the US, but ranking 1st in Italy.

Step 7: Proportions & 2017 US Data

Since our past visualizations and observations looked at the US dataset as a whole, which spans the years 1880-2017, and our Italy dataset only looks at years 2017-2021, I thought it could be impactful to subset the US dataset by those years to see if any of the top names overlapped in those years specifically.

The following code calculates a new proportion variable in the italynames dataset:

italy_names %>% 
  group_by(Year) %>%  
  summarize(Total = sum(Count)) -> i_totals

italy_names %>%  
  left_join(i_totals, by = 'Year') -> i_totals2

i_totals2 %>% 
  mutate(Prop = Count / Total ) -> i_totals3

Plot the 10 most popular names in Italy for the whole dataset:

i_totals3 %>%  
  group_by(Name) %>%  
  summarize(average = mean(Prop)) %>%  
  arrange(desc(average)) %>%  
  head(10) %>%  
  ggplot(aes(reorder(Name, average),average, fill = Name)) + geom_col() +
  labs( title= "Top 10 Italy Babynames Popularity", x= "Name",
        y = "Proportion", subtitle = "MEA 3290 Project 1, Graph 3")+
  theme_classic()+
  coord_flip()

head(italy_names$Year)

[1] 2020 2020 2020 2020 2020 2020

tail(italy_names$Year)

[1] 2017 2017 2017 2017 2017 2017

Using the head and tail functions, this shows that the dataset ranges in years from 2017-2020.

Next, we filter the US babynames dataset by these years using the filter function.We average the proportions by name and then arrange the data in descending order by our new average variable. Then, we plot this on a column chart to view the proportion of averaged names.

babynames %>%  
  filter(year > 2016 & year < 2021) %>%  
  group_by(name) %>%  
  summarize(average = mean(prop)) %>%  
  arrange(desc(average)) %>%  
  head(10) %>%  
  ggplot(aes(reorder(name, average),average, fill = name)) + geom_col() +
  labs( title= "Top 10 US Babyname Popularity in 2017", x= "Name",
        y = "Percentage", subtitle = "MEA 3290 Project 1, Graph 4")+
  theme_classic()+
  coord_flip()

As we can see by both bar charts, none of the top 10 names in Italy (regardless of gender) reached the top 10 list of names in the US in 2017.

Conclusion

Overall, from the visualizations provided, we can see 2 main Italian popular names in the US by gender: Marco and Anna. Anna seemed to have had much more prevalence prior to the 1950’s, and Marco seems to have a huge spike in the 2000’s. The rest of the top 5 names by gender in Italy do not show any clear or distinct trend over time, other than a slight increase and decrease from ~1970-1980 in the female names Maria and Andrea.

From what we see, I hypothesize that the increases in the female names from 1970-2000 could be a result of immigration waves into the US from Italy during that time, alongside a consistency of Italian culture and names in media (ie: The Godfather, Italian Music, etc). I also hypothesize that the small hump seen for the name Maria in the 1960’s-1970’s is likely a result of the release of the movie West Side Story in December of 1961.

We did not see any overlap in names in the US from 2017 when compared to names in Italy in 2017 as well. Does this mean that immigrants from Italy to the US are “Americanizing” their names? Is there a pressure to do so? In terms of why this is, a study from 2016 may give us some insight. The study done by Carneiro et al., had found that “at any given time between 1900 and 1930, about 77 percent of immigrants had an American-sounding first name, and it was the norm for them to have dropped their original name within a year of entering the U.S.” This push for “Americanized” first names may have also been a result of economic consequences, as the research article also outlines that “Native-born sons of Irish, Italian, German, and Polish immigrant fathers who were given very ethnic names ended up [earning] $50 to $100 less per year than sons who were given very ‘American’ names.”

If you’d like to read more, the study is linked here!: https://www.ucl.ac.uk/~uctppca/Paper_Immigration_11Feb2016_SL.pdf

Limitations:

Since Italian is a Latin based language, there is the possibility of names listed within the dataset incorrectly being labeled as “Italian”, when they could also easily be Spanish, Portuguese, Romanian, and many more.

On top of this, we see Male Italian names showing clearly a larger increase from the 2000’s onwards when compared to female names. However, it is also of note to recognize that female names in general have many more possibilities and variability within the US when compared to males, which may also have an impact on these results.

It is also important to note that although Gender was a column in the original dataset, there was no information on gender for each of the names. This required a manual subsection for each of these names, which required assuming gender for each name provided. For example, Francesco was assumed to be a male name and Francesca a female name, which may not have always been accurate. On top of this, Andrea was a gender-ambiguous name, meaning it was included in both the female subsection and male subsection.

It is also important to note that although we filtered the US babynames dataset by yeas > 2016 and years < 2021, the US babynames dataset only included values up to 2017, meaning our visualizations are only based on that year.