Using the babynames dataset on CRAN, I decided to filter out Italian names and track them over time. Overall, I believed that the top Italian names in Italy would have shown a steady increase in popularity over time in the US.
In order to locate popular Italian names, I researched online databases. I found a dataset on Kaggle which outlined these most popular names. This dataset can be found through the link here:
I started by loading all necessary libraries for this project. First, the tidyverse is loaded in in order to utilize the pipe function for our code. Next, the Babynames package is also loaded in in order to utilize the babynames data package. Next, the ggplot2 package is loaded in in order to effectively visualize the plots used for this project. Finally, the library readr was utilized to load in the new dataset for Italian names.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.0 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.1 ✔ tibble 3.1.8
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
Rows: 5967 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Name, Country
dbl (2): Count, Year
lgl (1): Gender
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Step 2: Arranging
The next step was to arrange the new Italian names dataset by descreasing order in order to show the most popular names throughout the years. However, the names needed to be grouped by the variable “Name” in order to exclude duplicated names, including only unique ones. The group_by() function was utilized to do this, and the summarize function was utilized to sum the total Counts by each name.
The top 20 names were printed in order to view enough data points to gather each of the top 5 names for each gender utilizing the head function after being arranged in descending order.
# A tibble: 20 × 2
Name total
<chr> <dbl>
1 Andrea 12534
2 Francesco 11060
3 Marco 10373
4 Maria 10255
5 Giuseppe 10137
6 Alessandro 9924
7 Luca 9883
8 Anna 8689
9 Antonio 8257
10 Francesca 8069
11 Giovanni 7929
12 Paolo 7776
13 Stefano 7434
14 Roberto 7174
15 Matteo 7164
16 Elena 6766
17 Davide 6447
18 Giulia 6253
19 Sara 6058
20 Laura 5941
Step 3: Subsectioning Data for Females
The next goal is to subsection a new dataset including only these top 5 Italian female names from the original babynames dataset. This was done utilizing a filter function and writing the top 5 female names taken from the Italian names dataset.
The next goal is to subsection a new dataset including only these top 5 Italian male names from the original babynames dataset. This was also done utilizing a filter function and writing the top 5 male names taken from the Italian names dataset.
I continued by graphing an outline of the popularity of these names in the US over time. I split the two graphs by gender, looking at the most popular female names, and then the most popular male names, respectively.
Plot the Female Names
femalenames %>%ggplot(aes(year, prop, color = name))+labs(title ="Female Italian Names in the US Over Time", subtitle ="MEA 3290 Project 1, Graph 1", x ="Year", y="Percentage")+geom_line()+theme_classic()
Plot the Male names
malenames %>%ggplot(aes(year, prop, color = name))+geom_line()+labs(title ="Male Italian Names in the US Over Time", subtitle ="MEA 3290 Project 1, Graph 2", x ="Year", y="Percentage")+theme_classic()
So, What Do We See?
From the female babynames dataset, we see a very high percentage of babies being named Anna in the early 1880’s, followed by a pretty steep decrease in proportion all the way into the 2000’s. The other four names do not show any clear or distinct trend, but the names Maria and Andrea have a small spike in the mid 1970’s.
From the male babynames dataset, we see a huge spike in the name Marco from ~1975 onwards until a general decrease in the later 2010’s. Most of the other names do not show as much of a clear trend over time. However, the name Alessandro seems to be on a general trend upwards in the post-2000’s, which may predict that the popularity of that name may increase over the next decade or so.
Step 6: Ranking Comparisons
I also thought it could be insightful to see the ranking of Italian baby names in the US by gender in comparison to the top names in Italy. I thought this could be insightful in investigating if the top Italian names were in the same ranking of popularity outside of Italy as well.
Specifically, here is the ranking from the dataset for females in Italy: 1. Andrea 2. Maria 3. Anna 4. Francesca 5. Elena
And, here is the ranking of most popular male names in Italy: 1. Andrea 2. Francesco 3. Marco 4. Giuseppe 5. Alessandro
# A tibble: 5 × 2
name total
<chr> <int>
1 Anna 888505
2 Maria 543324
3 Andrea 431157
4 Elena 73997
5 Francesca 28922
Based on this ranking from the babynames dataset, we see flip in popularity of these names when comparing the US to Italy. Anna is the most popular Italian name here in the US, when it only ranked 3rd in Italy. Francesca lands on the lower side of the ranking both in the US and Italy, and Maria also stayed at spot number 2 in both countries.
# A tibble: 5 × 2
name total
<chr> <int>
1 Marco 69756
2 Alessandro 7957
3 Francesco 7301
4 Andrea 5865
5 Giuseppe 5471
Based on this ranking of male names, we again see a flip of popularity. Specifically, the most popular name is Marco (which is not surprising after looking at the graph visualization), which only ranked 3rd in Italy. Andrea is one of the least popular names, ranking in 4th place in the US, but ranking 1st in Italy.
Step 7: Proportions & 2017 US Data
Since our past visualizations and observations looked at the US dataset as a whole, which spans the years 1880-2017, and our Italy dataset only looks at years 2017-2021, I thought it could be impactful to subset the US dataset by those years to see if any of the top names overlapped in those years specifically.
The following code calculates a new proportion variable in the italynames dataset:
italy_names %>%group_by(Year) %>%summarize(Total =sum(Count)) -> i_totalsitaly_names %>%left_join(i_totals, by ='Year') -> i_totals2i_totals2 %>%mutate(Prop = Count / Total ) -> i_totals3
Plot the 10 most popular names in Italy for the whole dataset:
Using the head and tail functions, this shows that the dataset ranges in years from 2017-2020.
Next, we filter the US babynames dataset by these years using the filter function.We average the proportions by name and then arrange the data in descending order by our new average variable. Then, we plot this on a column chart to view the proportion of averaged names.
babynames %>%filter(year >2016& year <2021) %>%group_by(name) %>%summarize(average =mean(prop)) %>%arrange(desc(average)) %>%head(10) %>%ggplot(aes(reorder(name, average),average, fill = name)) +geom_col() +labs( title="Top 10 US Babyname Popularity in 2017", x="Name",y ="Percentage", subtitle ="MEA 3290 Project 1, Graph 4")+theme_classic()+coord_flip()
As we can see by both bar charts, none of the top 10 names in Italy (regardless of gender) reached the top 10 list of names in the US in 2017.
Conclusion
Overall, from the visualizations provided, we can see 2 main Italian popular names in the US by gender: Marco and Anna. Anna seemed to have had much more prevalence prior to the 1950’s, and Marco seems to have a huge spike in the 2000’s. The rest of the top 5 names by gender in Italy do not show any clear or distinct trend over time, other than a slight increase and decrease from ~1970-1980 in the female names Maria and Andrea.
From what we see, I hypothesize that the increases in the female names from 1970-2000 could be a result of immigration waves into the US from Italy during that time, alongside a consistency of Italian culture and names in media (ie: The Godfather, Italian Music, etc). I also hypothesize that the small hump seen for the name Maria in the 1960’s-1970’s is likely a result of the release of the movie West Side Story in December of 1961.
We did not see any overlap in names in the US from 2017 when compared to names in Italy in 2017 as well. Does this mean that immigrants from Italy to the US are “Americanizing” their names? Is there a pressure to do so? In terms of why this is, a study from 2016 may give us some insight. The study done by Carneiro et al., had found that “at any given time between 1900 and 1930, about 77 percent of immigrants had an American-sounding first name, and it was the norm for them to have dropped their original name within a year of entering the U.S.” This push for “Americanized” first names may have also been a result of economic consequences, as the research article also outlines that “Native-born sons of Irish, Italian, German, and Polish immigrant fathers who were given very ethnic names ended up [earning] $50 to $100 less per year than sons who were given very ‘American’ names.”
Since Italian is a Latin based language, there is the possibility of names listed within the dataset incorrectly being labeled as “Italian”, when they could also easily be Spanish, Portuguese, Romanian, and many more.
On top of this, we see Male Italian names showing clearly a larger increase from the 2000’s onwards when compared to female names. However, it is also of note to recognize that female names in general have many more possibilities and variability within the US when compared to males, which may also have an impact on these results.
It is also important to note that although Gender was a column in the original dataset, there was no information on gender for each of the names. This required a manual subsection for each of these names, which required assuming gender for each name provided. For example, Francesco was assumed to be a male name and Francesca a female name, which may not have always been accurate. On top of this, Andrea was a gender-ambiguous name, meaning it was included in both the female subsection and male subsection.
It is also important to note that although we filtered the US babynames dataset by yeas > 2016 and years < 2021, the US babynames dataset only included values up to 2017, meaning our visualizations are only based on that year.