
I imported 720 sentences from this website (https://www.cs.columbia.edu/~hgs/audio/harvard.html). There are 72 lists (each list contains 10 sentences.) and saved it in an appropriate structure. I did those step in R. The code is immediately depicted below.

url <- 'https://www.cs.columbia.edu/~hgs/audio/harvard.html'
sentences <- read_html(url) %>%
  html_nodes("li") %>%
headers <- read_html(url) %>%
  html_nodes("h2") %>%

harvardList <- list()
sentenceList <- list()
n <- 1

for(sentence in sentences){
  sentenceList <- c(sentenceList, sentence)
  if(length(sentenceList) == 10) { #if we have 10 sentences
    harvardList[[headers[n]]] <- sentenceList #Those 10 sentences and the respective list from which they are derived, are appended to the harvard list
    sentenceList <- list() #emptying our temporary list which those 10 sentences were shuffled into
    n <- n+1 #set our list name to the next one

sentences1 <- split(sentences, ceiling(seq_along(sentences)/10))
setwd("/Users/juliayudkovicz/Documents/Homework 4 Datascience")
sentences.df <- do.call("rbind", lapply(sentences1, as.data.frame))
names(sentences.df)[1] <- "Sentences"
write.csv(sentences.df, file = "sentences1.csv", row.names = FALSE)

THEN, in PYTHON, I computed a list of all the words ending in "ing" and what their frequency was, aka, how many times they appeared across all 72 lists.

path="/Users/juliayudkovicz/Documents/Homework 4 Datascience"
cwd1 = os.getcwd()

import pandas as pd
df = pd.read_csv(r'/Users/juliayudkovicz/Documents/Homework 4 Datascience/sentences1.csv', sep='\t', engine='python')
df['Sentences'] = df['Sentences'].str.replace(".", "")
sen_List = df['Sentences'].values.tolist()

ingWordList = [];
for line in sen_List:
    for word in line.split():
         if word.endswith('ing'):

ingWordCountDictionary = {};

for word in ingWordList:
    word = word.replace('"', "")
    word = word.lower()
    if word in ingWordCountDictionary:
        ingWordCountDictionary[word] = ingWordCountDictionary[word] + 1
        ingWordCountDictionary[word] = 1


f = open("ingWordCountDictionary.txt", "w")

for key, value in ingWordCountDictionary.items():
    keyValuePairToWrite = "%s, %s\n"%(key, value)


Now, I am being asked to create a dataset which shows what list (1 from 72) each "ing" word is derived from. THIS IS WHAT I DON'T KNOW HOW TO DO. I obviously know they are a subset of huge 72 item list, but how do I figure out what list those words came from.

The expected output should look something like this:

[List Number] [-ing Word]
List 1        swing, ring, etc.,
List 2        moving

so and so forth

Here is one way for you. As far as I see the expected result, you seem to want to get verbs in progressive forms (V-ing). (I do not understand why you have king in your result. If you have king, you should have spring here as well, for example.) If you need to consider lexical classes, I think you want to use the koRpus package. If not, you can use the textstem package, for example.

First, I scraped the link and created a data frame. Then, I split sentences into words using unnest_tokens() in the tidytext package, and subsetted words ending with 'ing'. Then, I used treetag() in the koRpus package. You need to install Treetagger by yourself before you use the package. Finally, I counted how many times these verbs in progressive forms appear in the data set. I hope this will help you.


read_html("https://www.cs.columbia.edu/~hgs/audio/harvard.html") %>% 
  html_nodes("h2") %>% 
  html_text() -> so_list

read_html("https://www.cs.columbia.edu/~hgs/audio/harvard.html") %>% 
  html_nodes("li") %>% 
  html_text() -> so_text

# Create a data frame

sodf <- tibble(list_name = rep(so_list, each = 10),
           text = so_text)

# Split senteces into words and get words ending with ING.

unnest_tokens(sodf, input = text, output = word) %>% 
  filter(grepl(x = word, pattern = "ing$")) -> sowords

# Use koRpus package to lemmatize the words in sowords$word.

treetag(sowords$word, treetagger = "manual", format = "obj",
        TT.tknz = FALSE , lang = "en", encoding = "UTF-8",
        TT.options = list(path = "C:\\tree-tagger-windows-3.2\\TreeTagger",
                          preset = "en")) -> out

# Access to the data frame and filter the words. It seems that you are looking
# for verbs. So I did that here.

filter([email protected], grepl(x = token, pattern = "ing$") & wclass == "verb") %>% 

# A tibble: 16 x 2
#   token         n
#   <chr>     <int>
# 1 adding        1
# 2 bring         4
# 3 changing      1
# 4 drenching     1
# 5 dying         1
# 6 lodging       1
# 7 making        1
# 8 raging        1
# 9 shipping      1
#10 sing          1
#11 sleeping      2
#12 wading        1
#13 waiting       1
#14 wearing       1
#15 winding       2
#16 working       1

How did you store the data from the lists (ie what does your data.frame look like? Could you provide an example?

Without seeing this, I suggest you save the data in a list as follows:

COLUMN 1   ,   COLUMN 2,   COLUMN 3 
"List number", "Sentence", "-ING words (as vector)"

I hope this makes sense, let me know if you need more help. I wasn't able to comment on this post unfortunately.