I subsetted a list of words from a larger list of 72 items. How do I determine what list number (1-72) those words came from?

Question

I imported 720 sentences from this website (https://www.cs.columbia.edu/~hgs/audio/harvard.html). There are 72 lists (each list contains 10 sentences.) and saved it in an appropriate structure. I did those step in R. The code is immediately depicted below.

#Q.1a
library(xml2)
library(rvest)
url <- 'https://www.cs.columbia.edu/~hgs/audio/harvard.html'
sentences <- read_html(url) %>%
  html_nodes("li") %>%
  html_text()
headers <- read_html(url) %>%
  html_nodes("h2") %>%
  html_text()

#Q.1b
harvardList <- list()
sentenceList <- list()
n <- 1

for(sentence in sentences){
  sentenceList <- c(sentenceList, sentence)
  print(sentence)
  if(length(sentenceList) == 10) { #if we have 10 sentences
    harvardList[[headers[n]]] <- sentenceList #Those 10 sentences and the respective list from which they are derived, are appended to the harvard list
    sentenceList <- list() #emptying our temporary list which those 10 sentences were shuffled into
    n <- n+1 #set our list name to the next one
  }
}

#Q.1c
sentences1 <- split(sentences, ceiling(seq_along(sentences)/10))
getwd()
setwd("/Users/juliayudkovicz/Documents/Homework 4 Datascience")
sentences.df <- do.call("rbind", lapply(sentences1, as.data.frame))
names(sentences.df)[1] <- "Sentences"
write.csv(sentences.df, file = "sentences1.csv", row.names = FALSE)

THEN, in PYTHON, I computed a list of all the words ending in "ing" and what their frequency was, aka, how many times they appeared across all 72 lists.

path="/Users/juliayudkovicz/Documents/Homework 4 Datascience"
os.chdir(path)
cwd1 = os.getcwd()
print(cwd1)

import pandas as pd
df = pd.read_csv(r'/Users/juliayudkovicz/Documents/Homework 4 Datascience/sentences1.csv', sep='\t', engine='python')
print(df)
df['Sentences'] = df['Sentences'].str.replace(".", "")
print(df)
sen_List = df['Sentences'].values.tolist()
print(sen_List)

ingWordList = [];
for line in sen_List:
    for word in line.split():
         if word.endswith('ing'):
                ingWordList.append(word)

ingWordCountDictionary = {};

for word in ingWordList:
    word = word.replace('"', "")
    word = word.lower()
    if word in ingWordCountDictionary:
        ingWordCountDictionary[word] = ingWordCountDictionary[word] + 1
    else: 
        ingWordCountDictionary[word] = 1

print(ingWordCountDictionary)

f = open("ingWordCountDictionary.txt", "w")

for key, value in ingWordCountDictionary.items():
    keyValuePairToWrite = "%s, %s\n"%(key, value)
    f.write(keyValuePairToWrite)


f.close()

Now, I am being asked to create a dataset which shows what list (1 from 72) each "ing" word is derived from. THIS IS WHAT I DON'T KNOW HOW TO DO. I obviously know they are a subset of huge 72 item list, but how do I figure out what list those words came from.

The expected output should look something like this:

[List Number] [-ing Word]
List 1        swing, ring, etc.,
List 2        moving

so and so forth

Please don't vandalize your post by removing its content. See meta.stackoverflow.com/q/275864/1288408 for further information — Modus Tollens
Please don't make more work for other people by vandalizing your posts. By posting on the Stack Exchange network, you've granted a non-revocable right, under the CC BY-SA 4.0 license, for Stack Exchange to distribute that content (i.e. regardless of your future choices). By Stack Exchange policy, the non-vandalized version of the post is the one which is distributed. Thus, any vandalism will be reverted. If you want to know more about deleting a post please see: How does deleting work? — Machavity♦
Please read up on how to use Stack Overflow. You will have a better experience. Two people here posted answers, as volunteers in their free time. Trying to get this question deleted invalidates their work. Saying that nobody is helping and that it is useless is not true. Your answer pointing to another post is not what SO considers a valid answer. Please try to value the community by being open to learn how to use Stack Overflow. Thanks. — Modus Tollens
I just saw that you even tried to delete an answer by changing its text. That's big no go. Value the time of users trying to help. — Modus Tollens
lol then why was it the only answer that got the output required for my assignment? — help

jazzurro jazzurro · Accepted Answer · 2019-12-01T05:34:32

Here is one way for you. As far as I see the expected result, you seem to want to get verbs in progressive forms (V-ing). (I do not understand why you have king in your result. If you have king, you should have spring here as well, for example.) If you need to consider lexical classes, I think you want to use the koRpus package. If not, you can use the textstem package, for example.

First, I scraped the link and created a data frame. Then, I split sentences into words using unnest_tokens() in the tidytext package, and subsetted words ending with 'ing'. Then, I used treetag() in the koRpus package. You need to install Treetagger by yourself before you use the package. Finally, I counted how many times these verbs in progressive forms appear in the data set. I hope this will help you.

library(tidyverse)
library(rvest)
library(tidytext)
library(koRpus)

read_html("https://www.cs.columbia.edu/~hgs/audio/harvard.html") %>% 
  html_nodes("h2") %>% 
  html_text() -> so_list

read_html("https://www.cs.columbia.edu/~hgs/audio/harvard.html") %>% 
  html_nodes("li") %>% 
  html_text() -> so_text


# Create a data frame

sodf <- tibble(list_name = rep(so_list, each = 10),
           text = so_text)

# Split senteces into words and get words ending with ING.

unnest_tokens(sodf, input = text, output = word) %>% 
  filter(grepl(x = word, pattern = "ing$")) -> sowords

# Use koRpus package to lemmatize the words in sowords$word.

treetag(sowords$word, treetagger = "manual", format = "obj",
        TT.tknz = FALSE , lang = "en", encoding = "UTF-8",
        TT.options = list(path = "C:\\tree-tagger-windows-3.2\\TreeTagger",
                          preset = "en")) -> out

# Access to the data frame and filter the words. It seems that you are looking
# for verbs. So I did that here.

filter([email protected], grepl(x = token, pattern = "ing$") & wclass == "verb") %>% 
  count(token)

# A tibble: 16 x 2
#   token         n
#   <chr>     <int>
# 1 adding        1
# 2 bring         4
# 3 changing      1
# 4 drenching     1
# 5 dying         1
# 6 lodging       1
# 7 making        1
# 8 raging        1
# 9 shipping      1
#10 sing          1
#11 sleeping      2
#12 wading        1
#13 waiting       1
#14 wearing       1
#15 winding       2
#16 working       1

I subsetted a list of words from a larger list of 72 items. How do I determine what list number (1-72) those words came from?

2 Answers