Code: from bs4 import BeautifulSoup import bs4 from urllib.request import urlopen from collections import defaultdict txt_dict = defaultdict(list) def extract_text(link, level=0, MAX_LEVEL=1000): if level == MAX_LEVEL: return with urlopen(link) as response: soup = BeautifulSoup(response, 'html.parser') title = soup.title.contents[0] txt = '' for c in soup.find_all('div', {'class': "chapter-content"}): for x in c.strings: txt += x + "\n" print(level, title) if title in txt_dict.keys(): for _t in txt_dict[title]: if _t == txt: print ("\tDUPLICATE") return txt_dict[title].append(txt) wordcount = len(txt.split()) print("\t", wordcount) children = [] for c in soup.find_all('div', {'class': "question-content"}): for x in c.find_all('a'): link = x.get('href', '/') if link != 'https://chyoa.com/auth/login': children.append(extract_text(link, level+1)) return {'title': title, 'text': txt, 'link': link, 'wordcount': wordcount, 'children': children} def sum_wordcounts(results): if results is None: return 0 return results['wordcount'] + sum([sum_wordcounts(r) for r in results['children']]) Run as follows (on root link of your story): Code: results = extract_text('https://chyoa.com/story/Vampire-Newborn.20536') Which figures out the chapters, and tree structure of story Code: print("wordcount = ", sum_wordcounts(results))
Funny, wrote a quickie python script to track hits by hour. I was curious when the story was read, and found between 7-10 pst is my prime time. I take a snapshot the side panel and just throw it into a file, could be far more sophisticated, but I was not wanting to spend much time.