Wrote a super clunky Python script to do story wordcounts

Discussion in 'CHYOA General' started by krm2116, Aug 22, 2019.

  1. krm2116

    krm2116 Virgin

    Code:
    from bs4 import BeautifulSoup
    import bs4
    from urllib.request import urlopen
    from collections import defaultdict
    
    txt_dict = defaultdict(list)
    
    def extract_text(link, level=0, MAX_LEVEL=1000):
       
        if level == MAX_LEVEL:
            return
               
        with urlopen(link) as response:
            soup = BeautifulSoup(response, 'html.parser')
           
        title = soup.title.contents[0]
       
        txt = ''
        for c in soup.find_all('div', {'class': "chapter-content"}):
            for x in c.strings:
                txt += x + "\n"
               
        print(level, title)
        if title in txt_dict.keys():
            for _t in txt_dict[title]:
                if _t == txt:
                    print ("\tDUPLICATE")
                    return
       
        txt_dict[title].append(txt)
       
        wordcount = len(txt.split())
        print("\t", wordcount)
        children = []
        for c in soup.find_all('div', {'class': "question-content"}):
            for x in c.find_all('a'):
                link = x.get('href', '/')
                if link != 'https://chyoa.com/auth/login':
                    children.append(extract_text(link, level+1))
        return {'title': title, 'text': txt, 'link': link, 'wordcount': wordcount, 'children': children}
    def sum_wordcounts(results):
        if results is None:
            return 0
        return results['wordcount'] + sum([sum_wordcounts(r) for r in results['children']])
    
    Run as follows (on root link of your story):

    Code:
    results = extract_text('https://chyoa.com/story/Vampire-Newborn.20536')
    
    Which figures out the chapters, and tree structure of story


    Code:
    print("wordcount = ", sum_wordcounts(results))
    
     
    Loeman likes this.
  2. cmc

    cmc Virgin

    Funny, wrote a quickie python script to track hits by hour. I was curious when the story was read, and found between 7-10 pst is my prime time. I take a snapshot the side panel and just throw it into a file, could be far more sophisticated, but I was not wanting to spend much time.
     
  3. krm2116

    krm2116 Virgin

    If anyone is interested, happy do to a word count for their story.