Skip to content

extract method can only be called once per goose instance #191

@ghost

Description

Prior to 1.0.24, the following code can work:

from goose import Goose
g = Goose()
article_1 = g.extract(url=...)
article_2 = g.extract(url=...)

For each Goose instance, extract() method can be called multiple times.

But it seems due to #161 fixing, the above code cannot work now. When calling extract() second time, it throws the exception.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../package/lib/python2.7/site-packages/goose_extractor-1.0.24-py2.7.egg/goose/__init__.py", line 56, in extract
    return self.crawl(cc)
  File ".../package/lib/python2.7/site-packages/goose_extractor-1.0.24-py2.7.egg/goose/__init__.py", line 63, in crawl
    parsers.remove(self.config.parser_class)
ValueError: list.remove(x): x not in list

By default, lxml parser is used. When first calling extract(), parsers changes from ['lxml', 'soup'] to ['soup']. The second time error occurs when it tries to remove 'lxml' from ['soup']. https://github.com/grangier/python-goose/blob/develop/goose/__init__.py#L62

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions