Creating Tests For This Website: Links

Posted by Ryan Himmelwright on Mon, Mar 30, 2020
Tags website, hugo, dev, python, testing
PNC Arena, Raleigh NC

In my previous two posts, I created a test framework for my website, and automated it using Jenkins. But we can do better. One of the most annoying things when maintaining (or even reading) something on the internet, are broken links. While I cannot control the availability of content outside the website, I can choose to remove links if they are broken. So, in this post, we will add tests to ensure that links in our posts are working. Well, at least the markdown ones.

What to Test

Google 404 Error Page
Google 404 Error Page

For this test set, we will be scanning the content files of all of the posts, and grabbing every markdown link defined in them. With the links known, we will then make a request to each one to check that it is available. If we can connect, the test passes. If not (ex: we get a 404), it fails.

Adding Utility Functions

Before we are able to write the test function, we first need to add to the utility functions. These will allow us to get the post’s file paths, grab their content, and extract all the markdown links from that content.

get_file_paths

First, lets define a new helper function, get_file_paths:

def get_file_paths(src, extension=None):
    """Collects the paths of all files of a directory"""
    file_list = []
    root_path = path.expanduser(src)
    for file in listdir(root_path):
        # If extension provided, check file has that extension
        if extension:
            if file.endswith(extension):
                file_list.append(path.join(root_path, file))
        # Otherwise, add everything
        else:
            file_list.append(path.join(root_path, file))
    return file_list

When provided a directory path (src), this function will return a list of all the file paths in that directory. Optionally, the extension parameter can be supplied to only return files of that extension type (in our case, md). This will be used to grab the paths of all of the website post source files.

get_file_content

Now lets define get_file_content. This function will take the file lists generated from get_file_paths, grab the content from those files, and return a dictionary of all the data.

def get_file_content(file_list):
    """Grabs all the content from a list of file paths."""
    content_all_files = {}
    for file in file_list:
        f = open(file, "r")
        file_content = f.read()
        content_all_files[path.basename(file)] = file_content
    return content_all_files

The returned dictionary uses the filename as the key, and the content set to the value. For example:

{
  'post1.md': 'This is the text of post1.',
  'post2.md': "This is the text of post2. Basically the same ol' stuff."
}

Lastly, lets define get_md_links. This function takes the content dictionary returned by get_file_content, and uses some regular expression magic to match the markdown links:

def get_md_links(content_dict, regex="\[.*?\]\((.*?)\)"):
    """Parses the dictionary of content strings, and pulls out the url of any links."""
    p = re.compile(regex)
    all_links = []
    for file in content_dict:
        content = content_dict[file].replace("\n", "")
        match_iter = p.finditer(content)
        for match in match_iter:
            # Regex can't properly match urls with parens in them, so skip.
            if "(" not in match.group(1):
                all_links.append(match.group(1))
    return all_links

First, the function compiles the regular expression defined by the regex parameter. Next, it loops through all the data in the content dictionary, strips the newline characters, and then grabs all the regex matches.

Unfortunately, the regex expression can’t properly match markdown formated urls with parenthesis in them, so we have to check if each match has a ( in it. If it does, the url is thrown away because we cannot be sure we matched the full url. If there are no parenthesis, the url is added to our saved list. After parsing all the values of the content dictionary, a list of the matched urls is returned.

Adding to conftest.py

With our new utility functions defined, we can next add a new fixture (and it’s helper function) to the conftest.py file. Lets start with the fixture’s helper function, post_md_link():

def post_md_links():
    """Returns the md_link object of the md links in all the posts."""
    all_post_files = get_file_paths(POST_DIR)
    all_post_contents = get_file_content(all_post_files)
    all_post_md_links = get_md_links(all_post_contents)
    # Return de-dup list
    return list(set(all_post_md_links))

This function uses the utility functions we just wrote above, to extract all of the markdown links from the post files found at the POST_DIR constant location. It then returns a de-duplicated list of all the links.

Now, we can define the fixture, post_md_link:

@pytest.fixture(params=post_md_links())
def post_md_link(request):
    """Returns the md_link object for a md link found in a post."""
    return request.param

Similar to the fixtures in the pages tests, this one will allow tests to map across all the links found in the markdown pages, so a test will run for each link.

Finally, time to write the one and only test function in this post:

def test_md_links(post_md_link):
    """Checks that the markdown links are not broken."""
    if post_md_link.startswith("http") or post_md_link.startswith("https"):
        url = post_md_link
    else:
        url = BASE_URL + post_md_link.lower()
    response = requests.get(url)
    assert response.status_code != 404, f"The link {post_md_link} is not found."
    assert response.status_code != 403, f"The link {post_md_link} is forbidden."

Because I link to both internal and external pages in my posts, I have to prep my urls a bit. So, I first check if the link starts with http (which would also match ones starting with https). If it does, we can leave the link as is. If it doesn’t, we can assume the link is an internal one (ex: /post/creating-website-tests-links/), and we need to prepend it with the BASE_URL constant.

With a proper url, we can use requests.get() to attempt to retrieve a response code from the page. If we get a response, I then assert that the status_code is not 404 or 403.

Side Note: I started by asserting that each link returned a 200 status, but quickly learned that it was a bad idea, because I was testing mostly external links. I never got all the tests to pass because they would often return odd 500-level errors for issues that quite frankly, doesn’t matter to me. For example, one site kept return a 500-level error I think because their servers were ‘under a slightly higher load’… but when I went to the link, the page loaded fine.

In the end, I decided I wasn’t trying to test the issues the websites I linked to were having, but instead just wanted to make sure that my links weren’t broken. So, I now just ensure that I’m not getting 404 or 403’s, and I’m happy with that.

Limitations

While I am very happy with the coverage these tests provide, they do have some limitations to keep in mind:

  • They cannot match urls with parentheses
  • Currently, they only check that pages do not return 403 and 404 errors. This means I could possibly still have broken links due to permission errors or other issues. I plan to expand this assert list in the future to cover more cases.
  • I’m currently only testing markdown links. This doesn’t grab any html links I have in my posts.
    • On a similar note, because I link most of my images with html, it also isn’t testing if my images are broken.
    • I’d like to add tests for both of these issues eventually, but decided testing the markdown links was the best place to start.
  • Sometimes tests fail because a site is down. No biggie. I just usually wait a bit and then run the tests again before deciding to remove the link.

Conclusion

Passing tests, including new link tests
Passing tests, including the new markdown link tests

That’s it. By adding a few easy helper functions, a new fixture, and a single test function, We’ve expanded my test results from 70 to over 420 tests (and growing).

More important than the number of tests, is what the results tell us. A failing test tells me that one of my markdown links might be broken. These tests have already been beneficial to me, as I ended updating/removing probably about 100 or so bad links from my archived posts while implementing this test. So, I’d say it was worthwhile!

Next Post:
Prev Post:

Creating Tests For This Website: Docker Jenkins Nodes Creating Tests For This Website: CI