Introduction

Why code linting?

The importance of keeping code base aligned to some standards is unquestionable. The main idea is to be consistent and rules themselves are secondary.

It doesn’t really matter if you follow the official language guidance or it is you the one who defines the convention.

Some of the obvious benefits of bringing in style checks are:

  • Defining one way to solve similar small pieces of tasks that could potentially be written differently. Example: defining a list in python: list(f(x) for x in foo) and [f(x) for x in foo] mean the same thing, why not just stick to one approach?
  • Forgetting about “tabs or spaces” discussions once and for all. Since now your style checks define that.
  • Makes code look prettier. No more 200 character long lines, unused imports, commented code... I can go on and on, but you get the point, right?
  • There is a class of errors that potentially could be caught with static checkers. And catching them in such a way is relatively “free”.

Drawbacks? Well, broken CI builds or failing VCS hooks are pretty annoying but mostly that is the only price to pay.

Code linting in Python at Zaleos

There are a bunch of tools you can use for that purpose in Python, some of them serve for catching errors, like PyFlake, others like pycodestyle and Pylint check code style. Fortunately, there are also “combiners” that bring in the best of those modules with one dependency. We stopped our search on Flake8 which combines PyFlake, pycodestyle and McCabe. McCabe keeps an eye on the complexity of your structures.

Let’s start from an easier case, imagine, we are starting a new Python project. The simplest version of a script that runs this linter would look like this:

PROJECT_DIR=$1
# Creating temporary directory and new virtual environment not to mess things up
TMP_DIR=$(mktemp -d)
virtualenv -p python3 ${TMP_DIR} > /dev/null
source ${TMP_DIR}/bin/activate > /dev/null
# installing flake
pip install flake8 > /dev/null
flake8 ${PROJECT_DIR}
deactivate
rm -rf ${TMP_DIR}
echo "Finished linting"

You can check the documentation in order to find rules that you consider overkill in order to ignore them. It is also common to slightly increase the maximum line length when all developers work under the same condition and, let’s say, they have 27’ monitors. In our case the final flake8 execution command looks like this:

flake8 --ignore E402,E731,E501 --max-line-length=99 ${PROJECT_DIR}

Existing python projects

In the real world, people often face big existing projects that have existed without linters for years. So did we. The good news is that there is an alternative to fixing 5000 linter errors at once and drowning in a sea of merge conflicts: linting diffs. It means that linting will be applied only to lines that were updated in your branch/commit. Using this method, the code base will be fixed iteratively and in a more natural way.

There are two ways to do that with Flake8:

  1. Using a flake8-diff wrapper module instead of bare flake8. It is a good tool that gives you an interface for applying Flake8 to a diff between commits or branches and hides VCS internals.
  2. Passing diff directly to Flake8 with the --diff flag - it's a little bit more of a “dirty” approach, but more flexible.

The main advantage of flake8 against flake8-diff is that the former supports plugins and the latter doesn't, that's why we have chosen the second approach.

Before showing the final script we need to mention the plugins we have applied. If you install flake8-awesome instead of the basic flake8 from PyPI you will get all the regular flake8 checks plus a bunch of plugins:

  • Flake8-breakpoint - highlighting forgotten debug breakpoints.
  • Flake8-builtins - check for Python builtins being used as variables or parameters.
  • Flake8-comprehensions - unifies the way to write list/set/dict comprehensions.
  • Flake8-eradicate - detects commented out (or so called "dead") code.
  • Flake8-if-expr - forbids "[on_true] if [expression] else [on_false]" syntax.
  • Flake8-isort - allows configuring the way imports must be sorted.
  • Flake8-logging-format - make sure non-constant data is always passed via the extra keyword and, therefore, to never use format strings, concatenation, or other similar techniques to construct a log string.
  • Flake8-print - simply make sure there are no print statements.
  • Flake8-pytest - Check for uses of Django-style assert-statements in tests. So no more self.assertEqual(a, b), but instead assert a == b.
  • Flake8-pytest-style - checking common style issues or inconsistencies with pytest-based tests.
  • Flake8-return - a good list of rules that make returns more transparent preventing bugs.
  • Pep8-naming - adds some string naming rules.

The second iteration of our script looks something like this:

PROJECT_DIR=$1
# Now we need to act from a project source directory to have VCS in place
pushd ${PROJECT_DIR}

# We never push directly to develop and we want to run linter only on feature branches
# Skip when on master branch as well
curr_branch_name=$(git rev-parse --abbrev-ref HEAD)
if [ "${curr_branch_name}" = "origin/develop" ] || [ "${curr_branch_name}" = "origin/master" ]; then
     echo "Skipping linter for ${curr_branch_name}."
     exit 0
fi

# Creating temporary directory and new virtual environment not to mess things up
TMP_DIR=$(mktemp -d)
virtualenv -p python3 ${TMP_DIR} > /dev/null
source ${TMP_DIR}/bin/activate > /dev/null

# Install flake8-awesome which includes flake8 and a bunch of extension plugins
pip install flake8-awesome > /dev/null

# Use the diff between current branch and develop branch and pass it as an input to flake8 --diff
git diff origin/develop -- "*.py" | flake8 --diff --ignore E402,E731,E501 --max-line-length=99

deactivate
rm -rf ${TMP_DIR}
echo "Finished linting"
popd

The proposed solution with feature branches fits well with our strict branching policy, but there is clearly some room for improvement regarding the method used to obtain the diff.

Custom flake8 checks

Apart from the common code checks (like the length of the lines or other code formatting checks), we are incorporating our custom checks to enforce some additional good practices in python, like avoiding the declaration of global variables, warning for specific words in the code comments (like TODO and its derivative and imaginative words) or checking for duplicated code. Since we didn’t find any flake8 plugins to run those specific checks we decided to write our own, so in this section we are going to explain how we are developing the new flake8 plugins in case you want to create your own plugin or just want to collaborate with us with the plugins we have open to the public.

Create a new flake8 plugin

Flake8 gives you a simple way to create your plugins, you just need a Python class that implements one or more checks and a setup.py file to indicate the entry point of your plugin.

Setup.py

Let’s start seeing an example of one of the plugins we have developed, the flake8-global-variables plugin. In the setup.py file we are using the setuptools Python module, which is used to install most of the Python modules, with flake8 as a requirement and an entry point configuration, which tells flake8 that this is a plugin and how to load it.

import osimport setuptools
...
setuptools.setup(
    ...
    install_requires=[
        "flake8 > 3.0.0",
    ],
    entry_points={
        'flake8.extension': [
            'W = flake8_global_variables.globalvariables:GlobalVariables',
        ],
    },
    ...
)

As you can see, entry_points is a Python dictionary that maps the type of entry point with a list of entry points for that type. Right now flake8 has two types of entry points: extensions, which are used for plugins that add new checks to the process, and reports, which are for plugins that perform extra report handling (formatting, filtering, ...), so in our case, since we are going to do some extra checks on the code, we have used the extension entry point.

Flake8 defines a format for each element of the entry point list, which for the extension entry point is an assignment. The first part of the assignment is a filter for the error codes reported by the plugin. These error codes start with a letter (usually is E for error, W for warning and I for info, but it can start with any letter) and it’s followed by one or more digits. Since the extension entry point is used to do new checks on the analyzed code this filter is mandatory to indicate the type of error codes we are generating, but you can filter it as much as you can. For example, if you report only E100 errors you can set the filter to E100, E10, E1 or E, and all the error codes starting with that filter definitions are going to be passed to flake8.

The second part of the assignment indicates the entry point class that is going to handle the actual checks using the format module:class, considering that the module part should be as you put it in a python import (which is what flake8 does in the end). In our case, our entry point class is GlobalVariables, which is placed in flake8_global_variables/globalvariables.py, that’s why in our configuration we’re indicating flake8_global_variables.globalvariables:GlobalVariables as the entry point.

Main Class

After writing our setup.py file we can start coding our plugin. The first thing to consider when you start implementing the new plugin is the class signature of the entry point because that determines the data you receive in your plugin and how often the plugin is going to be run.

Some of the properties are set once per file for plugins which iterate themselves over the data while others are set on each physical or logical line, so if you are planning to iterate over the data inside your plugin you will probably need only some of the properties that are generated for the file, like filename or total_lines, but if you want your plugin to be called line by line you will need some of the parameters thought for that, like indent_level or line_number.

You can also combine both types of properties, in which case the plugin is going to be called line by line but the file properties are going to be the same and will be calculated at the beginning of the file parsing. For the full list of parameters you can use in your __init__ function you can check the flake8 documentation.

In our example, we are going to parse the file content ourselves, so we have defined the following __init__ function:

class GlobalVariables(object):
    name = "global-variables"
    version = __version__

    def __init__(self, tree, filename, lines=None):
        self.filename = filename
        self.tree = tree
        self.lines = lines

As you can see, we are receiving three parameters, the syntax tree, which contains the code already parsed in an ast object, the filename and the lines, which is the file content without any parsing done. We have also defined a couple of variable classes to indicate to flake8 both the name of the plugin and its version to make it easier to show the proper information in our checks.

When parsing the file content inside your plugin you also have a couple of options, parsing line by line and do the checks by yourself or use the abstract syntax tree module. An Abstract Syntax Tree is a Python module that helps Python applications process trees of the Python abstract syntax grammar, providing a common interface to define visitors over the code to analyze them or making changes on the code before generating it again.

In this specific module we are using ast, but for another plugin that we are developing, flake8-banned-words, we are analyzing the content parsing line by line, because ast doesn’t let you visit the inline comments, so you can use one or the other depending on your needs when you develop the plugin.

So, to do our plugin's custom checking  we are receiving both the ast tree and the file name to check. Also, in the case the tree has not been passed properly to our plugin we can generate it ourselves:

    def load_file(self):
        if self.filename in ("stdin", "-", None):
            self.filename = "stdin"
            self.lines = pycodestyle.stdin_get_value().splitlines(True)
        else:
            self.lines = pycodestyle.readlines(self.filename)

        if self.tree is None:
            self.tree = ast.parse(''.join(self.lines))

NodeVisitor

Once we have the tree we can use the Visitor class to check the parsed tree line by line, so we are going to explain this class before using it.

The ast module provides two main classes to process a tree, NodeVisitor and NodeTransformer. The first one is suited to analyze the tree, that’s why it’s the one we are using as a superclass to implement our visitor, and the second one is used when you want to make changes in the code tree so we won't use it.

import ast

from collections import namedtuple

Error = namedtuple('Error', ['lineno', 'code', 'message'])


class Visitor(ast.NodeVisitor):
    def __init__(self):
        self.errors = []

    def visit_Assign(self, node):
        if node.col_offset == 0:
            for target in node.targets:
                # We consider that the variables in upper case are constants,
                # not global variables.
                if not target.id.isupper():
                    self.errors.append(Error(node.lineno, 'W001',
                                             'Global variable {0} defined'.format(target.id)))

        super(Visitor, self).generic_visit(node)

    def visit_Global(self, node):
        for name in node.names:
            # We consider that the variables in upper case are constants,
            # not global variables.
            if not name.isupper():
                self.errors.append(Error(node.lineno, 'W002',
                                         'Global variable {0} used'.format(name)))
        super(Visitor, self).generic_visit(node)

In the constructor function you can also define some additional parameters to be used during the checking (like words to check, optional checkings, …).

NodeVisitor defines several functions that are called in the parsing steps that you can override to do your extra checks. These functions are usually called visit_Element, where Element is the element you want to analyze. For example, if you want to check the imports of the source code you can override the visit_Import function and ast will only call your function if the line to analyze is an Import line. You can check all the elements ast visits in the ast documentation.

If you don’t know which element you want to visit or you want to do some checks in more than one type of element you can always override the generic_visit function, which is called for all the elements parsed.

Any visit function receives the node that matches that specific element and depending on the kind of element it can store different context information, but it usually has the col_offset and lineno variables. It can also have the parent variable to give you more information about the class/function that belongs or if it's inside an if or a for loop. Again, for more information on the context of the element visited you can check the ast module documentation to have more information.

In our example we are just interested in the assignment lines, to know if the variable assigned is global and all the statements with the global keyword that indicates that a class or a function is using a global variable, so we are just overriding visit_Assign and visit_Global. Inside those functions, we are generating and storing the errors found in the self.errors array using a tuple to be able to recover the errors later in the GlobalVariables class. The final step for each visit function is to call the generic_visit function of the superclass to continue with the visit over the rest of the tree.

Once we have the Visitor class defined we can use it in the GlobalVariables class:

    def error(self, error):
        return (
            error.lineno,
            0,
            "{0} {1}".format(error.code, error.message),
            GlobalVariables,
        )

    def check_variables(self):
        if not self.tree or not self.lines:
            self.load_file()

        visitor = Visitor()
        visitor.visit(self.tree)

        for err in visitor.errors:
            yield self.error(err)

As you can see, we have defined a check_variables function that initializes the tree if it’s not already initialized, create the visitor object and runs the visitor over the tree and it finally returns all the errors found during the visit. There are two important aspects of returning the errors in flake8:

  • The error needs to be a tuple to be parsed by flake8 that contains the line number, the column offset, the error message and the check that fails (which, in our simple plugin, is just one, but you can define multiple checks inside the same plugin).
  • The errors should use yield to be processed properly by flake8.

That being said , the last piece of code to show from our plugin is how to run it calling our check_variables function.

    def run(self):
        for error in self.check_variables():
            yield error

Conclusions

As we said at the beginning, code linting has several important benefits, like catching bugs at an early stage or having a more maintainable code (no more dead code or commented code, common code practices, ...), and implementing it in Python is easy, so you don't have any excuse not to do it. The "hard" part is to decide between the different tools you have available in Python and which linting checks you want to enforce. We hope that by sharing our experience you can make a better choice and decide once and for all to start checking your code frequently.

References

In this section you can find several links we have used during our work with Flake8.