Thumbtack Engineering Engineering

GoSF Meetup: Cryptography, Memory Leaks, and More

gosf

Go is a fantastic language for building highly scalable web applciations, and the GoSF community is leading the way in understanding how to use this new language effectively and efficiently. Thumbtack has been recently using Go in production with great success.

We were happy to host the recent GoSF meetup "Cryptography, Memory Leaks, and More."

Big thanks to Ken Fromm for organizing the event, and to the great speaker lineup:

  • Kyle Isom, Cloudera: An overview of the cryptography packages in Go
  • Oleg Shaldybin, Apcera: Debugging and Profiling to Find Memory Leaks
  • Quinn Slack, Beyang Liu, Sourcegraph: Building a large-scale web app in Go

More details about the event are at the GoSF Meetup page.

We will see you at a future GoSF meetup soon!

The Butcher's Knife and the Dependency Graph in Python

Let's say you've just decided that dependency injection is a good idea. You may have been reading blog posts about how this transforms your code into this amazing, readable explanation of what it does. Maybe you got annoyed with writing painful setup code for your unit tests. Somehow you've been bitten by the dependency injection bug (or should we say, feature?) and you are setting out to revolutionize your code. Great!

There's one thing, though. As you go down the path of dependency injected perfection – taking your logic out of constructors, making dependencies explicit – you start to find that constructing your objects becomes a pain. Before, when you referenced global dependencies, you might have constructed an object with no parameters. You left it up to the constructor to find everything it needs, like so:

index_view = IndexView()

Now, though, you have to do all the wiring by hand. This takes all the logic you had hidden away in constructors and brings it up to the place where you are constructing the class.

engine = sqlalchemy.create_engine('sqlite:///posts.db')
Base.metadata.create_all(engine)
session_maker = orm.sessionmaker(engine)
post_storage = PostStorage(session_maker)

loader = jinja2.PackageLoader('postchan', 'templates')
jinja_env = jinja2.Environment(loader=loader)
jinja_env.filters['nl2br'] = nl2br
index_template = jinja_env.get_template('posts.html')

index_view = IndexView(post_storage, index_template)

Even though all the messy wiring is exposed, you get the advantages of dependency injection. You can see exactly what your class needs in order to run, and it is very easy to swap out dependencies when testing. Before, you had no idea from looking at it that IndexView depended on a PostStorage instance and a jinja2 template. This code forces you to acknowledge all the pieces needed to create an IndexView.

Now that you've pulled it out of the constructors, this code needs a new home. You need a factory. There are a couple of different ways of making a factory, with different trade-offs. The one that works best depends on your code.

Factory methods

Factory methods are a way to construct smaller, simpler applications. They are good at the scale where your application is a class or two and has a handful of dependencies.

Factory methods are simply methods on a class that construct and return an instance of that class. They take care of creating and injecting all the dependencies that class needs, allowing the constructor to simply save the values it is given. Since factory methods are run before you have an instance of the class, they are class methods.

Let's see what a class with a factory method might look like in Python. The entire example above is too complex for a factory method, so instead we'll construct something else:

class LogConverter:
    def __init__(self, input_file, output_collection):
        self._input_file = input_file
        self._output_collection = output_collection

    @classmethod
    def create_from_args(cls, cli_args):
        """ This is the factory method """
        input_file = open(cli_args.file_name, 'r')
        mongo = MongoClient(cli_args.mongo_uri)
        collection = mongo[cli_args.mongo_collection]
        return cls(input_file, collection)

    def run(self):
        for line in self._input_file:
            ...

If you are familiar with Python, skip this paragraph. Otherwise, here's what you need to know about the above code. self in Python refers to the current instance of the class, like this in most other OO languages. __init__ is the name of the constructor. Prefixing methods and member variables with _ is a convention meaning private (it doesn't actually make anything private, but people know to not access it). All instance methods, including the constructor, take self as the first argument. There are also static methods which are denoted with the @classmethod decorator. These functions are given the class as the first parameter (cls is the conventional name for the variable). If you call a class, you get a new instance of it (Python doesn't use the new keyword for object instantiation).

Creating the application's dependencies in a class method seems awfully similar to creating the dependencies in the constructor, but the difference is that now you can construct an instance of the class with different values for the dependencies. You could inject a mock file and output collection into ErrorConverter in order to test it.

The class now also lists its dependencies as arguments to the constructor. Because using the factory method hides the list of dependencies from you, you shouldn't use factory methods for injecting dependencies while your application is running (using the technique to do some processing before construction is still fine, though). Factory methods should be used like a factory – they should be used to do the initial setup of your application. If you just use a factory method at runtime where you would otherwise use a constructor, you haven't gained much in terms of keeping the code clear and truthful.

It is possible to have too much complexity for a factory method. Remember that the code to construct a class is ancillary to what the class actually does – the class is just a convenient place to tack on the method. If the construction of the application involves a significant amount of code (especially if it is broken into multiple functions), then it starts to overwhelm the actual purpose of the class. When that happens, it's time to break out the setup into a factory class.

Factory classes

Factory classes are next step up from factory methods. They give the construction logic its own place to live, instead of being grafted on to another class, and give you room to split the construction logic into multiple methods. Using a class for the factory also opens up more possibilities, such as having multiple different factories involving the same class.

Let's consider an example where there is too much setup for one function. (In case you are wondering, Configurator is a Pyramid class.)

def make_app():
    engine = sqlalchemy.create_engine('sqlite:///posts.db')
    Base.metadata.create_all(engine)
    session_maker = orm.sessionmaker(engine)
    post_storage = PostStorage(session_maker)

    loader = jinja2.PackageLoader('postchan', 'templates')
    jinja_env = jinja2.Environment(loader=loader)
    jinja_env.filters['nl2br'] = nl2br
    index_template = jinja_env.get_template('posts.html')

    index_view = IndexView(post_storage, index_template)
    add_post_view = AddPostView(post_storage)

    config = Configurator()

    config.add_route('index', '/')
    config.add_view(index_view, route_name='index')

    config.add_route('add-post', '/add-post', request_method='POST')
    config.add_view(add_post_view, route_name='add-post')

    return config.make_wsgi_app()

Since that's too much code for one function, let's split it up a bit. We'll create a function for each view, and one for each stage of dependencies along the way. Since there are multiple functions all related to factory code, let's also put them together in a class.

class AppFactory:
    def get_session_maker(self):
        engine = sqlalchemy.create_engine('sqlite:///posts.db')
        Base.metadata.create_all(engine)
        return orm.sessionmaker(engine)

    def get_post_storage(self, session_maker):
        return PostStorage(session_maker)

    def get_jinja_env(self):
        loader = jinja2.PackageLoader('postchan', 'templates')
        jinja_env = jinja2.Environment(loader=loader)
        jinja_env.filters['nl2br'] = nl2br
        return jinja_env

    def get_index_template(self, jinja_env):
        return jinja_env.get_template('posts.html')

    def get_index_view(self, post_storage, index_template):
        return IndexView(post_storage, index_template)

    def get_add_post_view(self, post_storage):
        return AddPostView(post_storage)

    def make_app(self):
        session_maker = self.get_session_maker()
        post_storage = self.get_post_storage(session_maker)
        jinja_env = self.get_jinja_env()
        index_template = self.get_index_template(jinja_env)

        config = Configurator()

        config.add_route('index', '/')
        config.add_view(
            self.get_index_view(post_storage, index_template),
            route_name='index',
        )

        config.add_route('add-post', '/add-post', request_method='POST')
        config.add_view(
            self.get_add_post_view(post_storage),
            route_name='add-post',
        )

        return config.make_wsgi_app()

This is a little better. Setting up the sqlalchemy session and the jinja environment are now separated out from the app construction itself. Each dependency gets its own function.

(You might argue that some of the functions like get_post_storage() and get_index_template() are too simple and shouldn't be functions, and you're right. The thing is, this is a contrived example, so pretend that those are actually harder to set up.)

Passing previously-constructed values like post_storage around to get_index_view() and get_add_post_view() is a little clumsy, and you still have to manually construct all the dependencies in the right order so that you can pass them along to the next function. These problems only get worse as the application grows in complexity.

To fix this, let's change get_index_view() and get_add_post_view() to call get_post_storage() directly. Assuming that you don't want to create a new instance of PostStorage for every class that uses it, you need to make sure to only construct it once. One way to do this is to create it the first time you call the function, and save the result for subsequent calls:

class AppFactory:
    def __init__(self):
        self._session_maker = None
        self._post_storage = None

    def get_session_maker(self):
        if self._session_maker is None:
            engine = sqlalchemy.create_engine('sqlite:///posts.db')
            Base.metadata.create_all(engine)
            self._session_maker = orm.sessionmaker(engine)
        return self._session_maker

    def get_post_storage(self):
        if self._post_storage is None:
            self._post_storage = PostStorage(self.get_session_maker())
        return self._post_storage

    def get_index_view(self, index_template):
        return IndexView(self.get_post_storage(), index_template)

    def get_add_post_view(self):
        return AddPostView(self.get_post_storage())

    def make_app(self):
        ...

The get_session_maker() and get_post_storage() functions check to see if they have a saved result from last time, and only construct their result if none exists.

It's not 'memorized' spelled wrong

The problem with the approach above is that it adds the same boilerplate to every dependency creation function. We are lazy programmers; we don't like to type the same thing out multiple times. Let's solve this.

Python has a very useful language feature called a decorator. These let you modify an existing function, wrapping it in another function that changes its behavior in some way.

What we want is a decorator that will take a normal function and change it to memoize its result, so that the function is only run at most once. Something like this:

class AppFactory:
    @memoized
    def get_session_maker(self):
        ...
        return orm.sessionmaker(engine)

    @memoized
    def get_post_storage(self):
        return PostStorage(self.get_session_maker())

Sounds nice, right? Here's how to set that up.

import functools

def memoized(fn):
    """ A decorator that memoizes a method.

    The class of the wrapped method must define
    ._memoized_values to be a dictionary.
    """
    name = fn.__name__
    @functools.wraps(fn)
    def wrapped(self):
        if name not in self._memoized_values:
            self._memoized_values[name] = fn(self)
        return self._memoized_values[name]
    return wrapped

class AppFactory:
    def __init__(self):
        self._memoized_values = {}

    ...

The decorator takes in your function and returns a wrapped version. When you call the decorated function, you actually call the wrapped version. The wrapped version of the function (def wrapped) is a closure that has a reference to fn (the function you passed it) and name (the function's name) since those exist in the parent scope.

Using the original function's name as a key, the wrapped function looks in a dictionary defined on the class called _memoized_values to see if the original function has been called before. If this is the first time it is being called, the wrapped function will call the original function to get its value, and will store the value back into the _memoized_values dictionary. From then on, it will just return _memoized_values[name] without calling the original function again.

There is a potential simplification of the implementation of the memoized decorator. Instead of storing values in _memoized_values on the class of the wrapped function, it could instead just cache the value in the closed over state (i.e. store it in something declared in def memoized). This approach has a caveat, though: If you create multiple instances of the factory, the memoized decorator will share state between the instances, which is likely not what you want. Additionally, that makes it less obvious how it works and harder to debug.

The functools.wraps decorator is useful when creating decorators, but it's not vital. It copies over the docstring and name of the function (otherwise all the memoized functions would look like they were called wrapped).

There's no argument

What if you want to be able to pass arguments to the memoized functions? Your first option is to cheat and create memoized wrapper functions that take no arguments, but call the underlying (non-memoized) function with arguments:

class SomeOtherFactory:
    def __init__(self):
        self._memoized_values = {}

    def _alchemy_engine(self, connection_string):
        return sqlalchemy.create_engine(connection_string)

    @memoized
    def in_memory_engine(self):
        return self._alchemy_engine('sqlite:///:memory:')

    @memoized
    def file_backed_engine(self):
        return self._alchemy_engine('sqlite:///my_database.db')

If that isn't enough, you could also modify the memoized decorator to allow extra arguments to the wrapped function (def wrapped(self, *args, **kwargs)), and use the combination of arguments and function name as a key into the _memoized_values dictionary.

Testing factories

If you are using dependency injection, chances are good that you also write tests for your code. Since factories are code, you may be considering testing your factories. In general, testing code is a great idea, but there are a few reasons why you might want to avoid writing tests for your factories.

Factories are fundamentally about wiring. They take existing (presumably correct, thanks to your unit tests) classes and stick them together. The first kind of bug that can happen when wiring things together is an integration problem where two classes or systems don't work correctly with each other. This kind of bug can be caught using integration tests that have nothing to do with the factory.

This leaves bugs in the factory itself. It might wire up the wrong structure or call functions incorrectly. Unlike logic or integration bugs, this class of bug is often very easy to catch. It tends to happen early on in the life of the application, and tends to fail noisily. You are much less likely to have a bug in factory code and not know about it. Since it is easier to avoid bugs in a factory, the benefits of having a test for the factory may not outweigh the cost of the extra code. Remember that tests are code that must be maintained like any other code. You should add a test only if it pulls its own weight.

Another problem with tests for factories is that they tend to assert that the code is what the code is. If for example you were testing some logic like a sorting algorithm, you wouldn't test that it selects some element as a pivot, and recurses on the right portion. You would test a level above that and assert that it returns lists in sorted order. When testing factories, on the other hand, the outcome is just the result of the wiring. If the factory injects an instance of class B into class A, you can't test much more deeply than asserting that object A has a reference to object B.

This leads to brittle tests. If you were to change the implementation of the sorting algorithm to use a median of three partitioning scheme, your test wouldn't change – you just want a sorted list. However, if you want to change the implementation of your application (say you want to wrap class B in a decorator C), your test would break even though there may be nothing wrong with the factory.

Just like you don't want to test that your sorting algorithm picks the median correctly, you don't want to test factories at the level of their implementation. You want to test that the entire output is correct, but this is hard to do since the output is the entire application. There is one kind of testing that can do this, though: an end-to-end test. Setting up an end-to-end test can be time consuming, so whether this kind of test is a good idea depends a lot on your application.

If there is business logic in your factory, that might be something worth testing separately. If that's the case, that logic should probably be extracted to its own unit that can be tested independently of the factory with a unit test.

For more on this topic, there is a great talk on testing by Misko Hevery that you should watch.

From here

Like with any pattern, it is possible to overuse factories. There is always a danger of using a pattern for the sake of using patterns instead of using them only when they serve some practical need. Factories are often useful for setting up dependency-injected applications, but you should remember that they are there to make your life easier and your code better. You shouldn't add the complexity of a factory class where a factory method will do. It pays to be judicious about keeping your code (to paraphrase Einstein) as simple as possible, but not simpler.

Factories aren't the only way to construct applications. There are dependency injection frameworks that automate most of the process of injection. These frameworks let you specify dependencies with annotations or decorators, and they take care of wiring up the objects for you. If you search the Python package index for dependency injection frameworks, you can find several, many of which are modeled after Google Guice, a DI framework for Java. Like with factories, you should only use a DI framework if it serves some practical need. You can have the benefits of dependency injection without adding the overhead of a framework.

P.S. Think dependency injection is the best thing since sliced bread? You should join us!

Welcome our newest engineer, Glen

Glen Oakley

Glen is joining Thumbtack's enginering team after graduating from The College of New Jersey with a degree in computer science. Glen has been a hacker for many years, and while in school he organized the college's first hackathon. He's especially passionate about functional programming (Haskell, anyone?) as well as modern programming techniques in Python and Node.

Outside of coding, Glen likes to spend time hiking, biking, or practicing yoga. You also might find Glen enjoying classic PC games like TF2, RuneScape, or Sid Meier's Civilization series.

At Thumbtack, Glen is looking for new ways to improve the user experience for users on both sides of the marketplace.

Looking for more about Glen? Check out his his website or follow him on Twitter.

Welcome, Glen!

Welcome our newest engineer, Richard

Richard Whalen

Richard has just joined Thumbtack from Vanderbilt University with a Masters in CS. At Vanderbilt, Richard helped teach a course in software patterns. He worked at FedEx and as a freelance developer. You can read about his audiophile tendencies on his website.

If you ask Richard if he's a "mountain or beach person" the answer is an easy "both". He is easily stoked for the West Coast surf and sun. But you can easily get him excited about bouldering, climbing, cycling, or any other mountain pursuit.

Since starting at Thumbtack, Richard is already hard at work on his first projects to help Thumbtack's infrastructure scale to the next level, and he's contributing to the brew club's efforts to make its best tasting ales yet.

Welcome, Richard!

Working at Thumbtack, the New Guy's Perspective

Hi, I’m Tommy. I’m an engineer, a tinkerer, a cyclist, a dive master and generally, a maker of things. In my career so far, I’ve had the opportunity to work on small teams, large teams, founding teams, and I even spent 6 years on teams in a sea of other teams at Google. A month and a half ago I was very happy to join the engineering team at Thumbtack.

Starting a new job can be a scary process. You do as much due diligence as possible to make sure it’s going to be great, but at some point you take the leap and decide to say yes. As you proceed through your first month of work, your friends are kind enough to constantly be grilling you with questions about your new job as you’re just trying to manage the firehoses of new information and remember everyone’s name.

Of course, I’ve been answering those questions a lot recently. Happily for me, I’ve found myself very excited to share my perspective on Thumbtack with my friends. Why? Well, there’s the $30 million Series C from Sequoia Capital and Tiger Global Management that gives us fuel to grow, the amazing new San Francisco SoMa office that gives us room to grow, and the delicious, free meals that happen because food is a priority.

However, as important as all that is, I’ve found that the more in depth parts of those discussions tend to revolve around these 6 points.

1. I’m working with brilliant engineers who are articulate, thoughtful, and intellectually curious.

This manifests itself in many ways, but one of the most evident I’ve seen so far is in the technical design discussions. Most of us have encountered a bad version of these before where it feels like everyone is talking past each and decisions seem to come out arbitrarily or not at all. At Thumbtack, however, people aren’t afraid to ask the hard questions, dig deeper, or defer to the data. This means decisions get made correctly and quickly which is critical because these types of discussions happen many times a day. I’ve seen great examples of these from the small one line code reviews, all the way up to picking a web framework that we’ll be working with for years.

2. We realize that handholding and micromanaging don’t scale; ownership and mentorship do.

For a new engineer, that means ownership comes immediately. Your first month at Thumbtack isn’t fixing bugs, it’s a real project, of your choice, that gets you up to speed on the code base and makes a real difference to the company. On top of that, management at Thumbtack is necessarily flat and lightweight. This encourages engineers to step up and mentor each other, a practice that manifests itself as pair programming sessions, walks around the block, and engaged lunch discussions.

3. We’re dedicated to relentless self-improvement; we’re addicted to learning things.

Steve teaches an OS class. Alex is coordinating a Machine Learning class. Katie is teaching me Vim. Specific skills, tools, and areas of knowledge are helpful, but what’s really valuable is knowing how to learn something new. It gets a little meta, but like any skill, it gets rusty unless you practice it.

4. We’re not afraid to change everything and we’re rigorous when deciding anything.

In a field where there are constantly new tools, novel processes, and interesting technologies at such a high volume, it can be difficult to reason about them all. However, arbitrary decisions suck. They lead to wasted time and frustration. At Thumbtack, we make a point of making any and all decisions rigorously. Our open sourced test analysis library is a great example of how we go about this for changes to our website.

Every check needs a balance, however, and to ensure we don’t get lazy, change has been built into our process. For example, every 6 weeks we review our product process. Sometimes there are minor tweaks, sometimes it changes drastically, but every time we take the opportunity to evaluate our prior decision with a month and a half of additional data.

5. We’re working on novel and interesting technical challenges.

The volume of activity in our marketplace is up 4x year over year and we need to build and rebuild our systems to manage that growth seamlessly. This means automating, sharding, optimizing, and redesigning as the size of the problem changes drastically. For example, right now we’re working on extracting a number of services to reduce both the size of and the query load of our database master by half.

Additionally, scaling a marketplace is particularly difficult. We serve more than 700 unique categories that can behave remarkably differently from each other. We need to create ways to help define and maintain high quality on both sides of the market. While some manual curation helps with this, that doesn’t scale. Ultimately we’re continually figuring out new, creative ways to reduce manual work loads by an order of magnitude.

6. What we’re doing has a real, significant impact.

We’re helping newlyweds remodel their first home, parents plan their child’s first birthday party, and aspiring singers to find a music teacher. We’re now sending professionals $1.8 billion of business a year. Many of those 63,000 paying pros have even used Thumbtack to double or even triple the size of their businesses. And we’re just getting started -- we’re still just a drop in the $800B US local services market.

However you add up the numbers, we’re making two sides of this big problem much more efficient and having a huge, positive effect on people’s lives.

Conclusion

I’m ecstatic to be working on amazingly interesting things with a team that I’m proud of. If that resonates with you, we’re hiring!

Page 1 / 7 »