tag:peteh.me,2013:/posts Pete Huang 2017-02-07T20:11:30Z Pete Huang tag:peteh.me,2013:Post/1119109 2016-12-29T22:34:23Z 2017-02-07T20:11:30Z An easy way to get started with open source in 2017

This week, I kicked off a personal challenge to devote at least 30 minutes to an open source project every day for 30 days (to start). I've chosen to work with matplotlib, a 2D plotting library written in Python. The library, as part of the SciPy stack and the broader Python data ecosystem, was a huge help when I was at Northwestern and is a community staple. Hopefully, my efforts are a net add to their work.

As someone who prefers not to code and probably couldn't do so well enough to cobble together a meaningful pull request, I've always faced a mental barrier when thinking about contributing to open source. Why would any project need someone who doesn't or can't code?

As I've learned this week, there's actually a surprising shortage of people who focus more on the process of an open source project rather than the content. There are a ton of projects with stale issues that were opened in 2013 and left sitting ever since. There are a ton of projects with unclear or broken documentation that just need some editing. Many of the open source contributors out there just don't want to spend time on these things - both are great places for you to jump in!

Today, if you want an excellent, low-barrier and low-effort way to get involved with open source, I'll show you exactly how to get started with issue triage in six steps.

Step 1: Find a project that you love or that needs help

If you are an active software user, chances are you've interacted with plenty of open source libraries and projects. If you can, always start with something you use. You'll be more familiar with what the code does and you've probably seen the docs before.

If you are not an active software user or aren't inspired by one of the projects that you use, head to Code Triage for a great list.

When searching for repos, I recommend searching for those with:

  • a large number of issues (e.g., 500+)
  • a fairly active commit history (i.e., last commit 2 years ago is bad)
  • a programming language you're familiar with or are interested in learning
  • activity on gitter

A large number of issues means there's clearly room for you to jump in, and activity via commits and gitter means there will be people there to help you produce the outcome that you're looking for. Overall, I was looking for a repo where I could make a measurable and distinct impact, which would include both a reduction in issues and a piece of structure / guideline for the overall project.

I recommend the PyData community, specifically the SciPy stack:

  • matplotlib, a plotting library
  • scipy, a library for math, science and engineering
  • sympy, a symbolic mathematics library
  • iPython, a resource for interactive computing 
  • numpy, a scientific computing library
  • pandas, a data structure and analysis library

Each of these has a large number of issues, some totaling over 1000, with recent development activity.

Step 2: Read the contributing guide

Before you do anything, read as much as you can about the project. Each repo usually has a section on how to contribute to the project. This is a great starting point and a necessary read to get you oriented on how things work.

As Steve Klabnik notes in his popular 2014 post to issue triage, each project has set different rules on where certain comments go and which channels the project uses.

For example, bugs are often tracked in Github Issues, but they may be tracked in another tracker instead (like Django does with Trac). Feature requests may be submitted on a Trello kanban board, but they may also be lumped in at Github Issues. Discussion topics may be located on mailing lists, but they may also be lumped in at Github Issues.

These are your ground rules and can represent easy issue closes when you triage. Steve noted in his post that a quarter of the open issues didn't abide by these rules and therefore should be closed: an easy source of impact for you.

Step 3: Communicate with the core team

Still before you do anything, try your best to get in contact with someone from the core team to talk about what you're wanting to do. Nobody knows you exist yet, so firing off 100 comments on issues without prior notice would be a little of a surprise. We want to make sure we're spending our time in a valuable and helpful manner.

To find the core team, look to the commit history on the Github repo. Most times these are the folks with the most active commit history on the master branch.

Reach out to them on the gitter channel with a quick message:

"Hi <name>! I was going through the issues on the <project> repo and noticed that many of them are incredibly stale. Is this an issue for you or the core team? Are there any existing efforts to manage this that I can join or start?"

In essence, do your best to learn more about what's going on behind the scenes. If you think it's a problem, you can't really be sure until you ask. If you think it's not a problem, you can't really be sure until you ask. So ask!

Another point: keep communication in mind as you continue with open source. 

First, it's terribly hard to read intent and tone with online interaction, and it gets even harder when you add binary outcomes ("accept" vs. "reject" a pull request, "open" vs. "close" an issue). Most people are incredibly nice and generous, but the communication may seem impersonal at times. You are not invulnerable to this impression! Do your best to be nice and make "I" statements.

Second, it's good to just know what's going on. Communicate early and often about your work and your questions. Open source is a community, after all. Do your best to actually be a part of it.

Step 4: Learn the labeling landscape

Just as each project has set its own guidelines about what goes where, each project also manages their issues differently. In Github Issues, labeling your issue is incredibly important and is the best way to sort and filter all issues.

As an example, pandas actually does a fantastic job with labeling the recent issues. Take a look:

Here, we can see exactly what type of issue it is, what part of the code it refers to and the expected difficulty and effort levels of the fix.

You can then filter all issues by labels: a common one is to filter all issues for ones with a "new contributor" label, which are ones that are great starting points for beginners to the project.

What's important for now isn't what the labeling landscape should be, but what it actually is. Take a look at the labels and figure out what's used and what's not - this will help you be effective at suggesting labels when you go through issues.

Step 5: Install the software (optional)

In cases of bug reports, it will be helpful to try to reproduce an old bug with the most current version of the software. Sometimes, the bug has been addressed and the issue can be closed.

It can also be a helpful way for you to get familiar with the code if you do intend on contributing code in the future.

If you want to be fully effective in issue triage, I recommend that you try your best to get your own copy of the software installed - you're pretty hamstrung if you don't.

Step 6: Start triaging

Now, you're ready to get going. Head over to the issues section and sort the issues by least recently updated. These are the ones that haven't been touched and should probably have someone provide an update. Open one up and take a look.

Your goal when triaging issues is twofold: 1) ensure that the issue is still current and relevant and 2) provide as much detail as possible to help contributors address the issue efficiently, especially if it's a bug.

Issue triage looks a little different for each type of issue:

All issues

  • Recommend labels based on your best understanding of the issues and the commonly used labels
  • Redirect the issue to the proper channel and recommend closing the issue if Github Issues is not the right channel (see Step 2)

Bug reports

  • If the issue does not have all the information that's typically useful for a bug report (incl. example code to reproduce, actual outcome and expected outcome), leave a comment asking for the remaining pieces of information
  • If the issue has all the information that's typically useful for a bug report (including example code to reproduce, actual outcome and expected outcome), try to reproduce it with the example code.
    • If it works fine, say, "I think the bug has been addressed. In version 1.5.3, I get the following output: <x>. I recommend closing this issue."
    • If the bug persists, say, "I was able to reproduce the bug in version 1.5.3."
  • If you are not reproducing bugs, it may be useful for you to ask if anyone is able to reproduce the bug, though I recommend again that you try to do so and report your results

Enhancements / feature requests

  • Check the documentation and code, especially any "what's new" documents, to see if the feature was implemented. If so, point to the version that implemented it and recommend closing the issue
  • If the issue was mid-discussion (i.e., is this request a good request?), leave a question asking if there were any updates since the last comment that affects the viability of the request
  • If there appears to be some work done (i.e., person assigned, pull request in progress, etc.), try to summarize or ask someone to summarize the current status of the work
  • If the issue just plain stale and nothing else, ask if the enhancement is still useful

Discussion

  • Ask if this is still a relevant discussion to have

One Github feature you can use to make this easier is to set up saved replies for things you find yourself repeating.

For the most part, you will be relying on the core team and other active contributors who know the code base well to help you actually address the issue. This is especially true with feature requests; eventually, they are the best people to evaluate whether the feature request is worthwhile, especially on the impact-feasibility tradeoff front.

You are critical to open source

Focusing on issue triage means that you're focused on making contributing code as easy as possible. Part of this job is to help make each issue as standalone as possible - ideally, someone could jump into the issues and know that everything is valid and ready to work on.

This is hard! When I look at a repo with 800 issues (matplotlib!), I always wonder how the contributors know what to fix. If there's a bug from 2013, how do the devs know that it still exists?

Turns out they don't, because they're volunteers and doing this out of their own generosity. As issue triage, you're helping get all the organizational knowledge out on paper to make it easy for people to code.

I say all this because it's easy to feel like you're not truly adding value by not adding code. Remember that strong process makes developing strong content easier.

See you in the issues section! I'll be there through January (and hopefully beyond!).

]]>
Pete Huang