Like most early companies we started to manually keep track of our numbers in google sheets. Obviously, this approach is hardly scalable and massively prone to human error.
Albeit the right solution in the early days of a company — the challenge is to grow the system week by week into a more robust business information system.
But where to start?
In any case, you need to set-up a centralized data warehouse first, which feels like a comprehensive project to implement. I believe most companies are outright scared by the extent and efforts it takes to set up and maintain their BI system. It seems that sticking to google sheets is often the cheaper and leaner approach. This post gives a short introduction to a lean data warehouse and visualization approach.
How to escape the google sheets massacre
Part I: You don’t want to write code
- Setting-up a cloud-based reporting database. I suggest using a PostgresQL or MySQL. AWS has free tiers for small business and an easy setup approach (https://aws.amazon.com/rds/postgresql/)
- Accessing your database and designing your database structure. I suggest to use pgAdmin (or similar free tools for MySQL) to access your database locally and set-up your reporting tables. Your initial google sheets or excel reporting structure can serve as a good glimpse on how to set-up your table structure.
- Setting-up a layer to retrieve and store data in Postgres. In the first step, you can already achieve a lot using Zapier without coding. Zapier allows you to connect to any webhook / API and push data into your database routinely without coding.
With this set-up, you can already aggregate a lot of external data and save it in your database (or namely data warehouse, which is nothing more than a data repository). Some examples:
- Saving data from google sheets in your database
- Pushing data from any API through Zapier into your database
- Setting-up daily or hourly routines in Zapier to retrieve data from one of the tools you are using (webshop, Trello, email, Google, etc) and pushing it into your database
- Zapier already provides a ton of apps by default. Just have a look here
How to escape the google sheets massacre
Part II: Zapier is not doing the job
At some point, Zapier might not be sufficient anymore to serve as a layer between your business data and your data warehouse. In that case, you need a more elaborate piece of software working as the middleware between all your data and your warehouse.
We have become a huge fan of python and their data manipulation libraries such as Pandas. If you are using python exclusively for BI purposes, I suggest using a service like python anywhere to avoid code deployment or other set-up headaches. Alternatively, Google app engine comes with a lot of specific and handy google APIs built into the service.
Pythonanywhere provides you with a virtual environment and console to host and execute python scripts as well as to set-up routines (e.g. pulling and pushing data daily)
The most important libraries to get a head start with python are:
- sqlalchemy: Query databases and push data into SQL databases
- requests: Make API requests
- pandas: Manipulate all sorts of data
- gspread: Connect to google sheets and Google APIs
To pull the data from the various services you may be using, you will need to make API calls to them (if they provide an API service). The methods of authentication for API services vary and are sometimes complex, but python normally has a library with decent documentation for you to work through. Your most frequent ‘go-to’ library would be ‘requests’, but we have also used ‘sha’ and ‘hashlib’ to deal with APIs that require unique hash signatures on every call.
Certain APIs sometimes return different paginated responses, but normally the methods for using them are explained in the API documentation. After the API call to retrieve your data, you’ll normally receive a JSON file, which you can read using the ‘json’ library in python and can manipulate how you like with the rest of your code.
.. so how to visualize and manipulate your data?
Having a data warehouse and aggregating data from different sources is nice, but still cumbersome to work with. There is a ton of data visualization and BI tools on the market. The major difference lies between simple dashboard tools visualizing data or more elaborate tools, allowing also colleagues without SQL knowledge to query the database and build customized reports and graphs.
We had a look at holistics, tableau, chart, and a looker. Holistics and Chartio both provide a free test-version you can test. Unfortunately, some of the tools are getting a bit pricy for startups as soon as you add more seats. (chartio 199$ per month / seat).
The beauty with most tools is that they can be easily connected with your database. So once you have a well-structured database, which is routinely fed with all relevant business data, you can easily manipulate data, build dashboards or schedule reports.
The entire set-up doesn’t take long. It takes you 30 minutes to set-up Postgresql on AWS. Another 30 minutes to install local database access such as pgAdmin. The most expensive part is setting up the database structure and Zaps to populate your tables. As mentioned before, the best way is probably to start iteratively by moving different data sources step by step into PostgreSQL tables. The most important part is: you don’t need to be extraordinarily python literate to start building your own data warehouse.
AWS offers free tiers for PostgreSQL, in any case, it shouldn’t be more than €50 per month. Pythonanywhere €12 per month. The most expensive part is the Chartio, which starts at €199 per month and seat.
Disclaimer: The post originally appeared on medium.