23 TIPS ABOUT WHAT, HOW AND WHERE DATA SHOULD BE COLLECTED
Obtaining information from data can only be achieved if it is good, quality data. The more quality and data you have, the more information you can get.
Therefore, in order to achieve this, we give you 23 tips to allow your organization to start, or definitely get, the most value from data.
3 factors to consider when deciding where to save data
- Amount of data you will collect.
- Frequency that data will be queried.
- Urgency to obtain data when requested.
Why should we contemplate these three factors?
3 tips to know where to save data
- If retrieving data fast is not critical, we can store data on slower devices but with a large storage capacity.
- Store in high speed devices data that is requested more frequently or whose urgency in obtaining is high (it will depend on how important it is to get your data on time).
- If we have a large amount of data for a certain device, we must create an adequate design so that data with the highest value and frequency of access must be on devices with the highest speed.
Moving on to the most important part, we are talking about what data you need to store. This will depend on whether your data is quality data or not, so we recommend reading it several times until it is completely understood.
12 tips to know what data need to be stored
- Store non-urgent data on devices with high capacity and slow access times. These data can be referenced with pointers located on faster devices.
- In the same situation as the previous one, compressing data is also a good idea.
- Do not worry about defining different schemas according to the users of the data (the most common case is usually for privacy reasons) since you will create logical schemas from data already saved.
- Study relationships between concepts, since running a query may have to involve recovering a lot of unnecessary data. Try to think of the operations that will be performed more frequently and create few intermediate relationships between related concepts.
- The more numerical data you store, the better.
- If you categorize data, for example, people of legal age you assign 1, and those who do not, 0, always save how those values were calculated somewhere, in this case storing the raw age value as well.
- Save dates and times for operations and transactions.
- There are cases where you can get missing data from your data source. Keep in mind the logical default value for those that are missing.
- Know the encoding in which you will save your data (¿UTF-8?).
- Write validations to prohibit those data values that do not make sense. When some values violate it, think about whether you should reject inserting that data, or whether you should set default values.
- When you have a high urgency to get certain data that requires a lot of computationally expensive operations, you may well need to pre-calculate that data and store it redundantly so that it can be retrieved faster.
- Date type is much slower than integer type, so if you have no storage constraints, it is preferable to have separate values for the date and time components (year, month, hour, etc.) to perform operations on numeric data instead of date type.
8 Tips on how to start taking data
- Always start with your most important assets and processes. Do not take much into account your current goals or KPIs, focus on getting data from them. If your goals change overtime, you will need to completely change the schema of data obtained so far and your data has little value (if it has any).
- Start taking data little by little, do not try to create a big data schema at the beginning, start small and increase your schema as you get comfortable (be agile).
- The ingestion process must be periodic and methodical. We mean that the frequency with which data is obtained must be stipulated according to the time that data needs to be collected.
- Be realistic, getting data until you have an automated process can be laborious, be aware of the resources you have available to devote to it.
- Follow standard formats, it will save you time and money.
- Make regular backups. Errors can always occur, so we consider it is extremely important that you create some decentralized, redundant backups.
- Give importance to metadata (data that describes data), such as timestamp, etc.
- Once you have all of the above, then you can thinking about having more data, for example, you can start getting data about the weather of a specific day (if that’s important to you), data about your competition, about your market share, etc. In other words, data that is not directly part of your business, but that has an impact on it.