Privacy Breaches and Big Data: Solutions and Suggestions in India’s Context

By: Navreet Kaur[+]

This article discusses the unprecedented rate at which data is growing and the various possibilities of privacy infringements.  It views the problem in the context of a developing nation, India, because India has formed a committee last year to devise and regulate regulations for data protection.  The article also discusses the approaches adopted by various other countries so that India can meet global standard when formulating policies for data protection.  The features of the Data Protection Bill that the committee has recently submitted to the Ministry of Electronics and Information Technology, Government of India are also discussed.

In this article, we restrict our focus to how social media giants contribute to create Big Data and how they shy away from the liability of protecting the data of millions of their users’ accounts.

Big data is a term given to the collection of data sets produced from multiple sources in different volumes, varieties, and with varying velocities.  This data is used for analysis to reveal patterns and trends within data sets for analytical purposes.  According to Forbes, this data is growing faster than ever before, and by the year 2020 about 1.7 megabytes of new information will be created every second for every human being on the planet.[1]  Data from various sources such as logs, IoT[2] sensors, the web, and social media alone contribute a substantial amount to Big Data.

Out of this, the data from the web and social media is collected from the activities of a user in his or her personal space over social media.  Whenever a user logs in to his personal account whatever he browses, shares, or posts contributes to Big Data irrespective of the user’s personal choice.  A user cannot avoid this continuous collection of his activity log because he has agreed to the Terms and Conditions, and he cannot select which of his activities he would like to share or to which he has an objection.

So, absolutely all the data on social media is or can be streamed or transferred to a storage medium.  From here it can either be used by the parent organization (which could be a Data Controller)[3] or it can be sold to other organization (a Data Processor)[4] for the purpose of analysis to obtain some tangible benefits.

This continuous stream of data is procured in real time through the firehose API.[5]

Data Firehose is the name given to an API that continuously streams the data from the web in real time.  The use cases for Data Firehose range from weather forecast sources, social media, geographical location data, RSS feeds, blogs, comments, review ratings, etc.

There is also a possibility that once the data from various sources is procured third parties (Data Processors)—who work as strategists—also get involved.  They receive the data set, analyze it, and share the results.  This third party has no interest in the contents of records.  They use analytical tools and languages to mine the data and extract information relevant to them.

Even though all of this data is of no personal use to this third party; it receives the entirety of the data and goes through it conclusively.  In this process, data reaches an altogether new source and becomes more vulnerable to a breach of privacy.  The more the data travels, the higher the risk of a breach of privacy.

To avoid this privacy contravention, companies use techniques like Anonymization[6] and Pseudonymization [7] of data.

Various techniques such as generalization, encryption, blurring of images, truncation, character masking, and dropping identifiable information are undertaken to anonymize data. [8]  But none of these techniques are 100 percent efficient.

There have been incidents of private data being re-identified by finding any correlation among databases—the famous case of Re-identification of Governor Weld’s Health Record where Professor Latanya Sweeny demonstrated that the medical record of the then Governor of Massachusetts could be re-identified by comparing the medical record and publicly known facts about the governor, is one such incident.[9]  So, a question arises: when any organization is analyzing a data set, or has shared its data set with third parties for analysis, and a breach of privacy is reported, who should be held responsible?  The organization which sold the data, the organization which bought the data for its own use, or the third-party organization (or other third party) which was hired only for the purpose of analytics because of their specialization?  Who should be punishable by law?

As the methods to conceal private information are limited and not failsafe, who shall take the responsibility of the breach: the seller, the buyer, or the user who has accepted the Terms and Conditions?  Further in this article, we will analyze the approaches adopted by India as well as other nations.

Understanding the Role of Third Party

It is important here to understand why the analyst or strategist organization is considered as a third party here.

The reason is this third-party organization has no interest in the content of the data received for analysis.  It gets paid only for its analytics and uses automated tools and high-level languages to analyze data.  Considering the nature of its work, there will be rare instances where this organization is liable for any infringement.  This third party is always hired by any parent party for its own consumption.  Whereas, another entity who has purchased data only with the intention of analyzing data for its own profits has a much higher chance of breaching the law.

Both entities should not be treated under the same strict law.

The third party can comply by merely signing the contract that it shall protect the data.  In case of any data breach it shall be punishable as per the terms of the contract, whereas the obligation to protect the data still lies with the parent organization.  Since it is their platform on which the user has shared their personal or sensitive information, the trust between the user and service provider cannot be an object of compromise.

India’s Approach Towards Dealing with Data Protection

Focusing on India, and where India has reached on the pedestal of Data Protection, it is important to mention that the committee for Data Protection (nine-member expert committee headed by Justice BN Srikrishna) in India was formed in July 2017 and had posted a white paper for public opinion in November 2017.[10]  In July 2018, it has submitted the final bill to the government.[11]

Drafting policies and regulation is a complex task as there are multiple dimensions that need to be addressed.  Seeing data protection in the light of the “Right to Privacy”[12] and maintaining the compatibility of rules with global standards is the basis on which the bill has been drafted.

The Personal Data Protection Bill redefines the basic terms such as data, personal data, anonymization, data principal, and sensitive personal data.[13]  Apart from definitions, it talks about fair and reasonable processing of data and the Purpose Limitation.[14]

Since the concept is relatively new, well-laid policies and regulation are part of the bill proposed. The bill has taken “consent” very seriously and mentions that it should be capable of being withdrawn, and the ease of withdrawal should be comparable to the ease of giving consent. This also gives a hint of the strictness with which the laws are formulated.

It has also clearly stated the exemptions for processing of data.  The exemptions are applicable on legal, journalistic, domestic, and research purposes.  The bill recommends the establishment of a Data Protection Authority (DPA).  The role of this authority will include investigation, education, policy drafting, and policy enforcement.

The bill also prescribes penalties which can be as high as INR 15 crores or 4 per cent of the total worldwide turnover of any data collection entity.[15]  Quoting from the bill—“[w]here the data [sic] fiduciary contravenes any of the following provisions, it shall be liable to a penalty which may extend up to five crore rupees or [sic] two per cent of its total worldwide turnover of the preceding financial year, whichever is higher. . . .” [16]

The Right to Be Forgotten is also included in the India Data Protection Bill.[17]  It states that the user has the right to restrict the disclosure of personal data if the purpose for which it was collected is fulfilled.  This will make it mandatory for the Data Controller to state its purpose more clearly and visibly.  It is an essential step for the organization to create procedures to implement these provisions on an individual level.

International Approaches                                                              

Similar data protection acts in other nations are already in practice, and it is advantageous to adopt worthy portions of those acts suitable to our situation.

For instance, the policies of South Africa and the European Union are strict enough to impose a fine on the defaulter (both data controller and data processor) whereas the UK leaves it to the discretion of the Information Commissioner. [18]

In another example, the Personal Data Protection Commission of Singapore, which was formed to promote and enforce personal data protection, has considered data transfer across geographical territories as an exemption. [19]  It also mentions how to file an application in case there is any exemption  from any provision under the PDPA. [20]

It is worth appreciating the fact that Singapore has gone a step ahead.  Along with well-laid procedures and regulations, there is a chat window on Singapore’s website which can provide answers to questions related to Data Protection in order to keep its citizens well informed. [21]

The intention is to achieve a fine balance between considering data sharing as a crime and letting it flow seamlessly.  The reason is the stricter the laws, the less information would flow in or out.  At the same time, the more lenient the law, the higher the probability of the crime of invasion of privacy.


Stating the solution now—out of the three parties involved, if the immediate party who is right before the party in breach, in a chain of transaction, is entrusted with the responsibility of securely transferring only those fields of records, which do not comprise of any private information or information that cannot be used to re-identify anonymized data, it shall do so owing to the liability.

The same can be achieved if the parent organization exposes a set of APIs for the external applications, only ‘relevant’ fields of data should be transferred, and private information should be discarded.  The word ‘relevant’ is important here because some social media sites have already exposed their APIs, but they are analogous to firehose APIs.

This solution is more feasible because no new setup must be established for external applications to send APIs and fetch processed information because an API mechanism already exists for the Firehose API.

Instead of sharing of the entire data stream, information should be shared on a controlled basis. Sharing of irrelevant information requires extra transportation cost, as well as making the information more susceptible to breaches of privacy the more the information is shared across the boundaries.

It is also true that not all data sets are structured.  And not all structured data sets have the same pattern.  But this mechanism can be followed for social media giants at least because they alone contribute hundreds of petabytes of data every day.

Moreover, the introduction of tags over social media resolves the purpose of structuring the data to an extent.

Another advantage of sharing relevant data only though APIs is that the data sets will be sorted at a micro level.  Instead of accumulating all data sets and then sorting them by the relevant information, it is better to sort them while they are comparatively smaller in size.

In the context of India, the data protection laws are flexible enough to meet the ever-changing technologies, and the following suggestions can also be incorporated at any later stage:

  • Limiting the sale of data—Limiting the sale of data can be done by maintaining a slab. A range of data transfer should be defined falling in one slab which can also be the basis for a pricing mechanism.
  • Licensing the process—Licenses can be categorized as corporate, government, direct marketing, or third-party analysis. The amount of data one can sell should be based on a license. Obtaining a license could be assigned by telecommunication authorities.
  • Geographical restriction—One of the most complicated issues is the ownership of data. Data, not being something perceptible, increases its ownership with every copy. Who owns the data once it crosses geographical boundaries?  Does the recipient country have well-laid laws?  Who shall be liable: the sender company or the recipient for any infringement of privacy rights?  How relevant is the data of users of one country to any other?  Obviously, there is no unit akin to the degree of relevance but there definitely can be a categorization as to which data is free to be shared and which is not.
  • Complaint mechanism—When laws are formulated the possibility of them being distorted also arises. Keeping this in mind, a strong complaint reporting and resolution mechanism must exist to run the system seamlessly.
  • Informed Citizens—If the government is engineering laws for the privacy and protection of its citizens’ data, citizens should be well informed. The government should make its citizens well aware and vigilant against any attempt to flout the data protection laws.


It can be concluded that personal data protection is a citizen’s right and providing it is a concern of the government.  The Indian Government is gearing up to secure this accomplishment for its countrymen.

Ethically speaking, the onus lies with the data controllers referred to as a “parent organization” in this paper.  But mere obligation cannot fetch the desired results.  There is a dire need to devise and enforce the necessary guidelines.

Also indispensable is the need to enlighten the citizens about their rights and how they can remain vigilant about them.  If the above-mentioned practices are adopted, the privacy of data can definitely be retained and maintained for a long time.

[+] The author currently works as a Technical Assistant at JIRICO, a research initiative at O P Jindal Global University, India. Her area of interest includes policy making related to Big Data, IoT and AI.

[1] Bernard Marr, Big Data: 20 Mind-Boggling Facts Everyone Must Read, Forbes (Sept. 30, 2015, 02:19 AM),

[2] Internet of Things.

[3] What is a Data Controller or a Data Processor?, Eur. Comm’n, (last visited Sept. 13, 2018).

[4] Id.

[5] The firehose API is a steady stream of all available data from a source in real-time¾a giant spigot that delivers data to any number of subscribers at a time.  The stream is constant, delivering new, updated data as it happens. See Joe Hanson, What is a Data Firehose API?, PubHub (Nov. 14, 2014), (describing the functions of firehose API).

[6] Anonymisation and Pseudonymisation, Data Protection Comm’n (Ir.),, (last visited Sept. 13, 2018).

[7] Id.

[8] Gregory S. Nelson, Practical Implications of Sharing Data: A Primer on Data Privacy, Anonymization, and De-Identification 13–15 (ThotWave Tech., No. 1884, 2015),

[9] Edward W. Felten & Arvind Narayanan, No Silver Bullet: De-Identification Still Doesn’t Work, (July 9, 2014),

[10] White Paper on Data Protection Framework for India – Public Comments Invited, Indian Ministry Elec. & Info. Tech., (last visited Sept. 13, 2018).

[11] The Personal Data Protection Bill (2018),, (last visited Sept. 17, 2018).

[12] Justice K Puttaswamy (Retd.) and Anr. v. Union of India and Ors., Writ Petition (Civil) No. 494 of 2012 (Sup. Ct. India Aug. 24, 2017),

[13] The Personal Data Protection Bill, supra note 12, at 7.

[14] The Personal Data Protection Bill, supra note 12, at 11.

[15] Approximately USD $700,000.

[16] The Personal Data Protection Bill, supra note 12, at 41.

[17] The Personal Data Protection Bill, supra note 12, at 21.

[18] See Guidance About Issuing Monetary Penalties, Info. Comm’n Off., (describing the UK’s considerations when assessing fines); Commission Regulation 2016/679, 2016 O.J. (L 119) (GDPR) (detailing the fines for non-compliance).

[19] Who We Are, Pers. Data Protection Comm’n of Sing., (last visited Sept. 13, 2018).

[20] Id.

[21] Exemption Requests, Pers. Data Protection Comm’n of Sing., (last visited Sept. 13, 2018).