Research:Ethical and human-centered AI/Bias in recommender systems

Tracked in Phabricator:
Task T203049

Contact

Jonathan Morgan

Wikimedia Foundation

Duration: 2018-August – 2018-December

Research:Projects

This page documents a completed research project.

This page in a nutshell: This document is an early draft of Research:Ethical_and_human-centered_AI

This document provides an overview of recent research, policy, process, methods, and critique related to addressing bias in machine learning applications, focusing on recommender systems.

In the past few years, researchers, designers, developers and legal scholars have begun to develop new methods for preventing, identifying, and addressing bias in recommender systems and other machine learning applications, and to formulate best practices for organizations that engage in machine learning development.

Below is an outline of considerations, methods, best practices, and external resources that inform the development of the Wikimedia Foundation's recommender systems.

Overview

Recommender systems are among the most widespread and visible machine learning applications. Wikimedia has begun to develop recommender systems to power a variety of reader- and contributor-facing features across our platforms: such as articles to read next, articles to translate, and links and sections to add to existing articles.

Recommender systems can be a powerful tool for increasing reader engagement, and encouraging contribution, and filling knowledge gaps. However, recommender systems, like all machine-learning applications, have the potential to reflect and reinforce harmful biases. Sources of bias in machine learning applications can stem from a wide variety of sources, such as:

limitations in training data
social and analytical assumptions of system developers
technical limitations of software infrastructure, and
evolving social norms

Related issues around user expectations of privacy and anonymity, appropriate stewardship of personal data, and accountability and interpretability in algorithmic decision-making can compound the risks associated with bias in recommender systems.

Goals

This document is intended to:

summarize the state of the art in research, scholarship, and organizational policy related to the causes and consequences of bias in recommender systems
describe general best practices and industry-standard methods for prevent, identifying, and addressing bias during software development
identify important considerations and external resources that are relevant at each stage of the recommender system development process

Concepts and key terms

Recommender systems: Algorithmic tools for identifying items (e.g. products, services) of interest to users of a technology platform (Ekstrand et al. 2018)

Bias: Systematically and unfairly discriminating against certain individuals or groups of individuals in favor of others (Friedman and Nissenbaum, 1996)

Harm: Denying a group access to valuable resources and opportunities (allocative harm), or underscoring or reinforcing the subordination of some social or cultural group (representative harm) (Reisman et al. 2018)

Accountability: the process of assigning responsibility for harm when algorithmic decision-making results in discriminatory and inequitable outcomes (Donovan et al. 2018)

Transparency: the degree to which the factors that influence the decisions made by an algorithmic system are visible to the people who use, regulate or are impacted by those systems

Interpretability: the degree to which the people who use, regulate, or are impacted by algorithmic systems can understand how that system makes decisions and/or the reason why a particular decision was made

How to use this document

This document is intended to be used as a menu, not a checklist: It is not assumed that every consideration presented here will be applicable to the development of any particular recommender systems in any particular context. This document is also not a comprehensive set of all possible considerations, methods, best practices, and resources related to bias in the development of recommender systems.

As such, this document is intended to provide teams engaged in the development and/or evaluation of recommender systems with guidance on approaches that they may want to consider applying, singly or in combination; common pitfalls to watch out for; and principles that others engaged in this work have identified as important for understanding causes and indicators of bias; assessing risk and preventing potential negative impacts of bias before they occur; as well as addressing those (in some cases, potentially inevitable) impacts of bias where they appear, and when they matter most.

Document format: The recommendations, considerations and resources presented in this document are organized according to three stages of design and development identified by the Fairness, Accountability, and Transparency in Machine Learning organization: design, pre-launch, and post-launch.^[1]

Stage 1: Design

Articulating design goals

Identifying audience, purpose, and context

Direct and indirect stakeholders. Identifying the direct (intended end users) and indirect stakeholders (non-users who may be impacted) of a recommender system is a key first step in human-centered algorithm design.^[2] Considering who will use to system, what they may use it for, and the context of use helps ensure that the resulting product is usable and addresses a real human need.^[3] Considering potential impacts on indirect stakeholders helps the developers understand the scope of their accountability: for example, whether the resulting system may benefit some people at the expense of others.
Risk assessment. Several agencies have developed checklists, frameworks, and principles that development teams can use to assess the potential risks associated with building and deploying algorithmic systems.^[4]^[5]^[1] In the case of recommender systems, risk assessments should focus on risks associated with potential sources of pre-existing, technical, and emergent bias in the planned system.^[6]

Developing hypotheses and benchmarks

Current conditions. An assessment of current conditions related to the problem the system is intended to address (e.g. gaps in Wikipedia content coverage mediated by low contributor diversity) and contributing factors should be conducted in order to develop benchmarks against which the impact of the system can be assessed.
Anticipated impacts. Anticipated impacts should be clearly stated and justified based on relevant empirical research and social science theory. Specific, testable hypotheses should be developed in order to ensure that the system is evaluated according to criteria that are consistent with stated goals and against pre-specified benchmarks.

Selecting models, data, and features

Using interpretable models

Model opacity. Recommender models are opaque if an important stakeholder (and end-user or anyone else invested in or impacted by the model) are unable to understand how or why the algorithm arrived at a particular recommendation. Thus, opacity can be both a function of the computational literacy of the stakeholder and the characteristics of the model.^[7] Opaque models present a greater risk of introducing or reinforcing bias because the ability to understand how the model works overall, and how it makes specific recommendations, is restricted to stakeholders with high levels of computational literacy. When selecting and designing recommender models, developers should consider not just accuracy and efficiency, but the computational literacy of their stakeholders and the degree to which the stakeholders need to be able to interpret model outputs.
Model complexity. Some recommender models may be so operationally complex that it is very difficult, if not impossible, to articulate the model's features, operations, and underlying logic to an end user with sufficient completeness and correctness.^[7] When selecting a model, developers should consider whether they can get sufficient accuracy and efficiency from a less complex (but more explainable) model, in order to reduce the risk that bias in the model will go undetected. They should also consider whether--given the model chosen, the anticipated context of use, the likelihood and anticipated consequences of bias, and the computational literacy of the stakeholders, it is more important to present stakeholders with a more descriptive vs. a more persuasive description of model's decision-making.^[8]

Collecting training data

Labeling campaigns. Human annotators may embed their own biases in training data based on their level of expertise and sociocultural background.^[9] The way the annotation task is presented to annotators—including the task description, question prompts, choice architecture, and user interface—can also introduce bias in the training dataset. In the context of developing training data for recommender systems, developers should consider whether the interests and attitudes of the labelers match those of the intended audience.
Behavioral trace data. When drawing on log data of user-system interactions (e.g. webrequest logs, edit histories) to train recommender algorithms, it is important to identify sources of noise and potential mis-matches between the population the sample is drawn from and the target population. For example, a dataset to train a recommender system for suggesting articles to good-faith new editors should exclude edits by likely vandals. A dataset of section headings in English Wikipedia used to train a section heading recommender system in Spanish Wikipedia may suggest sections that are discouraged by the Spanish community's policies or Manual of Style.
Existing datasets. Datasets of behavioral traces, or annotated datasets, collected from other platforms or other contexts may introduce bias when used as training data for recommender systems. Recent proposals have called for the development of data statements for NLP datasets that include information such as speaker and annotator demographics, text characteristics, and speech situation in order to help researchers who re-use these corpora in different contexts assess the potential sources of and consequences of bias.^[10]

Engaging relevant stakeholders

Eliciting stakeholder input

Generative research. Performing qualitative research with a diverse set of stakeholders before development begins can help identify potential sources of bias, as well as helping to define the problem space and user requirements for the system. Successful approaches include surveys, interviews,^[11] focus groups,^[12] and participatory design activities.^[2]
Community consultation. Pitching your proposal to the impacted community and inviting feedback allows community members to raise concerns and point out un-interrogated assumptions, and can build trust. The AI Now Institute notes that holding multiple rounds of request for comment "provides a strong foundation for building public trust through appropriate levels of transparency... subsequent requests can solicit further information or the presentation of new evidence, research, or other inputs that the agency may not have adequately considered."^[5]

Building diverse teams

Demographic diversity. The team that develops recommender systems should include members with a diversity of backgrounds, including but not limited to age, sex, race, gender identity, culture, language, and geographic location. People are often blind to the biases that impact their own opinions and decisions. Including a diverse set of voices in discussions about the goals and process of system development, and actively encouraging team members to articulate their own perspectives in those discussions, can help surface hidden assumptions, edge cases, values conflicts, and disparate impacts.^[13]
Subject-matter expertise. Social scientists, designers and researchers, legal and policy experts, and community leaders can provide invaluable insights on who important stakeholders and stakeholder communities are; their motivations, wants, needs, and identified problems; and the broader social, organizational, and societal impacts of developing and deploying a new system in an existing sociotechnical system. Incorporate people who have expertise beyond those of the core model and infrastructure development team—i.e. beyond data scientists, software engineers and project managers—into the team early on. Give them meaningful, ongoing roles in project scoping, goal setting, requirements development, evaluation criteria, and other important decision points and decision-making processes.

Defining roles, goals, and processes

Internal accountability. Decide who on the team is responsible for making final decisions about design, deployment, and evaluation. Define internal reporting processes, long-term maintenance and post-launch monitoring plans (and associated roles), and contingency plans that cover issues such as discovery of data breaches, harmful biases, and other unintended consequences.^[1]
External review. People not directly involved in the development of the recommender system (inside and outside the organization) should be invited to review project plans and resources before launch. Develop mechanisms for external review of development, deployment, and evaluation plans; risk assessments; potential training datasets, models, and infrastructure components. External review can flag issues related to bias, risk, ethics, organizational policy and strategy, law, user privacy, and sustainability.^[5]

Stage 2: Pre-launch

Performing comparative and iterative testing

Offline evaluation

Evaluation protocols. Different offline evaluation metrics are more suitable to different recommender algorithms.^[14] Recent work has suggested that the design of the protocol selected for offline evaluation of can exaggerate particular kinds of bias in recommender systems, resulting in divergence between assessed accuracy and actual utility (as measured by online evaluation and user studies).^[15]
Comparative evaluation. Research has shown that models that perform similarly according to accuracy-based metrics exhibit very different degrees of utility when compared using online evaluation or user studies.^[16]^[17] Experimenting with multiple models and model parameterizations, rather than committing to a specific modeling approach at the start of development, enables comparative evaluation according to utility, bias, and fairness.

User studies

Expectations, explanations, and mental models. Testing model prototypes with representative system users can help system developers understand the results people expect from the model, how they understand the way the model generates recommendations, and what kind of explanations of model processes and outcomes are useful for evaluating user trust and acceptance. Allowing users to explore input datasets, sample outputs, and interactive visualizations of model decision-making can help surface new features that can be included to improve the model performance. These studies can also be conducted before any model development work has been done, using "Wizard of Oz" protocols.^[18]
Usability and utility. The usability and utility of recommender systems can be quantitatively measured through user studies^[19] consisting of questionnaires and/or task protocols based on established Human-Recommender Interaction frameworks^[20]^[21] psychometric evaluation protocols,^[16] or designed to suit the particular goals and context of the recommender application. User study methods can be used to identify relationships between subjective factors such as user satisfaction, novelty, and item diversity when comparing different models or different iterations of a single model.^[22] Explicit user input can also be elicited interactively to improve personalization.^[23]

Online evaluation

Pilot testing. Pilot testing (usually A/B testing) should be performed in a way that is non-disruptive and that does not unfairly advantage or disadvantage particular groups. Evaluation of pilot tests should be compared against pre-defined and publicly disclosed benchmarks or pre-formulated hypotheses. Additionally, pilot tests provide the first opportunity to study the unintended impacts of the recommender system on non-participants (e.g. platform users who are not part of an experimental cohort), and on the dynamics of the platform or community as a whole.
UI design. Recommendations are surfaced to users in many different ways depending on the platform, the content, the purpose of the system, and the anticipated desires of the users.^[24] Design decisions (even small ones) in the user interface—such as the number of recommendations provided, the information about the recommendations, and the mechanisms used to gather implicit feedback—can have consequences for how, and how much, they use the system^[25], and on what the algorithm learns.^[26] There are a variety of common design patterns for the user interfaces of recommender system, and different patterns may be better suited to different audiences, purposes, and contexts.^[27]
Explicit feedback. Gathering explicit feedback from system users during short-term, experimental, or beta deployments. The recommender interface should clearly communicate to users that it is an experimental feature, and there should be clear calls to action for the users to provide substantial feedback (beyond binary or scalar relevance ratings or a "dismiss" icon) on the content of the recommendations and the way they are presented. Allowing meaningful feedback (e.g. via free-text comments or short multiple-choice surveys) not only provides additional insights into users expectations and impressions of the recommender system, it also allows them to flag potentially harmful unintended consequences, such as the presence of bias in recommendations, or negative impacts on user experience or community health.

Interrogating disparate impacts and unintended consequences

Offline evaluation

False positives and false negatives. When determining the proper thresholds for false-positives and false-negatives, consider the consequences of each type of error on user experience and user outcomes.^[28]
Subgroup fairness. Recommender systems that perform well overall may nonetheless work better for some types of users than others. Therefore, it is important to evaluate the effectiveness of the recommender for particular subgroups of users when possible in order to identify and mitigate this source of bias.^[15]^[29] However, evaluating accuracy for individual, identified subgroups (e.g. race, gender, geographic location, or site tenure) may not itself be sufficient to prevent harmful bias, as bias can also be reflected in intersectional/constructed subgroups, such as "newer female users" or "established users from southeast asia").^[30]

User studies

Purposeful sampling. User studies often attempt to recruit a relatively random sample of system users, in order to make generalizable claims about the usability or utility of a system. However, in cases where the purpose of the study is to identify issues--such as bias--where the system may have differential impacts on different people based on background, attitudes, or activities, it is important to purposefully sample from these groups when recruiting for user studies.^[31] In the case of bias, it is important to test the system with users who are a) under-represented in the system, b) known to have needs that are poorly addressed by the current systems, c) likely to be more dependent on the functionality that the recommender system provides, and/or d) known to be subject to harmful biases in society at large.
Edge and corner cases. User studies provide opportunities to investigate how well the recommender system performs (in terms of perceived utility and in terms of sources of bias in the recommended items) in cases where the recommender system is making recommendations outside of its normal operating parameters^[32] (edge cases) and in cases where multiple variables are near, but not exceeding, their extreme limits (corner cases). For example, a system that recommends articles to editors based on edit history should be tested with very new users, who have made very few edits (edge case); a system that recommends articles to readers based on a combination of article quality, user browsing history and user geographic location should be tested to see how well it performs for readers who live in regions in which there are relatively few geo-tagged articles, and most articles are of comparatively low assessed quality (corner case).

Stage 3: Post-launch

Assessing impacts

Short term

Research outcomes. The most robust measure of the success of a recommender system is a large-scale evaluation with real users in the intended context of use. Pre-launch A/B test experiments can be effective for testing particular hypotheses about the impacts of the system. However, formulating more comprehensive metrics related to research outcomes, and evaluating the system against those outcomes after final deployment, is important to assure that the results from A/B tests (which are typically simplified or short-term, compared to real-world use cases) are generalizable and ecologically valid. Per Shani and Gunawardana: "When choosing an algorithm for a real application, we may want our conclusions to hold on the deployed system, and generalize beyond our experimental data set. Similarly, when developing new algorithms, we want our conclusions to hold beyond the scope of the specific application or data set that we experimented with."^[19]
Community response. It is important to monitor the degree to which users adopt the system after deployment: how many people adopt it, who adopts it, and how quickly. The degree to which people who try out the system continue to use it over time can be a strong indicator of the systems long-term utility: user study participants and people involved in short-term experimental deployments may exhibit a high degree of interest in or engagement with a system because of its novelty or imagined utility, but decreased adoption rate (or decreased regular use among early adopters) can be a sign that regular use of the system decreases user trust, or that the system is poorly integrated into users accustomed workflows. Explicit feedback from early adopters, as well as non-users, in the form of bug reports, posts to technical support forums, or other public discussions can also provide insights into the impact that the system is having on the community as a whole.

Long term

Top-line metrics. After a new recommender system is fully deployed, the impact of the system should be assessed in terms of the top-line metrics, key performance indicators, strategic goals of the organization.^[33] Systems that show positive impacts according to their own specific success criteria may nevertheless have unintended consequences that are only apparent through retrospective analysis of the dynamics of the platform as a whole.^[34]
Iteration. All software systems can benefit from iterative improvement. Feedback from system users and other community members can help developers identify sources of bias in recommender systems, as well as technical bugs and usability issues. Machine-learning driven applications may lose accuracy over time without explicit re-training and re-evaluation, as changes over time in user behavior impact the accuracy of existing model features.^[35] Recommender systems are further susceptible to algorithmic confounding, in which user's interactions with the system increases the homogeneity of recommendations without increasing the utility of the recommender system.^[36]

Enabling ongoing monitoring and re-evaluation

Accountability

External auditing. The AI Now Institute recommends that public agencies "provide a meaningful and ongoing opportunity for external researchers to review, audit, and assess [algorithmic] systems using methods that allow them to identify and detect problems" in order to ensure greater accountability.^[5] Journalists and social science researchers are in the process of adapting established methods for detecting fraud and bias in other domains performing audits of 'black box' algorithms.^[37] External auditing can be made easier and more effective by publishing source code, providing public APIs,^[38] and using interpretable algorithmic models.^[39] In cases where full (public) transparency of models and data is not feasible, developers should work with researchers and community members to establish research access provisions, and the organization should maintain a public log of who is provided access to system code and/or data, and on what basis that access has been granted.^[5]
Reporting channels. The development team should maintain rich reporting channels—such as mailing lists, public wiki pages, and bug tracking systems—to both disseminate and collect information about system performance.

Transparency

Logging. Complete and comprehensive logs of model versions, UI changes, data schemas and inputs, performance evaluations, and deployments should be maintained and made publicly available (if possible) to facilitate auditing, error reporting, and retrospective analysis.
Documentation. It is important to develop detailed, public, and readable documentation of system, including the system's limitations and assumptions embedded in its design.^[40] Documentation of input sources, features, and public APIs should be regularly updated to reflect changes in the system.

References

↑ ^a ^b ^c "Principles for Accountable Algorithms and a Social Impact Statement for Algorithms :: FAT ML". www.fatml.org. Retrieved 2018-09-24.
↑ ^a ^b Baumer, Eric PS (2017-07-25). "Toward human-centered algorithm design". Big Data & Society 4 (2): 205395171771885. ISSN 2053-9517. doi:10.1177/2053951717718854.
↑ Kling, R. & Star, L. (1997) "Organizational and Social Informatics for Human Centered Systems". scholarworks.iu.edu. Retrieved 2018-09-25.
↑ Ethical OS (2018) Risk Mitigation Checklist. https://ethicalos.org/
↑ ^a ^b ^c ^d ^e Reisman, D., Schultz, J., Crawford, K., & Whittaker, M. (2018). Algorithmic Impact Assessments: a Practical Framework for Public Agency Accountability. Retrieved from https://ainowinstitute.org/aiareport2018.pdf
↑ Friedman, Batya; Nissenbaum, Helen (1996-07-01). "Bias in computer systems". ACM Transactions on Information Systems (TOIS) 14 (3): 330–347. ISSN 1046-8188. doi:10.1145/230538.230561.
↑ ^a ^b Burrell, Jenna (2016-01-05). "How the machine ‘thinks’: Understanding opacity in machine learning algorithms". Big Data & Society 3 (1): 205395171562251. ISSN 2053-9517. doi:10.1177/2053951715622512.
↑ Herman, Bernease (2017-11-20). "The Promise and Peril of Human Evaluation for Model Interpretability". arXiv:1711.07414 [cs, stat].
↑ Sen, Shilad; Giesel, Margaret E.; Gold, Rebecca; Hillmann, Benjamin; Lesicko, Matt; Naden, Samuel; Russell, Jesse; Wang, Zixiao (Ken); Hecht, Brent (2015-02-28). "Turkers, Scholars, Arafat and Peace: Cultural Communities and Algorithmic Gold Standards". ACM. pp. 826–838. ISBN 9781450329224. doi:10.1145/2675133.2675285.
↑ Bender, E. M., & Friedman, B. (2018). Data Statements for NLP: Toward Mitigating System Bias and Enabling Better Science. Transactions of the ACL. Retrieved from https://openreview.net/forum?id=By4oPeX9f
↑ Lee, Min Kyung; Kim, Ji Tae; Lizarondo, Leah (2017-05-02). "A Human-Centered Approach to Algorithmic Services: Considerations for Fair and Motivating Smart Community Service Management that Allocates Donations to Non-Profit Organizations". ACM. pp. 3365–3376. ISBN 9781450346559. doi:10.1145/3025453.3025884.
↑ Tintarev, Nava; Masthoff, Judith (2007-10-19). "Effective explanations of recommendations: user-centered design". ACM. pp. 153–156. ISBN 9781595937308. doi:10.1145/1297231.1297259.
↑ Ethical OS (2018), A guide to Anticipating the Future Impact of Today's Technology, https://ethicalos.org/
↑ Gunawardana, A., & Shani, G. (2009). A survey of accuracy evaluation metrics of recommendation tasks. Journal of Machine Learning Research, 10(Dec), 2935-2962. Retrieved from http://jmlr.csail.mit.edu/papers/volume10/gunawardana09a/gunawardana09a.pdf
↑ ^a ^b Ekstrand, Michael; Tian, Mucun; Azpiazu, Ion Madrazo; Ekstrand, Jennifer D.; Anuyah, Oghenemaro; McNeill, David; Pera, Maria Soledad (2017). "Scripts for All The Cool Kids, How Do They Fit In". Boise State Data Sets. doi:10.18122/b2gm6f. Retrieved 2018-09-25.
↑ ^a ^b McNee, Sean M.; Kapoor, Nishikant; Konstan, Joseph A. (2006-11-04). "Don't look stupid: avoiding pitfalls when recommending research papers". ACM. pp. 171–180. ISBN 1595932496. doi:10.1145/1180875.1180903.
↑ Beel, Joeran; Genzmehr, Marcel; Langer, Stefan; Nürnberger, Andreas; Gipp, Bela (2013-10-12). "A comparative analysis of offline and online evaluations and discussion of research paper recommender system evaluation". ACM. pp. 7–14. ISBN 9781450324656. doi:10.1145/2532508.2532511.
↑ Amershi, Saleema; Cakmak, Maya; Knox, William Bradley; Kulesza, Todd (2014-12-22). "Power to the People: The Role of Humans in Interactive Machine Learning". AI Magazine 35 (4): 105–120. ISSN 2371-9621. doi:10.1609/aimag.v35i4.2513.
↑ ^a ^b Shani G., Gunawardana A. (2011) Evaluating Recommendation Systems. In: Ricci F., Rokach L., Shapira B., Kantor P. (eds) Recommender Systems Handbook. Springer, Boston, MA. Retrieved from: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.600.7100&rep=rep1&type=pdf
↑ McNee, Sean M.; Riedl, John; Konstan, Joseph A. (2006-04-21). "Making recommendations better: an analytic model for human-recommender interaction". ACM. pp. 1103–1108. ISBN 1595932984. doi:10.1145/1125451.1125660.
↑ Knijnenburg, Bart P.; Willemsen, Martijn C.; Gantner, Zeno; Soncu, Hakan; Newell, Chris (2012-03-10). "Explaining the user experience of recommender systems". User Modeling and User-Adapted Interaction 22 (4-5): 441–504. ISSN 0924-1868. doi:10.1007/s11257-011-9118-4.
↑ Ekstrand, Michael D.; Harper, F. Maxwell; Willemsen, Martijn C.; Konstan, Joseph A. (2014-10-06). "User perception of differences in recommender algorithms". ACM. pp. 161–168. ISBN 9781450326681. doi:10.1145/2645710.2645737.
↑ Loepp, Benedikt; Hussein, Tim; Ziegler, Jüergen; Loepp, Benedikt; Hussein, Tim; Ziegler, Jüergen (2014-04-26). "Choice-based preference elicitation for collaborative filtering recommender systems, Choice-based preference elicitation for collaborative filtering recommender systems". ACM, ACM. pp. 3085, 3085–3094, 3094. ISBN 9781450324731. doi:10.1145/2556288.2557069.
↑ Cremonesi, Paolo; Elahi, Mehdi; Garzotto, Franca (2016-09-24). "User interface patterns in recommendation-empowered content intensive multimedia applications". Multimedia Tools and Applications 76 (4): 5275–5309. ISSN 1380-7501. doi:10.1007/s11042-016-3946-5.
↑ Cosley, Dan; Lam, Shyong K.; Albert, Istvan; Konstan, Joseph A.; Riedl, John (2003-04-05). "Is seeing believing?: how recommender system interfaces affect users' opinions". ACM. pp. 585–592. ISBN 1581136307. doi:10.1145/642611.642713.
↑ Sharma, Abhinav (2016-05-17). "Designing Interfaces for Recommender Systems". The Graph. Retrieved 2018-09-25.
↑ Cremonesi, Paolo; Elahi, Mehdi; Garzotto, Franca (2016-09-24). "User interface patterns in recommendation-empowered content intensive multimedia applications". Multimedia Tools and Applications 76 (4): 5275–5309. ISSN 1380-7501. doi:10.1007/s11042-016-3946-5.
↑ Diakopoulos, Nicholas (2016-01-25). "Accountability in algorithmic decision making". Communications of the ACM 59 (2): 56–62. ISSN 0001-0782. doi:10.1145/2844110.
↑ World Economic Forum. (2016). How to Prevent Discriminatory Outcomes in Machine Learning. Global Future Council on Human Rights, (March). Retrieved from http://www3.weforum.org/docs/WEF_40065_White_Paper_How_to_Prevent_Discriminatory_Outcomes_in_Machine_Learning.pdf
↑ Kearns, Michael; Neel, Seth; Roth, Aaron; Wu, Zhiwei Steven (2017-11-14). "Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness". arXiv:1711.05144 [cs].
↑ Palinkas, Lawrence A.; Horwitz, Sarah M.; Green, Carla A.; Wisdom, Jennifer P.; Duan, Naihua; Hoagwood, Kimberly (2013-11-06). "Purposeful Sampling for Qualitative Data Collection and Analysis in Mixed Method Implementation Research". Administration and Policy in Mental Health and Mental Health Services Research 42 (5): 533–544. ISSN 0894-587X. PMC 4012002. PMID 24193818. doi:10.1007/s10488-013-0528-y.
↑ McNee, Sean M.; Riedl, John; Konstan, Joseph A. (2006-04-21). "Making recommendations better: an analytic model for human-recommender interaction". ACM. pp. 1103–1108. ISBN 1595932984. doi:10.1145/1125451.1125660.
↑ "IBM Knowledge Center". www.ibm.com. Retrieved 2018-09-28.
↑ Halfaker, Aaron; Geiger, R. Stuart; Morgan, Jonathan T.; Riedl, John (2012-12-28). "The Rise and Decline of an Open Collaboration System". American Behavioral Scientist 57 (5): 664–688. ISSN 0002-7642. doi:10.1177/0002764212469365.
↑ Harford, Tim (December 2014). "Big data: A big mistake?". Significance 11 (5): 14–19. ISSN 1740-9705. doi:10.1111/j.1740-9713.2014.00778.x.
↑ Chaney, Allison J. B.; Stewart, Brandon M.; Engelhardt, Barbara E. (2017-10-30). "How Algorithmic Confounding in Recommendation Systems Increases Homogeneity and Decreases Utility". arXiv:1710.11214 [cs, stat].
↑ Sandvig, C., Hamilton, K., Karahalios, K., & Langbort, C. (2014). Auditing Algorithms : Research Methods for Detecting Discrimination on Internet Platforms. Data and Discrimination: Converting Critical Concerns into Productive Inquiry, a preconference at the 64th Annual Meeting of the International Communication Association. Seattle, Washington, USA. Retrieved from http://www-personal.umich.edu/~csandvig/research/Auditing%20Algorithms%20--%20Sandvig%20--%20ICA%202014%20Data%20and%20Discrimination%20Preconference.pdf
↑ Diakopoulos, Nicholas (2014). "Algorithmic Accountability Reporting: On the Investigation of Black Boxes". Academic Commons. doi:10.7916/D8ZK5TW2.
↑ Bamman, D. (2016). Interpretability in human-centered data science. In CSCW Workshop on Human-Centered Data Science.
↑ Shahriari, Kyarash; Shahriari, Mana (July 2017). "IEEE standard review — Ethically aligned design: A vision for prioritizing human wellbeing with artificial intelligence and autonomous systems". 2017 IEEE Canada International Humanitarian Technology Conference (IHTC) (IEEE). ISBN 9781509062645. doi:10.1109/ihtc.2017.8058187.

[:0-1] "Principles for Accountable Algorithms and a Social Impact Statement for Algorithms :: FAT ML". www.fatml.org. Retrieved 2018-09-24.

[:2-2] Baumer, Eric PS (2017-07-25). "Toward human-centered algorithm design". Big Data & Society 4 (2): 205395171771885. ISSN 2053-9517. doi:10.1177/2053951717718854.

[3] Kling, R. & Star, L. (1997) "Organizational and Social Informatics for Human Centered Systems". scholarworks.iu.edu. Retrieved 2018-09-25.

[4] Ethical OS (2018) Risk Mitigation Checklist. https://ethicalos.org/

[ainow-5] Reisman, D., Schultz, J., Crawford, K., & Whittaker, M. (2018). Algorithmic Impact Assessments: a Practical Framework for Public Agency Accountability. Retrieved from https://ainowinstitute.org/aiareport2018.pdf

[6] Friedman, Batya; Nissenbaum, Helen (1996-07-01). "Bias in computer systems". ACM Transactions on Information Systems (TOIS) 14 (3): 330–347. ISSN 1046-8188. doi:10.1145/230538.230561.

[:1-7] Burrell, Jenna (2016-01-05). "How the machine ‘thinks’: Understanding opacity in machine learning algorithms". Big Data & Society 3 (1): 205395171562251. ISSN 2053-9517. doi:10.1177/2053951715622512.

[8] Herman, Bernease (2017-11-20). "The Promise and Peril of Human Evaluation for Model Interpretability". arXiv:1711.07414 [cs, stat].

[9] Sen, Shilad; Giesel, Margaret E.; Gold, Rebecca; Hillmann, Benjamin; Lesicko, Matt; Naden, Samuel; Russell, Jesse; Wang, Zixiao (Ken); Hecht, Brent (2015-02-28). "Turkers, Scholars, Arafat and Peace: Cultural Communities and Algorithmic Gold Standards". ACM. pp. 826–838. ISBN 9781450329224. doi:10.1145/2675133.2675285.

[10] Bender, E. M., & Friedman, B. (2018). Data Statements for NLP: Toward Mitigating System Bias and Enabling Better Science. Transactions of the ACL. Retrieved from https://openreview.net/forum?id=By4oPeX9f

[11] Lee, Min Kyung; Kim, Ji Tae; Lizarondo, Leah (2017-05-02). "A Human-Centered Approach to Algorithmic Services: Considerations for Fair and Motivating Smart Community Service Management that Allocates Donations to Non-Profit Organizations". ACM. pp. 3365–3376. ISBN 9781450346559. doi:10.1145/3025453.3025884.

[12] Tintarev, Nava; Masthoff, Judith (2007-10-19). "Effective explanations of recommendations: user-centered design". ACM. pp. 153–156. ISBN 9781595937308. doi:10.1145/1297231.1297259.

[13] Ethical OS (2018), A guide to Anticipating the Future Impact of Today's Technology, https://ethicalos.org/

[14] Gunawardana, A., & Shani, G. (2009). A survey of accuracy evaluation metrics of recommendation tasks. Journal of Machine Learning Research, 10(Dec), 2935-2962. Retrieved from http://jmlr.csail.mit.edu/papers/volume10/gunawardana09a/gunawardana09a.pdf

[:4-15] Ekstrand, Michael; Tian, Mucun; Azpiazu, Ion Madrazo; Ekstrand, Jennifer D.; Anuyah, Oghenemaro; McNeill, David; Pera, Maria Soledad (2017). "Scripts for All The Cool Kids, How Do They Fit In". Boise State Data Sets. doi:10.18122/b2gm6f. Retrieved 2018-09-25.

[:3-16] McNee, Sean M.; Kapoor, Nishikant; Konstan, Joseph A. (2006-11-04). "Don't look stupid: avoiding pitfalls when recommending research papers". ACM. pp. 171–180. ISBN 1595932496. doi:10.1145/1180875.1180903.

[17] Beel, Joeran; Genzmehr, Marcel; Langer, Stefan; Nürnberger, Andreas; Gipp, Bela (2013-10-12). "A comparative analysis of offline and online evaluations and discussion of research paper recommender system evaluation". ACM. pp. 7–14. ISBN 9781450324656. doi:10.1145/2532508.2532511.

[18] Amershi, Saleema; Cakmak, Maya; Knox, William Bradley; Kulesza, Todd (2014-12-22). "Power to the People: The Role of Humans in Interactive Machine Learning". AI Magazine 35 (4): 105–120. ISSN 2371-9621. doi:10.1609/aimag.v35i4.2513.

[:5-19] Shani G., Gunawardana A. (2011) Evaluating Recommendation Systems. In: Ricci F., Rokach L., Shapira B., Kantor P. (eds) Recommender Systems Handbook. Springer, Boston, MA. Retrieved from: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.600.7100&rep=rep1&type=pdf

[20] McNee, Sean M.; Riedl, John; Konstan, Joseph A. (2006-04-21). "Making recommendations better: an analytic model for human-recommender interaction". ACM. pp. 1103–1108. ISBN 1595932984. doi:10.1145/1125451.1125660.

[21] Knijnenburg, Bart P.; Willemsen, Martijn C.; Gantner, Zeno; Soncu, Hakan; Newell, Chris (2012-03-10). "Explaining the user experience of recommender systems". User Modeling and User-Adapted Interaction 22 (4-5): 441–504. ISSN 0924-1868. doi:10.1007/s11257-011-9118-4.

[22] Ekstrand, Michael D.; Harper, F. Maxwell; Willemsen, Martijn C.; Konstan, Joseph A. (2014-10-06). "User perception of differences in recommender algorithms". ACM. pp. 161–168. ISBN 9781450326681. doi:10.1145/2645710.2645737.

[23] Loepp, Benedikt; Hussein, Tim; Ziegler, Jüergen; Loepp, Benedikt; Hussein, Tim; Ziegler, Jüergen (2014-04-26). "Choice-based preference elicitation for collaborative filtering recommender systems, Choice-based preference elicitation for collaborative filtering recommender systems". ACM, ACM. pp. 3085, 3085–3094, 3094. ISBN 9781450324731. doi:10.1145/2556288.2557069.

[24] Cremonesi, Paolo; Elahi, Mehdi; Garzotto, Franca (2016-09-24). "User interface patterns in recommendation-empowered content intensive multimedia applications". Multimedia Tools and Applications 76 (4): 5275–5309. ISSN 1380-7501. doi:10.1007/s11042-016-3946-5.

[25] Cosley, Dan; Lam, Shyong K.; Albert, Istvan; Konstan, Joseph A.; Riedl, John (2003-04-05). "Is seeing believing?: how recommender system interfaces affect users' opinions". ACM. pp. 585–592. ISBN 1581136307. doi:10.1145/642611.642713.

[26] Sharma, Abhinav (2016-05-17). "Designing Interfaces for Recommender Systems". The Graph. Retrieved 2018-09-25.

[27] Cremonesi, Paolo; Elahi, Mehdi; Garzotto, Franca (2016-09-24). "User interface patterns in recommendation-empowered content intensive multimedia applications". Multimedia Tools and Applications 76 (4): 5275–5309. ISSN 1380-7501. doi:10.1007/s11042-016-3946-5.

[28] Diakopoulos, Nicholas (2016-01-25). "Accountability in algorithmic decision making". Communications of the ACM 59 (2): 56–62. ISSN 0001-0782. doi:10.1145/2844110.

[29] World Economic Forum. (2016). How to Prevent Discriminatory Outcomes in Machine Learning. Global Future Council on Human Rights, (March). Retrieved from http://www3.weforum.org/docs/WEF_40065_White_Paper_How_to_Prevent_Discriminatory_Outcomes_in_Machine_Learning.pdf

[30] Kearns, Michael; Neel, Seth; Roth, Aaron; Wu, Zhiwei Steven (2017-11-14). "Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness". arXiv:1711.05144 [cs].

[31] Palinkas, Lawrence A.; Horwitz, Sarah M.; Green, Carla A.; Wisdom, Jennifer P.; Duan, Naihua; Hoagwood, Kimberly (2013-11-06). "Purposeful Sampling for Qualitative Data Collection and Analysis in Mixed Method Implementation Research". Administration and Policy in Mental Health and Mental Health Services Research 42 (5): 533–544. ISSN 0894-587X. PMC 4012002. PMID 24193818. doi:10.1007/s10488-013-0528-y.

[32] McNee, Sean M.; Riedl, John; Konstan, Joseph A. (2006-04-21). "Making recommendations better: an analytic model for human-recommender interaction". ACM. pp. 1103–1108. ISBN 1595932984. doi:10.1145/1125451.1125660.

[33] "IBM Knowledge Center". www.ibm.com. Retrieved 2018-09-28.

[34] Halfaker, Aaron; Geiger, R. Stuart; Morgan, Jonathan T.; Riedl, John (2012-12-28). "The Rise and Decline of an Open Collaboration System". American Behavioral Scientist 57 (5): 664–688. ISSN 0002-7642. doi:10.1177/0002764212469365.

[35] Harford, Tim (December 2014). "Big data: A big mistake?". Significance 11 (5): 14–19. ISSN 1740-9705. doi:10.1111/j.1740-9713.2014.00778.x.

[36] Chaney, Allison J. B.; Stewart, Brandon M.; Engelhardt, Barbara E. (2017-10-30). "How Algorithmic Confounding in Recommendation Systems Increases Homogeneity and Decreases Utility". arXiv:1710.11214 [cs, stat].

[37] Sandvig, C., Hamilton, K., Karahalios, K., & Langbort, C. (2014). Auditing Algorithms : Research Methods for Detecting Discrimination on Internet Platforms. Data and Discrimination: Converting Critical Concerns into Productive Inquiry, a preconference at the 64th Annual Meeting of the International Communication Association. Seattle, Washington, USA. Retrieved from http://www-personal.umich.edu/~csandvig/research/Auditing%20Algorithms%20--%20Sandvig%20--%20ICA%202014%20Data%20and%20Discrimination%20Preconference.pdf

[38] Diakopoulos, Nicholas (2014). "Algorithmic Accountability Reporting: On the Investigation of Black Boxes". Academic Commons. doi:10.7916/D8ZK5TW2.

[39] Bamman, D. (2016). Interpretability in human-centered data science. In CSCW Workshop on Human-Centered Data Science.

[40] Shahriari, Kyarash; Shahriari, Mana (July 2017). "IEEE standard review — Ethically aligned design: A vision for prioritizing human wellbeing with artificial intelligence and autonomous systems". 2017 IEEE Canada International Humanitarian Technology Conference (IHTC) (IEEE). ISBN 9781509062645. doi:10.1109/ihtc.2017.8058187.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

Research:Ethical and human-centered AI/Bias in recommender systems

Contents

Overview

Goals

Concepts and key terms

How to use this document

Stage 1: Design

Articulating design goals

Identifying audience, purpose, and context

Developing hypotheses and benchmarks

Selecting models, data, and features

Using interpretable models

Collecting training data

Engaging relevant stakeholders

Eliciting stakeholder input

Building diverse teams

Defining roles, goals, and processes

Stage 2: Pre-launch

Performing comparative and iterative testing

Offline evaluation

User studies

Online evaluation

Interrogating disparate impacts and unintended consequences

Offline evaluation

User studies

Stage 3: Post-launch

Assessing impacts

Short term

Long term

Enabling ongoing monitoring and re-evaluation

Accountability

Transparency

See also

Recommender systems on Wikimedia projects

Related projects

References